They remind me more of the Korean TV series Cashero. There, the main hero has strength superpowers, but the power comes by crunching through cash in his pocket.
But I use the per-token APIs for my usage, not a subscription, so I'm guessing I'm in the minority.
I also use them per-token (and strongly prefer that due to a lack of lock-in).
However, from a game theory perspective, when there's a subscription, the model makers are incentivized to maximize problem solving in the minimum amount of tokens. With per-token pricing, the incentive is to maximize problem solving while increasing token usage.
I don't think this is quite right because it's the same model underneath. This problem can manifest more through the tooling on top, but still largely hard to separate without people catching you.
I do agree that Big Ai has misaligned incentives with users, generally speaking. This is why I per-token with a custom agent stack.
I suspect the game theoretic aspects come into play more with the quantizing. I have not (anecdotally) experienced this in my API based, per-token usage. I.e. I'm getting what I pay for.
Some more love for the updates page. E.g. select a subset of updates to install, be more clear that the last update time could be different if you installed updates via CLI, that kind of thing.
Are there any good static (i.e. not runtime) type checkers for arrays and tensors? E.g. "16x64x256 fp16" in numpy, pytorch, jax, cupy, or whatever framework. Would be pretty useful for ML work.
This would be an insta-switch feature for me! Jaxtyping is a great idea, but the runtime-only aspect kills it for me - I just resort to shape assertions + comments, but it's a pretty poor solution.
A follow-up question: Google's old `tensor_annotations` library (RIP) could statically analyse operations - eg. `reduce_sum(Tensor[Time, Batch], axis=0) -> Tensor[Batch]`. I guess that wouldn't come with static analysis for jaxtyping?
> There exist a couple of contract libraries. However, at the time of this writing (September 2018), they all required the programmer either to learn a new syntax (PyContracts) or to write redundant condition descriptions ( e.g., contracts, covenant, deal, dpcontracts, pyadbc and pcd).
@icontract.require(lambda x: x > 3, "x must not be small")
def some_func(x: int, y: int = 5) -> None:
icontract with numpy array types:
@icontract.require(lambda arr: isinstance(arr, np.ndarray))
@icontract.require(lambda arr: arr.shape == (3, 3))
@icontract.require(lambda arr: np.all(arr >= 0), "All elements must be non-negative")
def process_matrix(arr: np.ndarray):
return np.sum(arr)
invalid_matrix = np.array([[1, -2, 3], [4, 5, 6], [7, 8, 9]])
process_matrix(invalid_matrix)
# Raises icontract.ViolationError
> The result is a powerful combination that allows you to automatically test your code. Instead of writing manually the Hypothesis search strategies for a function, icontract-hypothesis infers them based on the function's precondition. This makes automatic testing as effortless as it goes.
> If you have a function with type annotations and add a contract in a supported syntax, CrossHair will attempt to find counterexamples for you: [gif]
> CrossHair works by repeatedly calling your functions with symbolic inputs. It uses an SMT solver (a kind of theorem prover) to explore viable execution paths and find counterexamples for you
Check out optype (specifically the optype.numpy namespace). If you use scipy, scipy-stubs is compatible and the developer of both is very active and responsive. There's also a new standalone stubs library for numpy called numtype, but it's still in alpha.
Jaxtyping is the best option currently - despite the name it also works for Torch and other libs. That said, I think it still leaves a lot to be desired. It's runtime-only, so unless you wire it into a typechecker it's only a hint. And, for me, the hints aren't parsed by Intellisense, so you don't see shape hints when calling a function - only when directly reading the function definition.
Personally, I also think the syntax is a little verbose: for a generic shape hint you need something like `Shaped[Array, "m n"]`. But 95% of the time I only really care about the shape "m n". It doesn't sound like much, but I recently tried hinting a codebase with jaxtyping and gave up because it was adding so much visual clutter, without clear benefits.
There have been some early proposals to add something like that, but none of them have made it very far yet. As you might imagine, it's a hard problem!
Of typical homelabs that are posted and discussed.
The online activity of the homelab community leans towards those who treat it as an enjoyable hobby as opposed to a pragmatic solution.
I'm on the other side of the spectrum. Devops is (at best) a neutral activity; I personally do it because I strongly dislike companies being able to do a rug-pull. I don't think you'll see setups like mine too often, as there isn't anything to brag about or to show off.
It still seems to have the same pitfalls as all the other image generation models. I ran it through my test prompt (wary of posting it here, lest it gets trained on) - it still cannot generate something along the lines of "object A, but with feature X from Y", where that combo has never been seen in the training data. I wonder how the "astronaut riding unicorn on the moon" was solved...
EDIT: after significant prompting, it actually solved it. I think it's the first one to do so in my testing.
It's a shame that it's not on OpenRouter. I hate platform lock-in, but the top-tier "deep think" models have been increasingly requiring the use of their own platform.
OpenRouter is pretty great but I think litellm does a very good job and it's not a platform middle man, just a python library. That being said, I have tried it with the deep think models.
Part of OpenRouter's appeal to me is precisely that it is a middle man. I don't want to create accounts on every provider, and juggle all the API keys myself. I suppose this increases my exposure, but I trust all these providers and proxies the same (i.e. not at all), so I'm careful about the data I give them to begin with.
Unfortunately that's ending with mandatory-BYOK from the model vendors. They're starting to require that you BYOK to force you through their arbitrary+capricious onboarding process.
Sweet! I love clever information theory things like that.
It goes the other way too. Given that LLMs are just lossless compression machines, I do sometimes wonder how much better they are at compressing plain text compared to zstd or similar. Should be easy to calculate...
EDIT: lossless when they're used as the probability estimator and paired with something like an arithmetic coder.
> Given that LLMs are just lossless compression machines, I do sometimes wonder how much better they are at compressing plain text compared to zstd or similar. Should be easy to calculate...
I've actually been experimenting with that lately. I did a really naive version that tokenizes the input, feeds the max context window up to the token being encoded into an LLM, and uses that to produce a distribution of likely next tokens, then encodes the actual token with Huffman Coding with the LLM's estimated distribution. I could get better results with arithmetic encoding almost certainly.
It outperforms zstd by a long shot (I haven't dedicated the compute horsepower to figuring out what "a long shot" means quantitatively with reasonably small confidence intervals) on natural language, like wikipedia articles or markdown documents, but (using GPT-2) it's about as good as zstd or worse than zstd on things like files in the Kubernetes source repository.
You already get a significant amount of compression just out of the tokenization in some cases ("The quick red fox jumps over the lazy brown dog." encodes to one token per word plus one token for the '.' for the GPT-2 tokenizer), where as with code a lot of your tokens will just represent a single character so the entropy coding is doing all the work, which means your compression is only as good as the accuracy of your LLM, plus the efficiency of your entropy coding.
I would need to be encoding multiple tokens per "word" with Huffman Coding to hit the entropy bounds, since it has a minimum of one bit per character, so if tokens are mostly just one byte then I can't do better than a 12.5% compression ratio with one token per word. And doing otherwise gets computationally infeasible very fast. Arithmetic coding would do much better especially on code because it can encode a word with fractional bits.
I used Huffman coding for my first attempt because it's easier to implement and most libraries don't support dynamically updating the distribution throughout the process.
I do not agree on the "lossless" adjective. And even if it is lossless, for sure it is not deterministic.
For example I would not want a zip of an encyclopedia that uncompresses to unverified, approximate and sometimes even wrong text. According to this site : https://www.wikiwand.com/en/articles/Size%20of%20Wikipedia a compressed Wikipedia without medias, just text is ~24GB. What's the medium size of an LLM, 10 GB ? 50 GB ? 100 GB ? Even if it's less, it's not an accurate and deterministic way to compress text.
(to be clear this is not me arguing for any particular merits of llm-based compression, but) you appear to have conflated one particular nondeterministic llm-based compression scheme that you imagined with all possible such schemes, many of which would easily fit any reasonable definitions of lossless and deterministic by losslessly doing deterministic things using the probability distributions output by an llm at each step along the input sequence to be compressed.
With a temperature of zero, LLM output will always be the same. Then it becomes a matter of getting it to output the exact replica of the input: if we can do that, it will always produce it, and the fact it can also be used as a bullshit machine becomes irrelevant.
With the usual interface it’s probably inefficient: giving just a prompt alone might not produce the output we need, or it might be larger than the thing we’re trying to compress. However, if we also steer the decisions along the way, we can probably give a small prompt that gets the LLM going, and tweak its decision process to get the tokens we want. We can then store those changes alongside the prompt. (This is a very hand-wavy concept, I know.)
There's an easier and more effective way of doing that - instead of trying to give the model an extrinsic prompt which makes it respond with your text, you use the text as input and, for each token, encode the rank of the actual token within the set of tokens that the model could have produced at that point. (Or an escape code for tokens which were completely unexpected.) If you're feeling really crafty, you can even use arithmetic coding based on the probabilities of each token, so that encoding high-probability tokens uses fewer bits.
From what I understand, this is essentially how ts_zip (linked elsewhere) works.
The models are differentiable, they are trained with backprop. You can easily just run it in reverse to get the input that produces near certainty of producing the output. For a given sequence length, you can create a new optimzation that takes the input sequence, passes to model (frozen) and runs steps over the input sequence to reduce the "loss" which is the desired output. This will give you the optimal sequence of that length to maximize the probability of seeing the output sequence. Of course, if you're doing this to chatGPT or another API-only model, you have no choice but to hunt around.
Of course the optimal sequence to produce the output will be a series of word vectors (of multi-hundreds of dimensions). You could match each to its closest word in any language (or make this a constraint during solving), or just use the vectors themselves as the compressed data value.
Ultimately, NNets of various kinds are used for compression in various contexts. There are some examples where guassian-splatting-like 3d scenes are created by comrpessing all the data into the weights of a nnet via a process similar to what I described to create a fully explorable 3d color scene that can be rendered from any angle.
A bit of nitpicking, a temperature of zero does not really exist (it would lead to division by zero in softmax). It's sampling (and non-deterministic compute kernels) that makes token prediction non-deterministic. You could simply fix it (assuming deterministic kernels) by using greedy decoding (argmax with a stable sort in the case of ties).
As temperatures approach zero, the probability of the most likely token approaches one (assuming no ties). So my guess is that LLM inference providers started using temperature=0 to disable sampling because people would try to approximate greedy decoding by using teensy temperatures.
Compression can be generalized as probability modelling (prediction) + entropy coding of the difference between the prediction and the data. The entropy coding has known optimal solutions.
So yes, LLMs are nearly ideal text compressors, except for all the practical inconveniences of their size and speed (they can be reliably deterministic if you sacrifice parallel execution and some optimizations).
LLMs are good at predicting the next token. Basically you use them to predict what are the probabilities of the next tokens to be a, b, or c. And then use arithmetic coding to store which one matched. So the LLM is used during compression and decompression.
Yes LLMs are always lossy, unless their size / capacity is so huge they can memorize all their inputs. Even if LLMs were not resource-constrained, one would expect lossy compression due to batching and the math of the loss function. Training is such that it is always better for the model to accurately approximate the majority of texts than to approximate any single text with maximum accuracy.
But I use the per-token APIs for my usage, not a subscription, so I'm guessing I'm in the minority.
reply