Grabbed up as much ram as they could, nearly no questions asked, at above market rates in some cases, ramping up the perceived demand and decreasing supply significantly.
Kill they Supreme leader and 40 other leaders, destroy their Navy and Airforce and give them 30 days of B1 and B2 night and day bombings, and they decide it still worth it to joke on Aprils Fools ? :-) I have to give to them...
This is the bet of many of the big AI companies, and why they're subsidizing majorly the calls. With the latest cracks by the US gov, it seems Anthropic is starting to reduce those subsidies given their edge in the game. I am starting to consider local models more seriously beside just testing, but nowadays the ram/gpu market is bloated.
Local models just don't seem that useful for me for these particular tasks yet - the most recent versions of Codex and Claude Opus are the first time I've found them to be particularly useful in a "real engineering" context that isn't just vibe coding.
Google's TurboQuant might help address this, but it also might just widen the gap even further.
I am far on the skeptic edge when it comes to the generative AI side of ML tools though, so do take my opinion with that weight.
Turboquant is totally irrelevant compared to current quantization methods. It has been thoroughly test by people who build inferencing engines for local models. It's all talk no actual meat to it.
Their paper TurboQuant (TQ) is not new per say. It's released last year, and heavily rehash of old ideas that were released a year prior (RabitQ). There is also [a bit of drama](https://openreview.net/forum?id=tO3ASKZlok) there that boils down to what it seems a bit of malpractice for google's researchers. TQ does few things: it claims better compression quality and speed, and better KV cache handling. Currently KV cache takes a load of resources beside that of the model itself. Many people applied different quantization strategy for it, but the quality degradation is a too apparent. Enter Attention Rotation. This seems to have genuinely helped KV cache compression as per [llama.cpp latest tests](https://github.com/ggml-org/llama.cpp/pull/21038). On the other hand, [ik_llama.cpp](https://www.reddit.com/r/LocalLLaMA/comments/1s7nq6b/technic...) did tests on the quality of turboquant-3 compared to IQ4 quantized models, and yhe quality degradation is much worse. So it's 2 things: KV compression -> good. Turboquant quantazation -> not good.
Tbh, I think distillation is happening both ways. And at this stage, "quality" is stagnating, the main edge is the tooling. The harness of CC seems to be the best so far, and I wonder if this leak would equalize the usability.
It's funny you mention that. The only difference is sometimes you need a functionality without doing the plumbing. At the end of the day if you're getting the output you need, the process doesn't matter. It's an interesting analogy but only works if the inspector is another expert dev.
When I have such moment and I take a step back, there’s usually a strong hint that there’s a meta problem behind those instances. And while you have to chose when to take the time to solve such problem, it’s usually worth it.
It is amazing how you can order so many small sensors from aliexpress, around 1-2€ each, and having in a week or two delivered. I am not sure we will have this for long.
reply