Hey, this is super cool! I’ve been working on a similar problem, focusing on low-resource and underserved languages including the Mayan family, and have published some research and open resources around that [0, 1].
On the data side, I’ve found that the biggest bottleneck isn’t collecting text (it’s out there!) but reliable language identification. It’s often difficult or ambiguous to separate languages cleanly in datasets like Common Crawl, Fineweb, or others. I worked on improving this a bit for Fineweb 2 for my native language, that might inspire you [3].
Many of the challenges you mention seem to recur across regions and language families, so I’d love to connect and compare notes sometime. Feel free to reach me at omar [at] the labs site below.
Excellent, thank you mandeepj! Curious about the language coverage of your agent and if / how you plan to eval your agent, if you're willing to share more.
> RAM use also increases with context window size.
KV cache is very swappable since it has limited writes per generated token (whereas inference would have to write out as much as llm_active_size per token, which is way too much at scale!), so it may be possible to support long contexts with quite acceptable performance while still saving RAM.
Make sure also that you're using mmap to load model parameters, especially for MoE experts. It has no detrimental effect on performance given that you have enough RAM to begin with, but it allows you to scale up gradually beyond that, at a very limited initial cost (you're only replacing a fraction of your memory_bandwidth with much lower storage_bandwidth).
Well mmap can still cause issues if you run short on RAM, and the disk access can cause latency and overall performance issues. It's better than nothing though.
Strictly speaking: no. The "forward pass" terminology does not imply that there exists a "reverse pass" that does the same kind of computation. Rather, it's describing two different kinds of computation, and the direction they occur in.
The forward pass is propagating from inputs to outputs, computing the thing the model was trained for. The reverse/backwards pass is propagating from outputs back to inputs, but it's calculating the gradients of parameters for training (rougly: how much changing each parameter in isolation affects the output, and whether it makes the output closer to the desired training output). The result of the "reverse pass" isn't a set of inputs, but a set of annotations on the model's parameters that guide their adjustment.
The computations of the forward pass are not trivially reversible (e.g. they include additions, which destroys information about the operand values). As a sibling thread points out, you can still probabilistically explore what inputs _could_ produce a given output, and get some information back that way, but it's a lossy process.
And of course, you could train a "reverse" model, one that predicts the prefix of a sequence given a suffix (trivially: it's the same suffix prediction problem, but you train it on reversed sequences). But that would be a separate model trained from scratch on that task, and in that model the prefix prediction would be its forward pass.
I do want to see ChatGPT running upwards on my screen now, predicting earlier and earlier words in a futile attempt to explain a nonsense conclusion. We could call it ChatJeopardy.
Not as trivially as the forwards direction, unsurprisingly information is lost, but better than you might expect. See for example https://arxiv.org/pdf/2405.15012
The model being 32B could run in <20GB VRAM with Q4 quantization (minimal loss of quality), or 80GB unquantized at full fidelity. The quoted 160GB is for their recommended evaluation settings.
There's a few pre-quantized options[0] or you can quantize it yourself with llama.cpp[1]. You can run the resulting gguf with llama.cpp `llama-cli` or `llama-server`, with LM Studio or with Ollama.
I just went through an eerily similar situation where the coding agent was able to muster some pretty advanced math (information geometry) to solve my problem at hand.
But while I was able to understand it enough to steer the conversation, I was utterly unable to make any meaningful change to the code or grasp what it was doing. Unfortunately, unlike in the case you described, chatting with the LLM didn’t cut it as the domain is challenging enough. I’m on a rabbit hunt now for days, picking up the math foundations and writing the code at a slower pace albeit one I can keep up with.
And to be honest it’s incredibly fun. Applied math with a smart, dedicated tutor and the ability to immediately see results and build your intuition is miles ahead of my memories back in formative years.
Do you have benchmarks for the SGLang vs vLLM latency and throughput question? Not to challenge your point, but I’d like to reproduce these results and fiddle with the configs a bit, also on different models & hardware combos.
Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit.
yes, but the parrent link was to the big glm 4.7 that had a bunch of ggufs, the new one at the point of posting did not, nor does it now. im waiting for unsloth guys for the 4.7 flash
On the data side, I’ve found that the biggest bottleneck isn’t collecting text (it’s out there!) but reliable language identification. It’s often difficult or ambiguous to separate languages cleanly in datasets like Common Crawl, Fineweb, or others. I worked on improving this a bit for Fineweb 2 for my native language, that might inspire you [3].
Many of the challenges you mention seem to recur across regions and language families, so I’d love to connect and compare notes sometime. Feel free to reach me at omar [at] the labs site below.
0: https://wikilangs.org
1: https://omneitylabs.com
2: https://huggingface.co/blog/omarkamali/gherbal-multilingual-...
reply