Hacker Newsnew | past | comments | ask | show | jobs | submit | justaboutanyone's commentslogin

This failed when I put in an australian postal code.

Shell programming is high density inter-language glue. You simply have more options of implementations to call out to and so less to write.

I can trivially combine a tool written in rust with one written in js/java/C/whatever without writing bindings


First thing a public spacex would want to do is sell off all the non-spacex crap


A public SpaceX will still be run by Musk. A public SpaceX would have to sell assets like X for a huge loss given its debt load, which would also take a propaganda machine out of Musk’s hands.

They’re stuck with those assets.


This sort of thing will be great for the SpaceX IPO :/


Especially if contracts with SpaceX start being torn up because the various ongoing investigations and prosecutions of xAI are now ongoing investigations and prosecutions of SpaceX. And next new lawsuits for creating this conflict of interest by merger.


Running llama.cpp rather than vLLM, it's happy enough to run the FP8 variant with 200k+ context using about 90GB vram


yeah, what did you get for tok/sec there though? Memory bandwidth is the limitation with these devices. With 4 bit I didn't get over 35-39 tok/sec, and averaged more like 30 when doing actual tool use with opencode. I can't imagine fp8 being faster.


You can run large-ish MoE model at good speeds, like gpt-oss-120b, it's snappy enough even with big context.

But large and dense at the same time is a bit slow.

Running a local LLM will be a load of money for something much slower than the api providers though.


Makes sense regarding the MoE performance. I am not sure the cost argument holds up for high volume workloads though. If you are running batch jobs 24/7 the hardware pays for itself in a few months compared to API opex. It really just comes down to utilization.


Do you have specific t/s numbers for those dense models? I'm curious just how severe the memory bandwidth bottleneck gets in practice.

I'm not sure I agree on the cost aspect though. For high-volume production workloads the API bills scale linearly and can get painful fast. If you can amortize the hardware over a year and keep the data local for privacy, the math often works out in favor of self-hosting.


For Qwen2.5-72B-Instruct-Q5_K_M at 32k context, I fed it a 26k token file (truncated fiction novel) asking it to summarize, and it input processed at 224 tok/s and output generated at 3 tok/s. Not really good enough for interactive use without frustration. Not just from watching it reply, but also the long wait for it to actually read the book.

On the same hardware gpt-oss-120b at 128k context, I fed it a longer version of the input (a whole novel, 97k tok), and it input processed at 1650 tok/s and output generated at 27 tok/s. Just fast enough IMO


We may as well have the LLMs use the hardest most provably-correct language possible


What is the article text for those stopped by the paywall



It really feels like more than 1 in 20 driving around the 101/280


Probably because Santa Clara County has more EV sales compared to its neighbors, according to this map:

https://www.energy.ca.gov/data-reports/energy-almanac/zero-e...


And newer cars get driven more than old cars on average so 1/20 cars being EVs will do more than 1/20th of the miles.


Where are the RVA23 boards that have been hinted at for so long?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: