Hacker Newsnew | past | comments | ask | show | jobs | submit | hydroreadsstuff's commentslogin

The 4x comes from the neural accelerators (tensor core in NVIDIA jargon). It's 4x fp16 over the vector path (And 8x compared to M1 because at some point they 2x'd the fp16 vector path). Therefore LLM prefill(context processing/TTFT), diffusion models (image gen), and e.g. video and photo effects that make use of them can be up to 4x faster. At fp16 that's the same speed at the same clock as NVIDIA. But NVIDIA still has 2xfp8 and 4xnvfp4.

Batch-1 token generation, that is often quoted, does not benefit from this. It's purely RAM bandwidth-limited.


afaia emulators like Fex are within 30 to 70% of native performance. On the fringes worse or better. But overall emulation seems totally fine. Plus emulator technology in general could be used for binary optimization rather than strict mappings, opening up space for more optimization.


Some companies like to stress the efficiency or performance of Arm SoCs, but really this is a hedge against more expensive x86 hardware. AMD has increased prices of mobile SoCs radically recently. I'm looking forward to having more affordable SoC options for laptops, handhelds and desktops, perhaps from Mediatek or other lower-cost vendors.

The history of the PC is one of commoditization. A fractured multi-polar landscape is detrimental to the ecosystem/productivity and should ultimately fail.

x86 emulation is an important puzzle piece, and I'm happy Valve recognizes this and sponsors it.


This is for the Steam Frame


That's what it seems like. Some people here disagree with you, but I can add anecdata that my employer insisted I do no coding on such a VISA.


Even team members visiting from Canada were told very explicitly to not say they're coming to the US to "work" but rather for a business trip. But practically of course everyone do some amount of actual work. From checking their email / slack to doing some white boarding designs etc. If those aren't even allowed, then I don't see how any in person team meetings can be conducted.


Indeed, the memory model has a decent impact. Unfortunately it's difficult to isolate in measurement. Only Apple has support for weak memory order and TSO in the same hardware.


Oh there’s an interesting idea. Given that Linux runs on the M1 and M2 Macs, would it be possible to do some kind of benchmark there where you could turn it on and off at will for your test program?



This means GPUs are dead for local enthusiast AI. And SoCs with big RAM are in.

Because 17B active parameters should reach enough performance on 256bit LPDDR5x.


This has been the case for a while now. 3090 hoarders were always just doing it for street cred or whatever, no way these guys are computing anything of actual value.

Tenstorrent is on fire, though. For small businesses this is what matters. If 10M context is not a scam, I think we'll see SmartNIC adoption real soon. I would literally long AMD now because their Xilinx people are probably going to own the space real soon. Infiniband is cool and all, but it's also stupid and their scale-out strategy is non-existent. This is why https://github.com/deepseek-ai/3FS came out but of course nobody had figured it out because they still think LLM's is like, chatbots, or something. I think we're getting to a point where it's a scheduling problem, basically. So you get like like lots of GDDR6 (HBM doesnn't matter anymore) as L0, DDR5 as L1, and NVMe-oF is L2. Most of the time the agents will be running the code anyway...

This is also why Google never really subscribed to "function calling" apis


I was going to buy my first GPU for DL in 2018, but crypto didn't make it easy. I waited for the prices to fall, but demand kept up, then covid happened, then LLM happened and used GPUs now cost more than their original new prices. ... as we can see by the paper launch from Nvidia, lack of competition, and the prices of the 5000 series easily 50% above original MSRP. Demand is still here, now we have tarrif... Folks got reasons to collect, hoard or do whatever you think they are doing, even if it's just for street cred.


Tenstorrent


Not a hoarder per-se but I bought a 24GB card on the secondary market. My privacy is valuable. I'm okay being a half-step or full-step behind in LLM or image diffusion if it means my data never leaves my machine.


If you really were serious about privacy, you wouldn't put yourself at disadvantage with a locked-down six-year out of date card. Tenstorrent Blackhole exists now, btw.


Be serious now. Plenty of useful, privacy-required queries can be run with 24GB of VRAM, especially given the existence of e.g. Gemma 3 27B and the heavy NVIDIA-targeted optimisation work that has occurred.

The Tenstorrent cards exist, but are low in availability and the software is comparatively nonexistent. I'm excited for them too, but at the end of the day, I can buy a used 3090 today and do useful work with it, while the same is not true of TT yet.


I think it's disingenuous to suggest they're putting themselves at a disadvantage with an RTX 3090, especially in a comparison to an inferior product that isn't even shipping yet.

RTX 3090: 24GB RAM, 936.2GB/s bandwidth

Tenstorrent p150a: 32GB RAM, 512GB/s bandwidth

an extra 8GB of ram isn't worth nearly halving memory bandwidth.


> inferior product

Tenstorrent p300 is coming at 64 GB and 1 Tbps but that's not the point; even p150a with plenty of bandwidth (512 GB/s is fine for inference) and four 800G ports. But hardware is not the problem: even if they had the hardware, they wouldn't know what to do with it. Privacy is a hobby to most people, making you feel good.



I might be filing for bankruptcy soon so I'm definitely stuck with what I've got.


> Infiniband is cool and all, but it's also stupid and their scale-out strategy is non-existent.

god I love this website.


Keyword: compute-in-network


Not sure what you’re suggesting. I’m well aware that things like SHARP exist.


Overallcoation has a limit. You only have so much RAM/storage. Beyond that you start swapping. I could really use a hash table (or similar structure) that degrades less with higher occupancy.


For other platforms there is already https://github.com/nviennot/core-to-core-latency

This project works around a limitation in MacOS, namely lack of thread pinning, and makes it possible to do it without Asahi Linux, and without using custom MacOS kernel extensions and disabling security features.


Did they raise 6B or are they valued at 6B and don't disclose the raised amount? Probably the latter?


If that was the intend an animation that doesn't literally crush things would have worked much better. Let it fall into a black hole and let an iPad emerge or whatever. The dramatic effect of a hydraulic press adds nothing positive.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: