More

hydroreadsstuff · 2026-03-04T07:31:36 1772609496

The 4x comes from the neural accelerators (tensor core in NVIDIA jargon). It's 4x fp16 over the vector path (And 8x compared to M1 because at some point they 2x'd the fp16 vector path). Therefore LLM prefill(context processing/TTFT), diffusion models (image gen), and e.g. video and photo effects that make use of them can be up to 4x faster. At fp16 that's the same speed at the same clock as NVIDIA. But NVIDIA still has 2xfp8 and 4xnvfp4.

Batch-1 token generation, that is often quoted, does not benefit from this. It's purely RAM bandwidth-limited.

hydroreadsstuff · 2025-12-04T12:39:07 1764851947

afaia emulators like Fex are within 30 to 70% of native performance. On the fringes worse or better. But overall emulation seems totally fine. Plus emulator technology in general could be used for binary optimization rather than strict mappings, opening up space for more optimization.

hydroreadsstuff · 2025-11-21T12:12:47 1763727167

Some companies like to stress the efficiency or performance of Arm SoCs, but really this is a hedge against more expensive x86 hardware. AMD has increased prices of mobile SoCs radically recently. I'm looking forward to having more affordable SoC options for laptops, handhelds and desktops, perhaps from Mediatek or other lower-cost vendors.

The history of the PC is one of commoditization. A fractured multi-polar landscape is detrimental to the ecosystem/productivity and should ultimately fail.

x86 emulation is an important puzzle piece, and I'm happy Valve recognizes this and sponsors it.

LeFantome · 2025-11-21T15:25:52 1763738752

This is for the Steam Frame

hydroreadsstuff · 2025-09-11T04:34:22 1757565262

That's what it seems like. Some people here disagree with you, but I can add anecdata that my employer insisted I do no coding on such a VISA.

yibg · 2025-09-11T07:52:35 1757577155

Even team members visiting from Canada were told very explicitly to not say they're coming to the US to "work" but rather for a business trip. But practically of course everyone do some amount of actual work. From checking their email / slack to doing some white boarding designs etc. If those aren't even allowed, then I don't see how any in person team meetings can be conducted.

hydroreadsstuff · 2025-09-08T16:06:14 1757347574

Indeed, the memory model has a decent impact. Unfortunately it's difficult to isolate in measurement. Only Apple has support for weak memory order and TSO in the same hardware.

MBCook · 2025-09-09T01:52:48 1757382768

Oh there’s an interesting idea. Given that Linux runs on the M1 and M2 Macs, would it be possible to do some kind of benchmark there where you could turn it on and off at will for your test program?

gok · 2025-09-09T05:20:23 1757395223

This has been done in fact: https://doi.org/10.1016/j.sysarc.2024.103102

hydroreadsstuff · 2025-04-05T20:12:26 1743883946

This means GPUs are dead for local enthusiast AI. And SoCs with big RAM are in.

Because 17B active parameters should reach enough performance on 256bit LPDDR5x.

tucnak · 2025-04-05T23:07:30 1743894450

This has been the case for a while now. 3090 hoarders were always just doing it for street cred or whatever, no way these guys are computing anything of actual value.

Tenstorrent is on fire, though. For small businesses this is what matters. If 10M context is not a scam, I think we'll see SmartNIC adoption real soon. I would literally long AMD now because their Xilinx people are probably going to own the space real soon. Infiniband is cool and all, but it's also stupid and their scale-out strategy is non-existent. This is why https://github.com/deepseek-ai/3FS came out but of course nobody had figured it out because they still think LLM's is like, chatbots, or something. I think we're getting to a point where it's a scheduling problem, basically. So you get like like lots of GDDR6 (HBM doesnn't matter anymore) as L0, DDR5 as L1, and NVMe-oF is L2. Most of the time the agents will be running the code anyway...

This is also why Google never really subscribed to "function calling" apis

segmondy · 2025-04-06T06:21:53 1743920513

I was going to buy my first GPU for DL in 2018, but crypto didn't make it easy. I waited for the prices to fall, but demand kept up, then covid happened, then LLM happened and used GPUs now cost more than their original new prices. ... as we can see by the paper launch from Nvidia, lack of competition, and the prices of the 5000 series easily 50% above original MSRP. Demand is still here, now we have tarrif... Folks got reasons to collect, hoard or do whatever you think they are doing, even if it's just for street cred.

tucnak · 2025-04-06T08:33:05 1743928385

Tenstorrent

xigency · 2025-04-06T00:21:25 1743898885

Not a hoarder per-se but I bought a 24GB card on the secondary market. My privacy is valuable. I'm okay being a half-step or full-step behind in LLM or image diffusion if it means my data never leaves my machine.

tucnak · 2025-04-06T10:24:44 1743935084

If you really were serious about privacy, you wouldn't put yourself at disadvantage with a locked-down six-year out of date card. Tenstorrent Blackhole exists now, btw.

Philpax · 2025-04-06T13:04:27 1743944667

Be serious now. Plenty of useful, privacy-required queries can be run with 24GB of VRAM, especially given the existence of e.g. Gemma 3 27B and the heavy NVIDIA-targeted optimisation work that has occurred.

The Tenstorrent cards exist, but are low in availability and the software is comparatively nonexistent. I'm excited for them too, but at the end of the day, I can buy a used 3090 today and do useful work with it, while the same is not true of TT yet.

halifaxbeard · 2025-04-06T13:02:49 1743944569

I think it's disingenuous to suggest they're putting themselves at a disadvantage with an RTX 3090, especially in a comparison to an inferior product that isn't even shipping yet.

RTX 3090: 24GB RAM, 936.2GB/s bandwidth

Tenstorrent p150a: 32GB RAM, 512GB/s bandwidth

an extra 8GB of ram isn't worth nearly halving memory bandwidth.

tucnak · 2025-04-06T22:17:50 1743977870

> inferior product

Tenstorrent p300 is coming at 64 GB and 1 Tbps but that's not the point; even p150a with plenty of bandwidth (512 GB/s is fine for inference) and four 800G ports. But hardware is not the problem: even if they had the hardware, they wouldn't know what to do with it. Privacy is a hobby to most people, making you feel good.

janwas · 2025-04-06T16:08:14 1743955694

Or how about https://www.notebookcheck.net/Way-to-run-DeepSeek-s-671B-AI-... 768 GiB for $6000.

xigency · 2025-04-07T02:37:01 1743993421

I might be filing for bankruptcy soon so I'm definitely stuck with what I've got.

nickysielicki · 2025-04-06T00:52:18 1743900738

> Infiniband is cool and all, but it's also stupid and their scale-out strategy is non-existent.

god I love this website.

tucnak · 2025-04-06T10:17:13 1743934633

Keyword: compute-in-network

nickysielicki · 2025-04-06T22:31:58 1743978718

Not sure what you’re suggesting. I’m well aware that things like SHARP exist.

hydroreadsstuff · on Feb 11, 2025

Overallcoation has a limit. You only have so much RAM/storage. Beyond that you start swapping. I could really use a hash table (or similar structure) that degrades less with higher occupancy.

hydroreadsstuff · on Dec 4, 2024

For other platforms there is already https://github.com/nviennot/core-to-core-latency

This project works around a limitation in MacOS, namely lack of thread pinning, and makes it possible to do it without Asahi Linux, and without using custom MacOS kernel extensions and disabling security features.

hydroreadsstuff · on May 27, 2024

Did they raise 6B or are they valued at 6B and don't disclose the raised amount? Probably the latter?

hydroreadsstuff · on May 10, 2024

If that was the intend an animation that doesn't literally crush things would have worked much better. Let it fall into a black hole and let an iPad emerge or whatever. The dramatic effect of a hydraulic press adds nothing positive.