More

selcuka · 2026-04-02T05:39:43 1775108383

Reminded me of the XKCD [1] that points out the problem with average scores.

selcuka · 2026-04-01T07:45:19 1775029519

In general you can't without patching the app itself, statically or at runtime using something like Frida.

selcuka · 2026-04-01T05:53:17 1775022797

This is even worse. My Claude Code instance can theoretically write the same code as your instance for a similar prompt. Why should one of us be able to have the copyright?

selcuka · 2026-04-01T05:11:20 1775020280

No, they are different entities. Having the same founder does not mean much in this context. You signed a contract with 23andme, not Regeneron Pharmaceuticals.

selcuka · 2026-04-01T04:12:15 1775016735

Interesting. Qwen 3.5 0.8B failed the test for me.

selcuka · 2026-03-31T05:58:07 1774936687

Any citations? Because that was my impression, too. I want frontier model performance for my coding assistant, but "most users" could do with smaller/faster models.

ChatGPT free falls back to GPT-5.2 Mini after a few interactions.

lxgr · 2026-03-31T07:17:23 1774941443

Have you used GPT instant or mini yourself? I think it’s pretty cynical to assume that this is “good enough for most people”, even if they don’t know the difference between that and better models.

selcuka · 2026-04-01T02:34:20 1775010860

> I think it’s pretty cynical to assume that this is “good enough for most people”

It's a deduction, not an assumption. Obviously it's "good enough" for "most people". Otherwise nobody would be using the free version of ChatGPT today.

I pay for a Claude subscription, but even then I sometimes downgrade to Sonnet or even Haiku when I need a quick answer.

lxgr · 2026-04-01T10:39:17 1775039957

> Obviously it's "good enough" for "most people". Otherwise nobody would be using the free version of ChatGPT today.

I'd say it's better than nothing, which to me is not the same thing at all as "good enough".

For example, I believe most people would be better off with half the allowable queries per day, routed to a better model, but that's not an available product.

throwaway27448 · 2026-03-31T08:45:16 1774946716

Say more. Why do you think this?

embedding-shape · 2026-03-31T10:51:26 1774954286

They're awful and hallucinate a lot, I couldn't imagine using it even for prompts about TV shows, even less so for serious work. Repeating the question from the parent, have you tried those yourself? Even compared to ChatGPT Thinking, they're short of useless.

lxgr · 2026-03-31T13:45:30 1774964730

They're essentially replying based on vibes, instead of grounding their responses in extensive web searches, which is what the paid models/configurations generally do. This makes them wrong more often than they're right for anything but the most trivial requests that can be easily responded to out of memorized training data.

This is all on top of the (to me) insufferable tone of the non-thinking models, but that might well be how most users prefer to be talked to, and whether that's how these models should accordingly talk is a much more nuanced question.

Regardless of that, everybody deserves correct answers, even users on the free tier. If this makes the free tier uneconomical to serve for hours on end per user per day, then I'd much rather they limit the number of turns than dial down the quality like that.

asutekku · 2026-03-31T06:39:32 1774939172

Frontier model has much better knowledge and they usually hallucinate less. It's not about the coding capabilities, it's about how much you can trust the model.

Barbing · 2026-03-31T07:05:31 1774940731

re: trust-

Have you tried the free version of ChatGPT? It is positively appalling. It’s like GPT 3.5 but prompted to write three times as much as necessary to seem useful. I wonder how many people have embarrassed themselves, lost their jobs, and been critically misinformed. All easy with state-of-the-art models but seemingly a guarantee with the bottom sub-slop tier.

Is the average person just talking to it about their day or something?

theshrike79 · 2026-03-31T08:47:49 1774946869

Even the paid version of ChatGPT tends to use a 1000 words when 10 will do.

You can try asking it the same question as Claude and compare the answers. I can guarantee you that the ChatGPT answer won't fit on a single screen on a 32" 4k monitor.

Claude's will.

PhilipRoman · 2026-03-31T13:00:35 1774962035

I use the free version of ChatGPT (without logging in) when I need some one-off question without a huge context. Real world prompt:

  "when hostapd initializes 80211 iface over nl80211, what attributes correspond to selected standard version like ax or be?"

It works fine, avoids falling into trap due to misleading question. Probably works even better for more popular technologies. Yeah, it has higher failure rates but it's not a dealbreaker for non-autonomous use cases.

throwaway27448 · 2026-03-31T08:46:13 1774946773

If someone blindly submits chatbot output they deserve to be embarrassed and fired. But I don't think that's going to improve.

jychang · 2026-03-31T07:44:32 1774943072

The free version of ChatGPT is insanely crippled, so that's not surprising.

selcuka · 2026-03-29T23:33:02 1774827182

Previous discussion:

https://news.ycombinator.com/item?id=47356968

selcuka · 2026-03-27T06:35:13 1774593313

I was once asked to migrate a Microsoft Access application to C#/MS SQL Server because it was too slow. I just added a few database indexes to make it an order of magnitude faster.

(They still wanted to go ahead with the migration, but that's a different story.)

guzfip · 2026-03-27T15:18:45 1774624725

> They still wanted to go ahead with the migration, but that's a different story.

Yeah I would too lol. During Covid I found myself in the odd situation of developing a new Access DB product and man was it miserable.

selcuka · 2026-03-27T03:06:39 1774580799

Aren't LLMs commodity products these days? It's the same thing as running this on a $7 VPS that you don't "own".

I don't think switching to a different provider, or running an open one locally would affect the response quality that much.

ekianjo · 2026-03-27T03:27:08 1774582028

The LLM is the key element here, not the 7 dollars VPS... The model itself has cost billions of dollars to train and of the service shuts down or is interrupted for some reason your fancy setup breaks like nothing.

selcuka · 2026-03-27T03:38:14 1774582694

> The model itself has cost billions of dollars to train

But that has nothing to do with this use case, right? By the same logic, Linux has millions of man-hours went into it but we can use it for free on a $7 VPS.

> service shuts down or is interrupted for some reason your fancy setup breaks like nothing

No, it doesn't. That's what I meant by commodity. You can switch to another service and it will work just fine (unless you meant that all LLM providers might cease to exist).

Also note that they have a $2/day API usage cap, meaning that they are willing to spend $60+/month for the LLM use. If everything else fails, they can use those funds to upgrade the VPS and run a local model on their own hardware. It won't be Sonnet-4.6-level, but it will do. It just doesn't make sense with current dollar-per-token prices.

ekianjo · 2026-03-29T13:23:25 1774790605

> Linux has millions of man-hours went into it but we can use it for free on a $7 VPS.

Bad analogy. I don't need an API to run Linux.

chatmasta · 2026-03-27T04:28:46 1774585726

> The LLM is the key element here

No, the key (novel) element here is the two-tiered approach to sandboxing and inter-agent communication. That’s why he spends most of the post talking about it and only a few sentences on which models he selected.

selcuka · 2026-03-27T01:04:50 1774573490

It's a race to the bottom. DeepSeek beats all others (single-shot), and it is ~50% cheaper than the cost of local electricity only.

> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot

> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline

yogthos · 2026-03-27T02:01:58 1774576918

You could use this approach with DeepSeek as well. The innovation here is that you can generate a bunch of solutions, use a small model to pick promising candidates and then test them. Then you feed errors back to the generator model and iterate. In a way, it's sort of like a genetic algorithm that converges on a solution.

hu3 · 2026-03-27T03:15:47 1774581347

Indeed but:

1) That is relatively very slow.

2) Can also be done, simpler even, with SoTA models over API.

yogthos · 2026-03-27T03:35:39 1774582539

Right, this works with any models. To me, the most interesting part is that you can use a smaller model that you could run locally to get results comparable to SoTA models. Ultimately, I'd far prefer running local, even if slower, for the simple reason of having sovereignty over my data.

Being reliant on a service means you have to share whatever you're working on with the service, and the service provider decides what you can do, and make changes to their terms of service on a whim.

If locally running models can get to the point where they can be used as a daily driver, that solves the problem.

eru · 2026-03-27T08:34:31 1774600471

Why do you need a small model to pick promising candidates? Why not a bigger one?

(And ideally you'd probably test first, or at least try to feed compiler errors back etc?)

Overall, I mostly agree.

yogthos · 2026-03-27T13:20:13 1774617613

mostly an issue of speed and resource usage, if the model is too big then simply running the tests will be cheaper

strangescript · 2026-03-27T13:47:42 1774619262

I will "suffer" through .004 of electricity if I can run it on my own computer

sourcecodeplz · 2026-03-27T04:04:02 1774584242

I've tested many open models, Deepseek 3.2 is the only SOTA similar.

mikestorrent · 2026-03-27T01:15:50 1774574150

> cheaper than the cost of local electricity only.

Can you explain what that means?

simonw · 2026-03-27T01:30:40 1774575040

I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

Local model enthusiasts often assume that running locally is more energy efficient than running in a data center, but fail to take the economies of scale into account.

BoredomIsFun · 2026-03-27T09:34:24 1774604064

> Local model enthusiasts often assume that running locally is more energy efficient than running in a data center,

It is a well known 101 truism in /r/Localllama that local is rarely cheaper, unless run batched - then it is massively, 10x cheaper indeed.

> I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

Because it is hosted in China, where energy is cheap. In ex-USSR where I live it is inexpensive too, and keeping in mind that whole winter I had to use small space heater, due to inadequacy of my central heating, using local came out as 100% free.

littlestymaar · 2026-03-27T03:15:45 1774581345

I guess it mostly comes from using the model with batch-size = 1 locally, vs high batch size in a DC, since GPU consumption don't grow that much with batch size.

Note that while a local chatbot user will mostly be using batch-size = 1, it's not going to be true if they are running an agentic framework, so the gap is going to narrow or even reverse.

eru · 2026-03-27T08:35:24 1774600524

Well, different parts of the world also have different electricity prices.

littlestymaar · 2026-03-27T13:15:21 1774617321

Usually not multiple orders of magnitude difference though.

croes · 2026-03-27T05:28:57 1774589337

Local enthusiasts don’t have to fear account banning.

jacquesm · 2026-03-27T03:50:27 1774583427

Some of those local model enthusiasts can actually afford solar panels.

jLaForest · 2026-03-27T04:18:00 1774585080

You are still incurring a cost if you use the electricity instead of selling it back to the grid

Kodiack · 2026-03-27T04:26:11 1774585571

The extent of that heavily depends on where you are. Where I live in NZ, the grid export rates are very low while the import rates are very high.

Our peak import rate is 3x higher than our solar export rate. In other words, we’d need to sell 3 kWh hours of energy to offset the cost of using 1 kWh at peak.

We’re currently in the process of accepting a quote for home batteries. The rates here highly incentivise maximising self-use.

jacquesm · 2026-03-27T12:43:03 1774615383

Selling it back to the grid is something that is still possible but much, much less of a financially sound proposition than it was a few years ago because of regulatory capture by the utilities. In some places it is so bad that you get penalized for excess power. Local consumption is the fastest way to capitalize on this, more so if you can make money with that excess power.

dmichulke · 2026-03-27T04:59:21 1774587561

Luxembourg: Purchase price = 2 x sales price, mostly due to grid costs.

And this is with no income tax or VAT on sold electricity.

pbhjpbhj · 2026-03-27T13:58:09 1774619889

Is it economies of scale, or is it unpaid externalities?

atoav · 2026-03-27T02:18:10 1774577890

It means that the electricity you would have to pay if you did the computations yourself would be more expensive than paying them to do it. Part of thst has to do with the fact that China has cheap electricity, also due to their massive push into renewables. Part of that is just economies of scale. A big server farm can run more efficiently than your PC on average.

AuthAuth · 2026-03-27T03:53:16 1774583596

cheap electric due to their massive push on non renewables. There has been no change in the price of electricity during the renewable shift.

atoav · 2026-03-28T02:14:27 1774664067

Normslly you'd expext that more (and cheaper) supply would drive down prices. Classic market logic.

How do do you explain that this market logic ceases to exist for renewables only? A whopping ~2TW or ~35% of generated power in China is renewable and since renewable energy is roughly 1.5 to 4 times cheaper than e.g. coal per kW/h produced that ought to have some impact.

If it has not I'd be curious in your explaination of the mechanism involved.

AuthAuth · 2026-03-30T01:23:11 1774833791

China's power price has never reflected market costs. But the reason why you arent seeing it drop with so much renewables added is because your numbers are wrong. Its about 20-25% generated as renewables over the past 5 years and while a lot of renewable capacity and non renewable capacity has been added the power demand has raised to match. When we look at the current demand its a near inf scale whatever is available modern industry will consume.

When we compare this to the power boom in the 2000s they were able to build enough energy generation to meet demand and were able to drive the price down from 30c rural and 17c city to 8cents for both.

jojobas · 2026-03-27T01:32:15 1774575135

China has cheap electricity.

ericd · 2026-03-27T01:40:55 1774575655

Well, also, LLM servers get much more efficient with request queue depth >1 - tokens per second per gpu are massively higher with 100 concurrents than 1 on eg vllm.

DeathArrow · 2026-03-27T08:42:13 1774600933

Yes, but the hardware they use for inference like Huawei Ascend 910C is less efficient than Nvidia H100 used in US due to the difference in the process node.

alifeinbinary · 2026-03-27T16:01:17 1774627277

All those parameters and it still won't answer questions about Tianenman Square in 1989... :(

viktorcode · 2026-03-27T16:34:18 1774629258

It will. The web chat has censorship features, but the model you can download doesn't.