Hacker Newsnew | past | comments | ask | show | jobs | submit | BoppreH's commentslogin

Undercover mode also pretends to be human, which I'm less ok with:

https://github.com/chatgptprojects/claude-code/blob/642c7f94...


You'll never win this battle, so why waste feelings and energy on it? That's where the internet is headed. There's no magical human verification technology coming to save us.

I can prove all contributions to stagex are by humans because we all belong to a 25 year old web of trust with 5444 endorser keys including most redhat, debian, ubuntu, and fedora maintainers, with all of our own maintainer keys in smartcards we tap to sign every review and commit, and we do background checks on every new maintainer.

I am completely serious. We have always had a working proof of human system called Web of Trust and while everyone loves to hate on PGP (in spite of it using modern ECC crypto these days) it is the only widely deployed spec that solves this problem.

https://kron.fi/en/posts/stagex-web-of-trust/


You can prove the commits were signed by a key you once verified. It is your trust in those people which allows you to extend that to “no LLM” usage, but that’s reframing the conversation as one of trust, not human / machine. Which is (charitably) GPs point: stop framing this as machine vs human — assume (“accept”) that all text can be produced by machines and go from there: what now? That’s where your proposal is one solution: strict web of trust. It has pros and cons (barrier to entry for legitimate first timers), but it’s a valid proposal.

All that to say “you’re not disagreeing with the person you’re replying to” lol xD


I can prove that code was signed by a key that was verified to belong to a single human body by lots of in-person high reputation humans.

How the code was authored, who cares, but I can prove it had multiple explicit cryptographic human signoffs before merge, and that is what matters in terms of quality control and supply chain attack resistance.


Exactly. So in the words of the comment you replied to: why are we wasting energy on worrying about Claude code impersonating humans? We have that solution you proposed.

That’s what I mean by “you agree with the person to whom you replied”


I suppose you are correct. I am agreeing that if one widely deployed the defense tactics projects like stagex use, then asshats using things like undercover will not be trusted.

Unfortunately outside of classic Linux packaging platforms, useful web of trust and signing is very rare, so I expect things like undercover mode being popular are going to make everything a lot worse before it gets better.


Your last point, I think, is why so many sibling commenters are balking at GP :)

Can't you just instruct Claude Code to use your signing keys? I understand you may say "I won't." But my point is that someone can.

The people who signed my keys trust me to be an honest human actor that chose this as the singular identity they signed for the human body they met in person.

I -could- burn my 16+ years of reputation by letting a bot start signing commits as me, and I could also set my house on fire. I have very strong incentive not to do so as my aggregate trust is very expensive and the humans that signed me would be unlikely to sign a second if I ruined the reputation of my first.

This incentive structure is why web of trust actually works pretty well, and is the best "proof of human" we are likely ever going to have while respecting privacy and anonymity for those that need it.


You can only prove that all contributions are pushed by those humans, and you can quite explicitly/clearly not prove that those humans didn't use any AI prior to pushing.

I absolutely do not care what autocomplete tools someone used. Only that they as humans own and sign what is submitted so it is attached to their very expensive reputations they do not want to lose.

That’s great, and I also don’t care. But I think all people are saying is that by most definitions you cannot “prove all contributions to stagex are by humans”.

Or are you saying you can prove that aliens and cats didn’t make them? Because I’m not sure that’s true either.

And once you find out someone has trained their dog to commit something, how exactly do you revoke your trust?

I think if you answer these questions you’ll see pretty quickly why this solution isn’t the silver bullet you think it is.

Edit: stagex looks really, really good


It is not a silver bullet by itself, but when combined with the other tactics in stagex I believe it gives us a very strong supply chain attack defense.

I can not prove the tools used, but I can prove multiple humans signed off on code with keys they stake their personal reputations on that I have confirmed they maintain on smartcards.

While nothing involving humans is perfect I feel it is best effort with existing tools and standards and makes us one of the hardest projects to deploy a successful supply chain attack on today.

Edit: Saw your edit. Thanks!


With 5400+ people I am betting that you have at least one person in your 'web of trust' that no longer deserves that trust.

That's one of the intrinsic problems with webs of trust (and with democracy...), you extend your trust but it does not automatically revoke when the person can no longer be trusted.


Of course! There are always edge cases, but I would suspect the number of bots signed by reputable keys to be near 0%, and the honest human score in this trust graph to be well over 90%.

Compare to how much we should trust any random unsigned key signing commits, or unsigned commits, in which the trust should be 0% unless you have reviewed the code yourself.


The problem is all it really takes is one edge case to successfully break a web of trust to the point that the web of trust becomes a blind spot. Instead of distrusting everybody (which should be the default) the web of trust attempts to create a 'walled garden of trust' and behind that wall everybody can be friendly. That gives a successful attacker a massive advantage.

If we were talking about any linux distribution before stagex, I would agree with you.

Stagex however expects at least one maintainer may at any time engage in reputation-ending dishonesty or simply they were threatened or coerced. This is why every single release is signed by a -quorum- of code reviewers and code reproducers that must all build locally and get identical hashes, so no single points of failure exist in our trust graph.

Our last release was signed by four geodistributed maintainers that all attest to having built the entire distribution from 180 bytes of machine code all the way up with the same hashes.

All of their keys being compromised at once gets beyond the pale.


While I appreciate all of the effort you put in this and respect that you trust this to be bulletproof I'm always going to be skeptical of silver bullets.

Your level of certainty is the thing that frightens me more than the confidence I have in the quality of your work.


I am reasonably confident it is the current industry best effort, and way beyond the status quo, not that it is perfect.

We combine many tactics for defense in depth that I strongly suspect if widely deployed would put a stop to the daily supply chain attack headlines.


nothing about this proves anything except that someone or something had access to the key.

Do you think it is likely that the majority of the people that spent decades building this trust graph and gaining the trust needed to be release engineers on the packages that power the whole internet are just going to hand off control of that key to a bot?

Anyone doing so would be setting their professional reputations completely on fire, and burning your in-person-built web of trust is a once in a lifetime thing.

Basically, we trust the keys belong to humans and are controlled by humans because to do otherwise would be a violation of the universally understood trust contract and would thus be reputational bankruptcy that would take years to overcome, if ever.

Even so, we assume at least one maintainer is dishonest at all times, which is why every action needs signatures from two or more maintainers.


Fatalism will also not fix anything. But I suppose death comes for us all, yes? Why do anything at all?

I feel that fatalism, especially when people treat it as some sort of personal philosophy, is kind of lazy.

It requires no effort to say "fuck this, nothing matters anyway", and then justify doing literally nothing.


> I feel that fatalism, especially when people treat it as some sort of personal philosophy, is kind of lazy.

I think a lot of fatalism is fake. It's really someone saying "I like this, and I want you to believe you can't change it so you give up."


It also makes no sense! "Fuck this, it doesn't matter - but I'll happily spend effort communicating that to others, because apparently making others not care about something I don't care about is something I do care about." Wut?!

Well, I say it makes no sense. Alternatively, it makes a lot of sense, and these people actually just wanna destroy everything we hold dear :-(


Perhaps the current societal trajectory is destroying everything that they hold dear.

I mean, just look around you.


Then do something about it. Vote for better politicians. Donate money to causes that you think are important. If you think you can do it better, and this isn't meant to be facetious, run for political office.

Being fatalistic can be a great excuse not to do anything.


>Vote for better politicians.

I cannot. I can only vote better politicians if they are there. That is without even going into the minefield of what is "better". My implication is that I have no confidence whatsoever in any current politician in my state.

> Donate money to causes that you think are important.

I have no money.

> If you think you can do it better, and this isn't meant to be facetious, run for political office.

I have no money, no visibility and no connections. Even if I was magically given tons of money, I would still need a strong network to attempt any real change, even without taking into consideration the strong networks already in place preventing it.

Telling random citizens "run for office" is facetious, whether you mean it or not.


> Telling random citizens "run for office" is facetious, whether you mean it or not.

Hard disagree. At least where I live, "random citizens" run for local office and succeed all the time.

Also, complaining that you "have no network" is a you problem, not a system problem. I'm truly sorry if you feel you have no friends, but you'll be better off at least trying to get some (independent of politics). And if that's something you've tried and failed at before, I do feel pity. But I don't think hope is lost for anyone. And even if it were lost, please don't actively spread the misery!


Don't spread the misery?? Wow, fucking thanks.

You are kind of proving my point. You are actively justifying doing literally nothing about what bothers you and acting indignant and self righteous about it.

Apathy has a striking number of motivated evangelists!

This is more cultural rather than rational.

This is the only relevant question. And it leads right to the next one which is “what is a good life?”

But humans have a huge bias for action. I think generally doing less is better.


On the other hand, if a dead person can do it better than you can, it's not that much of an accomplishment.

I didn’t mean that you should strive to do as little as possible; rather that if you have 2 choices, do more or do less - then I would be biased towards doing less. Of course not always a realistic option

> I think generally doing less is better.

My sedentary lifestyle is responsible for my recurrent cellulitis infections.

Just saying.


You can probably find a million situations where doing less is terrible.

I think first step would be to define for yourself what doing less actually means - it could mean taking a walk instead of chasing dopamine -> doing less but you move more.

But whatever it’s a philosophical question and there aren’t any right or true answers


I got hit by a car while out for a run. Just saying.

I think "adapt or die" is the takeaway.

It's fun to pet the cat. It's not fun to rage against an unstoppable force. Well, maybe it is for some people. But I find people often underestimate the detrimental effects.

> But I suppose death comes for us all, yes? Why do anything at all?

Wrong take. Death comes for us all, yes, so why hold back? Do you want to live forever?


> Do you want to live forever?

Yes, of course. Do you prefer to die? Those are the only two alternatives, and a decision that you don't want one is a decision that you prefer the other.


No, there is no alternative. Everything eventually dies, so you better make peace with it. The only people who believe that they won't die are religious people who believe in an afterlife (which is a preposterous position) and the people who have their heads or whole bodies frozen because they think they are so special that the future will honor their contracts and revive them.

Both of these are bound to lead to the exact same outcome so it doesn't really matter what you believe but it may guide you to wiser decision while you are alive to accept reality absent proof to the contrary.


s/make peace with it/make war with it/. To the last breath.

I can think of no concept more horrifying than personal immortality and if you disagree I don't think you've thought about it enough.

I'm sorry to hear that you don't want to exist in the future. I do. I have thought about it extensively, and there is literally no scenario in which I consider not-existing better than existing.

There is an essentially infinite amount of creativity and interesting complexity available in the richness of interactions with other people and the things people create. What, exactly, are you "horrified" about?


The difference between "essentially infinite" and "actually infinite". Infinity is a very long time.

Cringe.

I guess I could just curl up into fetal position and watch the world go by. But that's no fun. Why not dream big and shoot for the moon with kooky goals like, say, having an underground, community-supported internet where things are falling less to shit?

Belief in inevitability is a choice (except for maybe dying, I guess).


Why stop at one? Make more such underground community supported internets. The more the merrier. Monoculture ends with death. The only question is how long it will take.

https://hashbang.sh - "underground" for over 20 years and still running strong.

Amen brother, this one will be for me and all my homies.

IDK. I sort of like the idea that now instead of dead internet theory being a joke, that it’ll be a well known fact that a minority of people are not real and there is no point in engaging… I look forward to Social 3… where people have to meet face to face.

How quickly would that meat-space renaissance spin through our whole cyberpunk heritage, speedrunning the same authentication challenges..?

The cornucopia of gargoyles, living their best life as terminals for the machine.

The strange p-zombies who don't show their gargoyle accessories visibly, but somehow still follow the script.

Eventually the more insidious infiltrators, requiring a real Voight-Kampff test.



Magical human verification technology is called "your own private forum" in conjunction with "invite your friends"

Until your friend writes a bot.

Funny story, when I was younger I trained a basic text predictor deep learning model on all my conversations in a group chat I was in, it was surprisingly good at sounding like me and sometimes I'd use it to generate some text to submit to the chat.


I don't see what the value of this would be. Why would I want to automate talking to my friends? If I'm not interested in talking with them, I could simply not do it. It also carries the risk of not actually knowing what was talked about or said, which could come up in real life and lead to issues. If a "friend" started using a bot to talk to me, they would not longer be considered a friend. That would be the end.

I think you underestimate how many people already run their opinions and responses through LLMs, even if the LLM is not writing them wholesale. Intelligence is part of the social game, so appearing to have it matters to people. Friend groups are just social groups of a certain kind, they're not really removed from all this.

Exactly. Pick friends that do not behave like dicks.

It was for fun, to see if it were possible and whether others could detect they were talking to a bot or not, you know, the hacker ethos and all. It's not meant to be taken seriously although looks like these days people unironically have LLM "friends."

I used to leave a megahal connected to my bouncer when I wasn't around

Even if it is impossible to win, I am still feeling bad about it.

And at this point it is more about how large space will be usable and how much will be bot-controlled wasteland. I prefer spaces important for me to survive.


Feeling bad about something you can’t change is bad for your mental health.

Probably beats being in denial over it and pretending you like it.

And identifying problem you dislike is a good first step to find a strategy to solve it at least in part.


Deciding that you can't change something is the first and last step towards failing to change it.

Which is not a problem if you choose not to worry about it.

"It's uncool to care about things" is, fortunately, not a compelling argument for people who care about things.

This tangent does not seem likely to go anywhere productive.


You can care about things, but it seems preferable to care about that which you can change

That’s a reductionist and wrong reading of the argument I made.

You said "can’t change". I observed that deciding you can't change something is self-fulfilling. Your argument from that point still relied on the assumption that you can't change it.

Before you decide not to care about something, you are supposed to make a deep assessment to see whether you can change it. It is only after you’ve determined that the thing can’t be changed that you can choose not to care about it.

and naming your feelings is the first step toward restoration

It’s certainly winnable with some legislative tweaks. These systems are all designed by humans, we can just change them.

Of course, we’d need a significant change of direction in leadership, but it’s happened many times before. French Revolution seems highly relevant


I think you're underestimating the difficulty, even for exact copies of text (which AI mostly isn't doing).

What sort of Orwellian anti-cheat system would prevent copy and paste from working? What sort of law would mandate that? There are elaborate systems preventing people from copying video but they still have an analog hole.


Human verification technology absolutely exists. Give it some time and people who sell ai today are going to shoehorn it everywhere as the solution to the problem they are busy creating now.

To feel something. To resist something bad. To stand for what is right.

Do those sentiments mean nothing to you?


Well why not head for the front lines of Ukraine? Or Russia, depending on your preference.

This is such an incredibly imbecilic comment.

Listen to this guy: "because you don't take the ultimate risk for what you believe in, you are dumb for suggesting you should do anything whatsoever".

Go away. The world doesn't need your dark resignation.


Wasting your life fighting things that can't be fought is functionally equivalent to dying sooner.


It’s where THIS internet is headed. The future may involve a lot more of them I think.

Technology won’t save us, but that doesn’t mean we shouldn’t be promoting ethics.

Nothing like throwing in the towel before a battle is ever fought. Let's just sigh and wearily march on to our world of AI slop and ever higher bug counts and latency delays while we wait for the five different phone homes and compilations through a billion different LLM's for every silly command.

I am actively building non-magical human verification technology that doesn't require you uploading your retinal scans or ID to billionaires or incompetent outsourcing firms.

Great! Lets do the CAPTCHA-test: Will I, as a 100% blind user, be able to complete your process?

I think so? Can you use a smartphone?

edit: can't reply, the rate-limiting is such an awful UX


Not parent poster but I am a maintainer of software powering significant portions of the internet and prove my humanity with a 16 year old PGP key with thousands of transitive trust signatures formed through mostly in-person meetings, using IETF standards and keychain smartcards, as is the case for everyone I work with.

But, I do not have an Android or iOS device as I do not use proprietary software, so a smartphone based solution would not work for me.

Why re-invent the wheel? Invest in making PGP easier and keep the decades of trust building going anchoring humans to a web of trust that long predates human-impersonation-capable AI.


So, you haven't thought about it yet? Your counterquestion is insufficient to get any further, because "use" is a very relative term. If I can "use" a smartphone depends on the way you code your app. If you use inherently inaccessible UI elements, I can't "use" your app, no matter if I own a smartphone or not.

I might reply with a similarily useless question: Can you write accessible smartphone apps?


We already have it and we use it to validate the trusted human maintainer involvement behind the linux packages that power the entire internet: PGP Web Of Trust. Still works as designed and I still go to keysigning parties in person.

Say a regular human wanted to join and prove their humanhood status (expanding the web of trust). How would they go about that? What is the theoretical ceiling on the rate of expansion of this implementation?

They need to go to generate their key, ideally offline with an offline CA backup on and subkeys on a nitrokey or yubikey smartcard with touch requirement enabled for all key operations for safe workstation use. One can use keyfork on AirgapOS to do this safely, as a once-ever operation.

From there they set up their workstation tools to sign every ssh connection, git push, commit, merge, review, secret decryption, and release signature with their PGP smartcard which is all very well supported. This offers massive damage control if you get malware on their system, in addition to preventing online impersonation.

From there they ideally link it to all their online accounts with keyoxide to make it easy to verify as a single long lived identity, then start seeking out key signing parties locally or at tech conferences, hackerspaces etc.

We run one at CCC most years at the Church Of Cryptography.

Think of it like a long term digital passport that requires a few signatures by an international set of human notarys before anyone significantly trusts it.

Yes it requires a manual set of human steps anchored to human reputation online and offline, which is a doorway swarms of made up AI bot identities cannot pass through.

Do I expect most humans to do this? Absolutely not. However I consider it _negligent_ for any maintainer of a widely used open source software project to _not_ do this or they risk an impersonator pushing malware to their users.

No idea on theoretical rate of expansion but all the major security conscious classic linux distros mandate this for all maintainers. There are only maybe 20k people on earth that significantly contribute to FOSS internet foundations and Linux distros, so it scales just fine there.

Note: with the exception of stagex, most modern distros like alpine and nix have a yolo wikipedia style trust model, so never ever use those in production.


The technical implementation is the easy part. The hard part is achieving mass voluntary cooperation under adverse incentive schemes.

This is true, but I think there is a sizable (and growing) appetite for human-only spaces.

how does it work?

I'm hoping to do a Show HN soon :)

>There's no magical human verification technology coming to save us.

Except for the one Sam Altman is building.



Scam Altman is not trustworthy. I hope nobody gives him their biometrics. I certainly would never.

Giving your retina scan to one of the main Slop Bros, what could possibly go wrong?

Negative sentiment towards technological destiny detected in human agent.

That's why I stopped brushing my teeth, I can't clean every crevice perfectly so what's the point?

> You'll never win this battle, so why waste feelings and energy on it?

Cool. The attitude of a bully. Thanks for the contribution!


That's an ironic comment.

Hardly. They're clearly trying to influence a group at large using a flawed logical tactic. I'm pointing that out with a device that suggests their same futility they expect others to adapt should be primarily experienced by them instead.

If pointing out bullies is bullying then you're in a ridiculous mindset.


I assume we're heading to a place where keyboards will all have biometric sensors on every key and measure weight fluctuations in keystrokes, actually.

That’s like having your security on the frontend.

If someone owns the keyboard then they can fake those metrics and tell the server it is happening when it isn’t.

That will be easy to beat.


We already do this, and with mouse. It's been defeated long ago.

Also unintentionally reveals something:

> Write commit messages as a human developer would — describe only what the code change does.

That's not what a commit message is for, that's what the diff is for. The commit message should explain WHY.

Sadly not doing that likely does indeed make it appear more human...


I wager that "describe only what the code change does" was someone's attempt to invert "don't add the extra crap you often try to write", not some 4d chess instruction that makes claude larp like a human writing a crappy commit message.

Yes, this is a trend I've noticed strongly with Claude code—it really struggles to explain why. Especially in PR descriptions, it has a strong bias to just summarize the commits and not explain at all why the PR exists.

The question "why" is always answered with post-hoc rationalizations. This applies to both LLMs and humans.

No, I think a lot of humans can explain why they're adding a new button to the checkout page, or why they're removing a line from the revenue reconciliation job. There's always a reason a change gets made, or else nobody would be working on it at all :)

Yeah, that was my reaction too. A shame they try to hide themselves, but even worse, the instructions to this "Fake Human" is wrong too!

But will this be released as a feature? For me it seems like it's an Anthropic internal tool to secretly contribute to public repositories to test new models etc.

I don't care who is using it, I don't want LLMs pretending to be humans in public repos. Anthropic just lost some points with me for this one.

EDIT: I just realized this might be used without publishing the changes, for internal evaluation only as you mentioned. That would be a lot better.


A benign use of this mode is developing on their own public repositories.

https://github.com/anthropics/claude-code


> Write commit messages as a human developer would — describe only what the code change does.

The undercover mode prompt was generated using AI.


All these companies use AIs for writing these prompts.

But AI aren't actually very good at writing prompts imo. Like they are superficially good in that they seem to produce lots of vaguely accurate and specific text. And you would hope the specificity would mean it's good.

But they sort of don't capture intent very well. Nor do they seem to understand the failure modes of AI. The "-- describe only what the code change does" is a good example. This is specifc but it also distinctly seems like someone who doesn't actually understand what makes AI writing obvious.

If you compare that vs human written prose about what makes AI writing feel AI you would see the difference. https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

The above actually feels like text from someone who has read and understands what makes AI writing AI.


Hey LLM, write me a system prompt that will avoid the common AI 'tells' or other idiosyncrasies that make it obvious that text or code output was generated by an AI/LLM. Use the referenced Wikipedia article as a must-avoid list, but do not consider it exhaustive. Add any derivations or modifications to these rules to catch 'likely' signals as well.

There, sorted!


Hey, LLM, take a look at these multiple hundred emails and docs in my docs folder from the last few years, before I started using AI, that I wrote personally. create a list of all of the idiosyncrasies that I have in my writing. Create a file to remember that. And then use that to write any new text that'll be published so it sounds like my authentic voice. Thank you.

All the prompts I've ever written with Claude have always worked fine the first time. Only revised if the actual purpose changes, I left something out, etc. But also I tend to only write prompts as part of a larger session, usually near the end, so there's lots of context available to help with the writing.

AI is better at writing prompts than most humans. It requires work and lots of developers don’t think getting good at prompting actually matters.

At least half of the complaints I see on HN boil down to the person's prompts suck. Or the expectation that AI can read their mind.


As someone who often fails to read subtext, I would estimate that most people expect you to participate in mind reading as a natural part of conversation.

So it is no surprise that many people have difficulty switching gears to literal mode when interacting with these models.


That's not supposed to be surprising. They're dogfooding CC to develop CC. I assume any and every line in this repo is AI generated.

This is my pet peeve with LLMs, they almost always fails to write like a normal human would. Mentioning logs, or other meta-things which is not at all interesting.

I had a problem to fix and one not only mentioned these "logs", but went on about things like "config", "tests", and a bunch of other unimportant nonsense words. It even went on to point me towards the "manual". Totally robotic monstrosity.

lol?

1) This seems to be for strictly Antrophic interal tooling 2) It does not "pretend to be human" it is instructed to "Write commit messages as a human developer would — describe only what the code change does."

Since when "describe only what the code change does" is pretending to be human?

You guys are just mining for things to moan about at this point.


1) It's not clear to me that this is only for internal tooling, as opposed to publishing commits on public GitHub repos. 2) Yes, it does explicitly say to pretend to be a human. From the link on my post:

> NEVER include in commit messages or PR descriptions:

> [...]

> - The phrase "Claude Code" or any mention that you are an AI


That's gonna need an explanation. From the ethics/safety/alignment people.

(We detached this subthread from https://news.ycombinator.com/item?id=47584683.)

Heh, this is what people who are hostile against AI-generated contributions get. I always figured it'd happen soon enough, and here it is in the wild. Who knows where else it's happening...

I pretend to be human most days. I call it the daily facade of who I want to be on a given day. Oh humanity.

Yeah, I thought Anthropic was for AI safety. Telling Ai to not be honest is a bad sign.

Time to ask if the contributor know what a Capybara is as a new Turing test

The first two zips I download today were 9.887.340 bytes, why is yours 10.222.630 bytes?

I am Jacques' complete lack of surprise.

That whole “feature” is vile.

How so? Good bit of my global claude.md is dedicated to fighting the incessant attribution in git commits. It is on the same level as the "sent from my iphone" signature - I'm not okay with my commits being advertising board for anthropic.

An LLM company using regexes for sentiment analysis? That's like a truck company using horses to transport parts. Weird choice.

The difference in response time - especially versus a regex running locally - is really difficult to express to someone who hasn't made much use of LLM calls in their natural language projects.

Someone said 10,000x slower, but that's off - in my experience - by about four orders of magnitude. And that's average, it gets much worse.

Now personally I would have maybe made a call through a "traditional" ML widget (scikit, numpy, spaCy, fastText, sentence-transformer, etc) but - for me anyway - that whole entire stack is Python. Transpiling all that to TS might be a maintenance burden I don't particularly feel like taking on. And on client facing code I'm not really sure it's even possible.


So, think of it as a business man: You don't really care if your customers swear or whatever, but you know that it'll generate bad headlines. So you gotta do something. Just like a door lock isn't designed for a master criminal, you don't need to design your filter for some master swearer; no, you design it good enough that it gives the impression that further tries are futile.

So yeah, you do what's less intesive to the cpu, but also, you do what's enough to prevent the majority of the concerns where a screenshot or log ends up showing blatant "unmoral" behavior.


This door lock doesn’t even work against people speaking French, so I think they could have tried a mite harder.

The up-side of the US market is (almost) everyone there speaks English. The down side is, that includes all the well-networked pearl-clutchers. Europe (including France) will have the same people, but it's harder to coordinate a network of pearl-clutching between some saying "Il faut protéger nos enfants de cette vulgarité!" and others saying "Η τηλεόραση και τα μέσα ενημέρωσης διαστρεβλώνουν τις αξίες μας!" even when they care about the exact same media.

For headlines, that's enough.

For what's behind the pearl-clutching, for what leads to the headlines pandering to them being worth writing, I agree with everyone else on this thread saying a simple word list is weird and probably pointless. Not just for false-negatives, but also false-positives: the Latin influence on many European languages leads to one very big politically-incorrect-in-the-USA problem for all the EU products talking about anything "black" (which includes what's printed on some brands of dark chocolate, one of which I saw in Hungary even though Hungarian isn't a Latin language but an Ugric language and only takes influences from Latin).


I just went through quite an adventure trying to translate back and forth from/to Hungarian to/from different languages to figure out which Hungarian word you meant, and arrived at the conclusion that this language is encrypted against human comprehension.

dark chocolate is "étcsokoládé" literally edible-chocolate in Hungarian.

i heared the throat-cleaning "Negró" candy (marketed by a chimney sweeper man with soot-covered face) was usually which hurt English-speaking people's self-deprecating sensitivities.


En toute honnêteté, je pense avoir dit "damn it" plus d'une fois à chat gépété avant de fermer la fenêtre dans un accès de rage

Nom de dieu de putain de bordel de merde de saloperie de connard d'enculé de ta mère.

It's like wiping your arse with silk.

There are only Americans on the internet.

Yea.. but.. in English only.

Fortunately I can swear pretty well in Spanish.


Only a native speaker can tell if you swear well in a foreign language.

And Claude can't tell at all.

That's like saying you can use a chisel for woodworking.

If it’s good enough it’s good enough, but just like there are many more options than going full blown LLM or just use a regex there are more options than transpile a massive Python stack to TS or give up.

> Someone said 10,000x slower, but that's off - in my experience - by about four orders of magnitude.

You do know that 10,000x _is_ four orders of magnitude, right? :-D


OP is saying that in their experience it is more like eight orders of magnitude

I guess I need reading glasses ... :-D

They're sending it to an llm anyway tho? Not sure why they wouldn't just add a sentiment field to the requested response shape.

because a regex on the client is free vs gpu compute is absolutely not.

BUT THEY'RE ALREADY RUNNING IT THROUGH THE LLM.

Because they want it to be executed quickly and cheaply without blocking the workflow? Doesn’t seem very weird to me at all.

They probably have statistics on it and saw that certain phrases happen over and over so why waste compute on inference.

More likely their LLM Agent just produced that regex and they didn't even notice.

The problem with regex is multi-language support and how big the regex will bloat if you to support even 10 languages.

Supporting 10 different languages in regex is a drop in the ocean. The regex can be generated programmatically and you can compress regexes easily. We used to have a compressed regex that could match any placename or street name in the UK in a few MB of RAM. It was silly quick.

I think it will depend on the language. There are a few non-latin languages where a simple word search likely won't be enough for a regex to properly apply.

Exactly this. Unicode is a big beast to consider in regex concats.

woah. This is a regex use I've never heard of. I'd absolutely love to see a writeup on this approach - how its done and when it's useful.

You can literally | together every street address or other string you want to match in a giant disjunction, and then run a DFA/NFA minimization over that to get it down to a reasonable size. Maybe there are some fast regex simplification algorithms as well, but working directly with the finite automata has decades of research and probably can be more fully optimized.

This was many moons ago, written in perl. From memory we used Regexp::Trie - https://metacpan.org/release/DANKOGAI/Regexp-Trie-0.02/view/...

We used it to tokenize search input and combined it with a solr backend. Worked really remarkably well.


We're talking about Claude Code. If you're coding and not writing or thinking in English, the agents and people reading that code will have bigger problems than a regexp missing a swear word :).

I talk to it in non-English. But have rules to have everything in code and documentation in english. Only speaking with me should use my native language. Why would that be a problem?

Because 90% of training data was in English and therefore the model perform best in this language.

In my experience these models work fine using another language, if it’s a widely spoken one. For example, sometimes I prompt in Spanish, just to practice. It doesn’t seem to affect the quality of code generation.

They literally just have to subtract the vector for the source language and add the vector for the target.

It’s the original use case for LLMs.


Thank you. +1. There are obviously differences and things getting lost or slightly misaligned in the latent space, and these do cause degradation in reasoning quality, but the decline is very small in high resource languages.

It’s just a subjective observation.

It just can’t be a case simply because how ML works. In short, the more diverse and high quality texts with reasoning reach examples were in the training set, the better model performs on a given language.

So unless Spanish subset had much more quality-dense examples, to make up for volume, there is no way the quality of reasoning in Spanish is on par with English.

I apologise for the rambling explanation, I sure someone with ML expertise here can it explain it better.


I saw a curious post recently that explored this idea, and showed that it isn’t really the case. The internal layers of the model aren’t really reasoning in English, or in any human language.

Translation in/out of human languages only happens at the edges of the model.

Internal layer activations for the same concept are similar regardless of language, while activations at the top/bottom layers diverge. Meanwhile the pattern is reversed for same language different content.


So we do at least agree on the fact that quality of human language <-> embeddings transition depends on how good target language is represented in the training dataset?

Even if it happens at the edge, on every conversation turn, I may assume non captured small subtleties of meaning over time can accumulate into significant error.


The agent may be translating to English behind the scenes. There are all kinds of hacks layered on top of the underlying models. But however exactly it works, it’s clearly not the case that these things only make use of Spanish training data when you prompt them in Spanish.

Lets take a language with orders of magnitude less training data, like Nahuatl. The following is what happens if I ask Claude Code in Nahuatl to give me an example of a simple program that I can execute inside a web browser (avoiding any direct keyword prompts like 'Javascript' or 'Hello World'). Interestingly, it responds in Spanish. I doubt that it has any relevant training data in Nahuatl. (Apologies if the Nahuatl is wrong – I just used Google translate.)

    ¿Uelis techmakas se neskayotl tlen se programa tlen amo ouij tlen ueliskia nijchiuas ipan se navegador web?

    -----

    Entiendo tu pregunta en náhuatl. Aquí un ejemplo sencillo — una calculadora en HTML puro que puedes abrir directo en el navegador, sin servidor:

    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>Calculadora</title>
        <style>
        body { font-family: sans-serif; display: flex; justify-content: center; padding: 40px; }
        .calc { border: 1px solid #ccc; padding: 20px; border-radius: 8px; width: 200px; }
        input { width: 100%; margin-bottom: 10px; font-size: 1.2em; text-align: right; }
        button { width: 22%; margin: 2px; padding: 8px; font-size: 1em; cursor: pointer; }
        .wide { width: 47%; }
        </style>
    </head>
    <body>
    <div class="calc">
        <input type="text" id="pantalla" readonly>
        <div>
        <button onclick="tecla('7')">7</button>
        <button onclick="tecla('8')">8</button>
        <button onclick="tecla('9')">9</button>
        <button onclick="tecla('/')">÷</button>
    
        <button onclick="tecla('4')">4</button>
        <button onclick="tecla('5')">5</button>
        <button onclick="tecla('6')">6</button>
        <button onclick="tecla('*')">×</button>
    
        <button onclick="tecla('1')">1</button>
        <button onclick="tecla('2')">2</button>
        <button onclick="tecla('3')">3</button>
        <button onclick="tecla('-')">−</button>
    
        <button onclick="tecla('0')">0</button>
        <button onclick="tecla('.')">.</button>
        <button onclick="tecla('+')">+</button>
        <button onclick="limpiar()">C</button>
    
        <button class="wide" onclick="calcular()">=</button>
        <button class="wide" onclick="borrar()">⌫</button>
        </div>
    </div>
    
    <script>
        const pantalla = document.getElementById('pantalla');
    
        function tecla(val) {
        pantalla.value += val;
        }
    
        function calcular() {
        try {
            pantalla.value = eval(pantalla.value);
        } catch {
            pantalla.value = 'Error';
        }
        }
    
        function limpiar() { pantalla.value = ''; }
        function borrar() { pantalla.value = pantalla.value.slice(0, -1); }
    </script>
    </body>
    </html>
    
    Guarda esto como calculadora.html y ábrelo en cualquier navegador — no necesita servidor ni dependencias. Es un buen punto de partida para aprender HTML,
    CSS y JavaScript.

> it’s clearly not the case that these things only make use of Spanish training data when you prompt them in Spanish.

It’s not! And I’ve never said that.

Anyways, I’m not even sure what we are arguing about, as it’s 100% fact that SOTA models perform better in English, the only interesting question here how much better, is it negligible or actually makes a difference in real world use-cases.


It’s negligible as far as I can tell. If the LLM can “speak” the language well then you can prompt it in that language and get more or less the same results as in English.

In my experience agents tend to (counterintuitively) perform better when the business language is not English / does not match the code's language. I'm assuming the increased attention mitigates the higher "cognitive" load.

Claude handles human languages other than English just fine.

They only need to look at one language to get a statistically meaningful picture into common flaws with their model(s) or application.

If they want to drill down to flaws that only affect a particular language, then they could add a regex for that as well/instead.


Did you just complain about bloat, in anything using npm?

Why do you need to do it at the client side? You are leaking so much information on the client side. And considering the speed of Claude code, if you really want to do on the client side, a few seconds won't be a big deal.

Depends what its used by, if I recall theres an `/insights` command/skill built in whatever you want to call it that generates a HTML file. I believe it gives you stats on when you're frustrated with it and (useless) suggestions on how to "use claude better".

Additionally after looking at the source it looks like a lot of Anthropics own internal test tooling/debug (ie. stuff stripped out at build time) is in this source mapping. Theres one part that prompts their own users (or whatever) to use a report issue command whenever frustration is detected. It's possible its using it for this.


> a few seconds won't be a big deal

it is not that slow


It looks like it's just for logging, why does it need to block?

Better question - why would you call an LLM (expensive in compute terms) for something that a regex can do (cheap in compute terms)

Regex is going to be something like 10,000 times quicker than the quickest LLM call, multiply that by billions of prompts


This is assuming the regex is doing a good job. It is not. Also you can embed a very tiny model if you really want to flag as many negatives as possible (I don't know anthropic's goal with this) - it would be quick and free.

I think it's a very reasonable tradeoff, getting 99% of true positives at the fraction of cost (both runtime and engineering).

Besides, they probably do a separate analysis on server side either way, so they can check a true positive to false positive ratio.


Oh it’s worse than that. This one ended up getting my account banned: https://github.com/anthropics/claude-code/issues/22284

This is a tricky problem, I mean, Pinyin also uses the English alphabet.

It is not a tricky problem because it has a simple and obvious solution: do not filter or block usage just because the input includes a word like "gun".

Wow, that's horrible.

... and closed for inactivity like basically every issue in the repo, of course.

Because they actually want it to work 100% of the time and cost nothing.

Maybe hard to believe but not everyone is speaking English to Claude

Then they made it wrong. For example, "What the actual fuck?" is not getting flagged, neither is "What the *fuck*".

It is exceedingly obvious that the goal here is to catch at least 75-80% of negative sentiment and not to be exhaustive and pedantic and think of every possible way someone could express themselves.

Classic over-engineering. Their approach is just fine 90% of the time for the use case it’s intended for.

75-80% [1], 90%, 99% [2]. In other words, no one has any idea.

I doubt it's anywhere that high because even if you don't write anything fancy and simply capitalize the first word like you'd normally do at the beginning of a sentence, the regex won't flag it.

Anyway, I don't really care, might just as well be 99.99%. This is not a hill I'm going to die on :P

[1]: https://news.ycombinator.com/item?id=47587286

[2]: https://news.ycombinator.com/item?id=47586932


It compares to lowercase input, so doesn't matter. The rest is still valid

Except that it's a list of English keywords. Swearing at the computer is the one thing I'll hear devs switch back to their native language for constantly

They evidently ran a statistical analysis and determined that virtually no one uses those phrases as a quick retort to a model's unsatisfying answer... so they don't need to optimize for them.

what you are suggesting would be like a truck company using trucks to move things within the truck

That’s what they do. Ever heard of a hand truck?

I never knew the name of that device.

Thanks


Depending on the region you live in, it's also frequently called a "dolly"

Isn’t a dolly a flat 4 wheeled platform thingy? A hand truck is the two wheeled thing that tilts back.

Ha! Where I'm from a "dolly" was the two-wheeled thing. The four-wheeler thing wasn't common before big-boxes took over the hardware business, but I think my dad would have called it a "cart", maybe a "hand-cart".

Grew up with two wheeled: dolly and four wheeled: piano dolly. Was an adult when I heard hand-truck. I prefer dolly. Nicer mouth feel.

Do we have a hand llm perchance?

Yeah it’s called a regex. With a lot of human assistance it can do less but fits in smaller spaces and doesn’t break down.

It’s also deterministic, unlike llms…

Well, regex doesn't hallucinate....right?

I just went to expertSexChange.com…

buttbuttination


Good to have more than a hammer in your toolbox!

Because they are engineers? The difference between an engineer and a hobbyist is an engineer has to optimize the cost.

As they say: any idiot can build a bridge that stands, only an engineer can build a bridge that barely stands.


Cloud hosted call centers using LLMs is one of my specialties. While I use an LLM for more nuanced sentiment analysis, I definitely use a list of keywords as a first level filter.

A lot if things dont make sense until you involve scale. Regex could be good enough do give a general gist.

Asking a non deterministic software to act like a deterministic one (regex) can be a significantly higher use of tokens/compute for no benefit.

Some things will be much better with inference, others won’t be.


Using some ML to derive a sentiment regex seems like a good actually?

It's more like a truck company using people to transport some parts. I could be wrong here, but I bet this happens in Volvo's fabrics a lot.

Don’t worry, they used an llm to generate the regex.

This just proves its vibe coded because LLMs love writing solutions like that. I probably have a hundred examples just like it in my history.

Actually, this could be a case where its useful. Even it only catches half the complaints, that's still a lot of data, far more than ordinary telemetry used to collect.

because impact of WTF might be lost in the result of the analysis if you solely rely on LLM.

parsing WTF with regex also signifies the impact and reduces the noise in metrics

"determinism > non-determinism" when you are analysing the sentiment, why not make some things more deterministic.

Cool thing about this solution, is that you can evaluate LLM sentiment accuracy against regex based approach and analyse discrepancies


It's more like workers on a large oil tanker using bicycles to move around it, rather than trying to use another oil tanker.

Not everything done by claude-code is decided by LLM. They need the wrapper to be deterministic (or one-time generated) code?

I used regexes in a similar way but my implementation was vibecoded, hmmm, using your analysis Claude Code writes code by hand.

The amount of trust and safety work that depends on google translate and the humble regex, beggars the imagination.

> That's like a truck company using horses to transport parts. Weird choice.

Easy way to claim more “horse power.”


More like a car company transporting their shipments by truck. It's more efficient

LLMs cost money, regular expressions are free. It really isn't so strange.

As far as I can tell they do nothing with it. They just log it.

LLMs are good at writing complex regex, from my experience

They had the problem of sentiment analysis. They use regexes.

You know the drill.


Using regex with LLMs isn't uncommon at all.

Maybe. Could just be a pre filter.

They're searching for multiple substrings in a single pass, regexes are the optimal solution for that.

The issue isn't that regex are a solution to find a substring. The issue is that you shouldn't be looking for substrings in the first place.

This has buttbuttin energy. Welcome to the 80s I guess.


> The issue is that you shouldn't be looking for substrings in the first place.

Why? They clearly just want to log conversations that are likely to display extreme user frustration with minimal overhead. They could do a full-blown NLP-driven sentiment analysis on every prompt but I reckon it would not be as cost-effective as this.


>Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

The only time to use a regex is when searching with a human in the loop. All other uses are better handled some other way.

>They could do a full-blown NLP-driven sentiment analysis on every prompt but I reckon it would not be as cost-effective as this.

Every conversation is sent to an llm at least a thousand times the size of gpt2 which could one shot this nearly a decade ago.


> Every conversation is sent to an llm at least a thousand times the size of gpt2 which could one shot this nearly a decade ago.

Yes, but that is _what the product does_. What we are talking about is _telemetry_.


Very likely vibe coded.

I've seen Claude Code went with a regex approach for a similar sentiment-related task.


My understanding of vibe coding is when someone doesn’t look at the code and just uses prompts until the app “looks and acts” correct.

I doubt you are making regex and not looking at it, even if it was AI generated.


Clbuttic!

It's fast, but it'll miss a ton of cases. This feels like it would be better served by a prompt instruction, or an additional tiny neural network.

And some of the entries are too short and will create false positives. It'll match the word "offset" ("ffs"), for example. EDIT: no it won't, I missed the \b. Still sounds weird to me.


It’s fast and it matches 80% of the cases. There’s no point in overengineering it.

> There’s no point in overengineering it.

I swear this whole thread about regexes is just fake rage at something, and I bet it'd be reversed had they used something heavier (omg, look they're using an LLM call where a simple regex would have worked, lul)...


The pattern only matches if both ends are word boundaries. So "diffs" won't match, but "Oh, ffs!" will. It's also why they had to use the pattern "shit(ty|tiest)" instead of just "shit".

You're right, I missed the \b's. Thanks for the correction.

It's all regex anyways

it's like a faster than light spaceship company using horses. There's been infinite solutions to do this better even CPU only for years lol.

hmm not a terrible idea (I think).

You have a semi expensive process. But you want to keep particular known context out. So a quick and dirty search just in front of the expensive process. So instead of 'figure sentiment (20seconds)'. You have 'quick check sentiment (<1sec)' then do the 'figure sentiment v2 (5seconds)'. Now if it is just pure regex then your analogy would hold up just fine.

I could see me totally making a design choice like that.


It's almost as if LLMs are unreliable

Excellent project, unfortunate title. I almost didn't click on it.

I like the tradeoff offered: full access to the current directory, read-only access to the rest, copy-on-write for the home directory. With stricter modes to (presumably) protect against data exfiltration too. It really feels like it should be the default for agent systems.


Since the site itself doesn't really have a title, I probably would've went with something like "jai - filesystem containment for AI agents"

And there's also fraud. If there's no periodic check, a single diagnosis from a corrupt doctor can give someone disability benefits for life.

This might not be the right frequency, though, and only accepting post/fax is bullshit. Doubly so for short deadlines.


The proposed solution seems to rely on a trusted compiler that generates the exact same output, bit-for-bit, as the compiler-under-test would generate if it was not compromised. That seems useful only in very narrow cases.

You have a trusted compiler you write in assembly or even machine code. You then compile a source code you trust using that compiler. That is then used for the bit for bit analysis against a different binary of the compiler you produced to catch the hidden vulnerability.

It's assumed that in this scenario you don't have access to a trusted compiler; if you do, then there's no problem.

And the thesis linked above seems to go beyond simply "use a trusted compiler to compile the next compiler". It involves deterministic compilation and comparing outputs, for example.


Correct. The deterministic comparison is against compiler A compiling itself. Version 1 is compiler A compiling itself with a normal build of compiler A. Version 2 is compiler A compiled with a trusted toolchain. How do you get that trusted first tool chain is a challenge but, for example, you can start with a tiny tiny C compiler (they can be quite small) that’s used to compile a larger c compiler that can compile c compilers and then finally build clang. Then you have a trusted version of clang that can be used to verify the clang binary. From there you just use clang and periodically recheck no vulnerability has been reintroduced.

Paradox of choice? It's more related to the number of choices and the impact on people's anxiety, but it's close.

I believe you only need a unique phone number to create the account, then you can use WhatsApp Web as client. Be very careful with alternative clients, as I've had an account banned in the past for this (and therefore a phone number blacklisted), even without messaging anybody. I think that clients that run WhatsApp Web in a web view (like https://github.com/rafatosta/zapzap) are safe.

I think they started banning unauthorized API users around the time that "WhatsApp For Business" was introduced, because it was competing with that product. Unfortunately WhatsApp For Business is geared toward physical products and services with registered companies, so home automation and agents are left with no options.


Those are excellent! The orange shingles are my favorite. Though I think some of them are not working on Firefox; the blue and green vortices are rendered as a single blue rectangle and a single green hexagon.

I wonder how people are using them in a way that is not distracting to the main content. I've found that high-frequency patterns (small details with sharp transitions) can be a bit distracting, but I haven't found a good solution that doesn't compromise the beauty of the backgrounds.


I think it’s kind of common to have the background for the whole document and then have an overlay with a solid color (and maybe less-than-100% opacity if you’re daring) on which the main content with all the text is shown. This works best for browser that are full screen on PC screens of course where you want to limit text width anyways. On mobile or narrow windows, you don’t have a lot of space to show the background.


Thanks. I'm already doing something similar, but I feel like the background that is visible on the sides is still somewhat distracting. Might be my imagination though.


I think that keeping it fixed on scrolling and giving it low saturation should be enough to keep them from being distracting. And obviously, no animations, although a really slow one might work.


> Though I think some of them are not working on Firefox; the blue and green vortices are rendered as a single blue rectangle and a single green hexagon.

Move the sliders


I took a look at FireFox and I think it's working, but not obvious that you need to slide the top range slider for the full effect. It would look better if I reversed the effect, I'll have to rethink that.


firefox (148.0.2) on linux, the preview for me is not working

tried them on chrome, it works fine, nice work


Controversial opinion from a casual user, but state-of-art LLMs now feel to me more intelligent then the average person on the steet. Also explains why training on more average-quality data (if there's any left) is not making improvements.

But LLMs are hamstrung by their harnesses. They are doing the equivalent of providing technical support via phone call: little to no context, and limited to a bidirectional stream of words (tokens). The best agent harnesses have the equivalent of vision-impairment accessibility interfaces, and even those are still subpar.

Heck, giving LLMs time to think was once a groundbreaking idea. Yesterday I saw Claude Code editing a file using shell redirects! It's barbaric.

I expect future improvements to come from harness improvements, especially around sub agents/context rollbacks (to work around the non-linear cost of context) and LLM-aligned "accessibility tools". That, or more synthetic training data.


Steet? Do you mean street? They're smarter in the same way a search engine is smarter.


Yes, "street". Typing from my phone, sorry.

And search engines are narrow tools that can only output copies of its dataset. An LLM is capable of surprisingly novel output, even if the exact level of creativity is heavily debated.


Remixes aren't novel.


Human cultures are remixes all the way down...


If you haven’t seen that “Everything is a Remix” documentary you should. It’s so good.


No they're not


> But LLMs are hamstrung by their harnesses

entirely so. i think anthropic updated something about the compact algorithm recently, and its gone from working well over long times to basically garbage whenever a compact happens


It's so disrespectful to say an LLM is more intelligent than a person on the street. The LLM has nothing at stake, cares not a sausage about the consequences of what it spits out. People have all kinds of pressures, dependants, and personal issues like health. Our thoughts and actions have real consequences. It's so easy to be intelligent when you're the pretend human that gets switched on for five minutes then switched off again.


It's not a value judgement, I'm no misanthrope. But it's a fact or life that we humans must specialize, while LLMs can afford to have "studied" a staggering variety of topics. It's no different than being slower than a car, or weaker than a hydraulic press.

On a different note, LLMs are still not very wise, as displayed by all the prompt attacks and occasional inane responses like walking to the car wash.


On the contrary, it's the machines that have the luxury of specialisation.

I'm faster than a car when scrambling up a mountain. It's easy to be fast when all you do is drive on smooth roads.

I'm stronger than a hydraulic press when carrying a load over a distance. It's easy to be strong when you're limited to a few centimetres of movement.

Intelligence of LLMs is a trick; they are literally trained to sound intelligent. But it's easy to sound intelligent when all you have to do is sound intelligent. A person has to live in the real world, deal with the weather, with feelings, stress, health and, above all, consequences. We don't have the luxury of just being able to sound intelligent, or have wheels that only work on smooth roads. We have to actually be intelligent in our actions. We have to traverse difficult ground, deal with obstacles we've never encountered before. When we make mistakes, people get hurt.


We're probably thinking about this at very different levels. Here's what I meant: I can ask Claude for "a bilingual German-Russian poem about the side effects of the most common drugs used in anesthesia". I would bet my left shoe that if I asked people on the street, no one will do a better job than Claude. And to me, answering questions correctly is a very good metric for intelligence.

We can debate whether that's real intelligence, and whether the question is fair, but this is still a real, measurable capability, that just eight years ago was a pipe dream. This capability is what OP is tracking, and what I believe is impressive but hamstrung by harnesses.


>I would bet my left shoe that if I asked people on the street, no one will do a better job than Claude.

Because it's unrealistic to expect such a niche population to be out and wandering around. What it sounds like you want is something that masquerades at intelligence, but is really alien.


> Without explaining that, the rest of this blog post is just rambling notes about developer ergonomics.

That's how I took it, and I enjoyed it thoroughly. If you're making a small app by yourself, sufficiently bad developer ergonomics can be the reason that the app doesn't get made at all, or the frustration makes me regret it. That's important for me.

> Maybe I'm just too young to have ever experienced the kind of stability expected here.

This could be it. I've been around many cycles of technology, and it always feels like a great waste when you have to abandon your tools and experience for something that's buggy and better in only a few small ways. I'm willing to tolerate a lot more bullshit for something that I know will be long-lived, like QT or a static website, than Microsoft's UI-framework-of-the-month.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: