More

gck1 · 2026-04-07T06:59:57 1775545197

That probably matters for some scenarios, but I have yet to find one where thinking tokens didn't hint at the root cause of the failure.

All of my unsupervised worker agents have sidecars that inject messages when thinking tokens match some heuristics. For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).

lelanthran · 2026-04-07T09:08:14 1775552894

> For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).

It's so weird to see language changes like this: Outside of LLM conversations, a pragmatic fix and a correct fix are orthogonal. IOW, fix $FOO can be both.

From what you say, your experience has been that a pragmatic fix is on the same axis as a correct fix; it's just a negative on that axis.

b112 · 2026-04-07T10:03:43 1775556223

It's contextual though, and pragmatic seems different to me than correct.

For example, if you have $20 and a leaking roof, a $20 bucket of tar may be the pragmatic fix. Temporary but doable.

Some might say it is not the correct way to fix that roof. At least, I can see some making that argument. The pragmatism comes from "what can be done" vs "should be".

From my perspective, it seems viable usage. And I guess on wonders what the LLM means when using it that way. What makes it determine a compromise is required?

(To be pragmatic, shouldn't one consider that synonyms aren't identical, but instead close to the definition?)

lelanthran · 2026-04-07T11:51:55 1775562715

> It's contextual though, and pragmatic seems different to me than correct.

To me too, that's why I say they are measurements on different dimensions.

To my mind, I can draw a X/Y axis with "Pragmatic" on the Y and "Correctness" on the X, and any point on that chart would have an {X,Y} value, which is {Pragmatic, Correctness}.

If I am reading the original comment correctly, poster's experience of CC is that it is not an X/Y plot, it is a single line plot, with "Pragmatic" on the extreme left and "Correctness" on the extreme right.

Basically, any movement towards pragmatism is a movement away from correctness, while in my model it is possible to move towards Pragmatic while keeping Correctness the same.

mikkupikku · 2026-04-07T16:10:05 1775578205

I had some interesting experience to the opposite last night, one of my tests has been failing for a long time, something to do with dbus interacting with Qt segfaulting pytest. Been ignoring it for a long time, finally asked claude code to just remove the problematic test. Come back a few minutes later to find claude burning tokens repeatedly trying and failing to fix it. "Actually on second thought, it would be better to fix this test."

Match my vibes, claude. The application doesn't crash, so just delete that test!

gck1 · 2026-04-07T06:35:58 1775543758

The problem with specs for me is always with boundaries. How many specs do you have for a complex project? How do they reference each other? What happens when requirements cross boundaries?

And finally, how do you address spec drift?

gck1 · 2026-04-06T23:16:26 1775517386

I've noticed this too, but not necessarily type checkers, but more with linters. And can't really figure out if there's even a way to solve it.

If you set up restrictive linters and don't explicitly prohibit agents from adding inline allows, most LOC will be allow comments.

Based on this learning, I've decided to prohibit any inline allows. And then agents started doing very questionable things to satisfy clippy.

Recent example:

- Claude set up a test support module so that it could reuse things. Since this was not used in all tests, rust complained about dead_code. Instead of making it work, claude decided to remove test support module and just... blow up each test.

If you enable thinking summaries, you'll always see agent saying something like: "I need to be pragmatic", which is the right choice 50% of the time.

gck1 · 2026-04-05T20:20:54 1775420454

> I could just try and hack you (not even successfully) and Google will just shut down your entire life rather than attempt to work out who's right.

Had this happen to me. Fortunately the 'attacker' wasn't actually trying to do this, so damage was limited, but it's chilling when you think about what some motivated script kiddy can do with your Google account just by requesting password resets.

gck1 · 2026-04-05T01:12:52 1775351572

> And you don't have to get anyone's permission to use tmux.

I'm not so sure about that. I have my own local multi agent orchestration setup with tmux and native claude code, codex, and gemini. They can talk between each other using tmux.

I'm not sure precisely when does a wrapper around Claude Code become a "third party harness". If OpenCode is a third party harness, why is tmux not?

Aperocky · 2026-04-05T02:00:27 1775354427

on spirit, yes. programmatically speaking, not really.

And judging by intention is a slippery slope, and they'll probably end up just banning everyone who are high usage. While some smart orchestration setup will fly under the radar (i.e. trivial to rate limit the cross tmux communication).

gck1 · 2026-04-01T02:29:39 1775010579

Oh no. I pay for it, not antigravity, but gemini-cli.

The limits and response times are laughable. Remove branding and you'd think it's served by some hobbyist on a homelab server, not Google

gck1 · 2026-03-30T04:44:04 1774845844

Camoufox?

gck1 · 2026-03-30T04:40:57 1774845657

I always wondered why you even have logged out access. I'm glad I can use ChatGPT in incognito when I want a "clean room" response, but surely that's not the primary use case.

Is user base that never logs in really that significant?

pocksuppet · 2026-03-30T12:46:37 1774874797

This episode proves they know who you are, even when you're logged out. If they didn't know, they wouldn't let you use the service.

gck1 · 2026-03-28T17:41:58 1774719718

I'm curious, what does your workflow look like? I saw a plan prompt there, but no specs. When you want to change something, implement a new feature etc, do you just prompt requirements, have it write the plan and then have it work on it?

gck1 · 2026-03-28T05:41:43 1774676503

It's full VM or nothing.

I want AI to have full and unrestricted access to the OS. I don't want to babysit it and approve every command. Everything that is on that VM is a fair game and the VM image is backed up regularly from outside.

This is the only way.

adi_kurian · 2026-03-28T13:22:56 1774704176

I have a pretty insane thing where I patched the screen sharing binary and hand rolled a dummy MDN so I can have multiple profiles logged in at once on my Mac Studio. Then have screen share of diff profiles in diff "windows". Was for some ML data gathering / CV training.

It's pretty neat, screen sharing app is extremely high quality these days, I can barely notice a diff unless watching video. Almost feels like Firefox containers at OS level.

Have thought that could be a pretty efficient way to have restricted unrestricted convenient AI access. Maybe I'll get around to that one day.

gck1 · 2026-03-30T03:59:30 1774843170

> I have a pretty insane thing where I patched the screen sharing binary and hand rolled a dummy MDN so I can have multiple profiles logged in at once on my Mac Studio

I have a Studio collecting dust that I've been eyeing every time my VM crashed because of Apple's paravirtualized GPU proxy not being able to keep up with things I run in it.

This sounds exactly like what I wanted to do on my Studio and didn't know where to pull the thread from.

Do you have this method shared openly anywhere?

adi_kurian · 2026-03-31T01:08:11 1774919291

Nah but I'd be happy to share it with you over DM! (If they have DMs on here?)

griffindor · 2026-03-28T08:20:37 1774686037

I use Nix shells to give it the tools it wants.

If it wants to do system-level tests, then I make sure my project has Qemu-based tests.