More

technocrat8080 · 2026-03-17T19:43:05 1773776585

Have they ever talked about their size or weights?

derefr · 2026-03-17T19:47:54 1773776874

They never put the parameter counts in their model names like other AI companies did, but back in the GPT3 era (i.e. before they had PR people sitting intermediating all their comms channels), OpenAI engineers would disclose this kind of data in their whitepapers / system cards.

IIRC, GPT-3 itself was admitted to be a 175B model, and its reduced variants were disclosed to have parameter-counts like 1.3B, 6.7B, 13B, etc.

technocrat8080 · 2026-03-17T20:54:05 1773780845

Wow, would love to see a source for this.

derefr · 2026-03-19T17:50:57 1773942657

Table 2.1 in https://arxiv.org/pdf/2005.14165.

technocrat8080 · 2026-03-17T19:40:12 1773776412

5.4 Mini's OSWorld score is a pleasant surprise. When SOTA scores were still ~30-40 models were too slow and inaccurate for realtime computer use agents (rip Operator/Agent). Curious if anyone's been using these in production.

Someone1234 · 2026-03-17T20:09:27 1773778167

People seem to dismiss OSWorld as "OpenClaw," but I think they're missing how powerful and flexible that type of full-interaction for safe workflows.

We have a legacy Win32 application, and we want to side-by-side compare interactions + responses between it and the web-converted version of the same. Once you've taught the model that "X = Y" between the desktop Vs. web, you've got yourself an automated test suite.

It is possible to do this another way? Sure, but it isn't cost-effective as you scale the workload out to 30+ Win32 applications.

technocrat8080 · 2026-03-13T01:16:47 1773364607

You can provide the screencapture cli as a tool to Claude and it will take screenshots (of specific windows) to verify things visually.

technocrat8080 · 2026-03-13T00:43:07 1773362587

In my experience, especially with Opus 4.6, using subagents greatly mitigates the startup context hit. 4.6 has very obviously been RL'ed on subagent usage and it almost always spins up an Explore agent to get a feel of the codebase and get a token-efficient summary. The 1M context version of 4.6 further alleviates this.

My original question was more along the lines of implementing things like PR review yourself. I was tinkering with an internal service that spins up ephemeral CC instances to analyze PRs, but realized this can easily generalize across arbitrary tasks. Was curious what sort of things folks could use that for.

technocrat8080 · 2025-11-19T20:37:31 1763584651

To be clear, it's not a TEE replacement but does address one of the most common use cases of TEEs

technocrat8080 · 2025-10-23T21:47:24 1761256044

Seems pretty obvious Sky.app's functionality will land in the macOS ChatGPT app at some point. I wonder how Atlas fits into that story.

ignoramous · 2025-10-24T04:02:20 1761278540

Atlas will evolve to collect data for training. There's a bunch of context and content bots can't process or access, but a browser not only gives the mothership a closer look at all the walled-garden services and virals a user consumes but also a residential IP address.

technocrat8080 · 2025-10-23T21:43:26 1761255806

Sky is macOS only. It essentially gives an LLM access to various system APIs coupled with a floating user interface that you can access on command.

technocrat8080 · 2025-09-29T19:17:17 1759173437

Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.

ChadMoran · 2025-09-30T01:17:24 1759195044

Sub-agents. I've had Claude Code run a prompt for hours on end.

technocrat8080 · 2025-09-30T01:34:36 1759196076

What kind of agents do you have setup?

s900mhz · 2025-09-30T04:42:32 1759207352

You can use the built in task agent. When you have a plan and ready for Claude to implement, just say something along the line of “begin implementation, split each step into their own subagent, run them sequentially”

fragmede · 2025-10-01T17:26:03 1759339563

subagents are where Claude code shines and codex still lags behind. Claude code can do some things in parallel within a single session with subagents and codex cannot.

technocrat8080 · 2025-10-01T23:48:55 1759362535

By parallel, do you mean editing the codebase in parallel? Does it use some kind of mechanism to prevent collisions (e.g. work trees)?

fragmede · 2025-10-02T06:55:15 1759388115

Yeah, in parallel. They don't call it yolo mode for nothing! I have Claude configured to commit units of work to git, and after reviewing the commits by hand, they're cleanly separated be file. The todo's don't conflict in the first place though; eg changes to the admin api code won't conflict with changes to submission frontend code so that's the limited human mechanism I'm using for that.

I'll admit it's a bit insane to have it make changes in the same directory simultaneously. I'm sure I could ask it to use git worktrees and have it use separate directories, but I haven't (needed to) try that (yet), so I won't comment on how well it would actually do with that.

s900mhz · 2025-10-02T00:45:21 1759365921

I personally do not do any writes in parallel but parallel works great for read operations like investigating multiple failing tests.

technocrat8080 · 2025-09-29T18:55:33 1759172133

A bit confused, all this to say you folks use standard containerization?

whinvik · 2025-09-29T19:10:48 1759173048

Same. I didn't really understand what the difference is compared to containerization

rvz · 2025-09-29T19:39:31 1759174771

Fundamentally, there is no difference. Blocking syscalls in a Docker container is nothing new and one of the ways to achieve "sandboxing" and can already be done right now.

The only thing that caught people's attention was that it was applied to "AI Agents".

kjok · 2025-09-29T19:40:35 1759174835

What is so fundamentally different for AI agents?

rvz · 2025-09-29T20:27:42 1759177662

Other than the current popular thing which is "AI agents", like all programs, it changes absolutely nothing.

Yoric · 2025-09-29T20:39:05 1759178345

The fact that the first thing people are going to do is punch holes in the sandbox with MCP servers?

technocrat8080 · 2025-09-29T03:03:36 1759115016

My god, how can I take photos like that? Do you describe your setup and process somewhere?

grilledchickenw · 2025-09-29T05:55:03 1759125303

Thank you, I don't think I'm doing anything particularly special with my process. I like blacks, and keep photos slightly underexposed, something that most phone cameras will avoid like the plague, brightening everything up. Cropping and composition is also more important than it seems IMO.