They never put the parameter counts in their model names like other AI companies did, but back in the GPT3 era (i.e. before they had PR people sitting intermediating all their comms channels), OpenAI engineers would disclose this kind of data in their whitepapers / system cards.
IIRC, GPT-3 itself was admitted to be a 175B model, and its reduced variants were disclosed to have parameter-counts like 1.3B, 6.7B, 13B, etc.
5.4 Mini's OSWorld score is a pleasant surprise. When SOTA scores were still ~30-40 models were too slow and inaccurate for realtime computer use agents (rip Operator/Agent). Curious if anyone's been using these in production.
People seem to dismiss OSWorld as "OpenClaw," but I think they're missing how powerful and flexible that type of full-interaction for safe workflows.
We have a legacy Win32 application, and we want to side-by-side compare interactions + responses between it and the web-converted version of the same. Once you've taught the model that "X = Y" between the desktop Vs. web, you've got yourself an automated test suite.
It is possible to do this another way? Sure, but it isn't cost-effective as you scale the workload out to 30+ Win32 applications.
In my experience, especially with Opus 4.6, using subagents greatly mitigates the startup context hit. 4.6 has very obviously been RL'ed on subagent usage and it almost always spins up an Explore agent to get a feel of the codebase and get a token-efficient summary. The 1M context version of 4.6 further alleviates this.
My original question was more along the lines of implementing things like PR review yourself. I was tinkering with an internal service that spins up ephemeral CC instances to analyze PRs, but realized this can easily generalize across arbitrary tasks. Was curious what sort of things folks could use that for.
Atlas will evolve to collect data for training. There's a bunch of context and content bots can't process or access, but a browser not only gives the mothership a closer look at all the walled-garden services and virals a user consumes but also a residential IP address.
Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.
You can use the built in task agent. When you have a plan and ready for Claude to implement, just say something along the line of “begin implementation, split each step into their own subagent, run them sequentially”
subagents are where Claude code shines and codex still lags behind. Claude code can do some things in parallel within a single session with subagents and codex cannot.
Yeah, in parallel. They don't call it yolo mode for nothing! I have Claude configured to commit units of work to git, and after reviewing the commits by hand, they're cleanly separated be file. The todo's don't conflict in the first place though; eg changes to the admin api code won't conflict with changes to submission frontend code so that's the limited human mechanism I'm using for that.
I'll admit it's a bit insane to have it make changes in the same directory simultaneously. I'm sure I could ask it to use git worktrees and have it use separate directories, but I haven't (needed to) try that (yet), so I won't comment on how well it would actually do with that.
Fundamentally, there is no difference. Blocking syscalls in a Docker container is nothing new and one of the ways to achieve "sandboxing" and can already be done right now.
The only thing that caught people's attention was that it was applied to "AI Agents".
Thank you, I don't think I'm doing anything particularly special with my process. I like blacks, and keep photos slightly underexposed, something that most phone cameras will avoid like the plague, brightening everything up. Cropping and composition is also more important than it seems IMO.