Hacker Newsnew | past | comments | ask | show | jobs | submit | technocrat8080's commentslogin

Have they ever talked about their size or weights?


They never put the parameter counts in their model names like other AI companies did, but back in the GPT3 era (i.e. before they had PR people sitting intermediating all their comms channels), OpenAI engineers would disclose this kind of data in their whitepapers / system cards.

IIRC, GPT-3 itself was admitted to be a 175B model, and its reduced variants were disclosed to have parameter-counts like 1.3B, 6.7B, 13B, etc.


Wow, would love to see a source for this.



5.4 Mini's OSWorld score is a pleasant surprise. When SOTA scores were still ~30-40 models were too slow and inaccurate for realtime computer use agents (rip Operator/Agent). Curious if anyone's been using these in production.


People seem to dismiss OSWorld as "OpenClaw," but I think they're missing how powerful and flexible that type of full-interaction for safe workflows.

We have a legacy Win32 application, and we want to side-by-side compare interactions + responses between it and the web-converted version of the same. Once you've taught the model that "X = Y" between the desktop Vs. web, you've got yourself an automated test suite.

It is possible to do this another way? Sure, but it isn't cost-effective as you scale the workload out to 30+ Win32 applications.


You can provide the screencapture cli as a tool to Claude and it will take screenshots (of specific windows) to verify things visually.


In my experience, especially with Opus 4.6, using subagents greatly mitigates the startup context hit. 4.6 has very obviously been RL'ed on subagent usage and it almost always spins up an Explore agent to get a feel of the codebase and get a token-efficient summary. The 1M context version of 4.6 further alleviates this.

My original question was more along the lines of implementing things like PR review yourself. I was tinkering with an internal service that spins up ephemeral CC instances to analyze PRs, but realized this can easily generalize across arbitrary tasks. Was curious what sort of things folks could use that for.


To be clear, it's not a TEE replacement but does address one of the most common use cases of TEEs


Seems pretty obvious Sky.app's functionality will land in the macOS ChatGPT app at some point. I wonder how Atlas fits into that story.


Atlas will evolve to collect data for training. There's a bunch of context and content bots can't process or access, but a browser not only gives the mothership a closer look at all the walled-garden services and virals a user consumes but also a residential IP address.


Sky is macOS only. It essentially gives an LLM access to various system APIs coupled with a floating user interface that you can access on command.


Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.


Sub-agents. I've had Claude Code run a prompt for hours on end.


What kind of agents do you have setup?


You can use the built in task agent. When you have a plan and ready for Claude to implement, just say something along the line of “begin implementation, split each step into their own subagent, run them sequentially”


subagents are where Claude code shines and codex still lags behind. Claude code can do some things in parallel within a single session with subagents and codex cannot.


By parallel, do you mean editing the codebase in parallel? Does it use some kind of mechanism to prevent collisions (e.g. work trees)?


Yeah, in parallel. They don't call it yolo mode for nothing! I have Claude configured to commit units of work to git, and after reviewing the commits by hand, they're cleanly separated be file. The todo's don't conflict in the first place though; eg changes to the admin api code won't conflict with changes to submission frontend code so that's the limited human mechanism I'm using for that.

I'll admit it's a bit insane to have it make changes in the same directory simultaneously. I'm sure I could ask it to use git worktrees and have it use separate directories, but I haven't (needed to) try that (yet), so I won't comment on how well it would actually do with that.


I personally do not do any writes in parallel but parallel works great for read operations like investigating multiple failing tests.


A bit confused, all this to say you folks use standard containerization?


Same. I didn't really understand what the difference is compared to containerization


Fundamentally, there is no difference. Blocking syscalls in a Docker container is nothing new and one of the ways to achieve "sandboxing" and can already be done right now.

The only thing that caught people's attention was that it was applied to "AI Agents".


What is so fundamentally different for AI agents?


Other than the current popular thing which is "AI agents", like all programs, it changes absolutely nothing.


The fact that the first thing people are going to do is punch holes in the sandbox with MCP servers?


My god, how can I take photos like that? Do you describe your setup and process somewhere?


Thank you, I don't think I'm doing anything particularly special with my process. I like blacks, and keep photos slightly underexposed, something that most phone cameras will avoid like the plague, brightening everything up. Cropping and composition is also more important than it seems IMO.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: