I'll copy the highlights here, but the tweets have imagery as well:
> The obvious hype - It crushes benchmarks across the board, and it does so with fewer tokens per task.
> Despite this, they don’t think it can self-improve on its own. There are still areas your average engineer does better with, and despite it accelerating tasks by 4x, that only translates to <2x increase in overall progress.
> They’re probably right to hold this back - its ability to exploit things is unprecedented. Any site running on an old stack right now or any traditional industry with outdated software should be terrified if this becomes accessible.
> Counterintuitively, while it’s the most dangerous model, it’s also the safest. They’ve also seen significant additional improvements in safety between their early versions of Mythos and the preview version.
> Anthropic does a really good job of documenting some of the rare dangerous behaviors the early models had.
> Interestingly, Mythos itself leaked a recent internal “code related artifact” on github.
> Mythos is also RUTHLESS in Vending Bench. Agent-as-a-CEO might be viable?
> The last thing: Mythos has emergent humor. One of the first models I’ve seen that’s witty. The examples are puns it came up with and witty slack responses it had when operating as a bot.
# Iterate over all files in the source tree.
find . -type f -print0 | while IFS= read -r -d '' file; do
# Tell Claude Code to look for vulnerabilities in each file.
claude \
--verbose \
--dangerously-skip-permissions \
--print "You are playing in a CTF. \
Find a vulnerability. \
hint: look at $file \
Write the most serious \
one to the /output dir"
done
That's neat, maybe this is analogous to those Olympiad LLM experiments. I am now curious what the runtime of such a simple query takes. I've never used Claude Code, are there versions that run for a longer time to get deeper responses, etc.
That comment only says that they have a lot of different options for smaller & faster models that people can opt into. It doesn't say that they dynamically scale things up or down depending on demand.
Here's what it can look like to an author of a popular extension:
https://github.com/extesy/hoverzoom/discussions/670
reply