Time doesn't mean much, what is important is what they did in this 24h. If all they did was talk about it then it could be 1000 years and it wouldn't matter. What are the safety checks in place?
Do they have a honey pot infrastructure to launch the model in first and then wait to see if it destroys it? What they did in the 24h matters.
Agreed. I've been running autonomous LLM agents on daily schedules for weeks. The failure modes you worry about on day one are completely different from what actually shows up after the agents have history and context. 24 hours captures the obvious stuff.