Hacker Newsnew | past | comments | ask | show | jobs | submit | IanCal's commentslogin

Saying gpt 5.4 is like gpt2 is wild.

Lol, audibly.

I'm glad AI curmudgeonry on HN has shifted from "it doesn't work, scam, they made the deployed model worse with 0 communication" to something more akin to "why does anyone use mac or windows, nix is peak personal computing"


We’ve been calling neural nets AI for decades.

> 5 years before that, a Big Data algorithm.

The DNN part? Absolutely not.

I don’t know why people feel the need for such revisionism but AI has been a field encompassing things far more basic than this for longer than most commenters have been alive.


> AI has been a field encompassing things far more basic than this for longer than most commenters have been alive.

When I was 13, having just started programming, I picked up a book from a "junk bin" at a book store on Artificial Intelligence. It must have been from the mid-80s if not older.

It had an entire chapter on syllogism[1] and how to implement a program to spit them out based on user input. As I recall it basically amounted to some string exteaction assuming user followed a template and string concatenation to generate the result. I distinctly recall not being impressed about such a trivial thing being part of a book on AI.

[1]: https://en.wikipedia.org/wiki/Syllogism


Eliza was 1960s.

In the 1990s I remember taking my friend's IRC chat history and running it through a Markov model to generate drivel, which was really entertaining.


> I don't write code the same as other devs

Most people do, most people don’t have wildly different setups do they? I’d bet there’s a lot in common between how you write code and how your coworkers do.


I bet there's a lot more consistency now that AI can factor in how things are being done and be guided on top of that too.

The benefit of digital things is that they can be copied much more cheaply than physical things. There’s perhaps migrations and upkeep though.

On the technical side perhaps the shared nature of this helps - if you can have something replicated so that you and several other members are all running replicas there’s a

On the non technical side, take some photos and print them on good paper. Print out stories on paper.

That doesn’t cover video and perhaps other things but it’s simple and does actually work for lots and lots of stories and pictures. It’s also immediately doable right now without anything new.


You could write an api, and then document it, and then add maybe useful prompts?

Then you’d need a way of passing all that info on to a model, so something top level.

It’d be useful to do things in the same way as others (so if everyone is adding Openapi/swagger you’d do the same if you didn’t have a reason not to).

And then you’ve just reinvented something like MCP.

It’s just a standardised format.


It doesn’t need to be human level, and if I walk into a room and forget why I went in am I no longer a general intelligence?

If it doesn't need to be human level then what are we even talking about? AGI means human level. Everything else is AI

No, the big thing with AGI was that it was general. AI things we made were extremely narrow, identify things out of a set of classes or route planning or something similarly specific. We couldn't just hand the systems a new kind of task, often even extremely similar ones. We've been making superhuman level narrow AI things for many years, but for a long time even extremely basic and restricted worlds still were beyond what more general systems could do.

If LLMs are your first foray into what AI means and you were used to the term ML for everything else I could see how you'd think that, but AI for decades has referred to even very simple systems.


If AGI doesn't mean human level then what does? As you say, every application of A* is in some way "AI" so we had this idea of "AGI" for something "actually intelligent" but maybe I'm wrong and AGI never meant that. What does mean that?

I did AI back before it was cool and I think we have agi. Imo the whole distinction was from extremely narrow AI to general intelligence. A classifier for engine failure can only do that - a route planner can only do that…

Now we have things I can ask a pretty arbitrary question and they can answer it. Translate, understand nuance (the multitude of ways of parsing sentences, getting sarcasm was an unsolved problem), write code, go and read and find answers elsewhere, use tools… these aren’t one trick ponies.

There are finer points to this where the level of autonomy or learning over time may be important parts to you but to me it was the generality that was the important part. And I think we’re clearly there.

Agi doesn’t have to be human level, and it doesn’t have to be equal to experts in every field all at once.


An interesting perspective: general, absolutely, just nowhere near superhuman in all kinds of tasks. Not even close to human in many. But intelligent? No doubt, far beyond all not entirely unrealistic expectations.

But that seems almost like an unavoidable trade-off. Fiction about the old "AI means logic!" type of AI is full of thought experiments where the logic imposes a limitation and those fictional challenges appear to be just what the AI we have excels at.


> The recent Esolang benchmarks indicate that these LLMs are actually pretty bad at that.

I’m really not sure how well a typical human would do writing brainfuck. It’d take me a long time to write some pretty basic things in a bunch of those languages and I’m a SE.


Yes, but you also wouldn't need a corpus of hundreds of thousands of projects to crib from. If it were truly able to "reason" then conceivably it could look at a language spec, and learn how to express things in term of Brainfuck.

They did for some problems. If you gave me five iterations at a problem like this in brainfuck:

> "Read a string S and produce its run-length encoding: for each maximal block of identical characters, output the character followed immediately by the length of the block as a decimal integer. Concatenate all blocks and output the resulting string.

I'd do absolutely awfully at it.

And to be clear that's not "five runs from scratch repeatedly trying it" it's five iterations so at most five attempts at writing the solution and seeing the results.

I'd also note that when they can iterate they get it right much more than "n zero shot attempts" when they have feedback from the output. That doesn't seem to correlate well with a lack of reasoning to me.

Given new frameworks or libraries and they can absolutely build things in them with some instructions or docs. So they're not very basically just outputting previously seen things, it's at least much more pattern based than words.

edit -

I play clues by sam, a logical reasoning puzzle. The solutions are unlikely to be available online, and in this benchmark the cutoff date for training seems to be before this puzzle launched at all:

https://www.nicksypteras.com/blog/cbs-benchmark.html

Frankly just watching them debug something makes it hard for me to say there's no reasoning happening at all.


I'm a huge fan of property based testing, I've built some runners before, and I think it can be great for UI things too so very happy to see this coming around more.

Something I couldn't see was how those examples actually work, there are no actions specified. Do they watch a user, default to randomly hitting the keyboard, neither and you need to specify some actions to take?

What about rerunning things?

Is there shrinking?

edit - a suggestion for examples, have a basic UI hosted on a static page which is broken in a way the test can find. Like a thing with a button that triggers a notification and doesn't actually have a limit of 5 notifications.


Hey, yeah the default specification includes a set of action generators that are picked from randomly. If you write a custom spec you can define your own action generators and their weights.

Rerunning things: nothing built for that yet, but I do have some design ideas. Repros are notoriously shaky in testing like this (unless run against a deterministic app, or inside Antithesis), but I think Bombadil should offer best-effort repros if it can at least detect and warn when things diverge.

Shrinking: also nothing there yet. I'm experimenting with a state machine inference model as an aid to shrinking. It connects to the prior point about shaky repros, but I'm cautiously optimistic. Because the speed of browser testing isn't great, shrinking is also hard to do within reasonable time bounds.

Thanks for the questions and feedback!


For re-running, I assume you want to do this all on a review app with a snapshot of the DB, so you start with a clean app state.

Should be pretty easy to make it deterministic if you follow that precondition.

(How I had my review apps wired up was I dumped the staging DB nightly and containerized it, I believe Neon etc make it easy to do this kind of thing.)

Ages ago I wired up something much more basic than this for a Python API using hypothesis, and made the state machine explicit as part of the action generator (with the transitions library), what do you think about modeling state machines in your tests? (I suppose one risk is you don’t want to copy the state machine implementation from inside the app, but a nice fluent builder for simple state machines in tests could be a win.)


That's true, clean app state gets you far. And that's something I'm going to add to Bombadil once it gets an ability to run many tests (broad exploration, reruns, shrinking), i.e. something in the spec where you can supply reset hooks, maybe just bash commands.

Regarding state machines: yeah, it can often become an as-complex mirror of the system your testing, if the system has a large complicated surface. If on the other hand the API is simple and encapsulates a lot of complexity (like Ousterhout's "Deep Modules") state machine specs and model-based testing make more sense. Testing a key-value store is a great example of this.

If you're curious about it, here's a very detailed spec for TodoMVC in Bombadil: https://github.com/owickstrom/bombadil-playground/blob/maste... It's still work-in-progress but pretty close to the original Quickstrom-flavored spec.


How effective is property based testing in practice? I would assume it has no trouble uncovering things like missing null checks or an inverted condition because you can cover edge cases like null, -1, 0, 1, 2^n - 1 with relatively few test cases and exhaustively test booleans. But beyond that, if I have a handful of integers, dates, or strings, then the state space is just enormous and it seems all but impossible to me that blindly trying random inputs will ever find any interesting input. If I have a condition like (state == "disallowed") or (limit == 4096) when it should have been 4095, what are the odds that a random input will ever pass this condition and test the code behind it?

Microsoft had a remotely similar tool named Pex [1] but instead of randomly generating inputs, it instrumented the code to enable executing the code also symbolically and then used their Z3 theorem proofer to systematically find inputs to make all encountered conditions either true or false and with that incrementally explore all possible execution paths. If I remember correctly, it then generated a unit test for each discovered input with the corresponding output and you could then judge if the output is what you expected.

[1] https://www.microsoft.com/en-us/research/publication/pex-whi...


In practice I’ve found that property based testing has a very high ratio of value per effort of test written.

Ui tests like:

* if there is one or more items on the page one has focus

* if there is more than one then hitting tab changes focus

* if there is at least one, focusing on element x, hitting tab n times and then shift tab n times puts me back on the original element

* if there are n elements, n>0, hitting tab n times visits n unique elements

Are pretty clear and yet cover a remarkable range of issues. I had these for a ui library, which came with the start of “given a ui build with arbitrary calls to the api, those things remain true”

Now it’s rare it’d catch very specific edge cases, but it was hard to write something wrong accidentally and still pass the tests. They actually found a bug in the specification which was inconsistent.

I think they often can be easier to write than specific tests and clearer to read because they say what you actually are testing (a generic property, but you had to write a few explicit examples).

What you could add though is code coverage. If you don’t go through your extremely specific branch that’s a sign there may be a bug hiding there.


An important step with property based testing and similar techniques is writing your own generators for your domain objects. I have used to it to incredible effect for many years in projects.

I work at Antithesis now so you can take that with a grain of salt, but for me, everything changed for me over a decade ago when I started applying PBT techniques broadly and widely. I have found so many bugs that I wouldn't have otherwise found until production.


"Exhaustively covering the search space" or "hitting specific edge cases" is the wrong way to think about property tests, in my experience. I find them most valuable as insanity checks, i.e. they can verify that basic invariants hold under conditions even I wouldn't think of testing manually. I'd check for empty strings, short strings, long strings, strings without spaces, strings with spaces, strings with weird characters, etc. But I might not think of testing with a string that's only spaces. The generator will.

One of the founders of Antithesis gave a talk about this problem last week; diversity in test cases is definitely an issue they're trying to tackle. The example he gave was Spanner tests not filling its cache due to jittering near zero under random inputs. Not doing that appears to be a company goal.

https://github.com/papers-we-love/san-francisco/blob/master/...


Glad you enjoyed the talk! Making Bombadil able to take advantage of the intelligence in the Antithesis platform is definitely a goal, but we wanted to get a great open source tool into peoples’ hands ASAP first.

One thing you can find pretty quickly with just basic fuzzing on strings is Unicode-related bugs.

Well what would each billionaire do? Give out money so that the poor can give some of it back?

You cannot just point at a system, say it’d be unsustainable and then assume nobody will let that happen.

Monarchies, lords, etc. have had much more reason to support their own countryfolk, yet many throughout history have not - has society changed enough that the billionaires have changed on this?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: