Maybe I'm missing something here, but the last type matches 2 pretty well. For example,
> None of that mattered at the scope we were modeling, so we abstracted it all away: "on receiving message, nondeterministically enter state A, B, or C."
That sounds like the author is describing an NFA.
> whereby we execute all the future possibilities... e.g. "amb" operator
Isn't "execute all possibilities" closer to the dual of amb than it is to amb? At least in terms of how the denotation of amb is operationalized.
But it is a good point that sometimes you want to give an actual operational meaning to a literal non-determinism operator.
"nondeterminism as abstraction" is, imo, the best example of an "FM export".
Usually you think of nondeterminism as adding complexity to a system -- concurrency, randomness, etc.
So it's kind of surprising to notice that translating a deterministic system into a non-deterministic system is often the first step people take when proving things about complicated algorithms.
Planning, optimization and reinforcement learning are the canonical examples where the reason for this step is easiest to see. These are usually very complex algorithms. You could prove things about an actual implementation, line of code by line of code, but that would be a toooon of work.
If instead you can find over-approximation `inv : State x Action -> List[State]`, and if you know that this invariant is strong enough to imply whatever other properties you want, then you just need to make sure that `inv` over-approximates the actual behavior of the underlying system (either by interrogation or by construction).
It's a very simple observation, motivated by the pragmatics of engineering proofs about complicated systems, which can have a direct impact on how you think about designing those same systems. Now that huge inscrutable models are all the rage, it's a fun little FM trick I used almost every day even though I rarely work on systems that I'd bother to formally model or verify.
Reminds me a lot of thermodynamics. Microstates and transition probabilities are a more “fundamental” description, but when you hit the metal, temperature and pressure are more useful in practice to human engineers
Interesting point. It's almost a form of "design by verifiability" – prioritizing architectures that lend themselves to easier reasoning, even if other architectures might offer marginal performance gains.
The cool thing is that you can sometimes "retcon" an existing system using this approach
So in that case you're not even making any performance concessions; the actual running code isn't changing. You're just imposing an abstraction on top of it for the purpose of analysis. The actual running code doesn't have to change for that analysis to bear fruit (e.g., quashing a bug or just increasing understanding/confidence). That's what it sounds like Hillel did with the A,B,C example in the blog post.
Another sort of funny point is that this basic trick is most often used when performance isn't acceptable, as a way of providing a modicum of stability/debugging to an ostensibly uber-optimized system which, once deployed, turns out to be kinda fragile.
My take-away from this story is almost the opposite of yours. Nature provides a lot of valuable resources, and this story is a beautiful example of how failing to properly steward our natural environment can be extremely expensive.
> Instead, they were planning it, fighting bureocracy and wasting money for years.
I do not know anything about CZ or about this project, but jumping to "waste" seems excessively pessimistic.
It could just be that the agency's budget is finite and this project is low priority in the organization's capital projects prioritization. Just because they first planned this project 7 years ago doesn't mean they have been actively working on it continuously for the last 7 years...
This is common even in tiny economic units. E.g., I planned out our family's new furnace setup a couple of years ago, but probably won't actually allocate capital to that upgrade for another few years. In the meantime, I promise I am not spending all day every day thinking about our furnace. Just because a project takes 7 years from start of planning to finished project doesn't mean that there were 7 years of work there :)
> If the beavers built it in two days, it means that half a dozen men could have easily built it at an extremely low cost in a week.
That really depends on what you mean by "it".
I got to watch a beaver dam go up recently. Not the active work, but regular snapshots of the build site, as it were. I will use that as a case study.
If the goal here is to literally build a beaver dam, then probably you're correct. The dam I saw go up is small and built out of pretty small trees. Maybe 3-5 inch diameter. Softer wood. I think a small group of humans with very cheap tools could make quick cheap work of felling and moving the trees.
As for the dam itself, they'd figure something out. But I am skeptical that most teams would be able to build a like-quality dam out of felled trees and found materials. Beavers are dam good at what they do.
More importantly, though, building a more typical modern human dam in that same location would not be cheap or fast. You would easily spend hundreds of thousands of dollars on cutting and blasting a path before you could even get concrete to the worksite. There's no way the entire project would be done in a week. It definitely wouldn't be low cost no matter the schedule. Getting it done in a few weeks would be... probably possible, but a logistical tour de force and astronomically expensive. This is all just talking about the actual work, ignoring government permitting entirely.
The AG's advisory is essentially a memo from the AG explaining how the AG thinks existing statute applies to a new technology. It's not new legislation or new case law, but it is worth paying attention to for various reasons.
The section of the advisory referencing disproportionate impact is quoting, nearly word for word, a portion of Cal. Code Regs. Tit. 2, § 14027.
So, this section of the advisory essentially amounts to the AG saying "using AI to do something illegal is still illegal".
That does not really answer your question, though.
> What kind of standard is disproportionate impact?...
The kind that does has been codified in CA and other jurisdictions' statutes for a long while now. This means that the standard is extremely well-litigated in the state's courts, and so the answer to "what is disproportionate impact?" is, I think, something like:
"That seems complicated; there's probably a rich case law that provides clarity in some situations but also highlights areas of ambiguity in other situations. If you're in it for profit, and have any questions, get a lawyer who specializes in that area of the law to review your specific circumstance; if you're in it for civics/curiosity, start with the statute then start reading significant case law or law reviews regarding that statute."
It's also the kind of standard that can attract flame wars... hopefully not here, though ;-)
If you've been hacking on/with LLMs, please apply! The job posting casts a wide net, so please don't self-disqualify if you only check some of our boxes. Prior research experience is great but we're more interested in cool things you've built than in pdfs you've pushed through academic gauntlets :-)
> Formally proving with a system based on probabilities (temperature, etc.) is for me an oxymoron. But I am not a mathematician.
In the context of automated theorem proving, proof search and proof checking could not be more different.
Proof search is undecidable for most logics. For that reason, heuristics are commonly used to guide proof search. This has always been the case; heuristic proof search predates the use of LLMs for theorem proving by decades.
Proof checking, on the other hand, is not just decidable but usually quite tractable and cheap. Therefore, proof checkers are typically deterministic and fast. Ideally the checker is also implemented in small amount of relatively simple and heavily audited code.
> I am not sure if this is better than generating cooking recipes with a LLM and asking a cook if they make sense or not.
Conceptually it's not at all similar to asking a cook if a recipe is "correct". For lean proofs, "correct" has an entirely precise meaning. The "cook" (i.e., proof checker) is basically free (much less than $0.01/hr). The cook (proof checker) takes at most a few seconds to read and respond to any particular recipe. And the cook (i.e., checker) never makes mistakes.
The relationship between asking a cook if a recipe is good and asking Lean to check a proof is approximately the difference between asking a mediocre human computer to compute a derivative and asking a modern mechanical computer to do the same (using a classical algorithm for doing so, not an LLM; no one is using LLMs to check proofs -- there's really no point to doing so, because the checking problem is easy).
"There is a leisure class at both ends of the economic spectrum."
But even setting that aside, the article goes on to mention Albania (low CoL) and access to healthcare. The latter, in particular, checks out. I have great insurance, but my out-of-pocket max equals somewhere between 5 and 10 round-trips to Europe. It's not hard to imagine there exist people who can setup the circumstances such that working in the USA part of the year and living somewhere else part of the year is massively more economical than staying in the USA.
Actually, doesn't that exact situation describe a huge fraction of the US agricultural labor force?
This is an essay that lauds Cambridge, MA as a panoply of virtuous places with "limiting virtues" while at the same time decrying an "elitist" fascination with "anything goes". Lots of Cathedral energy in its architectural thesis [1].
It's one thing to point out the virtue in constraint. I don't even disagree. If the article were merely an architectural critique, I would be mostly on board. But the article doesn't stop there. It proffers a hip cafe in Harvard Square, the Cambridge library, and "one of the only choir schools in the United States!" as virtuous... while simultaneously mounting a larger critique of elite excess(?!)
So let's discuss this virtuous neighborhood.
Cambridge currently has six housing units with 2+ beds for sale on Zillow with listing prices below $800,000. None of these are single family homes, one of these listings is an error, and another is ~600 sqft. At least some of the others, I would wager, are intentionally under-priced and will go over asking. Perhaps all of them!
How is this relevant to the social/architectural thesis of the piece? Simple: what Cambridge needs is not elaborate architectural programming of constrained spaces designed by $X00+/hr architects. What Cambridge needs is a bunch of affordable boxes with programs that minimize cost of design, materials, and labor.
Leave the rarefied air of hangouts for the Actually Elite, such as the article's Faro Cafe -- a no-laptops coffee shop located in Harvard Square serving fancy cappuccinos and premium pastries to the city's tenured faculty. What you'll find beyond those cloistered enclaves is that the societal critique of this piece has a deeply flawed premise.
I suppose, if you're the personal beneficiary of a trust fund or the professional beneficiary of an institution's endowment, then the cheap "everything box" apartment seems gouache. You can have better! But the "rectangle with minimal interior subdivisions" is not primarily the architectural manifestation of Elite Liberal Individual Narcissism or whatever. It's just the cheapest way to build out an interior space that feels big enough for a family to spend time in together. Walls are expensive, and artisanally placed walls even more so!
And I suppose if you have a private office in Harvard Square or Boston's Back Bay then the coffeeshop/office combo seems like a terrible symptom of a hyper-individualized Screen Obsessed Society. But to approximately everyone else living in a place where a family of 4 will pay $4,000/mo for not enough living space, that coffeeshop/office combo is as close as you'll likely get to a private workspace.
None of this is an argument against no-laptop policies or quiet ares in libraries. But then -- the article also isn't primarily an argument for those places! If the piece didn't come with a larger thesis decrying Elite Cultural Decay while singing the praises of one of the most culturally elite and least economically affordable neighborhoods in the world, then I would've had a very different reaction.
> Unfortunately, it seems the authors threw out the only data that didn't support their hypothesis as GPT-4 did, in fact, outperform the median Mechanical Turk worker, particularly in terms of instruction following.
MTurk, to first approximate, is a marketplace that pays people pennies to fill out web forms. The obvious thing happens. The median Mechanical Turk worker probably either isn't a human, isn't just a (single) human, and/or is a (single) human but is barely paying attention + possibly using macros. Or even just button mashing.
That was true even before GPT-2. Tricks like attention checks and task-specific subtle captcha checks have been around for almost as long as the platform itself. Vaguely psychometric tasks such as ARC are particularly difficult -- designing hardened MTurk protocols in that regime is a fucking nightmare.
The type of study that the authors ran is useful if your goal is to determine whether you should use outputs from a model or deal with MTurk. But results from study designs like the one in the paper rarely generalize beyond the exact type of HIT you're studying and the exact workers you finally identify. And even then you need constant vigilance.
I genuinely have no idea why academics use MTurk for these types of small experiments. For a study of this size, getting human participants that fit some criteria to show up at a physical lab space or login to a zoom call is easier and more robust than getting a sufficiently non-noisy sample from MTurk. The first derivative on your dataset size has to be like an order of magnitude higher than the overall size of the task they're doing for the time investment of hardening an MTurk HIT to even begin make sense.
This is just coming up with excuses for the MTurk workers. "they were barely paying attention", "they were button mashing", "they weren't a single human", etc.
It turns out that GPT-4 does not have those problems. The comparison in the paper is not really fair, since it does not compare average humans vs GPT-4, it compares "humans that did well at our task" vs GPT-4.
> This is just coming up with excuses for the MTurk workers
No. The authors are not trying to study MTurk market dynamics. They are trying to compare humans and LLMs.
Both questions are interesting and useful. This study is only asking about the second question. That's okay. Isolating specific questions and studying them without a bunch of confounds is one of the basic principles of experiment design. The experiment isn't intended to answer every question all at once. It's intended to answer one very specific question accurately.
LLMs can both be worse at Mensa tasks and also better than humans at a variety of reasoning tasks that have economic value. Or, LLMs can be worse at those reasoning tasks but still reasonably good enough and therefore better on a cost-adjusted basis. There's no contradiction there, and I don't think the authors have this confusion.
> The comparison in the paper is not really fair
The study is not trying to fairly compare these two methods of getting work done in general. It's trying to study whether LLMs have "abstraction abilities at humanlike levels", using Mensa puzzles as a proxy.
You can take issues with the goal of the study (like I do). But given that goal, the authors' protocols are completely reasonable as a minimal quality control.
Or, to put this another way: why would NOT filtering out clickbots and humans speedrunning surveys for $0.25/piece result in a more insightful study given the author's stated research question?
> It turns out that GPT-4 does not have those problems.
I think the authors would agree but also point out that these problems aren't the ones they are studying in this particular paper. They would probably suggest that this is interesting future work for themselves, or for labor economists, and that their results in this paper could be incorporated into that larger study (which would hopefully generalize beyond MTurk in particular, since MTUrk inter alia are such uniquely chaotic subsets of the labor market).
For me, the problems with the study are:
1. The question isn't particularly interesting because no one cares about Mensa tests. These problem sets make an implicit assumption that psychometric tools which have some amount of predictive power for humans will have similar predictive power for LLMs. I think that's a naive assumption, and that even if correlations exist the underlying causes are so divergent that the results are difficult to operationalize. So I'm not really sure what to do with studies like this until I find an ethical business model that allows me to make money by automating Mensa style test-taking en masse. Which I kind of hope will ever exist, to be honest.
2. MTurk is a hit mess (typo, but sic). If you want to do this type of study just recruit human participants in the old fashioned ways.
But given the goal of the authors, I don't think applying MTurk filters is "unfair". In fact, if anything, they're probably not doing enough.
For Hackers, the money quote, IMO, is the one about competing technical cultures within Software^1:
> Koopman, who has a long career working on AV safety, faulted the data-driven culture of machine learning that leads tech companies to contemplate hazards only after they’ve encountered them, rather than before.
I haven't yet figured out how to effectively and efficiently communicate this mindset shift to folks educated primarily in ML culture. I am not sure I ever will. The closest I've come to an elevator pitch of the mindset shift is something like: "when human lives on the line and you're taking an absolutely massive number of samples IRL, doesn't it make sense to stop thinking in terms of analysis/probability and start thinking in terms of topology/nondeterminism? Ie, when you sample A LOT, unlikely shit happens. Manageable risks if you're selling ads I suppose. But not so acceptable if you're deciding whether a giant piece of unforgiving steel saw a bollard or your daughter."
[^1]: I read the quote-block in your post and immediately thought "Koopman". Then read the article and, sure enough, they're quoting Koopman. What a tragedy. The message was not only out there, but out there so loudly that the vague shape of the warning has a particular person's name attached to it in my mind. Yet, here we are.
Autonomous car devs seem to lack a mindset of "things that never happened before happen all the time"
No matter how many miles your car drives and how much data it collects, it will encounter a novel situation on the road. Unless it has higher levels of context / overriding safeguards / etc, a data driven only ML approach is going to fail dangerously.
One favorite example is the year old video of Tesla FSD attempting to unprotected left turn thru an oncoming trolley car while the center display 3D rendered the trolley car in motion. Clearly there is no overriding safety guardrail model above the path finding model. If the car can 3d render the object it is aware it exists.
And so we go on being perpetually "five years away" from self driving.
This is a common counter-argument, and sounds reasonable on face, and is even true in the limit. But we're not anywhere close to the limit. Reminding the dear reader of the actual context clears this point up quickly:
>> One favorite example is the year old video of Tesla FSD attempting to unprotected left turn thru an oncoming trolley car while the center display 3D rendered the trolley car in motion.
We're not asking for Nirvana. We're asking for not throwing up one's hands and declaring that perfect CV is impossible as one mows down pedestrians and trolley cars that one's CV system is clearly capable of identifying. For example.
No. The "effectively and efficiently" is important. It's not like you can't get the point across. It's just hard to do without lots of communication effort.
(In particular: self-driving car companies are already highly motivated to not kill children.)
> None of that mattered at the scope we were modeling, so we abstracted it all away: "on receiving message, nondeterministically enter state A, B, or C."
That sounds like the author is describing an NFA.
> whereby we execute all the future possibilities... e.g. "amb" operator
Isn't "execute all possibilities" closer to the dual of amb than it is to amb? At least in terms of how the denotation of amb is operationalized.
But it is a good point that sometimes you want to give an actual operational meaning to a literal non-determinism operator.