I was thinking the same thing, but then I decided to embrace the frustration of the image. It's reminding us that the pictures we have in our heads are kind of fragile. They don't prepare us for a live encounter with Earth from some random angle in space.
Hmm. Folks trying to discover the elegant core of data frame manipulation by studying... pandas usage patterns. When R's dplyr solved this over a decade ago, mostly by respecting SQL and following its lead.
The pandas API feels like someone desperately needed a wheel and had never heard of a wheel, so they made a heptagon, and now millions of people are riding on heptagon wheels. Because it's locked in now, everyone uses heptagon wheels, what can you do? And now a category theorist comes along, studies the heptagon, and says hey look, you could get by on a hexagon. Maybe even a square or a triangle. That would be simpler!
No. Stop. Data frames are not fundamentally different from database tables [1]. There's no reason to invent a completely new API for them. You'll get within 10% of optimal just by porting SQL to your language. Which dplyr does, and then closes most of the remaining optimality gap by going beyond SQL's limitations.
You found a small core of operations that generates everything? Great. Also, did you know Brainfuck is Turing-complete? Nobody cares. Not all "complete" systems are created equal. A great DSL is not just about getting down to a small number of operations. It's about getting down to meaningful operations that are grammatically composable. The relational algebra that inspired SQL already nailed this. Build on SQL. Don't make up your own thing.
Like, what is "drop duplicates"? What are duplicates? Why would anyone need to drop them? That's a pandas-brained operation. You want the distinct keys defined by a select set of key columns, like SQL and dplyr provide.
Who needs a separate select and rename? Select is already using names, so why not do your name management there? One flexible select function can do it all. Again, like both SQL and dplyr.
Who needs a separate difference operation? There's already a type of join, the anti-join, that gets that done more concisely and flexibly, and without adding a new primitive, just a variation on the concept of a join. Again, like both SQL and dplyr.
Props to pandas for helping so many people who have no choice but to do tabular data analysis in Python, but the pandas API is not the right foundation for anything, not even a better version of pandas.
[1] No, row labels and transposition are not a good enough reason to regard them as different. They are both just structures that support pivoting, which is vastly more useful, and again, implemented by both R and many popular dialects of SQL.
I guess I have pandas brain because I definitely want to drop duplicates, 100% of the time I'm worried about duplicates and 99% of the time the only thing I want to do with duplicates is drop them. When you've got 19 columns it's _really fucking annoying_ if the tool you're using doesn't have an obvious way to say `select distinct on () from my_shit`. Close second at say, 98% of the time, I want to a get a count of duplicates as a sanity check because I know to expect a certain amount of them. Pandas makes that easy too in a way SQL makes really fucking annoying. There are a lot of parts on pandas that made me stop using it long ago but first class duplicates handling is not among them.
And the API is vastly superior to SQL is some respects from a user perspective despite being all over the place in others. Dataframe select/filtering e.g. df = df[df.duplicated(keep='last')] is simple, expressive, obvious, and doesn't result in bleeding fingers. The main problem is the rest of the language around it with all the indentations, newlines, loops, functions and so on can be too terse or too dense and much hard to read than SQL.
Duplicates in source data are almost always a sign of bad data modeling, or of analysts and engineers disregarding a good data model. But I agree that this ubiquitous antipattern that nobody should be doing can still be usefully made concise. There should be a select distinct * operation.
And FWIW I personally hate writing raw SQL. But the problem with the API is not the data operations available, it's the syntax and lack of composability. It's English rather than ALGOL/C-style. Variables and functions, to the extent they exist at all, are second-class, making abstraction high-friction.
But seriously I'm not in always in control of upstream data, I get stuff thrown over to my side of the fence by an organization who just needs data jiggled around for one-off ops purposes. They are communicating to me via CSV file scraped from Excel files in their Shared Drive, kind of thing.
Do what you gotta do, but most of my job for the past decade has been replacing data pipelines that randomly duplicate data with pipelines that solve duplication at the source, and my users strongly prefer it.
Of course, a lot of one-off data analysis has no rules but get a quick answer that no one will complain about!
I updated my OG comment for context. As an org we also help clients come up with pipelines but it's just unrealistic to do a top-down rebuild of their operations to make one-off data exports appeal to my sensibilities.
I agree, sometimes data comes to you in a state that is beyond the point where rigor is helpful. And for some people that kind of data is most of their job!
Duplicates are a sign of reality. Only where you have the resources to have dedicated people clean and organize data do you have well modeled data. Pandas is a power tool for making sense of real data.
> Duplicates in source data are almost always a sign of bad data modeling
Nope. Duplicates in source data(INPUT) is natural, correct and MUST be supported or almost all data become impossible.
What is the actual problem is the OUTPUT. Duplicates on the OUTPUT need to be controlled and explicit. In general, we need in the OUTPUT a unique rowby a N-key, but probably not need it to be unique for the rest, so, in the relational model, you need unique for a combination of columns (rarely, by ALL of them).
I always warn people (particularly junior people) though that blindly dropping duplicates is a dangerous habit because it helps you and others in your organization ignore the causes of bad data quickly without getting them fixed at the source. Over time, that breeds a lot of complexity and inefficiency. And it can easily mask flaws in one's own logic or understanding of the data and its properties.
When I'm in pandas (or was, I don't use it anymore) I'm always downstream of some weird data process that ultimately exported to a CSV from a team that I know has very lax standards for data wrangling, or it is just not their core competency. I agree that duplicates are a smell but they happen often in the use-cases that I'm specifically reaching to pandas for.
On reflection I think it's possible I may have missed the potential positive value of the post a bit. Maybe analyzing pandas gets you down to a set of data frame primitives that is helpful to build any API. Maybe the API you start with doesn't matter. I don't know. When somebody works hard to make something original, you should try to see the value in it, even if the approach is not one you would expect to be helpful.
I stand by my warnings against using pandas as a foundation for thinking about tabular data manipulation APIs, but maybe the work has value regardless.
> There's no reason to invent a completely new API for them
Yes there is: SQL is one of many possible ways to interact with tabular data, why should it be the only one? R data frames literally pioneered an alternative API. Dplyr is fantastic for many reasons, one of those being that people like the verb-based approach
Furthermore I argue that dplyr is not particularly similar to SQL in the way you actually use it and how it's actually interpreted/executed.
As for the rest I feel like you're just stating your preferences as fact.
You make it sound like writing an SQL parser and query engine is a trivial task. Have you ever looked at the implementation of a query engine to see what’s actually involved? You can’t just ‘build on SQL’, you have to build a substantial library of functions to build SQL on top of.
Also it's not like dplyr is anything close to a "port" of SQL. You could in theory collect dplyr verbs and compile them to SQL, sure. That's what ORMs typically do, and what the Spark API does (and its descendants such as Polars).
"Porting" SQL to your language usually means inventing a new API for relational and/or tabular data access that feels ergonomic in the host language, and then either compiling it to SQL or executing it in some kind of array processing backend, or DataFusion if you're fancy like that.
dplyr straightforwardly transpiles to SQL through the dbplyr package, so it's semantically pretty close to a port, even though the syntax is a bit different (better).
The author takes the 4 operations below and discusses some 3-operation thing from category theory. Not worth it, and not as clear as dplyr.
> But I kept looking at the relational operators in that table (PROJECTION, RENAME, GROUPBY, JOIN) and thinking: these feel related. They all change the schema of the dataframe. Is there a deeper relationship?
I couldn’t agree more. But at the same time I try to stay quiet about it because SQL is the diamond in the rough that 95% of engineers toss into the trash. And I want minimal competition in a tight job market.
SQL only works on well defined data sets that obey relational calculus rules. Pandas is a power tool for dealing with data as you find it. Without Pandas you are stuck with tools like Excel.
> Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network.
Why would being a Bayesian network explain why transformers work? Bayesian networks existed long before transformers and never achieved their performance.
Bayesian network is a really general concept. It applies to all multidimensional probability distribution. It's a graph that encodes independence between variables. Ish.
I have not taken the time to review the paper, but if the claim stands, it means we might have another tool to our toolbox to better understand transformers.
The problem with these bromides is not that they're wrong, it's that they're not even wrong. They're predictive nulls.
What observable differences can we expect between an entity with True Understanding and an entity without True Understanding? It's a theological question, not a scientific one.
I'm not an AI booster by any means, but I do strongly prefer we address the question of AI agent intelligence scientifically rather than theologically.
We've tested this in the small with AI art. When people believe they're viewing human-made art which is later revealed to be AI art, they feel disappointed. The actual content is incidental, the story that supports it is more important than the thing itself.
It's the same mechanism behind artisanal food, artist struggles, and luxury goods. It is the metaphysical properties we attach to objects or the frames we use to interpret strips of events. We author all of these and then promptly forget we've done so, instead believing they are simply reality.
Well said. That's exactly what has been rubbing me the wrong way with all those "LLMs can never *really* think, ya know" people. Once we pass some level of AI capability (which we perhaps already did?), it essentially turns into an unfalsifiable statement of faith.
Agreed. We should be asking what the machines measurably can or can't do. If it can't be measured, then it doesn't matter from an engineering standpoint. Does it have a soul? Can't measure it, so it doesn't matter.
That's a bit too pessimistic. Often times you can productively find some measurable proxy for the thing you care about but can't measure. Turing's test is a famous example, of that.
Sometimes you only have a one-sided proxy. Eg I can't tell you whether Claude has a soul, but I'm fairly sure my dishwasher ain't.
When push came to shove, it turns out nobody really cared about the Turing test and immediately found excuses to discount it as soon as machines blew through that goalpost. It's fundamentally theological, but the thing is, it doesn't matter. It has no impact on what the machines can demonstrably do.
There are already people dealing with AI intelligence scientifically. That's what benchmarks do.
It's the "it's just a stochastic parrot!" camp that's doing the theological work. (and maybe also those in the Singularity camp...)
That said, I do think there's value in having people understand what "Understanding" means, which is kinda a theological (philosophical :D) question. IMHO, in every-day language there's a functional part (that can be tested with benchmarks), and there's a subjective part (i.e. what does it feel like to understand something?). Most people without the appropriate training simply mix up these two things, and together with whatever insecurities they have with AI taking over the world (which IMHO is inevitable to some extent), they just express their strong opinions about it online...
This AI-written post is part of an insight porn genre that attempts to draw a sharp distinction between two words that mean basically the same thing in real life. We read it, we politely agree that sure, you could use those two words to represent those two different concepts, then we go back to everyday life and continue to use them interchangeably.
If you read the post and actually believed what it said, you would tell people "your presentation convinced but did not persuade, that's why leadership isn't doing what you said." This doesn't make sense to a typical English speaker.
I looked up persuade and convince in the thesaurus and dictionary, based on the title, and then came to say the same thing. But then I got a little curious about the source of the title’s claim, and looked up Chaim Perelman. He really did try to make a distinction between convince and persuade in his influential book from sixty years ago, so the body of the blog post is accurate in a sense - this is a concept that came from an historically important philosopher. Perelman was dissecting argumentation and cataloguing the techniques for strong and persuasive arguments. The problem with this blog post is taking Perelman’s argument out of context and stating Perelman’s rhetorical distinction as though it’s a fact and then arguing logically for it. That leaves out all the ethos and pathos that Perelman was trying to convey is necessary for a good argument, and it also misses slightly on the logos as well.
Interesting. The post would have benefitted a lot from talking about this background instead of just name-dropping Perelman once and by last name only(!!).
That's the sort of sloppiness you get when you have a conversation with an AI, ask the AI to make a blog post based on the conversation, and then copy-paste that straight into your Substack without reading to see if a fresh reader would understand what you are talking about.
If the author insists on posting more unedited AI text, asking a fresh AI session to critique the post from scratch would probably catch this kind of mistake and lead to a much better result.
That's why LLM exist, you can go there, paste the link and specify the number of words you want the synthesis. Or just take a few minutes and read it entirely. Nobody forces you.
> This doesn't make sense to a typical English speaker
You read the words like they were generated by an AI - ie, empty words which did not make sense to you. ie, you did not derive meaning from it.
But for me, I derived meaning from this essay. Regardless of how the text was generated, I was able to relate it to some of my own insights.
For instance..
>Pascal used to distinguish between geometric and subtle minds (sensibilities). >The geometric mind works through axioms and deductions, step by step, like a >compiler. The subtle mind grasps a situation holistically, reading context, >feeling the weight of unspoken constraints, sensing what a room will and won’t >tolerate.
I read this as saying that as humans we have a "subtle" sense that can perceive the chaos out there and then derive patterns from that chaos and be able to symbolize them to derive theories of that chaos via the geometric mind. For example, the sense or feeling of space is symbolized into euclidean geometry. To me, this is a deep insight, and I did not know Pascal called it out. So, I learned something from this essay.
“Insight porn” is a new term for me but it seems to fit so well.
I think a key part of it is not just the simplification of complicated issues, but the willingness to oversimplify them in a way even if it perverts the message. It’s a cousin to “just blame immigrants” or “all cops are bastards” or “____ considered harmful.”
How fun to see that the most common insult in 2026 is that something is AI-generated. Is this comment too? Can you provide the prompt you use and the automation you have to increase your karma points?
And the people who repeat such statements uncritically to their reports will also get mildly annoyed when they have no Earthly clue what that actually means.
The evidence "actually supports the null" over what alternative?
In a Bayesian analysis, the result of an inference, e.g. about the fairness of a coin as in Lindley's paradox, depends completely on the distribution of the alternative specified in the analysis. The frequentist analysis, for better and worse, doesn't need to specify a distribution for the alternative.
The classic Lindley's paradox uses a uniform alternative, but there is no justification for this at all. It's not as though a coin is either perfectly fair or has a totally random heads probability. A realistic bias will be subtle and the prior should reflect that. Something like this is often true of real-world applicaitons too.
Thank you. The main problem with Bayesian statistics is that if the outcome depends on your priors, your priors, not the data determine the outcome.
Bayesian supporters often like to say they are just using more information by coding them in priors, but if they had data to support their priors, they are frequentists.
If they were doing frequentist inference they wouldn’t be using priors at all and there is nothing frequentist in using previous data to construct prior distributions.
Not true. In frequentist statistics, from the perspective of Bayesians, your prior is a point distribution derived empirically. It doesn't have the same confidence / uncertainty intervals but it does have an unnecessarily overconfident assumption of the nature of the data generating process.
Not true. In frequentist statistics, from the perspective of Bayesians and non-Bayesians alike, there are no priors.
—-
Dear ChatGPT, are there priors in frequentist statistics? (Please answer with a single sentence.)
No — unlike Bayesian statistics, frequentist statistics do not use priors, as they treat parameters as fixed and rely solely on the likelihood derived from the observed data.
There's always priors, they're just "flat", uniform priors (for maximum likelihood methods). But what "flat" means is determined by the parameterization you pick for your model. which is more or less arbitrary. Bayesians would call this an uninformative prior. And you can most likely account for stronger, more informative priors within frequentist statistics by resorting to so-called "robust" methods.
First, there is not such thing as a ‘uninformative’ prior; it’s a misnomer. They can change drastically based on your paramerization (cf change of variables in integration).
Second, I think the nod to robust methods is what’s often called regularization in frequentist statistics. There are cases where regularization and priors lead to the same methodology (cf L1 regularized fits and exponential priors) but the interpretation of the results is different. Bayesian claim they get stronger results but that’s because they make what are ultimately unjustified assumptions. My point is that if they were fully justified, they have to use frequentist methods.
One standard way to get uninformative priors is to make them invariant under the transformation groups which are relevant given the symmetries in the problem.
It’s not true that “there are always priors”. There are no priors when you calculate the area of a triangle, because priors are not a thing in geometry. Priors are not a thing in frequentist inference either.
You may do a Bayesian calculation that looks similar to a frequentist calculation but it will be conceptually different. The result is not really comparable: a frequentist confidence interval and a Bayesian credible interval are completely different things even if the numerical values of the limits coincide.
Frequentist confidence intervals as generally interpreted are not even compatible with the likelihood principle. There's really not much of a proper foundation for that interpretation of the "numerical values".
What does “as generally interpreted” mean? There is one valid way to interpret confidence intervals. The point is that it’s not based on a posterior probability and there is no prior probability there either.
If you want to say that when you do a frequentist analysis which doesn’t include any concept of prior you get a result that has a similar form to the result of a completely different conceptually Bayesian analysis which uses a flat prior (definitely not “a point distribution derived empirically”) that may be correct. It remains true that there is no prior in the frequentist analysis because they are not part of frequentist inference at all.
Priors are not used in construction of frequentist approaches, but that does not mean that the analyses aren't isomorphic in theory.
Point distribution <=> point estimate as a sample from an initially flat distribution. A priori vs a posteriori perspectives, which are equivocal if we are to take your description of frequentist statistics into account ;)
It’s not my description of frequentist statistics. It’s the frequentist statisticians’ description. This is from Wasserman’s All of Statistics:
The statistical methods that we have discussed so far are known as frequentist (or classical) methods. The frequentist point of view is based on the following postulates:
F1 […]
F2 Parameters are fixed, unknown constants. Because they are not fluctuating, no useful probability statements can be made about parameters.
> something happened to the model in training (RLHF?) that forcefully degraded its reasoning performance
I've been seeing more people speculating like this and I don't understand why. What evidence do we have for RLHF degrading performance on a key metric like reasoning? Why would this be tolerated by model developers?
Can someone point to an example of an AI researcher saying "oops, RLHF forcefully degrades reasoning capabilities, oh well, nothing we can do"?
It strikes me as conspiracist reasoning, like "there's a car that runs on water but they won't sell it because it would destroy oil profits".
The most obvious way would simply be excessive agreeableness. Users rate responses more highly if they affirm the user's thinking, but a general tendency to affirm would presumably result in the model being more inclined to affirm its own mistakes in a reasoning chain.
There was some research about it early on that was shared widely and shaped the folklore perception around it, such as the graph in https://static.wixstatic.com/media/be436c_84a7dceb0d834a37b3... from the GPT-4 whitepaper which shows that RLHF destroyed its calibration (ability to accurately estimate the likelihood that its guesses are correct). Of course the field may have moved on in the 2+ years that have passed since then.
reply