> If you're ok with the looming threat of total annihilation.
Don't you have that problem with any energy-dense fuel? It's just that it doesn get more dense than that, so you can be very space and weight efficient.
It's like everybody saying that a hydrogen car is a rolling bomb because of the energy stored in the hydrogen. Well, sure, but gasonline has just as much energy stored. Which is the whole point of fuel. To store energy. It's not like you are bringing 100x as much energy with you just because it's hydrogen. So that doesn't make an ICE car any less of a bomb...
The difference is that antimatter annihilates with any normal matter that it comes into contact with. This means you can't just put it in a tank, the way you can with hydrogen. You can't e.g. combine it with some metal to make a metal hydride to make it safer to store, the way you can with hydrogen.
At an absolute minimum, you need extremely strong magnetic confinement and an extremely hard vacuum. And even then, you're going to get collisions with stray atoms and annihilation events which release gamma rays and other radiation products - although shielding is probably the least of your worries in this scenario.
A typical research lab at a university or large corporation can't make a vacuum strong enough to store even tiny quantities of antimatter for more than a few minutes, and they can't produce the magnetic confinement strength required to store macro quantities of it, either.
So the question with an antimatter-powered car is not if it's going to destroy the surrounding region and bathe it in hard radiation, but how many milliseconds (or less) it will take before that inevitably happens.
But probably luckily for us, this is all moot, because we have no way of producing enough antimatter for this to be an issue. If all the antimatter that's ever been created by humans annihilated simultaneously, only scientists monitoring their instruments closely enough would notice, because it's such a microscopic amount.
Edit: for perspective, you'd need about 7 billion times the 92 antiprotons transported in the truck in the story to produce the energy produced by a single grain of gunpowder.
How is it possible to make as hard of vacuum as they did? I assume it's not perfect, so what's the trick? Does the magnet setup create a volume that's simultaneously high probability for antimatter and low for everything else?
For this antimatter transport experiment, they only transported 92 antiprotons. To store and transport that, the requirements for the magnetic field and vacuum are many orders of magnitude lower than what would be needed for macro-scale quantities.
Also, if there was an accident and all those protons annihilated, the consequences would be unnoticeable except to sensitive instruments. The energy involved is about one seven billionth of the energy in a single grain of gunpowder.
Liquid gasoline does not spontaneously explode like an action movie. You can put a match in the fuel tank and (presuming infinite oxygen availability) it'd just start a small fire. Heck, may even just give a little puff and then put out the match.
Antimatter in any sufficient fuel quantity, the moment it breaks confinement, will completely annihilate and release ALL it's energy in a single moment, setting off a chain reaction to the remaining antimatter. It's like sitting on an armed nuclear bomb, where you rely on electrified, highly sophisticated containment equipment never failing a single time for months to years... In a radiation-heavy environment known for causing sophisticated electronics to have errors.
And, yes, hydrogen cars were looked at critically because of the perception they can Hindenburg (I'm unsure if it's true or not). Which is a good example because you don't particularly see any hydrogen blimps anymore - we made them illegal because they're dangerous.
Any compressed gas fuel is inherently dangerous. There's a video of a CNG-fueled bus falling off a lift and sending a fireball through the maintenance facility.
Batteries have some of these same risks: they store a lot of energy and it can be released very quickly under the wrong circumstances.
This provides a great cover for intelligence agencies to avoid disclosing their actual data source. Just point to Strava and hand-wave a little. Nobody will suspect that you actually had an in via a close associate of the target.
It’s called parallel construction in many related circles and is used on a daily basis even in communities like yours.
For example, do you have information obtained from illegal surveillance technology to know of an illegal activity happening in a house? Well, why not just ask very forcefully of someone facing inflated jail time, whether they happen to remember… after thinking really hard about it… having seen that illegal activity in that particular house they definitely have been in, to get the warrant approved by a judge.
> but in this case the works in question were released under a license that allowed free duplication and distribution so no harm was caused.
FSF licenses contain attribution and copyleft clauses. It's "do whatever you want with it provided that you X, Y and Z". Just taking the first part without the second part is a breach of the license.
It's like renting a car without paying and then claiming "well you said I can drive around with it for the rest of the day, so where is the harm?" while conveniently ignoring the payment clause.
You maybe confusing this with a "public domain" license.
If what you do with a copyrighted work is covered by fair use it doesn't matter what the license says - you can do it anyway. The GFDL imposes restrictions on distribution, not copying, so merely downloading a copy imposes no obligation on you and so isn't a copyright infringement either.
I used to be on the FSF board of directors. I have provided legal testimony regarding copyleft licenses. I am excruciatingly aware of the difference between a copyleft license and the public domain.
> I am excruciatingly aware of the difference between a copyleft license and the public domain.
Then why did you say "no harm was caused"? Clearly the harm of "using our copylefted work to create proprietary software" was caused. Do you just mean economic harm? If so, I think that's where the parent comments confusion originates.
> The GFDL imposes restrictions on distribution, not copying, so merely downloading a copy imposes no obligation on you and so isn't a copyright infringement either.
The restrictions fall not only on verbatim distribution, but derivative works too. I am not aware whether model outputs are settled to be or not to be (hehe) derivative works in a court of law, but that question is at the vey least very much valid.
> the district court ruled that using the books to train LLMs was fair use but left for trial the question of whether downloading them for this purpose was legal.
The pipeline is something like: download material -> store material -> train models on material -> store models trained on material -> serve output generated from models.
These questions focus on the inputs to the model training, the question I have raised focuses on the outputs of the model. If [certain] outputs are considered derivative works of input material, then we have a cascade of questions which parts of the pipeline are covered by the license requirements. Even if any of the upstream parts of this simplified pipeline are considered legal, it does not imply that that the rest of the pipeline is compliant.
Consider the net effect and the answer is clear. When these models are properly "trained", are people going to look for the book or a derivative of it, with proper attribution?
Or is the LLM going to regurgitate the same content with zero attribution, and shift all the traffic away from the original work?
When viewed in this frame, it is obvious that the work is derivative and then some.
That is your opinion, but the judge disagreed with you. The decision may have been overturned on appeal, but as it stands, in that courtroom, the training was fair use.
I can memorize a song and it will be fair use too, but it won't be anymore once I start performing it publicly. Training itself is quite obviously fair use, what matters is what happens next.
This is also, unfortunately, the only way this can be settled. Making LLM output legally a derivative work would murder the AI golden rush and nobody wants that
I'm also skeptical that it's impossible to get an LLM to reproduce some code verbatim. Google had that paper a while back about getting diffusion models to spit out images that were essentially raw training data, and I wouldn't be surprised if the same is possible for LLMs.
Stack Overflow has verbatim copied GPL code in some of its questions and answers. As presented by SO, that code is not under the GPL license (this also applies to other licenses - the BSD advertising clause and the original json will cause similar problems).
Arguably, the use of the code in the Stack Overflow question and answer is fair use.
The problem occurs not when someone reads the Q&A with the improperly licensed code but rather when they then copy that code verbatim into their own non GPL product and distribute that without adherence to the GPL.
It's the last step - some human distributing the improperly licensed software that is the violation of the GPL.
This same chain of what is allowed and what is not is equally applicable to LLMs. Providing examples from GPL licensed material to answer a question isn't a license violation. The human copying that code (from any source) and pasting it into their own software is a license violation.
---
Some while back I had a discussion with a Swiss developer about the indefinite article used before "hobbit" in a text game. They used "an hobbit" and in the discussion of fixing it, I quoted the first line of The Hobbit. "In a hole in the ground there lived a hobbit." That cleared it up and my use of it in that (and this) discussion is fair use.
If someone listening to that conversation (or reading this one) thought that the bit that I quoted would be great on a T-shirt and them printed that up and distributed it - that would be a copyright violation.
The Ninth Circuit did, however, overturn the district court's decision that Google's thumbnail images were unauthorized and infringing copies of Perfect 10's original images. Google claimed that these images constituted fair use, and the circuit court agreed. This was because they were "highly transformative."
If I was to then take those thumbnails from a google image search and distribute that as an icon library, I would then be guilty of copyright infringement.
I believe that Stack Overflow, Google Images, and LLM models and their output constitutes an example of transformative fair use. What someone does with that output is where copyright infringement happens.
My claim isn't that AI vendors are blameless but rather that in the issue of copyright and license adherence it is the human in the process that is the one who has agency and needs to follow copyright (and for AI agents that were unleashed without oversight, it is the human that spun them up or unleashed them).
That's really interesting. I'm a lawyer, and I had always interpreted the license like a ToS between the developers. That (in my mind) meant that the license could impose arbitrary limitations above the default common law and statutory rules and that once you touched the code you were pregnant with those limitations, but this does make sense. TIL. So, thanks.
Does the reasoning in the cases where people to whom GPL software was distributed could sue the distributor for source code, rather than relying on the copyright holder suing for breach of copyright strengthen the argument that arbitrary limitations are enforceable?
Licenses != contracts, and well, the FSF's position has always been that the GPL isn't a contract, and contracts are what allow you to impose arbitrary limitations. Most EULAs are actually contracts.
Yes... a license can be granted via contract. I think the question here is whether posting a LICENSE.md file in a public github repo forms a contract (offer, acceptance, consideration) when a developer uses it. If so, I'm back to being unclear how "public domain" can really play a role. The developer is bound by the terms of that contract.
Unrelated question regarding this part, since you seem to be an expert on this:
> If what you do with a copyrighted work is covered by fair use it doesn't matter what the license says - you can do it anyway.
How is it that contracts can prohibit trial by jury but they can't ban prohibit fair use of copyrighted work? Is there a list of things a contract is and isn't allows to prohibit, and explanations/reasons for them?
The general answer is because there is a statute or court opinion that says so for one thing and a different one that says something else for the other thing.
It's also relevant that copyright (and fair use) is federal law, contracts are state law and federal law preempts state law.
It sounds that way a bit from the one sentence. But that’s not the case at all.
> 4. MODIFICATIONS
> You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release
the Modified Version under precisely this License, with the Modified
Version filling the role of the Document, thus licensing distribution
and modification of the Modified Version to whoever possesses a copy
of it. In addition, you must do these things in the Modified Version:
Etc etc.
In short, it is a copyleft license. You must also license derivative works under this license.
Just fyi, the gnu fdl is (unsurprisingly) available for free online - so if you want to know what it says, you can read it!
And the judgement said that the training was fair use, but that the duplication might be an infringement. The GFDL doesn't restrict duplication, only distribution, so if training on GFDLed material is fair use and not the creation of a derivative work then there's no damage.
Right. I can publish the work in whole without asking permission. That’s unrestricted duplication.
However, as i read it, an LLM spitting out snippets from the text is not “duplicating” the work. That would fall under modifications. From the license:
> A "Modified Version" of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.
I read that pretty clearly as any work containing text from a gnu fdl document is a modification not a duplication.
1) Obtaining the copyrighted works used for training. Anthropic did this without asking for the copyright holders' permission, which would be a copyright violation for any work that isn't under a license that grants permission to duplicate. The GFDL does, so no issue here.
2) Training the model. The case held that this was fair use, so no issue here.
3) Whether the output is a derivative work. If so then you get to figure out how the GFDL applies to the output, but to the best of my knowledge the case didn't ask this question so we don't know.
For this to stand up in court you'd need to show that an LLM is distributing "a modified version of the document".
If I took a book and cut it up into individual words (or partial words even), and then used some of the words with words from every other book to write a new book, it'd be hard to argue that I'm really "distributing the first book", even if the subject of my book is the same as the first one.
This really just highlights how the law is a long way behind what's achievable with modern computing power.
You’re just describing transformative use. I’m not a lawyer, but an example from music - taking a single drum hit from a james brown song is apparently not transformative. Taking a vibe from another song is also maybe not transformative, e.g. robin thicke and pharrell’s “blurred lines” was found to legally take the “feel” from Marvin Gaye’s “Got to Give it Up”
Which is all to say that the law is actually really bad at determining what is right and wrong, and our moral compasses should not defer to the law. Unfortunately, moral compasses are often skewed by money - like how normal compassess are skewed by magnets
Presumably, a suitable prompt could get the LLM to produce whole sections of the book which would demonstrate that the LLM contains a modified version.
I am distrubting an svg file. It’s a program that, when run, produces an image of mickey mouse.
By your description of the law, this svg file is not infringing on disney’s copyright - since it’s a program that when run creates an infringing document (the rasterized pixels of mickey mouse) but it is not an infringing document itself.
I really don’t think my “i wrote a program in the svg language” defense would hold up in court. But i wonder how many levels of abstraction before it’s legal? Like if i write the mickey-mouse-generator in python does that make it legal? If it generates a variety of randomized images of mickey mouse, is that legal? If it uses statistical anaylsis of many drawings of mickey to generate an average mickey mouse, is that legal? Does it have to generate different characters if asked before it is legal? Can that be an if statement or does it have to use statistical calculations to decide what character i want?
The SVG file is a representation of mickey mouse thus possibly touches Disney copyright (depends on exactly what form of Mickey it represents, as I believe some went public domain equivalent recently). It's not capable of being something else without substantial rework. Therefore it is a derivative work.
Generally, to pass the test of not being a derivative work it would need to be generic enough that it creates non-copyrighted works as well, then the responsibility shifts over. Can the program exist without a given copyrighted work (not general idea, specific copyrighted works)? Then it's quite probably not derivative.
If that would work reliably then you could apply that to human-produced code too. But nothing like that has shown to work, so I wouldn't put money on it working for LLM output.
Except one is an employee and the other one is an ex employee. The bias this introduces is not just a minor nuance, it's what fuels the public conflict and causes everybody else to double check their popcorn reserves.
Of course technical discussions happen all the time at companies between competent people. But you don't do that in public, nor is this a technical debate: "I don't recall talking to you about it" - "I do, I did xyz then you ignored me" - "<changes subject>"
Important distinction yes. It also means I can't go back and check the thread on what was said and when. Nor do I want to.
Always good to talk face to face if you're have strong feelings about something. When I said "talk" I meant literally face to face.
Spending a decade or so on lkml, everyone develops a thick skin. But mix it with the corporate environment, Facebook 2011, being an ex-employee adds more to the drama.
Having read through the comments here, I'm still of the opinion that any HW changes had a secondary effect and the primary contributor was a change in how HHVM/jemalloc interacted with MADV.
One more suggestion: evaluate more than one app and company wide profiling data to make such decisions.
One of the challenges in doing so is the large contingent of people who don't have an understanding of CPU uarch/counters and yet have a negative opinion of their usefulness to make decisions like this.
So the only tool you have left with is to run large scale rack level tests in a close to prod env, which has its own set of problems and benefits.
Perf counters are only indicative of certain performance characteristics at the uarch level but when one improves one or more aspects of it the result does not necessarily positively correlate to the actual measurable performance gains in E2E workloads deployed on a system.
That said, one of the comments above suggests that the HW change was a switch to Ivy Bridge, when zeroing memory became cheaper, which is a bit unexpected (to me). So you might be more right when you say that the improvement was the result of memory allocation patterns and jemalloc.
Anybody else being annoyed by all this focus on em-dash use to detect AI? In no time, the bad guys will tell their BS machines to avoid em-dashes and "it's not X it's Y" and whatever else people use as "tell-tale signs" and eventually the training data will have picked up on that too. And people who genuinely use em-dashes for taste reasons or are otherwise using expressions considered typical for AI are getting a bad rep.
This is all just demonstrating the helplessness that's coming to our society w.r.t. dealing with gen AI output. Looking for em-dashes is not the solution and distracts from actually having to deal with the problem. (Which is not a technical but a social one. You can't solve it with tech.)
This is turning out to be a huge issue for me as my frequent use of em-dashes makes my remarks trigger people effectively disrupting attempts to communicate. Maybe my communication needs to change or maybe these objections are yet another red flag to watch for.
> Anybody else being annoyed by all this focus on em-dash use to detect AI?
Yes, the “AI detectives” can be quite annoying, as the comments are always the same. No substance, just “has X, it’s AI”. The em-dashes detectives tend to be the worse, because they often refuse to understand that em-dashes are actually trivial to type (depending on OS) and that people have been using them on purpose since before LLMs.
Mind you, using em-dashes as one signal to reinforce what you already perceive as LLM writing is valid, it’s only annoying when it’s used as the sole or principal indicator.
I keep reading about students are learning to intentionally write worse so that it doesn't get flagged as AI-generated. I think it's a systemic problem that won't be solved in the short term, unfortunately.
It's hilarious that em dashes and "it's not X; it's Y" and other trivial things are the best way for humans to spot AI now. Like if AI robots infiltrated us, at first we'd be like "ooh, he has long ears, he's a robot". And after a while the robots will learn to keep their ears shorter. Then what? When we're out of tell-tale signs?
I can imagine that this will be similar to the "Emacs/Vim in the AI age" article - it will just be considered to matter less in the AI age. Why spend 3-5 years of your life with a sometimes frustrating experience to obtain this PhD degree if you have powerful models at your disposal that will just be able to solve everything for you? (Similar to why learn Elisp/VimScript/...) Especially considering the current trajectory, expecting where things will be in 5 or 15 years. It will just feel less and less appealing to get an in-depth education, especially a formal one.
Which is quite ironic, considering who wrote the article.
LLMs fall victim to "garbage in, garbage out." Claude can solve open problems if you know what you're doing, but it can also incorrectly convince you it's right if you don't know what you're doing.
A PhD teaches you how to think, how to learn, and how to question the world. That's a vital set of skills no matter what tool exists.
I don’t really know how to optimize for a world where AIs would be smarter than everyone and able to do everything.
If that comes to pass, I guess there won’t be any economic cost to having done my PhD because the entire economy will be AI driven and we’ll hopefully just be their happy pets.
If that doesn’t come to pass, and AIs just remain good at summarizing and remixing ideas, I guess people with experience generating research will still be useful.
Because you may have fun working in a scientific environment and doing research.
I liked my job at the university - independent of the final PhD. I enjoyed what I was doing. Most of the time I also enjoyed writing my dissertation, since I was given the opportunity to write about my stuff. And mostly I could write it in a way how I felt things are supposed to be explained.
Why spend your life doing anything at all? I'm biased on the topic since im writing up atm, but it was, if nothing else, a very itnerseting way to spend 4 years of my life.
I find it very fulfilling to do a PhD and did so myself. More people should. What I mean is that I'm expecting the general view on it to evolve as described.
Ah. I did indeed misunderstand. Also, as I said, I've got a personal stake, right at the tale end of the PhD, looking for jobs, so I guess im feeling pretty defensive. I certainly hope the general public doesn't feel this way, but I've seen plenty of people say similar things about college degrees now, so it kind of makes sense.
C dev wasn't an issue back in the 1 GB or 256 MB or 16 MB days either. You just didn't use to have a Chrome tab open that by itself is eating 345 MB just to show a simple tutorial page.
C dev wasn't a problem with MSDOS and 640K either. With CP/M and 64K it was a challenge I think. Struggling to remember the details on that and too lazy to research it right now.
If you believe that then you have hardly scratched the surface of what your editor is capable of doing.
reply