The idea is good, but the project isn't open. So I assume a rust fork will come out under MIT with these ideas, which can be the wider community adopted version.
Possibly, Amazon and Google also made the ability for smaller startup based DB companies to go that route with things like ValKey and OpenSearch. LLMs have made it super easy to transpile the ideas into whatever programming language you please though, you just have to put in the time.
All this nonsense because of a silly law well past it's due date. Demolish copyright, and let's move on to a world where ideas worth sharing are shared without any blocking. And if someone feels their "ideas" should not be shared without their consent, well, they can keep it to themselves - we're a society after all.
Generalize much? How would you feel if code-as-craft people were called out to be anti-social nerds who spent times on umpteenth rewrite and refactor, didn't care what impact that had on the actual user they were building for?
I'd thank you for the laugh and assume you worked in project management, marketing, or some other low info industry segment. I've worked with hundreds of developers over my career and the only time I've brushed up against anyone who even approximates what you are describing would be in HN comment threads. The craft-oriented women and men I've had the pleasure of working with have without exception held user experience and the future sanity of other developers interacting with the code they wrote as core requirements, every project, every line of code. Getting it right the first time tends to cut down significantly on refactoring.
Copyright is not a blacklist but an allowlist of things kept aside for the holder. Everything else is free game. LLM ingestion comes under fair use so no worries. If someone can get their hand on it, nothing in law stops it from training ingestion.
We can debate if this law is moral. Like the GP I took agree public data in -> public domain out is what's right for society. Copyright as an artificial concept has gone on for long enough.
I don't think so. It is no where "limited use". Entirety of the source code is ingested for training the model. In other words, it meets the bar of "heart of the work" being used for training. There are other factors as well, such as not harming owner's ability to profit from original work.
This hasn't gone to Supreme Court yet. And this is just USA. Courts in rest of the World will also have to take a call. It is not as simple as you make it out to be. Developers are spread across the World with majority living outside USA. Jurisdiction matters in these things.
Copyright's ambit has been pretty much defined and run by US for over a century.
You're holding out for some grace on this from the wrong venue. The right avenue would be lobbying for new laws to regulate and use LLMs, not try to find shelter in an archaic and increasingly irrelevant bit of legalese.
I don't disagree. However, just because your assertion of copyright being initially defined by US (which is not the fact. It was England that came up with it and was adopted by the Commonwealth which US was also a part of until its independence) does not mean jurisdiction is US. Even if US Supreme Court rules one way or the other, it doesn't matter as the rest of the World have its own definitions and legalese that need to be scrutinized and modernized.
Alsup absolutely did not vindicate Anthropic as "fair use".
> Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies. [0]
It was only fair use, where they already had a license to the information at hand.
These are crappy arguments. The author is seeking to re-litigate Piracy of IP is bad, and AI is bad.
If those are your axioms then you will find the old world is already in the rear-view mirror, and they want to pull back every other project to stay with them in that world.
AI is here. Free software succeeded - make as much as you want. This technology a force multiplier.
You can debate it's morality, but most people want to do their work.
I get (incorrectly) accused of writing undisclosed sponsored content pretty often, so I'm actually hoping that the visible sponsor banner will help people resist that temptation because they can see that the sponsorship is visible, not hidden.
That's actually a cleaner editorial standard than most publications follow. The major risk in tech journalism isn't disclosed sponsorships — it's the undisclosed access journalism where coverage tone shifts to maintain relationships. Visible banners beat invisible influence every time.
Honestly, after his ~23 years of writing online I think he's fairly earned the title as an independent researcher. He added those sponsorships three days ago; perhaps wait to raise your alarm bells until he actually writes about a sponsor.
I can't offer an example of code, but considering researchers were able to cause models to reproduce literary works verbatim, it seems unlikely that a git repository would be materially different.
These arguments absolutely infuriate me. You're code is not that unique. Lots of people write the same snippet everyday and have no idea that somebody else just wrote the same thing.
It's such a crock that you can somehow claim you're the only person who can write that snippet and now everyone else owes you something. No. No they don't. Get over it.
Writing a book is different. Lifting pages or chapters is different because it's much harder for two people to write the exact same thing. Code is code, it follows a formula and a everyone uses that formula.
Assuming that even works from a researcher's perspective, it's working back from a specific goal. There's 0 actual instances (and I've been looking) where verbatim code has been spat out.
It's a convenient criticism of LLMs, but a wrong one. We need to do better.
> There's 0 actual instances (and I've been looking) where verbatim code has been spat out.
That’s not true. I’ve seen it happen and remember reports where it was obvious it happened (and trivial to verify) because the LLM reproduced the comments with source information.
Either way, plagiarism doesn’t require one to copy 100% verbatim (otherwise every plagiarist would easily be off the hook). It still counts as plagiarism if you move a space or rename a variable.
You should take your findings to the large media organizations including NYT who've been trying to prove this for years now. Your discovery is probably going to win them their case.
I don't know code examples, but this tracks, for me. Anytime I have an agent write something "obvious" and crazy hard -- say a new compiler for a new language? Golden. I ask it to write a fairly simple stack invariant version of an old algorithm using a novel representation (topology) using a novel construction (free module) ... zip. It's 200loc, and after 20+ attempts, I've given up.
It happens often enough that the company I work for has set up a presubmit to check all of the AI generated and AI assisted code for plagiarism (which they call "recitation"). I know they're checking the code for similarity to anything on GitHub, but they could also be checking against the model's their training corpus.
reply