My tinfoil hat may be tighter than usual this morning, but I'm looking at migrating my startup's tooling to self-hosted. GitHub to self-hosted Git is among the first. The risks of my code ending up as LLM training fodder is starting to outweigh the benefits.
After all, LLM training on customer data is already happening. Slack is a recent example. [0] OpenAI is another -- they train on your private chats unless you opt out. [1]
My code has been in a GitHub private repo for years, so clearly, I have some trust in Microsoft not to outright steal it. But LLM training is different. Companies have been giving that a pass.
As a bootstrapped startup founder, having my code in some LLM's "knowledge update" could sink me if it could produce enough of it, even with alterations (hey, knowing my coding abilities, they'd probably be improvements). The same goes for docs, processes, and chats.
Copilot currently says it doesn't train on user data, but EULAs change. Some "we updated our terms" email could be a gloss on "heads up, we now train on your private code."
Is anyone else doing/thinking this, or do I need to take in my hat for alterations?
[0] https://www.techspot.com/news/103055-slack-has-siphoning-user-data-train-ai-models.html
[1] https://help.openai.com/en/articles/7730893-data-controls-faq
When it comes to IT companies and the cloud, then trust is really the wrong thing. No matter who or what it is.
AI is just making things easier for those companies to abuse that trust even more. People are sharing their code in the hopes of AI making it better, sharing their ideas in the hopes of AI helping them getting somewhere and so on.
And everyone willingly ignores that this information can be stored, analyzed and used against them later.
How naive can you be?