Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Is anyone else migrating to self-hosted tooling because of LLMs?
3 points by brushfoot on May 28, 2024 | hide | past | favorite | 4 comments
My tinfoil hat may be tighter than usual this morning, but I'm looking at migrating my startup's tooling to self-hosted. GitHub to self-hosted Git is among the first. The risks of my code ending up as LLM training fodder is starting to outweigh the benefits.

After all, LLM training on customer data is already happening. Slack is a recent example. [0] OpenAI is another -- they train on your private chats unless you opt out. [1]

My code has been in a GitHub private repo for years, so clearly, I have some trust in Microsoft not to outright steal it. But LLM training is different. Companies have been giving that a pass.

As a bootstrapped startup founder, having my code in some LLM's "knowledge update" could sink me if it could produce enough of it, even with alterations (hey, knowing my coding abilities, they'd probably be improvements). The same goes for docs, processes, and chats.

Copilot currently says it doesn't train on user data, but EULAs change. Some "we updated our terms" email could be a gloss on "heads up, we now train on your private code."

Is anyone else doing/thinking this, or do I need to take in my hat for alterations?

[0] https://www.techspot.com/news/103055-slack-has-siphoning-user-data-train-ai-models.html

[1] https://help.openai.com/en/articles/7730893-data-controls-faq



You people are so naive. How can you store any code in the cloud at all believing that nobody is going to look at it. Or running your business of cloud based systems. Running a webshop is one thing but having your company in the cloud is a whole nother ballpark.

When it comes to IT companies and the cloud, then trust is really the wrong thing. No matter who or what it is.

AI is just making things easier for those companies to abuse that trust even more. People are sharing their code in the hopes of AI making it better, sharing their ideas in the hopes of AI helping them getting somewhere and so on.

And everyone willingly ignores that this information can be stored, analyzed and used against them later.

How naive can you be?


MSFT and Google have a vested interest in not training public models on private data. They would lose customers en mass. Instead, they offer private models for your private data.

Self-hosting all the things will distract you from the core value add your startup provides. It will also be a lower quality experience internally.


Maybe, but I was shocked to find out that OpenAI had been training on my private chats. I was sending it entire files of my source code on the assumption that my private chats were just that, private. Now big swaths of my source code have ended up in their training data.

I opted out a week or two ago, but the past code is there. If a competitor asks it how to solve a problem that I already solved, the model will be that much better at solving it than it was before. I effectively just commodified part of my code base.

The same goes for Slack: The fact that Slack started training on private chats alarmed me. Conversations encode patterns of processes at a business. And Slack isn't exactly a small fly-by-night operation. (Edit: In their blog post, they at least say they aren't using the data to train LLMs that will be public, unlike OpenAI is doing by default with ChatGPT chats.)

These companies wouldn't release unmodified customer data into the wild, but the zeitgeist seems to be that trained models are different.

That said, I agree with your point about my startup's core value. I don't want this to become a distraction that impedes business. So it isn't a top priority by any means. But it's definitely on my radar now as a rainy-day project.

I also agree that a nonstandard experience could well lower the quality internally, but I might just consider that a cost of doing business until there's more regulation around this.


I think people are overly worried about the training, that because of a few examples, somehow these models can memorize everyone's content, reproduce it at will, and that will be used by some nefarious actor to undermine their business. It's hyperbole if you ask me.

I use Google for my business because they do security and privacy better than any other company. Note that this relationship is very different from the free, consumer offerings




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: