This type of data is actually better than independent human text (specifically for training the LLM that originally produced the output)
GPT4 is trained with PPO+RLHF. The web text that is produced by the LLM then fed back in will be more proximal to the original token distribution.
In other words, by selectively publishing LLM output you’re effectively performing the same action as clicking the thumbs up/thumbs down button on the chatgpt webui.
I agree with openai that this will not be a problem at all, since you would need a process to gauge the quality of the data anyways, even for human text.
Ive seen this referred to as “AI drift” before and I see it being a long term problem of data sets for training. As always, the quality of the data set determines the quality of the results. Models and data sets will be the arms race in the LLM world for a while to come, I think.
LLMs now still generating content that contains misinformation, which will end up on the web, LLMs then can learn and propagate those misinformation in their future responses. That's one way this feedback loop can affect their output.
GPT4 is trained with PPO+RLHF. The web text that is produced by the LLM then fed back in will be more proximal to the original token distribution.
In other words, by selectively publishing LLM output you’re effectively performing the same action as clicking the thumbs up/thumbs down button on the chatgpt webui.
I agree with openai that this will not be a problem at all, since you would need a process to gauge the quality of the data anyways, even for human text.