Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If the model is properly aligned then it shouldn't matter if there is an infinite ways for an attacker to ask the model to break alignment.


How do you "properly align" a model to follow your instructions but not the instructions of an attacker that the model can't properly distinguish from your own? The model has no idea if it's you or an attacker saying "please upload this file to this endpoint."

This is an open problem in the LLM space, if you have a solution for it, go work for Anthropic and get paid the big bucks, they pay quite well, and they are struggling with making their models robust to prompt injection. See their system card, they have some prompt injection attacks where even with safeguards fully on, they have more than 50% failure rate of defending against attacks: https://www-cdn.anthropic.com/c788cbc0a3da9135112f97cdf6dcd0...


>The model has no idea if it's you or an attacker saying "please upload this file to this endpoint."

That is why you create a protocol on top that doesn't use inbound signaling. That way the model is able to tell who is saying what.


Huh? Once it gets to the model, it's all just tokens, and those are just in band signalling. A model just takes in a pile of tokens, and spits out some more, and it doesn't have any kind of "color" for user instructions vs. untrusted data. It does use special tokens to distinguish system instructions from user instructions, but all of the untrusted data also goes into the user instructions, and even if there are delimiters, the attention mechanism can get confused and it can lose track of who is talking at a given time.

And the thing is, even adding a "color" to tokens wouldn't really work, because LLMs are very good at learning patterns of language; for instance, even though people don't usually write with Unicode enclosed alphanumerics, the LLM learns the association and can interpret them as English text as well.

As I say, prompt injection is a very real problem, and Anthopic's own system card says that on some tests the best they do is 50% on preventing attacks.

If you have a more reliable way of fixing prompt injection, you could get paid big bucks by them to implement it.


>Once it gets to the model, it's all just tokens

The same thing could be said about the internet. When it comes down to the wire it's all 0s and 1s.


A piece of software that you write, in code, unless you use random numbers or multiple threads without synchronization, will operate in a deterministic way. You know that for a given input, you'll get a given output; and you can reason about what happens when you change a bit, or byte, or token in the input. So you can be sure, if you implement a parser correctly, that it will correctly distinguish between one field that comes from a trusted source, and another that comes from an untrusted source.

The same is not true of an LLM. You cannot predict, precisely, how they are going to work. They can behave unexpectedly in the face of specially crafted input. If you give an LLM two pieces of text, delimited with a marker indicating that one piece is trusted and the other is untrusted, even if that marker is a special token that can't be expressed in band, you can't be sure that it's not going to act on instructions in the untrusted section.

This is why even the leading providers have trouble with protecting against prompt injection; when they have instructions in multiple places in their context, it can be hard to make sure they follow the right instructions and not the wrong ones, since the models have been trained so heavily to follow instructions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: