AI safety via debate

the problem

learning complex human goals and preferences - many approaches
- one is to specify complex goals + ask humans to judge during training, which agent is good or bad
  - but it’s so hard to judge if it’s a complicated task to see if agent is doing good or bad- there’s a lot of components

<aside> 💡 self play zero-sum debate game = see which agents have best points (most true and informative information)

</aside>

for context:

self play has produced impressive results for the games of Go, chess, Dota, etc
debate allows algorithmic and theoretical developments in these areas to carry over to AI alignment

alignment is hard to do when training because you’re trying to actively fix the behaviour and incentives of trained unaligned agents
currently, we’re able to somewhat align models - e.x. giving ground truths for segmentation tasks or imitation learning (mimic human on specific task)
human feedback can give feedback on robot doing backflip in a random environement
but as complexity increases and it gets harder for the human
- the answer of question might have subtle flaws that could be hard to detect for human
  - we could have a model to evaluate the flaws, but even that would be hard
- flaws itself can have flaws and those flaws can have flaws

<aside> 💡 a debate approach to alignment

debate between competing agents where agents make arguments + other agents poke holes in those arguments, and so on until enough information to decide the truth on statement </aside>

Untitled