Link to paper: https://arxiv.org/pdf/1805.00899.pdf
the problem
- learning complex human goals and preferences - many approaches
- one is to specify complex goals + ask humans to judge during training, which agent is good or bad
- but it’s so hard to judge if it’s a complicated task to see if agent is doing good or bad- there’s a lot of components
<aside>
💡 self play zero-sum debate game = see which agents have best points (most true and informative information)
</aside>
for context:
- self play has produced impressive results for the games of Go, chess, Dota, etc
- debate allows algorithmic and theoretical developments in these areas to carry over to AI alignment
introduction
- alignment is hard to do when training because you’re trying to actively fix the behaviour and incentives of trained unaligned agents
- currently, we’re able to somewhat align models - e.x. giving ground truths for segmentation tasks or imitation learning (mimic human on specific task)
- human feedback can give feedback on robot doing backflip in a random environement
- but as complexity increases and it gets harder for the human
- the answer of question might have subtle flaws that could be hard to detect for human
- we could have a model to evaluate the flaws, but even that would be hard
- flaws itself can have flaws and those flaws can have flaws
<aside>
💡 a debate approach to alignment
- debate between competing agents where agents make arguments + other agents poke holes in those arguments, and so on until enough information to decide the truth on statement
</aside>
How it works
