AI Safety Needs Social Scientists

Irving, Geoffrey; Askell, Amanda

doi:10.23915/distill.00014

We are grateful to Gillian Hadfield, Dario Amodei, Brian Christian, Michael Page, David Manley, Josh Kalla, Remco Zwetsloot, Baobao Zhang, David Moss, Daniel Greene, Daniel Ziegler, Danny Hernandez, Mahendra Prasad, Liv Boeree, Igor Kurganov, Cate Hall, Ashley Pilipiszyn, and others for extensive feedback on the article. We had conversations with many social scientists during the process of writing this article, including Mariano-Florentino Cuéllar, Philip Tetlock, Rob MacCoun, John Ahlquist, the participants of a Stanford CASBS workshop organized by Margaret Levi and Federica Carugati, Tom Griffiths, Elizabeth Rhodes, Alex Newar, and Stefan Schubert. We are also grateful to participants at the EA Global 2018: London conference where this work was presented. Paul Christiano participated in the domain expert debates and extensive discussion. We emphasize that all mistakes in content and terminology are our own, not those of the acknowledged.

On the Distill side, we are grateful to Arvind Satyanarayan for handling the review process, and Shan Carter and Ludwig Schubert for extensive help on structure, diagrams, and formatting.

The debate tree diagram was made with the help of Mike Bostock’s tree-o-matic notebook.

Example debate: Quantum SAT solver

We’ve conducted a few informal domain expert debates among people at OpenAI. Here is one example on the following question:

Note that this question does not have a proven answer, so “correctness” is a mix of mathematical knowledge and subjective human judgement. Both debaters knew quantum computation and had the same beliefs about the question, but one was trying to lie. The judge was invited to comment and ask questions throughout the debate. Full interaction from the judge was a change from a previous debate, where lack of judge participation meant that misinterpretations of statements early in the debate were not corrected. Once the transcript was complete, it was shown to several other people to ensure that the first judge’s confidence wasn’t purely a result of facial expressions in the room or other side channels (this is far from perfect given the judge interaction). All judges correctly determined who was telling the truth.

Each debater was given a budget of one sentence from Wikipedia to truthfully cite, excluding pages directly discussing the difficulty of quantum SAT solvers. All their other statements could be lies. The debaters were not allowed to take back moves, mostly to save time: a previous debate trying to get “the best transcript” by allowing debaters to rewind the game took many hours to complete. As a result, the original debate transcript goes off on a variety of tangents before settling on a central disagreement. As an illustration of how long the debate “might have been” if such tangents were avoided, we have provided a pruned transcript which culls away paths not relevant to the final deciding “line of argument”.

Neither the original transcript or the pruned version are intended to be taken as real data: they serve only as an example of the types of interaction one might find in debate with domain experts and a lay judge.

Instructions for judges

Questions for judges

Transcript

Pruned Full length

How fast can quantum computers solve boolean satisfiability?
According to our current understanding of quantum mechanics, a quantum computer with roughly $N$ qbits can solve an $N$ input SAT instance in roughly $N^2$ time. There is a particular quantum algorithm (Shor’s) that solves SAT this quickly.
1. There is no such algorithm.
What does it mean to solve boolean satisfiability? What's an $N$ input SAT instance?
1. Boolean satisfiability (SAT) with $N$ inputs is the task of deciding if a circuit with $N$ inputs and "and", "or", and "not" gates has an assignment to its inputs which makes it evaluate to true. E.g., "x or y if not z" is a 3 input SAT instance.
What is a quantum computer and is the question about quantum computers in principle?
1. A quantum computer is a computer whose bits (called qbits) have quantum state, and are operated on with quantum operations.
The rough outline of the algorithm is as follows:
1. Define $F(x) =$ " $x$ is all zeros, or $x$ is an input which makes the circuit evaluate to true".
2. Start in the all zeros state.
3. Allow $x$ to slowly diffuse into other possible states. By "diffuse" I mean that at every point in time, each bit of $x$ has some small probability of flipping.
4. Constantly evaluate $F$ .
5. Use the fact that a quantum watched pot never boils to infer that $F(x)$ is always true.
Quantum mechanics has a complex amplitude on each state, and the probability of a state is $\textrm{amplitude}^2$ . At any step of the algorithm, we have amplitude $\alpha$ on all ones (assuming that’s the solution), and amplitude $\sqrt{1 - \alpha^2}$ on all zeros. Initially $\alpha = 0$ . Red’s algorithm is wrong because it increases $\alpha$ by an additive exponentially small amount each step, so it takes exponentially many steps to get alpha high enough.
1. It’s not the case that the amplitude on the solution increases by an exponentially small additive amount each step. Instead it gets multiplied by a small factor each step.
2. I request the specific diffusion operator as a unitary matrix.
3. I can define the matrix precisely+implicitly as: "the matrix for which the amplitude in state 0 decreases by $\epsilon/n \times \textrm{amplitude}(x)$ and the amplitude in state $x$ increases by $\epsilon/n \times \textrm{amplitude}(x)$ ".
4. The segment below contains a reference to Wikipedia. Please do not follow the link. The line from Wikipedia has been pasted in for you.
5. The first line of https://en.wikipedia.org/wiki/Unitarity_(physics) says the sum of probabilities is conserved. Red says an equal amount is subtracted from one amplitude and added to another, implying the sum of amplitudes are conserved. But probabilities are the squared magnitudes of amplitudes, so this is a contradiction.
6. "In quantum physics, unitarity is a restriction on the allowed evolution of quantum systems that ensures the sum of probabilities of all possible outcomes of any event always equals 1."
7. There is no reason the sum of amplitudes can’t be conserved as well by any particular matrix (though I agree they aren’t always conserved). In this case, both the sum of probabilities and the sum of amplitudes are conserved.
8. Assume a tiny nonzero amplitude $\alpha$ on all ones and the remaining amplitude $\sqrt{1-\alpha^2}$ on all zeros. After one step, the amplitude on all ones is $\alpha (1 + \epsilon/n)$ , and the amplitude on all zeros is $\sqrt{1-\alpha^2} + \alpha \epsilon/n$ . These two numbers do not sum to one when squared: contradiction.
9. After each step we perform a measurement and renormalize the amplitudes.
10. The contradiction occurs before the measurement. The laws of physics must hold for all points in time.
11. At this point we reached the word/time limit for the debate.

Read full length transcript

How fast can quantum computers solve boolean satisfiability?
According to our current understanding of quantum mechanics, a quantum computer with roughly $N$ qbits can solve an $N$ variable SAT instance in roughly $N^2$ time.
There is no proven answer, but from what we believe so far there is a high probability that it takes at least $O(2^{cn})$ time for some constant $c$ , where $c \lt 1/2$ and probably $c \gt 1/10$ or so.
What does it mean to solve boolean satisfiability?
What's an $N$ variable SAT instance?
What do you mean by $N^2$ time and $O(2^{cn})$ time?
What is a quantum computer and is the question about quantum computers in principle?
What is a qbit?
Boolean satisfiability (SAT) is the task of deciding whether a circuit with "and", "or", and "not" gates has an assignment to its inputs which makes it evaluate to true.
1. So solving " $x$ or $y$ if not $z$ " is a 3 variable SAT instance?
2. Yes. And a true one in that case.
3. Yes, but you are going to have to work with all of these concepts in the abstract to reach the truth in a reasonable amount of time.
$N$ variables means there are $N$ inputs to each circuit.
$N^2$ and $2^{cN}$ are rough estimates of the number of operations required by an algorithm to solve SAT. Both estimates ignore factors related to the number of operations in the circuit.
1. But what is the difference between these estimates and is $2^{cN}$ the same as what Red said above?
2. $N^2$ is very fast, $2^{cN}$ is very slow, e.g. for $N = 100$ one is 10,000 and one is {two big to write}.
3. What is O in Blue's answer?
4. The O hides a constant, so ignoring constant factors I mean just $2^{cN}$ .
A quantum computer is a computer whose bits (called qbits) have quantum state, and are operated on with quantum operations.
1. What's a quantum operation? [If it's important]
In this question, are we asking how long it takes to solve for any instance of an $N$ variable SAT? (i.e., not just the average or something else)
1. Yes, the algorithm needs to be fast in all cases.
Important claim: In the black box model, whether the computer is quantum or not changes the exponent (the constant $c$ ), but only by a moderate factor. In the simplest case this factor is 2. It is unlikely that quantum vs. not changes the runtime from exponential $2^{cn}$ to polynomial $n^2$ .
1. I disagree.
2. [if relevant] what is the "black box model"?
  1. The black box model is where the algorithm is given the circuit as a black box which accepts values of variables and splits out the circuit's value, but where the algorithm can't peer inside the box to see the circuit. In this case, the provable complexity of SAT is $2^n$ for quantum computers (using Grover's algorithm).
  2. Red, do you agree with the claim about provable complexity of SAT from quantum computers?
  3. No. Grover's algorithm solves the black box case in time $2^{n/2}$ , but Shor's solves it in time $n^2$ .
3. [if not covered below] why is it unlikely that quantum vs. not changes the runtime to this degree?
  1. Because we have a proof that in the black box model classical -> quantum changes $c$ from 1 to $1/2$ .
For Red, why can a quantum computer solve an $N$ variable SAT so quickly?
1. There is a particular quantum algorithm (Shor's) that solves SAT quickly. I think we should focus on debating whether Shor's algorithm in fact solves SAT quickly (since this will settle the issue and is easier to argue about than impossibility arguments).
2. They key question is how quickly Shor's algorithm solves SAT. Blue: Do you agree that Shor's solves SAT quickly?
3. I do not agree, but I also disagree with the metapoint that the argument should focus on Shor. We have a proof of the quantum complexity of SAT in the black box case, so the important thing to discuss is whether white boxing reduces the complexity further from $2^{n/2}$ , not details of Shor.
4. I disagree that quantum computers can't solve the black box case (Shor's algorithm solves SAT in the black box case).
5. Blue, what is white boxing?
  1. Black box means you can't look at the circuit (inside the box), white box means you can look inside. This is not important anymore since Red has agreed to consider the black box case.
6. Blue: Are you claiming that quantum computers can't solve the black box case (re: Red's response)?
  1. Yes.
  2. I am confused by this. It sounded like you thought that they could solve the black box case but were debating about the time it takes.
  3. By "can solve" we both mean "can solve in time $N^2$
For Blue, why is the answer not proven and why do we think it would take so long?
1. In the classical computer case, whether we can solve SAT quickly or not is the P vs. NP question, which is still unresolved (though most computer scientists believe P $\ne$ NP. Indeed, the majority view is that there is no significantly faster algorithm than checking all possible assignments to the input variables, and there are $2^n$ of these.
2. Why does moving to a quantum computer help at all? Are there reasons you can give for the claim that the improvement is modest?
3. Grover's algorithm (a quantum algorithm) takes $2^{n/2}$ time, vs. $2^n$ for classical. That's the evidence for a modest speedup.
4. Can you explain why using a quantum algorithm helps?
  1. A classical computer with $n$ bits has $2^n$ possible values, but a quantum computer with $n$ qbits has a larger state space: it has a complex amplitude (roughly) a probability for each of the $2^n$ classical values, and thus each operation does more.
5. Why is the evidence of a single algorithm that takes this time strong evidence for such a high lower bound on time?
  1. When I said Grover there I was including the proof that Grover is optimal.
There are two ways we can continue. One is to discuss Shor's algorithm, and the other is to discuss the proof that Grover ( $2^{n/2}$ time) is optimal. Does the judge have a preference?
1. I agree those seem like the two ways to continue. I think in general algorithms are simpler than lower bounds, but this algorithm is a little bit complicated.
2. I'm fine going with Shor. I claim that no such algorithm exists, so I'll let Red describe it.
3. Let's go with Shor.
The rough outline of the algorithm is as follows:
1. Define $F(x) =$ " $x$ is all zeros, or $x$ is a satisfying input"
  1. What do you mean a satisfying input?
  2. An input that makes the circuit evaluate to true
2. Start in the all zeros state.
3. Allow $x$ to slowly diffuse into other possible states
  1. Flagging that I don't totally understand this in case it's important later
  2. By "diffuse" I mean that at every point in time, each bit of $x$ has some small probability of flipping.
4. Constantly evaluate $F$ .
5. Use the fact that a quantum watched pot never boils to infer that $F$ is always true
  1. There is no such fact with this implication. There might be only one satisfying input, and as $x$ diffuses the average value of $F(x)$ will decrease and eventually be exponentially small.
  2. I mean the following: suppose you have a quantum system with two states $A$ and $B$ , and it has a small probability of moving from state $A$ to state $B$ at each point in time. Then if you keep checking whether it's in state $A$ or state $B$ frequently enough, it will never move from state $A$ to state $B$ . In this case, state $A$ is " $x$ is zero or an input that makes the circuit true" and state $B$ is "a nonzero input that makes the circuit false.
  3. Ah, I agree with that fact, and had missed the "constantly evaluate $F$ ". So I agree that $F(x)$ will state true with high probability, but forcing it to be true will prevent it from diffusing outwards if the satisfying assignment is far from the all zero vector. Thus, the distribution will stay focused on the all zero vector in the worst case.
  4. Okay my intuition here is that in a case with one satisfying assignment this will be extremely hard to find and that makes me question the usefulness of something that always evaluates to true. But what does it mean for it to diffuse outwards?
  5. Intuitions from classical physics are extremely unreliable; there is a critical (but subtle) sense in which a quantum computer simultaneously acts on every possible state at once. There are many cases that make this intuitive weirdness very clear, so Blue should be willing to grant that classical intuitions are very unreliable.
  6. Many classical intuitions are unreliable, but not your intuition in this case. You should ask Red what the distribution looks like "halfway" through the algorithm if the unique satisfying assignment in the all one vector (furthest away from the starting all zero vector). Since $F(x)$ is always close to 1 in expectation, it must always be close to the all zero distribution.
  7. Red: The general claim that intuitions about physics seems weak relative to alternatives I'd expect of you if you had a compelling argument. The key claim is the claim about simultaneity. If this is doing work why wouldn't we expect an almost immediate solution?
  8. The strongest argument is the calculation that this process works, though it's a claim that depends on quantum mechanics (but we'll have to get into it). I don't have a simple argument that doesn't depend on quantum or addresses the intuition more directly.
  9. If Blue wants to defend this intuition as a reason for skepticism, I would want to point to the easiest obviously wrong implications of that intuition.
  10. I suspect we should just look at object level arguments for this procedure working.
  11. I don't want to totally reject the classical intuition, I'm happy to focus on the object level and just admit the intuition as evidence against.
6. If we wait for $N$ time, our variables will be uniformly random between "all zeros" and "satisfying input"
7. Measure, and with probability 1/2 we get a satisfying input if one exists
Disbelieve his "believe the calculation" argument. Red: What does the distribution look like halfway through in the case I mentioned?
1. Initially the state is entirely the zero state. After one step the state is $1 - 2^{-2N}$ of the zero state plus $2^{-2N}$ of the ones state. The amplitude on the ones state increases exponentially over subsequent steps, reaching $2^{-N}$ after $N$ steps and $1/2$ after $2N$ steps.
2. Red is telling the truth about the numbers after 1 step. This step diffuses out by a $2^{-n}$ amount, and there are $2^n$ options, so a $2^{-2n}$ portion ends up on all ones. At this point I need to have worked through the calculation to know the next move in detail, but fundamentally what will happen is that mass will diffuse away from the all one vector fast enough to make the speed not work. However, I have to have worked through the calculation to know what specifically to disagree with: he's correct that it's about the calculation.
3. Blue: If the calculation works as Red says (a) does this mean the algorithm works as he says and (b) if so, does this establish his claim?
4. Yes and yes.
5. Can both of you show me the results of the calculation?
6. Yes, but I've never done it, so I will need an adjournment.
We decided to continue the debate without showing the calculation.
Some quick quantum background: quantum mechanics has a complex amplitude on each state, and the probability of a state is $\textrm{amplitude}^2$ . Thus, at any step of the algorithm, we have amplitude $\alpha$ on all ones (assuming that's the solution), and amplitude $\sqrt{1 - \alpha^2}$ on all zeros. Initially $\alpha = 0$ . Red's algorithm is wrong because it only increases $\alpha$ by a roughly additive exponentially small amount at each step, so it takes exponentially many steps to get $\alpha$ high enough.
1. Blue: Do I need to know what amplitude refers to here?
2. You can treat them as abstract complex numbers, though they are physically and philosophically real.
3. Red: Response?
4. It's not the case that the amplitude on the solution increases by an exponentially small additive amount each step. Instead it gets multiplied by a small factor each step.
5. Can you write down the specific operator used for diffusion, and your calculation that gives a multiplicative increase?
6. The amplitude flow from the 0 to $x$ is $\epsilon/n \times 2^{-n}$ in step 1 and $\epsilon/n \times \textrm{amplitude}(x)$ in subsequent steps, and the flow backwards is 0 in step 1 and $\epsilon/n \times \textrm{amplitude}(x)$ in subsequent steps, where $\epsilon$ is some small constant such that choosing smaller $\epsilon$ makes it take longer but makes the algorithm more likely to succeed.
7. Are you claiming those are exact values?
8. Yes.
9. The formula for subsequent steps implies the initial flow is 0. Why is the initial step different?
10. We are free to choose the quantities in order to make the algorithm work, that's how algorithms work.
11. The algorithm description implied a constant diffusion operator. I request the specific operator or changing operators as a unitary matrix.
  1. The preceding description was way more informal than spelling out the dependence of the diffusion operator on time, as is necessary since the full algorithm description would use our whole word budget...
12. The description in my previous line is a complete description of the update operator (it acts symmetrically over all $x$ ).
13. What does "amplitude flow" mean, given that total amplitude is not conserved by quantum mechanics (only the sum of squared magnitudes of amplitudes is preserved)?
14. There's a lot going on here that's over my head. I think we need to either agree on a piece of evidence that would bottom this out, or do a quick detour through the Grover proof, or agree on a simple claim to focus on.
15. The properties of Red's operator contradict quantum mechanics. Operators here are unitary matrices, and Red is refusing to write down his matrix. In particular, he is using terms like "amplitude flow" that do not have precise and unambiguous meanings. Since he claims a concrete algorithm, describing the matrix should be easy.
16. The matrix is the composition of a large number of individual gates, the full matrix doesn't generally have a compact representation so we shouldn't be "writing down the matrix."
17. Red claims that he gave the exact "amplitude flow". If "amplitude flow" was a real concept, this would be equivalent to giving the exact matrix, contradicting his claim that the full matrix lacks a compact representation.
18. I can define the matrix precisely+implicitly as "the matrix for which the amplitude in state 0 decreases by $\epsilon/n \times \textrm{amplitude}(x)$ and the amplitude in state $x$ increases by $\epsilon/n \times \textrm{amplitude}(x)$
  1. If you're giving a definition without specifying an instance, I'd expect this to be common practice. One option is for you to give evidence that this is done (if available on Wikipedia)
  2. I claim that the majority of unitary evolution matrices described in wikipedia will be described implicitly in this way.
  3. I claim that this matrix is exponentially big, such that it would be impossible to write down the full matrix in a reasonable amount of time.
  4. This kind of implicit definition is fine practice.
  5. Cool.
19. The segment below contains a reference to wikipedia. Please do not follow the link. The line from wikipedia has been pasted in for you.
20. See the first line of https://en.wikipedia.org/wiki/Unitarity_(physics), which says that the sum of probabilities is conserved. Since Red says an equal amount is subtracted from one amplitude and added to another, he is implying the sum of amplitudes are conserved. But probabilities are the squared magnitudes of amplitudes, so this is a contradiction.
21. "In quantum physics, unitarity is a restriction on the allowed evolution of quantum systems that ensures the sum of probabilities of all possible outcomes of any event always equals 1."
22. There is no reason the sum of amplitudes can't be conserved as well by any particular matrix (though I agree they aren't always conserved).
23. Red: Why isn't your parentheses inconsistent with the first line that Blue cited?
24. The line Blue cited says that sum of probabilities is conserved, but sum of amplitudes isn't always conserved. (Though the sum of amplitudes is conserved in this case.)
25. Say we have a very tiny but nonzero amplitude $\alpha$ on all ones and the remaining amplitude $\sqrt{1-\alpha^2}$ on all zeros. After one of Red's imaginary steps, the amplitude on all ones is $\alpha (1 + \epsilon/n)$ , and the amplitude on all zeros is $\sqrt{1-\alpha^2} + \alpha \epsilon/n$ . But these two numbers do not sum to one when squared: contradiction.
26. After each step of the algorithm we perform a measurement and renormalize the amplitudes.
27. The contradiction occurs before the measurement. The laws of physics must hold for all points in time.
This is the point at which we reached the word/time limit for the debate.

[christiano2017human] Deep reinforcement learning from human preferences [PDF]
Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S. and Amodei, D., 2017. Advances in Neural Information Processing Systems, pp. 4299--4307.

[tversky1974judgment] Judgment under uncertainty: heuristics and biases [link]
Tversky, A. and Kahneman, D., 1974. Science, Vol 185(4157), pp. 1124--1131. American association for the advancement of science.

[hewstone2002intergroup] Intergroup bias [link]
Hewstone, M., Rubin, M. and Willis, H., 2002. Annual Review of Psychology, Vol 53(1), pp. 575--604. Annual Reviews 4139 El Camino Way, PO Box 10139, Palo Alto, CA 94303-0139, USA.

[irving2018debate] AI safety via debate [PDF]
Irving, G., Christiano, P. and Amodei, D., 2018. arXiv preprint arXiv:1805.00899.

[christiano2018amplification] Supervising strong learners by amplifying weak experts [PDF]
Christiano, P., Shlegeris, B. and Amodei, D., 2018. arXiv preprint arXiv:1810.08575.

[ibarz2018] Reward learning from human preferences and demonstrations in Atari [PDF]
Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S. and Amodei, D., 2018. Advances in Neural Information Processing Systems.

[leike2017gridworlds] AI safety gridworlds [PDF]
Leike, J., Martic, M., Krakovna, V., Ortega, P.A., Everitt, T., Lefrancq, A., Orseau, L. and Legg, S., 2017. arXiv preprint arXiv:1711.09883.

[kelley1983wizard] An empirical methodology for writing user-friendly natural language computer applications [link]
Kelley, J.F., 1983. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 193--196. ACM. DOI: 10.1145/800045.801609

[ought2018factored] Factored Cognition [link]
Stuhlmüller, A., 2018.

[evans2016inconsistent] Learning the Preferences of Ignorant, Inconsistent Agents [PDF]
Evans, O., Stuhlmuller, A. and Goodman, N.D., 2016. AAAI, pp. 323--329.

[laskey2017comparing] Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations [PDF]
Laskey, M., Chuck, C., Lee, J., Mahler, J., Krishnan, S., Jamieson, K., Dragan, A. and Goldberg, K., 2017. Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 358--365.

[wallach2016computational] Computational Social Science: Towards a collaborative future [PDF]
Wallach, H., 2016. Computational Social Science, pp. 307. Cambridge University Press.

[mitchell2018fairness] Mirror Mirror: Reflections on Quantitative Fairness [link]
Mitchell, S. and Shadlen, J., 2018.

[sep-moral-anti-realism] Moral Anti-Realism [link]
Joyce, R., 2016. The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University.

[buolamwini2018gender] Gender shades: Intersectional accuracy disparities in commercial gender classification [HTML]
Buolamwini, J. and Gebru, T., 2018. Conference on Fairness, Accountability and Transparency, pp. 77--91.

[haidt2000moral] Moral dumbfounding: When intuition finds no reason
Haidt, J., Bjorklund, F. and Murphy, S., 2000. Unpublished manuscript, University of Virginia.

[biyik2018batch] Batch active preference-based learning of reward functions [PDF]
Bıyık, E. and Sadigh, D., 2018. arXiv preprint arXiv:1810.04303.

[bahdanau2018learning] Learning to understand goal specifications by modelling reward [PDF]
Bahdanau, D., Hill, F., Leike, J., Hughes, E., Kohli, P. and Grefenstette, E., 2018. arXiv preprint arXiv:1806.01946.

[radford2018language] Improving language understanding by generative pre-training [PDF]
Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018.

[kahneman2011thinking] Thinking, fast and slow [link]
Kahneman, D. and Egan, P., 2011. , Vol 1. Farrar, Straus and Giroux New York.

[campbell2002deepblue] Deep Blue [link]
Campbell, M., Hoane, A. and Hsu, F., 2002. Artificial Intelligence, Vol 134(1), pp. 57 - 83.

[silver2017alphazero] Mastering chess and shogi by self-play with a general reinforcement learning algorithm [PDF]
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T. and others,, 2017. arXiv preprint arXiv:1712.01815.

[bicchieri2017deviant] Deviant or Wrong? The Effects of Norm Information on the Efficacy of Punishment [PDF]
Bicchieri, C., Dimant, E., Xiao, E. and others,, 2017.

[henrich2010weirdest] The weirdest people in the world? [PDF]
Henrich, J., Heine, S.J. and Norenzayan, A., 2010. Behavioral and brain sciences, Vol 33(2-3), pp. 61--83. Cambridge University Press.

[goodman1983fact] Fact, fiction, and forecast
Goodman, N., 1983. Harvard University Press.

[rawls2009theory] A theory of justice [link]
Rawls, J., 2009. Harvard university press.

[sugden2015looking] Looking for a psychology for the inner rational agent [PDF]
Sugden, R., 2015. Social Theory and Practice, Vol 41(4), pp. 579--598.

[greene2002and] How (and where) does moral judgment work? [PDF]
Greene, J. and Haidt, J., 2002. Trends in cognitive sciences, Vol 6(12), pp. 517--523. Elsevier.

[leike2018scalable] Scalable agent alignment via reward modeling: a research direction [PDF]
Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V. and Legg, S., 2018. arXiv preprint arXiv:1811.07871.

[openai2018five] OpenAI Five [link]
OpenAI,, 2018.

[christian2011human] The Most Human Human: What Talking with Computers Teaches Us About What It Means to Be Alive [link]
Christian, B., 2011. Knopf Doubleday Publishing Group.

[paluck2016overcome] How to overcome prejudice [PDF]
Paluck, E.L., 2016. Science, Vol 352(6282), pp. 147--147. American Association for the Advancement of Science.

[flynn2017nature] The nature and origins of misperceptions: Understanding false and unsupported beliefs about politics [link]
Flynn, D., Nyhan, B. and Reifler, J., 2017. Political Psychology, Vol 38, pp. 127--150. Wiley Online Library.

[falk2018persuasion] Persuasion, influence, and value: Perspectives from communication and social neuroscience [link]
Falk, E. and Scholz, C., 2018. Annual review of psychology, Vol 69.

[mellers2015identifying] Identifying and cultivating superforecasters as a method of improving probabilistic predictions [link]
Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., Chen, E., Baker, J., Hou, Y., Horowitz, M., Ungar, L. and Tetlock, P., 2015. Perspectives on Psychological Science, Vol 10(3), pp. 267--281. SAGE Publications Sage CA: Los Angeles, CA.

[tetlock2016superforecasting] Superforecasting: The art and science of prediction
Tetlock, P.E. and Gardner, D., 2016. Random House.

[hadfield2016cooperative] Cooperative inverse reinforcement learning [PDF]
Hadfield-Menell, D., Russell, S.J., Abbeel, P. and Dragan, A., 2016. Advances in neural information processing systems, pp. 3909--3917.

[hadfield2017inverse] Inverse reward design [PDF]
Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S.J. and Dragan, A., 2017. Advances in Neural Information Processing Systems, pp. 6765--6774.

[schopenhauer2013art] The art of being right [link]
Schopenhauer, A., 1896.

[bertrand2004emily] Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination [PDF]
Bertrand, M. and Mullainathan, S., 2004. American economic review, Vol 94(4), pp. 991--1013.

[kahneman1979prospect] Prospect theory: An analysis of decisions under risk [link]
Kahneman, D., 1979. Econometrica, Vol 47, pp. 278.

[tversky1992advances] Advances in prospect theory: Cumulative representation of uncertainty
Tversky, A. and Kahneman, D., 1992. Journal of Risk and uncertainty, Vol 5(4), pp. 297--323. Springer.

[erev2017anomalies] From anomalies to forecasts: Toward a descriptive model of decisions under risk, under ambiguity, and from experience [link]
Erev, I., Ert, E., Plonsky, O., Cohen, D. and Cohen, O., 2017. Psychological review, Vol 124(4), pp. 369. American Psychological Association.

[chen2018cicero] Cicero: Multi-Turn, Contextual Argumentation for Accurate Crowdsourcing [PDF]
Chen, Q., Bragg, J., Chilton, L.B. and Weld, D.S., 2018. arXiv preprint arXiv:1810.10733.

[hahn2007rationality] The rationality of informal argumentation: A Bayesian approach to reasoning fallacies [link]
Hahn, U. and Oaksford, M., 2007. Psychological review, Vol 114(3), pp. 704. American Psychological Association.

[bornstein2001rationality] Rationality in medical decision making: a review of the literature on doctors’ decision-making biases [link]
Bornstein, B.H. and Emler, A.C., 2001. Journal of evaluation in clinical practice, Vol 7(2), pp. 97--107. Wiley Online Library.

[tetlock2017expert] Expert political judgment: How good is it? How can we know? [HTML]
Tetlock, P.E., 2017. Princeton University Press.

[chi2006two] Two approaches to the study of experts’ characteristics [PDF]
Chi, M.T.H., 2006. The Cambridge Handbook of Expertise and Expert Performance, pp. 21--30.

[larrick2004debiasing] Debiasing [link]
Larrick, R.P., 2004. Blackwell Handbook of Judgment and Decision Making, pp. 316--338. Wiley Online Library.

[dwyer2012evaluation] An evaluation of argument mapping as a method of enhancing critical thinking performance in e-learning environments [link]
Dwyer, C.P., Hogan, M.J. and Stewart, I., 2012. Metacognition and Learning, Vol 7(3), pp. 219--244. Springer.

[tetlock2014forecasting] Forecasting tournaments: Tools for increasing transparency and improving the quality of debate [link]
Tetlock, P.E., Mellers, B.A., Rohrbaugh, N. and Chen, E., 2014. Current Directions in Psychological Science, Vol 23(4), pp. 290--295. Sage Publications Sage CA: Los Angeles, CA.

[gigerenzer1991make] How to make cognitive illusions disappear: Beyond "heuristics and biases" [link]
Gigerenzer, G., 1991. European review of social psychology, Vol 2(1), pp. 83--115. Taylor & Francis.

[graham2009liberals] Liberals and conservatives rely on different sets of moral foundations [PDF]
Graham, J., Haidt, J. and Nosek, B.A., 2009. Journal of personality and social psychology, Vol 96(5), pp. 1029. American Psychological Association.

[goel2011negative] Negative emotions can attenuate the influence of beliefs on logical reasoning [link]
Goel, V. and Vartanian, O., 2011. Cognition and Emotion, Vol 25(1), pp. 121--131. Taylor & Francis.

[list2001epistemic] Epistemic democracy: Generalizing the Condorcet jury theorem [link]
List, C. and Goodin, R.E., 2001. Journal of political philosophy, Vol 9(3), pp. 277--306. Wiley Online Library.

[list2002aggregating] Aggregating sets of judgments: An impossibility result [PDF]
List, C. and Pettit, P., 2002. Economics \& Philosophy, Vol 18(1), pp. 89--110. Cambridge University Press.

[rowe1999delphi] The Delphi technique as a forecasting tool: issues and analysis [link]
Rowe, G. and Wright, G., 1999. International Journal of Forecasting, Vol 15(4), pp. 353 - 375. DOI: https://doi.org/10.1016/S0169-2070(99)00018-7

[openai2018charter] OpenAI Charter [link]
OpenAI,, 2018.

AI Safety Needs Social Scientists

Authors

Affiliations

Published

DOI

An overview of AI alignment

Learning values by asking humans questions

Definitions of alignment: reasoning and reflective equilibrium

Disagreements, uncertainty, and inaction: a hopeful note

Alignment gets harder as ML systems get smarter

Debate: learning human reasoning

An example of debate

Are people good enough as judges?

From superforecasters to superjudges

Debate is only one possible approach

Experiments needed for debate

Synthetic experiments: single pixel image debate

Realistic experiments: domain expert debate

Other tasks: bias tests, probability puzzles, etc.

Questions social science can help us answer

Reasons for optimism

Engineering vs. science

We don’t need to answer all questions

Relative accuracy may be enough

We don’t need to pin down the best alignment scheme

A negative result would be important!

Reasons to worry

Our desiderata are conflicting

We want to measure judge quality given optimal debaters

ML algorithms will change

Need strong out-of-domain generalization

Lack of philosophical clarity

The scale of the challenge

Conclusion: how you can help

Acknowledgments

Example debate: Quantum SAT solver

Instructions for judges

Questions for judges

Transcript

References

Updates and Corrections

Reuse

Citation