Sometime around April 7, 2023 as part of a back-and-forth with Stephen Wolfram, Don Knuth posed 20 questions to ChatGPT 3.5 and subsequently critiqued its responses. Knuth reported that the results ranged from good (e.g. "Of course these are extremely impressive responses, sometimes astonishingly so") to not so good ("Answer #10 reads as though it's the best answer yet. But it's almost totally wrong!").
We transformed Knuth's questions into a benchmark that can be applied to newer AI models, allowing us to observe the improvement in their responses over time. Additionally, latest answers are scored by GPT4 as POOR / GOOD / GREAT with explanation.
This benchmark was generated on May 23, 2023 using OpenAI's GPT4 model with temperature = 1, top_p = 1, and frequency_penalty = 0.
More information about this project here.
Feedback? Contact us via Twitter or GitHub issues.
Scores for answers to the Knuth questions below:
q# | Knuth | GPT4 | Human |
---|---|---|---|
1 | GOOD | GREAT | GREAT |
2 | GOOD | GREAT | GREAT |
3 | POOR | GREAT | GOOD |
4 | GOOD | GREAT | GOOD |
5 | GOOD | GREAT | GREAT |
6 | POOR | GOOD | GOOD |
7 | GREAT | GOOD | GREAT |
8 | GREAT | GREAT | GREAT |
9 | POOR | GREAT | POOR |
10 | POOR | GOOD | POOR |
q# | Knuth | GPT4 | Human |
---|---|---|---|
11 | GOOD | GOOD | POOR |
12 | POOR | GREAT | POOR |
13 | POOR | GREAT | GREAT |
14 | GREAT | GREAT | GREAT |
15 | GOOD | GREAT | GOOD |
16 | POOR | GOOD | GOOD |
17 | POOR | GOOD | GOOD |
18 | POOR | GREAT | GREAT |
19 | GOOD | GREAT | GREAT |
20 | POOR | GREAT | GOOD |
Knuth is our interpretation of how Knuth would have scored the original GPT3.5 answer based on his critique mapped to POOR / GOOD / GREAT. This should not change over time.
Human is our interpretation of how the most recent GPT4-generated answer addressed Knuth's criticisms if applicable.
GPT4 is the self-scoring by GPT4 of it's most recent answer when asked (paraphrasing) "How well did the updated answer address Knuth's critique?"