About NovelQA
News
Data Description
Each data point in our dataset is represented as a dictionary with the following keys:
[
{
"QID": the QID which remains unchanged for tracking updates (only happen if necessary),
"Aspect": the question classification in 'aspect', e.g., "times",
"Complexity": the question classification in complexity, e.g., "mh",
"Question": the input question,
"Options": {
"A": Option A,
"B": Option B,
"C": Option C (not applicable in several yes/no questions),
"D": Option D (not application in several yes/no questions)
},
},
...
]
Here is an example of a data point:
[
{
"QID": "Q0148",
"Aspect": "times",
"Complex": "mh",
"Question": "How many times has Robert written letters to his sister?",
"Options": {
"A": "11",
"B": "9",
"C": "12",
"D": "10"
},
},
...
]
Contributors
Cunxiang Wang*, Ruoxi Ning*, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, and Yue Zhang
License
This dataset is released under the Apache-2.0 License.
Citation
@misc{wang2024novelqa,
title={NovelQA: A Benchmark for Long-Range Novel Question Answering},
author={Cunxiang Wang and Ruoxi Ning and Boqi Pan and Tonghui Wu and Qipeng Guo and Cheng Deng and Guangsheng Bao and Qian Wang and Yue Zhang},
year={2024},
eprint={2403.12766},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
| Model | Acc | Context Window | Text Input by | Team | |
|---|---|---|---|---|---|
| Human Performance |
90.00 | - | - | - | |
| π1 |
claude-3-sonnet-20240229-v1:0 |
53.66 | 200K | Long-context | Claude |
| π₯2 |
gpt-4-0125-preview |
46.88 | 128K | Long-context | OpenAI |
| π₯3 |
claude-v2:1 |
46.04 | 200K | Long-context | Claude |
| 4 |
InternLM-20b |
32.37 | 200K | Long-context | InternLM |
| 5 |
InternLM-7b |
30.90 | 200K | Long-context | InternLM |
| 6 |
- |
- | - | - | - |
| 7 |
- |
- | - | - | - |
| Model | Acc | Context Window | Text Input by | Team | |
|---|---|---|---|---|---|
| Human Performance |
97.00 | - | - | - | |
| π1 |
gpt-4-0125-preview |
71.80 | 128K | Long-context | OpenAI |
| π₯2 |
claude-3-sonnet-20240229-v1:0 |
71.11 | 200K | Long-context | Claude |
| π₯3 | GPT-4 RAG Langchain |
67.89 | 128K | RAG Langchain | Sayash Kapoor (sayashk@princeton.edu) and Benedikt StrΓΆbl (stroebl@princeton.edu) |
| 4 |
claude-v2:1 |
66.84 | 200K | Long-context | Claude |
| 5 |
GPT-3.5 RAG Langchain |
56.94 | 128K | RAG Langchain | Sayash Kapoor (sayashk@princeton.edu) and Benedikt StrΓΆbl (stroebl@princeton.edu) |
| 6 |
InternLM-20b |
49.18 | 200K | Long-context | InternLM |
| 7 |
InternLM-7b |
43.51 | 200K | Long-context | InternLM |
| Model | Acc | Context Window | Text Input by | Team | |
|---|---|---|---|---|---|
| Human Performance |
90.00 | - | - | - | |
| π1 |
claude-3-sonnet-20240229-v1:0 |
46.96 | 200K | Long-context | Claude |
| π₯2 |
gpt-4-0125-preview |
45.76 | 128K | Long-context | OpenAI |
| π₯3 |
claude-v2:1 |
44.32 | 200K | Long-context | Claude |
| 4 |
InternLM-20b |
30.04 | 200K | Long-context | InternLM |
| 5 |
InternLM-7b |
28.07 | 200K | Long-context | InternLM |
| 6 |
- |
- | - | - | - |
| 7 |
- |
- | - | - | - |
| Model | Acc | Context Window | Text Input by | Team | |
|---|---|---|---|---|---|
| Human Performance |
97.00 | - | - | - | |
| π1 |
gpt-4-0125-preview |
70.44 | 128K | Long-context | OpenAI |
| π₯2 |
claude-3-sonnet-20240229-v1:0 |
67.15 | 200K | Long-context | Claude |
| π₯3 |
claude-v2:1 |
65.92 | 200K | Long-context | Claude |
| 4 |
InternLM-20b |
45.87 | 200K | Long-context | InternLM |
| 7 |
InternLM-7b |
40.89 | 200K | Long-context | InternLM |