About NovelQA
News
Data Description
Each data point in our dataset is represented as a dictionary with the following keys:
[ { "QID": the QID which remains unchanged for tracking updates (only happen if necessary), "Aspect": the question classification in 'aspect', e.g., "times", "Complexity": the question classification in complexity, e.g., "mh", "Question": the input question, "Options": { "A": Option A, "B": Option B, "C": Option C (not applicable in several yes/no questions), "D": Option D (not application in several yes/no questions) }, }, ... ]
Here is an example of a data point:
[ { "QID": "Q0148", "Aspect": "times", "Complex": "mh", "Question": "How many times has Robert written letters to his sister?", "Options": { "A": "11", "B": "9", "C": "12", "D": "10" }, }, ... ]
Contributors
Cunxiang Wang*, Ruoxi Ning*, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, and Yue Zhang
License
This dataset is released under the Apache-2.0 License.
Citation
@misc{wang2024novelqa, title={NovelQA: A Benchmark for Long-Range Novel Question Answering}, author={Cunxiang Wang and Ruoxi Ning and Boqi Pan and Tonghui Wu and Qipeng Guo and Cheng Deng and Guangsheng Bao and Qian Wang and Yue Zhang}, year={2024}, eprint={2403.12766}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Model | Acc | Context Window | Text Input by | Team | |
---|---|---|---|---|---|
Human Performance |
90.00 | - | - | - | |
π1 |
claude-3-sonnet-20240229-v1:0 |
53.66 | 200K | Long-context | Claude |
π₯2 |
gpt-4-0125-preview |
46.88 | 128K | Long-context | OpenAI |
π₯3 |
claude-v2:1 |
46.04 | 200K | Long-context | Claude |
4 |
InternLM-20b |
32.37 | 200K | Long-context | InternLM |
5 |
InternLM-7b |
30.90 | 200K | Long-context | InternLM |
6 |
- |
- | - | - | - |
7 |
- |
- | - | - | - |
Model | Acc | Context Window | Text Input by | Team | |
---|---|---|---|---|---|
Human Performance |
97.00 | - | - | - | |
π1 |
gpt-4-0125-preview |
71.80 | 128K | Long-context | OpenAI |
π₯2 |
claude-3-sonnet-20240229-v1:0 |
71.11 | 200K | Long-context | Claude |
π₯3 | GPT-4 RAG Langchain |
67.89 | 128K | RAG Langchain | Sayash Kapoor (sayashk@princeton.edu) and Benedikt StrΓΆbl (stroebl@princeton.edu) |
4 |
claude-v2:1 |
66.84 | 200K | Long-context | Claude |
5 |
GPT-3.5 RAG Langchain |
56.94 | 128K | RAG Langchain | Sayash Kapoor (sayashk@princeton.edu) and Benedikt StrΓΆbl (stroebl@princeton.edu) |
6 |
InternLM-20b |
49.18 | 200K | Long-context | InternLM |
7 |
InternLM-7b |
43.51 | 200K | Long-context | InternLM |
Model | Acc | Context Window | Text Input by | Team | |
---|---|---|---|---|---|
Human Performance |
90.00 | - | - | - | |
π1 |
claude-3-sonnet-20240229-v1:0 |
46.96 | 200K | Long-context | Claude |
π₯2 |
gpt-4-0125-preview |
45.76 | 128K | Long-context | OpenAI |
π₯3 |
claude-v2:1 |
44.32 | 200K | Long-context | Claude |
4 |
InternLM-20b |
30.04 | 200K | Long-context | InternLM |
5 |
InternLM-7b |
28.07 | 200K | Long-context | InternLM |
6 |
- |
- | - | - | - |
7 |
- |
- | - | - | - |
Model | Acc | Context Window | Text Input by | Team | |
---|---|---|---|---|---|
Human Performance |
97.00 | - | - | - | |
π1 |
gpt-4-0125-preview |
70.44 | 128K | Long-context | OpenAI |
π₯2 |
claude-3-sonnet-20240229-v1:0 |
67.15 | 200K | Long-context | Claude |
π₯3 |
claude-v2:1 |
65.92 | 200K | Long-context | Claude |
4 |
InternLM-20b |
45.87 | 200K | Long-context | InternLM |
7 |
InternLM-7b |
40.89 | 200K | Long-context | InternLM |