NovelQA

A Benchmark for Long-Range Novel Question Answering

About NovelQA

News

Data Description

Each data point in our dataset is represented as a dictionary with the following keys:

[
{
    "QID": the QID which remains unchanged for tracking updates (only happen if necessary),
    "Aspect": the question classification in 'aspect', e.g., "times",
    "Complexity": the question classification in complexity, e.g., "mh",
    "Question": the input question,
    "Options": {
        "A": Option A,
        "B": Option B,
        "C": Option C (not applicable in several yes/no questions),
        "D": Option D (not application in several yes/no questions)
    },
},
...
]
           

Here is an example of a data point:

[
{
    "QID": "Q0148",
    "Aspect": "times",
    "Complex": "mh",
    "Question": "How many times has Robert written letters to his sister?",
    "Options": {
        "A": "11",
        "B": "9",
        "C": "12",
        "D": "10"
    },
},
...
]
            

Contributors

Cunxiang Wang*, Ruoxi Ning*, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, and Yue Zhang

License

This dataset is released under the Apache-2.0 License.

Citation

@misc{wang2024novelqa,
  title={NovelQA: A Benchmark for Long-Range Novel Question Answering}, 
  author={Cunxiang Wang and Ruoxi Ning and Boqi Pan and Tonghui Wu and Qipeng Guo and Cheng Deng and Guangsheng Bao and Qian Wang and Yue Zhang},
  year={2024},
  eprint={2403.12766},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
  
Leaderboard - Generative Setting (All novels)
Model Acc Context Window Text Input by Team

Human Performance
90.00 - - -
πŸ†1
claude-3-sonnet-20240229-v1:0
53.66 200K Long-context Claude
πŸ₯ˆ2
gpt-4-0125-preview
46.88 128K Long-context OpenAI
πŸ₯‰3
claude-v2:1
46.04 200K Long-context Claude
4
InternLM-20b
32.37 200K Long-context InternLM
5
InternLM-7b
30.90 200K Long-context InternLM
6
-
- - - -
7
-
- - - -
Leaderboard - MultiChoice Setting (All novels)
Model Acc Context Window Text Input by Team

Human Performance
97.00 - - -
πŸ†1
gpt-4-0125-preview
71.80 128K Long-context OpenAI
πŸ₯ˆ2
claude-3-sonnet-20240229-v1:0
71.11 200K Long-context Claude
πŸ₯‰3
GPT-4 RAG Langchain
67.89 128K RAG Langchain Sayash Kapoor (sayashk@princeton.edu) and
Benedikt StrΓΆbl (stroebl@princeton.edu)
4
claude-v2:1
66.84 200K Long-context Claude
5
GPT-3.5 RAG Langchain
56.94 128K RAG Langchain Sayash Kapoor (sayashk@princeton.edu) and
Benedikt StrΓΆbl (stroebl@princeton.edu)
6
InternLM-20b
49.18 200K Long-context InternLM
7
InternLM-7b
43.51 200K Long-context InternLM
Leaderboard - Generative Setting (Public domain novels)
Model Acc Context Window Text Input by Team

Human Performance
90.00 - - -
πŸ†1
claude-3-sonnet-20240229-v1:0
46.96 200K Long-context Claude
πŸ₯ˆ2
gpt-4-0125-preview
45.76 128K Long-context OpenAI
πŸ₯‰3
claude-v2:1
44.32 200K Long-context Claude
4
InternLM-20b
30.04 200K Long-context InternLM
5
InternLM-7b
28.07 200K Long-context InternLM
6
-
- - - -
7
-
- - - -
Leaderboard - MultiChoice Setting (Public domain novels)
Model Acc Context Window Text Input by Team

Human Performance
97.00 - - -
πŸ†1
gpt-4-0125-preview
70.44 128K Long-context OpenAI
πŸ₯ˆ2
claude-3-sonnet-20240229-v1:0
67.15 200K Long-context Claude
πŸ₯‰3
claude-v2:1
65.92 200K Long-context Claude
4
InternLM-20b
45.87 200K Long-context InternLM
7
InternLM-7b
40.89 200K Long-context InternLM