NovelQA

About NovelQA

News

Data Description

Each data point in our dataset is represented as a dictionary with the following keys:

[
{
    "QID": the QID which remains unchanged for tracking updates (only happen if necessary),
    "Aspect": the question classification in 'aspect', e.g., "times",
    "Complexity": the question classification in complexity, e.g., "mh",
    "Question": the input question,
    "Options": {
        "A": Option A,
        "B": Option B,
        "C": Option C (not applicable in several yes/no questions),
        "D": Option D (not application in several yes/no questions)
    },
},
...
]

Here is an example of a data point:

[
{
    "QID": "Q0148",
    "Aspect": "times",
    "Complex": "mh",
    "Question": "How many times has Robert written letters to his sister?",
    "Options": {
        "A": "11",
        "B": "9",
        "C": "12",
        "D": "10"
    },
},
...
]

Contributors

Cunxiang Wang*, Ruoxi Ning*, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, and Yue Zhang

License

This dataset is released under the Apache-2.0 License.

Citation

@misc{wang2024novelqa,
  title={NovelQA: A Benchmark for Long-Range Novel Question Answering}, 
  author={Cunxiang Wang and Ruoxi Ning and Boqi Pan and Tonghui Wu and Qipeng Guo and Cheng Deng and Guangsheng Bao and Qian Wang and Yue Zhang},
  year={2024},
  eprint={2403.12766},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Leaderboard - Generative Setting (All novels)

	Model	Acc	Context Window	Text Input by	Team
	Human Performance	90.00	-	-	Team	-
🏆1	claude-3-sonnet-20240229-v1:0	53.66	200K	Long-context	Claude
🥈2	gpt-4-0125-preview	46.88	128K	Long-context	OpenAI
🥉3	claude-v2:1	46.04	200K	Long-context	Claude
4	InternLM-20b	32.37	200K	Long-context	InternLM
5	InternLM-7b	30.90	200K	Long-context	InternLM
6	-	-	-	-	-
7	-	-	-	-	-

Leaderboard - MultiChoice Setting (All novels)

	Model	Acc	Context Window	Text Input by	Team
	Human Performance	97.00	-	-	Team	-
🏆1	gpt-4-0125-preview	71.80	128K	Long-context	OpenAI
🥈2	claude-3-sonnet-20240229-v1:0	71.11	200K	Long-context	Claude
🥉3	GPT-4 RAG Langchain	67.89	128K	RAG Langchain	Sayash Kapoor (sayashk@princeton.edu) and Benedikt Ströbl (stroebl@princeton.edu)
4	claude-v2:1	66.84	200K	Long-context	Claude
5	GPT-3.5 RAG Langchain	56.94	128K	RAG Langchain	Sayash Kapoor (sayashk@princeton.edu) and Benedikt Ströbl (stroebl@princeton.edu)
6	InternLM-20b	49.18	200K	Long-context	InternLM
7	InternLM-7b	43.51	200K	Long-context	InternLM

Leaderboard - Generative Setting (Public domain novels)

	Model	Acc	Context Window	Text Input by	Team
	Human Performance	90.00	-	-	Team	-
🏆1	claude-3-sonnet-20240229-v1:0	46.96	200K	Long-context	Claude
🥈2	gpt-4-0125-preview	45.76	128K	Long-context	OpenAI
🥉3	claude-v2:1	44.32	200K	Long-context	Claude
4	InternLM-20b	30.04	200K	Long-context	InternLM
5	InternLM-7b	28.07	200K	Long-context	InternLM
6	-	-	-	-	-
7	-	-	-	-	-

Leaderboard - MultiChoice Setting (Public domain novels)

	Model	Acc	Context Window	Text Input by	Team
	Human Performance	97.00	-	-	Team	-
🏆1	gpt-4-0125-preview	70.44	128K	Long-context	OpenAI
🥈2	claude-3-sonnet-20240229-v1:0	67.15	200K	Long-context	Claude
🥉3	claude-v2:1	65.92	200K	Long-context	Claude
4	InternLM-20b	45.87	200K	Long-context	InternLM
7	InternLM-7b	40.89	200K	Long-context	InternLM