Model Submission

Download the test set for a dataset (a single text file with one article per line): CNN/Daily Mail, XSum, Webis TL;DR
Generate summaries from your model in the exact order to avoid alignment errors with hosted models.
Send us the output file with summaries (a single text file with one summary per line) to shahbaz.syed[at]uni-leipzig.de or tariq.yousef[at]uni-leipzig.de.

Automatic Metrics

Summary Length: It is the number of words in the summary.

Novelty: It is the percentage of summary words that are not in the document.

Compression Ratio: It is the word ratio between the article and the summary.

ROUGE: We use the Python implementation provide by Google Research.

Factual Consistency: We compute this on two levels inspired by Nan et.al, 2021.

Entity-level: the percentage of named entities in the summary that are found in the document. We also match partial entities to their longer counterparts from the document if they share parts of the entity.
Relation-level: the percentage of relations (extracted using Stanford OpenIE) in the summary that are found in the document. Since we consider reference also a model, we only compute the precision with respect to the source document.

N-gram Abstractiveness: We compute the n-gram abstractiveness upto 4-grams following Gehrmann et al., 2019. It is the normalized score for novelty that tracks parts of a summary that are already among the n-grams it has in common with the document.

Summary Explorer: Visualizing the State of the Art in Text Summarization

Code