Data Science by ODS.ai 🦜

MLSUM: The Multilingual Summarization Corpus

The first large-scale MultiLingual SUMmarization dataset, comprising over 1.5M article/summary pairs in French, German, Russian, Spanish, and Turkish. Its complementary nature to the CNN/DM summarization dataset for English.

For each language, they selected an online newspaper from 2010 to 2019 which met the following requirements:
0 being a generalist newspaper: ensuring that a broad range of topics is represented for each language allows minimizing the risk of training topic-specific models, a fact which would hinder comparative cross-lingual analyses of the models.
1 having a large number of articles in their public online archive.
2 Providing human written highlights/summaries for the articles that can be extracted from the HTML code of the web page.

Also, in this paper, you can remember about similar other datasets

paper: https://arxiv.org/abs/2004.14900
github: https://github.com/recitalAI/MLSUM
Instructions and code will soon.

#nlp #corpus #dataset #multilingual

15.3K views08:41

👎🏿 3 👍🏿 42

About

Blog

Apps

Platform