Textual paraphrase dataset for deep language modeling

The project gathers a large dataset of Finnish paraphrase pairs (aiming at 100,000) and a comparatively small dataset of Swedish paraphrases. The paraphrases are selected and classified manually, so as to minimize lexical overlap, and provide examples that are maximally structurally and lexically different. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general. The project is funded by the European Language Grid (August 2020 - July 2021) and the Academy of Finland (2021-2023).

The project team: Filip Ginter, Jenna Saarni, Jemina Kilpeläinen, Li-Hsin Chang, Otto Tarkka, Jenna Kanerva, Maija Sevón, Hanna-Mari Kupari. Not on the image: Valtteri Skantsi