Textual paraphrase dataset for deep language modeling
The project gathers a large dataset of Finnish paraphrase pairs (aiming at 100,000) and a comparatively small dataset of Swedish paraphrases. The paraphrases are selected and classified manually, so as to minimize lexical overlap, and provide examples that are maximally structurally and lexically different. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general. The project is funded by the European Language Grid (August 2020 - July 2021) and the Academy of Finland (2021-2023).
- Project poster
- Annotation tools: candidate selection, candidate annotation, statistics
- Data: the first release of 53,000 paraphrases is expected by the end of Februrary / early March 2021
