Finnish NLP

Finnish Parser

Finnish dependency parser pipeline, which include tokenization, sentence splitting, lemmatization, morphological tagging and dependency parsing.

More information on the parser project

Contact: Filip Ginter (figint@utu.fi), Jenna Kanerva (jmnybl@utu.fi) or github issue tracker

UD_Finnish treebank (Turku Dependency Treebank)

UD_Finnish treebank is a broad-coverage dependency-annotated treebank of general Finnish annotated in the Universal Dependencies scheme. The treebank has about 200,000 words and 15,000 sentences.

The original Turku Dependency Treebank (TDT) was annotated in the Stanford dependency scheme, and later converted into the UD scheme. The UD version of the treebank is the main version, which is maintained. The UD version is distributed trough the UD project (data). The original version (TDT) is also available upon request but it is not maintained anymore. The original treebank has also additional data available regarding the annotation process (double annotations, timestamps etc).

Online treebank search: Finnish (UDv2.0) treebank in http://bionlp-www.utu.fi/dep_search/

License: CC BY-SA 4.0

Main references: Haverinen, K.; Nyblom, J.; Viljanen, T.; Laippala, V.; Kohonen, S.; Missilä, A.; Ojala, S.; Salakoski, T.; Ginter, F.: Building the essential resources for Finnish: the Turku Dependency Treebank. Language Resources and Evaluation. 2013. DOI: 10.1007/s10579-013-9244-1

Pyysalo, Sampo; Kanerva, Jenna; Missilä, Anna; Laippala, Veronika; Ginter, Filip: Universal Dependencies for Finnish. Proceedings of NoDaLiDa 2015 https://aclweb.org/anthology/W/W15/W15-1821.pdf

Contact: Filip Ginter (figint@utu.fi), Jenna Kanerva (jmnybl@utu.fi)

Finnish Internet Parsebank

A mass-scale corpus of Internet Finnish with automatic syntactic analysis. Currently includes about 3.7 billion tokens.

The project has three aims: 1) The creation of a language resource with automatic morphological and syntactic analyses, 2) The classification of the entire Parsebank to coherent subcorpora, 3) The creation of an online user interface

Online search of the morpho-syntactic features: Finnish Internet Parsebank corpus in Dep Search

Online search for words and their contexts: Finnish Internet Parsebank corpus in NoSketchEngine

Semantic similarity of words: An online demo which lets you query semantically similar words using a word2vec model trained on the Parsebank data, http://epsilon-it.utu.fi/wv_demo/. Embedding models in binary form are available here.

Word frequency list: A Finnish word frequency list can be downloaded at http://dl.turkunlp.org/finnish-parsebank/. It’s calculated from the Parsebank data and sorted in descending order. If you use this in your research, please cite us.

Main references: J. Luotolahti; J. Kanerva; V. Laippala; S. Pyysalo; F. Ginter. Towards Universal Web Parsebanks. Proceedings of the International Conference on Dependency Linguistics (Depling’15). 2015

Contact: Filip Ginter (figint@utu.fi), Veronika Laippala (mavela@utu.fi)

Search Tool for Dependency Graphs (dep_search)

Tool for searching morpho-syntactic constructions from dependency graphs.

Beta version of the tool’s new demo: http://depsearch-depsearch.rahtiapp.fi/ds_demo/

Main references: J. Luotolahti; J. Kanerva; S. Pyysalo; F. Ginter. SETS: Scalable and Efficient Tree Search in Dependency Graphs. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 2015

J. Luotolahti; J. Kanerva; F. Ginter. Dep_search: Efficient Search Tool for Large Dependency Parsebanks. Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa). 2017

Contact: Filip Ginter (figint@utu.fi), Juhani Luotolahti (mjluot@utu.fi)

Finnish Propbank

Finnish PropBank is a collection of predicates annotated with their sense and arguments. The data builds on top of the Turku Dependency Treebank (TDT), more specifically its Universal Dependencies version.

More information on the propbank’s project page: http://turkunlp.github.io/Finnish_PropBank/

License: CC BY-SA 4.0

Main references: Haverinen, K.; Kanerva, J.; Kohonen, S.; Missilä, A.; Ojala, S.; Viljanen, T.; Laippala, V.; Ginter, F. The Finnish Proposition Bank. Language Resources and Evaluation. 2015

Contact: Filip Ginter (figint@utu.fi), Jenna Kanerva (jmnybl@utu.fi)

Finnish BERT (FinBERT)

BERT model trained from scratch on Finnish.

More information on the FinBERT’s project page: https://github.com/TurkuNLP/FinBERT

License: CC BY 4.0

Main references: A. Virtanen; J. Kanerva; R. Ilo; J. Luoma; J. Luotolahti; T. Salakoski; F. Ginter; S. Pyysalo. Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076. 2019

Contact: github issue tracker

Finnish GPT (FinGPT)

A series of GPT-3 models trained on Finnish.

More information on the project page: https://turkunlp.org/gpt3-finnish

Contact: Sampo Pyysalo (sampo.pyysalo@utu.fi)

Finnish ChatGPT (FinGPT)

ChatGPT -like models for Finnish.

More information on the project page: https://turkunlp.org/chatgpt-finnish

Contact: Sampo Pyysalo (sampo.pyysalo@utu.fi)

Finnish NER

A state-of-the-art Finnish NER system

More information and online demo on the NER’s project page

Main references: Jouni Luoma, Miika Oinonen, Maria Pyykönen, Veronika Laippala, Sampo Pyysalo. 2020. A Broad-coverage Corpus for Finnish Named Entity Recognition. In Proceedings of The 12th Language Resources and Evaluation Conference (LREC’2020). BibTeX

Contact: Sampo Pyysalo (sampo.pyysalo@utu.fi)

Finnish paraphrase

Textual paraphrase dataset for deep language modeling gathers a large dataset of Finnish and Swedish paraphrases.

More information on the paraphrase project

Contact: Filip Ginter (figint@utu.fi)

Finnish Ice Hockey Data2Text

The Turku Hockey Data2Text dataset for ice hockey news generation is a collection of ice hockey statistics combined with Finnish natural language descriptions of the game events.

More information on the data2text page

Main references: Jenna Kanerva, Samuel Rönnqvist, Riina Kekki, Tapio Salakoski, Filip Ginter. 2019. Template-free Data-to-Text Generation of Finnish Sports News. In Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa’19). BibTeX

Contact: Jenna Kanerva (jmnybl@utu.fi), Filip Ginter (figint@utu.fi)