Research Projects
Finnish Internet Parsebank
PI: Filip Ginter (TY), Veronika Laippala (TY)
2014-2017, funded by the Kone foundation (Koneen Säätiö)
The Finnish Internet Parsebank is a joint project of the Department of Information Technology and the School of languages and Translation studies. It aims at producing a mass-scale corpus of Internet Finnish by using automatic syntactic analysis and document classification.
Universal Dependency Parser
PI: Filip Ginter (TY)
2016-2020, funded by the Academy of Finland
Automated syntactic analysis is one of the fundamental tasks of language technology, utilized in many applications such as search engines and machine translation. For a small number of well-resourced and well-researched languages, like English, the accuracy of the automatic parsers approaches human level. But for many others, the accuracy is significantly worse and the vast majority of languages cannot be automatically parsed at all, because the necessary data to train the parsers has not been created. In this project, we are finding ways in which the parser training data of many languages could be pooled using techniques of vector space representation of words. This will result in a “Universal Parser” which operates in a language-independent manner and can, for instance, use Czech training data to improve its performance on Finnish.
U-bot: News Generation Using Advanced Language Technology Methods
PI: STT, Partners: Filip Ginter (TY), Namia Oy
2018-2019, funded by the Google Digital News Initiative (DNI)
U-bot uses advanced language technology to automatically generate news text based on facts from a data source and previous news stories in the same area. Most organisations producing automatically generated text use simple template slot filling. These templates usually have to be written by journalists, which takes time and effort. With U-bot, text generation is driven by advanced language technology methods, cutting out the role of the time-consuming templates and letting the machine do the work.
DeepFin: Deep language models for Finnish
PI: Filip Ginter (TY), Sampo Pyysalo (TY)
2019, CSC Grand Challenge Pilot
Recent advances in deep learning have made it possible to model human language at unprecedented levels of accuracy, leading to breakthroughs in machine translation, question answering, natural language dialogue, and more. However, due to the computational cost of training state-of-the-art methods on billions of words of text, large-scale applications of this technology have been primarily limited to large multinational companies working on English. Using web-scale Finnish language resources compiled by the University of Turku Natural Language Processing group and newly introduced CSC computational resources, the DeepFin Grand Challenge project will create a deep language model of Finnish that will be comparable in coverage and quality to the best language models available today for any language. This critical language resource will be made freely available under Open Data licensing to accelerate further advances in Finnish natural language processing.
A piece of news, an opinion or something else? Different texts and their detection from the multilingual Internet
PI: Veronika Laippala (TY)
2019-2021, funded by Emil Aaltonen Foundation
The project combines linguistics, natural language processing and machine learning in order to analyse and automatically detect the different text varieties that are used on the internet. By distinguishing between e.g. user manuals, news articles on recent events and texts that also feature the author’s opinion, the project reveals crucial information on language use on the internet and improves the accessibility of the massive amount of information available online. As a practical outcome, the project detects the different text varieties used in a collection of billions of words of online texts written in Finnish, Swedish, French and English that were compiled by the research group during previous research efforts. This has significantly improved the usability of the collections.
Structuring language use across multilingual web corpora
Partners: Veronika Laippala (TY), Jesse Egbert and Douglas Biber, Northern Arizona University, USA
2018-2019, funded by Cultural foundation of Finland, Fulbright
The project combines corpus linguistics and machine learning to study registers, i.e. language varieties such as user manual, news article or film review used in the Internet. The objective is to define the linguistic characteristics of these registers and develop methods to automatically detect them from very large, web-crawled corpora, such as the Finnish Internet Parsebank and similar collections in other languages. This will improve the usability of such collections, because the users can then focus on particular registers. In the long term, the project will also enhance the availability of information in the Internet, because the results can be used to detect the origins of web documents. As a consequence, for instance a Google search could be asked to focus on specific registers, such as news or product reviews.
Computational History and the Transformation of Public Discourse in Finland, 1640–1910
PI: Hannu Salmi (TY), Partners: Kimmo Kettunen (HY), Tapio Salakoski (TY), Mikko Tolonen (HY)
2016-2019, funded by the Academy of Finland
The consortium Computational History and the Transformation of Public Discourse in Finland, 1640–1910 is based on the shared expertise of The Faculty of Humanities at the University of Helsinki, the Departments of Cultural History and Information Technology at the University of Turku, and the Centre for Preservation and Digitisation of the National Library of Finland. Its objective is to reassess the scope, nature and transnational connections of public discourse in Finland, 1640–1910. Two complementary approaches will be utilized, one based on the use of library catalogue metadata and the other based on the full text-mining of all the digitized Finnish newspapers and journals until 1910. The consortium will analyze how the language barriers, elite culture and popular debate, text reuse as well as different publication channels interacted. As a key methodological innovation, the consortium introduces the concept of open data analytical ecosystems.
Citizen Mindscapes – Detecting Social, Emotional and National Dynamics in Social Media
PI: Jussi Pakkasvista (HY), Partners: Juha Alho (HY), Filip Ginter (TY), Juho Saari (TAY), Jaakko Suominen (TY)
2016-2018, funded by the Academy of Finland
Mindscapes24 builds a research frontier for social media analysis by focusing on Suomi24–Finland’s largest topic-centric social media, and one of the world’s largest non-English online discussion fora. We bring together researchers from social sciences, digital culture, welfare sociology, language technology, and statistical data analysis, developing new ways of exploring social and political interaction. We tackle Suomi24 from three perspectives: (1) the digital culture that produces social media (2) novel visual tools and analysis methods for studying the digital content, and (3) a small number of spearhead research questions, such as characterizing the types of micro interaction, how heated debates might turn into political movements and how to detect emotional waves. In addition to an open data set made available through the Language Bank, the results will include a book on digital culture, visual tools for social scientists, and an international conference.
Profiling Premodern Authors
PI: Marjo Kaartinen (TY), Partners: Sampo Pyysalo (TY)
2016-2019, funded by the Academy of Finland
The Consortium Profiling premodern authors (PROPREAU) applies and develops new computational methods based on machine learning to explore several fundamental and unresolved questions of authorship in classical and medieval Latin texts, ranging from Roman grammarians to papal court and works of the inquisitors. Despite the unsurpassed cultural importance of Latin, many essential texts remain anonymous. Identifying their authors requires an analysis and comparison of large quantities of text, often characterized by imitation of earlier sources. Computational models and machine learning have potential to significantly alter our view of premodern authorship by allowing a much wider look at textual material than is attainable by conventional methods and a single human reader. The Consortium expects to offer new, well-grounded answers to questions of authorship that were previously considered unsolvable as well as guidelines for future endeavors to identify anonymous premodern texts.
Massively multilingual modeling of registers in web-scale corpora
PI: Veronika Laippala
2021-2024, funded by the Academy of Finland
This project combines the long traditions of corpus linguistics and the latest innovations of natural language processing (NLP) to explore web registers—situationally defined Internet text varieties such as news, blogs or how-to pages—on a massively multilingual scale. Specifically, the project 1) analyzes language-specific differences of registers and creates a data-driven description of the full range of web registers in six languages, 2) develops machine learning methods for large-scale register modeling and identification in massively cross/multi-lingual settings, and 3) automatically identifies registers in Universal Parsebanks, a language resource spanning 100 billion words and 64 languages. Thereby, the project provides critical knowledge about online communication and tools with which to develop web data from simple masses of raw text toward organized resources with rich metadata.