A piece of news, an opinion or something else? Different texts and their detection from the multilingual Internet
Uutinen, mielipide vai jotain muuta? Erilaiset tekstit ja niiden automaattinen tunnistus monikielisestä internetistä
Funded by Emil Aaltonen
The project combines linguistics, natural language processing and machine learning in order to analyse and automatically detect the different text varieties that are used on the internet. By distinguishing between e.g. user manuals, news articles on recent events and texts that also feature the author’s opinion, the project reveals crucial information on language use on the internet and improves the accessibility of the massive amount of information available online. As a practical outcome, the project detects the different text varieties used in a collection of billions of words of online texts written in Finnish, Swedish, French and English that were compiled by the research group during previous research efforts.
** Papers and talks:
At the 12th Web-as-Corpus Workshop remotely: Veronika Laippala, Samuel Rönnqvist, Saara Hellström, Juhani Luotolahti, Liina Repo, Anna Salmela, Valtteri Skantsi, Sampo Pyysalo From Web Crawl to Clean Register-Annotated Corpora. See https://www.aclweb.org/anthology/2020.wac-1.3/
In Research data and the humanities in Oulu: Veronika Laippala: From bits and numbers to explanations - doing research on Internet-based big data (Plenary talk). See slides at https://github.com/mavela/Papers-and-talks/blob/master/explaining_numbers_rdhum.pdf
In Corpus Linguistics 2019 in Cardiff: Veronika Laippala, Aki-Juhani Kyröläinen, Filip: Ginter, Jesse Egbert and Douglas Biber: Generating online text types: A cluster analysis of predicted document embeddings. See slides at https://github.com/mavela/Papers-and-talks/blob/master/Generating%20online%20text%20types.pdf
In Proceedings of Nodalida 2019 in Turku: Veronika Laippala, Roosa Kyllönen, Jesse Egbert, Douglas Biber, Sampo Pyysalo: Toward Multilingual Identification of Online Registers. (In September 2019)