About

This project holds the dependency parsing pipeline being developed by the University of Turku NLP group. This is still a work in progress, but a version of this pipeline has successfully been applied to several billions of tokens large corpora.

Newer and substantially better version of the parser

A new take on the trusty old Finnish-dep-parser with pretrained models for more than 50 languages is available at https://turkunlp.github.io/Turku-neural-parser-pipeline/. The new pipeline is fully neural and has a substantially better accuracy in all layers of prediction (segmentation, morphological tagging, syntax, lemmatization).

Download (old version)

Choose whichever option suits you best:

Clone the repository git clone https://github.com/TurkuNLP/Finnish-dep-parser.git
Download the current source code using the Download ZIP link of the project GitHub repository

Installation and prerequisites

On most systems, all you need is to run the install.sh script, which will download and test all of the necessary pre-requisites. You’ll need to have Java and Python 2.X installed. The script downloads the external tools needed to run the pipeline:

OpenNLP for sentence splitting and tokenization
OMorFi and HFST optimized lookup for morphological analysis
MarMoT for morphological disambiguation (tagging)
mate-tools for dependency parsing

These all are Java programs and tend to work fine anywhere with a sane Java installation.

Parsing plain text

The following command will run the entire pipeline (sentence splitting, tokenization, tagging, parsing) on a text file. All programs throughout the pipeline expect UTF-8 as text encoding.

cat sometext.txt | ./parser_wrapper.sh > output.conllu

Parsing plain text with possible comments

If you need to preserve metadata in the input, you can include comments (lines starting with ###C:) into the input. These lines will be passes through the pipeline unchanged and preserved in the output.

cat sometext.txt | ./split_text_with_comments.sh | ./parse_conll.sh > output.conllu

Parsing CoNLL-U formatted input

cat input.conllu | ./parse_conll.sh > output.conllu

Note that comments (lines in the CoNLL-U file that start with #) will be preserved and passed through the pipeline unchanged. Make sure the comments immediately precede the next sentence, i.e. there should not be an empty line between the comments and the first token of the sentence.

Visualizing trees

Parser output trees can be visualized by using the following command and opening the resulting .html file in a modern web browser. Use the --max_sent parameter to limit the number of trees shown.

cat output.conllu | python visualize.py > output.html

Testing

The data directory contains the file wiki-test.txt which is a small piece of text from the Finnish Wikipedia. You can parse it as follows:

cat data/wiki-test.txt | ./parser_wrapper.sh > wiki-test-parsed.conllu

Other features

Splitting sentences into clauses

cat output.conllu | python split_clauses.py > output_clauses.conllu

This script uses the two last columns in the CONLL-U format to mark the clause boundaries.

Visualizing clauses

Separate clauses can be visualized by using the following command and opening the resulting .html file in a modern web browser. Use the --max_sent parameter to limit the number of trees shown.

cat output_clauses.conllu | python visualize_clauses.py > output_clauses.html