View on GitHub

Turku-neural-parser-pipeline

A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.

Docker

Docker on OSX and Windows

Docker on OSX and Windows is configured with a default tight memory limit which needs to be increased. Reaching this limit manifests itself by the Docker container hanging indefinitely. See issue #15.

Input encoding on Windows

In case you get encoding issues in Windows PowerShell, try to enforce utf-8 as shown in issue #21

We provide docker images for both cpu and gpu architectures. When launching a container based upon these images, it is possible to select:

Whether to run the parser in a one-shot stdin-stdout streaming mode or a server mode which does not reload the model on every request
Which of the language models included with the image to use
Which pipeline from the model to run

Note that in order to run the gpu images, you need the nvidia-docker successfully installed. Please make sure you are able to successfully run it before trying to run the parser gpu images.

Ready-made images

Ready-made Docker images are published in the TurkuNLP Docker Hub where Docker can find them automatically. Currently there are images with the base parser environment for cpu and gpu, as well as an image with Finnish, Swedish, and English models, again for both cpu and gpu. To list the models and pipelines available in a particular image, you can run:

docker run --entrypoint ./list_models.sh turkunlp/turku-neural-parser:latest-fi-en-sv-cpu 

Streaming mode - one-off parsing of text

This is the simplest way to run the parser and is useful for one-off parsing of text. It is unsuitable for repeated requests, as running in this mode is subject to a major startup cost as the parser loads the large model, about one minute. To parse using one of the pre-made images with Finnish, Swedish and English models:

echo "Minulla on koira." | docker run -i turkunlp/turku-neural-parser:latest-fi-en-sv-cpu stream fi_tdt parse_plaintext > parsed.conllu

or if you have the NVidia-enabled docker, you can run the gpu version:

echo "Minulla on koira." | docker run --runtime=nvidia -i turkunlp/turku-neural-parser:latest-fi-en-sv-gpu stream fi_tdt parse_plaintext > parsed.conllu

And for English (the only change being that we specify the en_ewt model instead of fi_tdt):

echo "I don't have a goldfish." | docker run -i turkunlp/turku-neural-parser:latest-fi-en-sv-cpu stream en_ewt parse_plaintext > parsed.conllu

The general command to run the parser in this mode is:

docker run -i [image] stream [language_model] [pipeline]

Server mode - repeated requests

In this mode, the parser loads the model once, and can subsequently respond to repeated requests using HTTP requests. For example, using the gpu version:

docker run --runtime=nvidia -d -p 15000:7689 turkunlp/turku-neural-parser:latest-fi-en-sv-gpu server en_ewt parse_plaintext

and on cpu

docker run -d -p 15000:7689 turkunlp/turku-neural-parser:latest-fi-en-sv-cpu server en_ewt parse_plaintext

will start the parser in server mode, using the English en_ewt model and parse_plaintext pipeline, and will listen on the local port 15000 for requests once it has loaded the model. Note: There is nothing magical about the port number 15000, you can set it to any suitable port number. You can query the running parser as follows:

curl --request POST --header 'Content-Type: text/plain; charset=utf-8' --data-binary "This is an example sentence, nothing more, nothing less." http://localhost:15000 > parsed.conllu

or

curl --request POST --header 'Content-Type: text/plain; charset=utf-8' --data-binary @input_text.txt http://localhost:15000 > parsed.conllu

Images for other languages

Building a language-specific image is straightforward. For this you need to choose one of the available language models from here. These models refer to the various treebanks available at UniversalDependencies. Let us choose French and the GSD treebank model. That means the model name is fr_gsd and to parse plain text documents you would use the parse_plaintext pipeline. The hardware build parameter controls whether you want a gpu or cpu image.

Build the Docker image like so:

git clone https://github.com/TurkuNLP/Turku-neural-parser-pipeline.git
cd Turku-neural-parser-pipeline
docker build -t "my_french_parser" --build-arg models=fr_gsd --build-arg hardware=cpu -f Dockerfile-lang .

And then you can parse French like so:

echo "Les carottes sont cuites" | docker run -i my_french_parser stream fr_gsd parse_plaintext

In case you want to build an image with several language models included, you can specify several models when building the image, e.g. as follows: --build-arg models="fr_gsd en_ewt fi_tdt"