View on GitHub

Turku-neural-parser-pipeline

A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.

Training

Training new models — short version

The default command for training the parser pipeline is:

python3 train/train_models.py --name name --train_file /path/to/train.conllu --devel_file /path/to/devel.conllu --embeddings /path/to/embeddings.vectors

This will train the tagger, parser and lemmatizer models, and save everything into a directory models_%name. Training and development data must be in the CoNLL-U format, and pre-trained word embedding must be in the word2vec text format. Default parameters for each component are defined in the train/templates/ directory, you are free to tune the hyperparameter values in these config files.

GPU: With default parameters, lemmatizer (pytorch) is using GPU with id 0, remove -gpu_rank 0 parameter from train/templates/lemmatizer.yaml if you wish to run on CPU (but note that CPU training may be very slow).

Pre-trained word embeddings: By default, the model expects to get 100-dimensional pre-trained word embeddings. If you need to use any other dimensionality, change the embed_size pararameter in train/templates/tagger.cfg and train/templates/parser.cfg.

Skipping a component: If you need to skip a component, you can add --tagger False, --parser False or --lemmatizer False.

Tokenizer

Currently, the tokenizer model must be trained separately by using UDPipe v1.2.0, see UDPipe installation instructions here. After downloading/installing UDPipe, a tokenizer model can be trained with:

/path/to/udpipe --train model.output train.conllu --heldout dev.conllu --tokenizer="dimension=64;epochs=100;initialization_range=0.1;batch_size=50;learning_rate=0.005;dropout=0.1;early_stopping=1" --tagger=none --parser=none

After training, the tokenizer model must be copied into the models_%name/Tokenizer/tokenizer.udpipe.

cp model.output models_%name/Tokenizer/tokenizer.udpipe

Running the parser with your new models

Now you should be able to run the pipeline with your new models (use --gpu -1 for running on CPU):

echo "I have a dog." | python3 full_pipeline_stream.py --conf models_%name/pipelines.yaml parse_plaintext > myfile.conllu

Training new models — long version

In case of any problems with the train_models.py, here is a list of required steps for training new models:

TODO