home edit page issue tracker

This page still pertains to UD version 1.

Tokenization

Tokenization of the Slovenian UD Treebank reflects the following principles:

Space is the principal separator for tokens.

During tokenization, all characters are divided into two categories: words (W) and characters (C). Words are alphanumeric strings between spaces, while characters are punctuation and symbol characters.

If a string of alphanumeric characters between two spaces includes C characters, it is usually split into several tokens (e.g. AC/DC and Micro$oft are split into three tokens AC / DC and Micro $ oft).

However, the following exceptions apply, in which C characters become parts of W tokens:

Dot becomes part of a W token if it is:

URLs and e-mail addresses: all C characters become part of a single W token in strings recognized as URLs or addresses using a regular expression.

Information on whether a token is followed by a space (e.g. d.o.o. vs. d. o. o.) is indicated with SpaceAfter=Yes feature in the MISC column.

Note that the current version of the Slovenian UD Treebank does not yet comply with the universal guidelines recommendation for splitting of fused words, such as combinations of prepositions and pronouns, e.g. name “on me”, _zanj “for him”, _vase “in/to oneself”. Instead, these tokens are currently marked as pronouns with feature Variant=Bound.