Tokenization

Tokenization of the Slovenian UD Treebank reflects the following principles:

Space is the principal separator for tokens.

Sequences of words that can be written both with or without space without changing its meaning (e.g. kdorkoli, kdor koli “anybody, any body”) follow the same principle and become either one or two tokens depending on the use of space

During tokenization, all characters are divided into two categories: words (W) and characters (C). Words are alphanumeric strings between spaces, while characters are punctuation and symbol characters.

C tokens are recognized on the basis of a predefined list of punctuation and symbol characters included in the tokenizer.
C tokens may include only one punctuation or symbol character. Sequences of two or more characters (e.g. ?!) are treated as sequences of separate C tokens.

If a string of alphanumeric characters between two spaces includes C characters, it is usually split into several tokens (e.g. AC/DC and Micro$oft are split into three tokens AC / DC and Micro $ oft).

However, the following exceptions apply, in which C characters become parts of W tokens:

Apostrophe becomes part of a W token if used without space on both sides (e.g. O’Brian, mor’va “O’Brian, we have to”).
Comma and colon become part of a W token if used without space on both sides and if the string contains only digits (e.g. 30:00, 200,000,000).
Hyphen becomes part of a W token if used without space on both sides and if:
- the left part is an acronym (in capital letters), a single letter or a digit
- the right part is an affix or an inflectional ending; a finite list of possible affixes and endings is integrated in the tokenizer
- e.g. OZN-ovski “similar to United Nations”, a-ju “to the letter a”, 15-i “the 15th” )

Dot becomes part of a W token if it is:

used without space on both sides and the string contains only digits (e.g. 1.2)
used without space on the left and is part of an abbreviation or ordinal number (e.g. dr., 4., IV.); a finite list of possible abbreviations is integrated in the tokenizer.

URLs and e-mail addresses: all C characters become part of a single W token in strings recognized as URLs or addresses using a regular expression.

Information on whether a token is followed by a space (e.g. d.o.o. vs. d. o. o.) is indicated with SpaceAfter=Yes feature in the MISC column.

Note that the current version of the Slovenian UD Treebank does not yet comply with the universal guidelines recommendation for splitting of fused words, such as combinations of prepositions and pronouns, e.g. name “on me”, _zanj “for him”, _vase “in/to oneself”. Instead, these tokens are currently marked as pronouns with feature Variant=Bound.