Tokenization

The low-level tokenization of the Belarusian UD treebank generally adopts the RNC standard.

In general, tokens are delimited by whitespace. The regexp [А-zА-яЁёУўі-]+ usually corresponds to one token.
Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization.
Each punctuation mark is treated as a single token, e.g. the following sequence: )”, - becomes four tokens, ) , ”, ,, and -“. Exceptions are conventional multi-character punctuation marks: – , … , ?! , etc., and emojis and smileys: :) , ^_^, etc.
Conventional non-cyrillic multi-character terms are tokenized as single tokens: °С, км2.

Some special cases worth mentioning:

Numerical expressions including decimal numbers, such as 245, 3,14, are treated as single tokens.
Time expressions like 20:55 are splitted into separate tokens (in this case, three { 20 , : , 55 }).
Dates like 20.04.2012 are splitted into separate tokens (in this case, five { 20 , . , 04 , . , 2012 }).
Special symbols before and after numerical expressions, as in $500 , 2,67% , +27°С , are tokenised separately (so, the tokens are { $ , 500 } , { 2,67 , % } , { + , 27 , °С }).
Numerical expressions with hyphen and cyrillic endings (e.g. 1-ый “1st”, 3-м “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. 79-гадовы “79 year old”, 500-годдзе “500th anniversary”) are treated as single tokens.
Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { з-за } “because of”, { зялёна-шэрых } “green-gray”, { Санкт-Пецярбург } “St. Petersburg”, but { Ростове , - , на , - , Дону} “(in) Rostov on Don”.
Abbreviations are treated as single tokens, whitespaces split the abbreviations.
Abbreviations marked by a period, as in стр. “p. (page)”, П. “P. (for Peter)”, are treated as single tokens. If the period overlaps with the end of sentence period then it is written once as a separate token (denoting end-of-sentence), e.g. { 1914 , г , . } “year 1914”.
Abbreviations can not contain a period inside, i.e. the patterns like і т.д. “and so on”, да т.п. “and so forth” are splitted into three tokens: { i , т. , д. }, { да , т. , п. }.
Email addresses, URLs, and tweet-style names are treated as single tokens: {no@mail.ru}, {https://github.com}, {@anna_li}

The Belarusian UD treebank does not contain multiword tokens.

Indefinite pronouns and adverbs

хто-небудзь “someone, somebody” = single token
сёй-той , Gen. сяго-таго “someone” = three tokens: { сёй , - , той }, { сяго , - , таго }
хтось , хтосьцi “someone, somebody” = single token (orthographic rule)

Verb forms, analytical grammatical forms, negation

reflexive verbs are kept as a single token (orthographic rule): з’яўляецца “is”.
the forms of subjunctive mood, analytical passive, complex future tense, complex comparatives, etc. are splitted according to the orthographic principle: { маглі , б } “would be able, could”, { былі , зафіксаваныя } “were recorded”, { будзе , трымацца } “will be held”, { больш , сур’ёзныя } “more serious”
не and ни used as negation markers with verbs, pronouns and other words are tokenized according to the orthographic rules: { не , рэагуючы } “not reacting”, { ні , ў , каго } “at no one”, but { непапраўнай } “irrecoverable”, { незавершаны } “not finished”, { ніякіх } “no one”.
паў- and напаў- “half” are never kept separate: паўбеспрацоўны “half-unemployed”, напаўзабыты “half-forgotten”.

Character set

-,;:!?.’’”“”()/&#%°+0123456789aAábdDeěfFghHiIjkKlLmn№oOpPrRsStTuvVwWXyаАбБвВгГдДеЕёЁжЖзЗіІйкКлЛмМнНоОпПрРсСтТуУўфФхХцЦчЧшШыьэЭюЮяЯ