Tokenization
The low-level tokenization of the Belarusian UD treebank generally adopts the RNC standard.
- In general, tokens are delimited by whitespace. The regexp [А-zА-яЁёУўі-]+ usually corresponds to one token.
- Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization.
- Each punctuation mark is treated as a single token, e.g. the following sequence: )”, - becomes four tokens, ) , ”, ,, and -“. Exceptions are conventional multi-character punctuation marks: – , … , ?! , etc., and emojis and smileys: :) , ^_^, etc.
- Conventional non-cyrillic multi-character terms are tokenized as single tokens: °С, км2.
Some special cases worth mentioning:
- Numerical expressions including decimal numbers, such as 245, 3,14, are treated as single tokens.
- Time expressions like 20:55 are splitted into separate tokens (in this case, three { 20 , : , 55 }).
- Dates like 20.04.2012 are splitted into separate tokens (in this case, five { 20 , . , 04 , . , 2012 }).
- Special symbols before and after numerical expressions, as in $500 , 2,67% , +27°С , are tokenised separately (so, the tokens are { $ , 500 } , { 2,67 , % } , { + , 27 , °С }).
- Numerical expressions with hyphen and cyrillic endings (e.g. 1-ый “1st”, 3-м “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. 79-гадовы “79 year old”, 500-годдзе “500th anniversary”) are treated as single tokens.
- Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { з-за } “because of”, { зялёна-шэрых } “green-gray”, { Санкт-Пецярбург } “St. Petersburg”, but { Ростове , - , на , - , Дону} “(in) Rostov on Don”.
- Abbreviations are treated as single tokens, whitespaces split the abbreviations.
- Abbreviations marked by a period, as in стр. “p. (page)”, П. “P. (for Peter)”, are treated as single tokens. If the period overlaps with the end of sentence period then it is written once as a separate token (denoting end-of-sentence), e.g. { 1914 , г , . } “year 1914”.
- Abbreviations can not contain a period inside, i.e. the patterns like і т.д. “and so on”, да т.п. “and so forth” are splitted into three tokens: { i , т. , д. }, { да , т. , п. }.
- Email addresses, URLs, and tweet-style names are treated as single tokens: {no@mail.ru}, {https://github.com}, {@anna_li}
The Belarusian UD treebank does not contain multiword tokens.
Indefinite pronouns and adverbs
- хто-небудзь “someone, somebody” = single token
- сёй-той , Gen. сяго-таго “someone” = three tokens: { сёй , - , той }, { сяго , - , таго }
- хтось , хтосьцi “someone, somebody” = single token (orthographic rule)
Verb forms, analytical grammatical forms, negation
- reflexive verbs are kept as a single token (orthographic rule): з’яўляецца “is”.
- the forms of subjunctive mood, analytical passive, complex future tense, complex comparatives, etc. are splitted according to the orthographic principle: { маглі , б } “would be able, could”, { былі , зафіксаваныя } “were recorded”, { будзе , трымацца } “will be held”, { больш , сур’ёзныя } “more serious”
- не and ни used as negation markers with verbs, pronouns and other words are tokenized according to the orthographic rules: { не , рэагуючы } “not reacting”, { ні , ў , каго } “at no one”, but { непапраўнай } “irrecoverable”, { незавершаны } “not finished”, { ніякіх } “no one”.
- паў- and напаў- “half” are never kept separate: паўбеспрацоўны “half-unemployed”, напаўзабыты “half-forgotten”.
Character set
-,;:!?.’’”“”()/&#%°+0123456789aAábdDeěfFghHiIjkKlLmn№oOpPrRsStTuvVwWXyаАбБвВгГдДеЕёЁжЖзЗіІйкКлЛмМнНоОпПрРсСтТуУўфФхХцЦчЧшШыьэЭюЮяЯ