Tokenization
The low-level tokenization of the Russian UD treebanks generally adopt the RNC standard.
- In general, tokens are delimited by whitespace. The regexp [А-zА-яЁё-]+ usually corresponds to one token.
- Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization.
- Each punctuation mark is treated as a single token, e.g. the following sequence: )”, - becomes four tokens, ) , ”, ,, and -“. Exceptions are conventional multi-character punctuation marks: – , … , ?! , etc., and emojis and smileys: :) , ^_^, etc.
- Conventional non-cyrillic multi-character terms are tokenized as single tokens: °С.
Some special cases worth mentioning:
- Numerical expressions including decimal numbers, such as 245, 3,14, are treated as single tokens.
- Time expressions like 20:55 are splitted into separate tokens (in this case, three { 20 , : , 55 }).
- Dates like 20.04.2012 are splitted into separate tokens (in this case, five { 20 , . , 04 , . , 2012 }).
- Special symbols before and after numerical expressions, as in $500 , 2,67% , +27°С , are tokenised separately (so, the tokens are { $ , 500 } , { 2,67 , % } , { + , 27 , °С }).
- Numerical expressions with hyphen and cyrillic endings (e.g. 1-ый “1st”, 3-м “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. 79-летний “79 year old”, 500-летие “500th anniversary”) are treated as single tokens.
- Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { из-за } “because of”, { зелено-серых } “green-gray”, { Санкт-Петербург } “St. Petersburg”, but { Ростове , - , на , - , Дону} “(in) Rostov on Don”.
- The discoursive particles -то and -с are tokenised separately, e.g. Вася-то { Вася , - , то }. Exception: indefinite pronouns and adverbs, see below.
- Abbreviations are treated as single tokens, whitespaces split the abbreviations.
- Abbreviations marked by a period, as in стр. “p. (page)”, П. “P. (for Peter)”, are treated as single tokens. If the period overlaps with the end of sentence period then it is written once as a separate token (denoting end-of-sentence), e.g. { 1914 , г , . } “year 1914”.
- Abbreviations can not contain a period inside, i.e. the patterns like и т.д. “and so on”, и т.п. “and so forth” are splitted into three tokens: { и , т. , д. }, { и , т. , п. }.
- Email addresses, URLs, and tweet-style names are treated as single tokens: {no@mail.ru}, {https://github.com}, {@anna_li}
The Russian UD treebanks does not contain multiword tokens. (UD_Russian-Syntagrus treebank v.1.3 and v.1.4 contained multitokens following the Syntagrus standard).
Pronouns and adverbs
- Indefinite pronouns and adverbs like кто-нибудь, кто-либо, кто-то, кое-кто “someone, somebody”, etc. are treated as a single token.
- In preposition phrases, the pronouns with кое- are splitted into three tokens: кое к кому { кое , к , кому } “to someone”.
- Negative pronouns, adverbs and adverbial predicatives are tokenized as single tokens, e.g. никто “no one”, никакой “no, neither”, нигде “nowhere”, негде “there is no place”. However, in preposition phrases the negative pronouns and predicatives are splitted into three tokens: ни к кому { ни , к , кому } “to no one”, не от кого { не , от , кого }“there is no one”.
Verb forms, analytical grammatical forms, negation
- reflexive verbs are kept as a single token (orthographic rule): моется “wash oneself”, пройтись “to have a walk”.
- the forms of subjunctive mood, analytical passive, complex future tense, complex comparatives, etc. are splitted according to the orthographic principle: { могли , бы } “would be able, could”, { были , зафиксированы } “were recorded”, { будет , держаться } “will be held”, { более , серьезные } “more serious”.
- не and ни used as negation markers with verbs, pronouns and other words are tokenized according to the orthographic rules: { не , реагируя } “not reacting”, { ни , в , какую } “in no way”, but { непоправимый } “irrecoverable”, { назавершенный } “not finished”, { никаких } “no one”.
- пол- and полу- “half” are never kept separate: поллитра “half a liter”, пол-яблока “half an apple”, полубезработный “almost unemployed”.