PROPN
: proper noun
Definition
A proper noun is a noun that is the name of a specific individual, place, or object. Czech proper nouns are always written starting with an uppercase letter. Note that names of days of week (pondělí, úterý, středa, čtvrtek, pátek, sobota, neděle) and names of months (leden, únor, březen, duben, květen, červen, červenec, srpen, září, říjen, listopad, prosinec) are not written capitalized (unlike in English) and are not considered proper nouns.
Single-word named entities should be tagged PROPN
even if they originate
from a common noun (Zajíc, Huť) or an adjective (Veselý, Teplá).
Even if they were originally adjectives and inflect according to adjectival
paradigms, they behave syntactically as nouns. For instance, Teplá
(a river and city in western Bohemia) is originally feminine form of the
adjective teplý “warm” but as a geographical name, it is a noun.
It denotes a concrete location (rather than a property of somebody/something)
and its feminine gender is fixed (while adjectives have forms in all three
genders).
Note that names of languages (čeština, angličtina)
and adjectives derived from geographical names (český, anglický “Czech, English”)
are written in lowercase and are not tagged PROPN
.
Personal names are typically treated as a sequence of proper nouns
(one or more given names and one or more surnames).
If the name contains prepositions, conjunctions or articles (foreign names
and old Czech names), these are tagged as ADP
, CCONJ
and DET
,
respectively.
Czech (and other Slavic) multi-word named entities have internal syntactic
structure, which is preserved in the annotation. The headword is always noun
and there may be other nouns involved. They will be tagged either PROPN
or
NOUN
and possible ambiguities must be resolved individually.
Modifying adjectives are never tagged PROPN
. Even if an adjective is the
first word of a multi-word name, and thus it starts with an uppercase letter,
it is still tagged ADJ
.
Similarly, function words in named entities retain their normal tags.
These rules are less strict for foreign named entities where the original
part of speech is hidden for a Czech speaker.
Examples
- Bečov.
PROPN
nad.ADP
Teplou.PROPN
is a city. Bečov is the head and the nad Teplou part refers to the river flowing through the city, to distinguish it from other Bečovs. - Červený.
ADJ
Újezd.PROPN
is a village. Újezd is the head and it is taggedPROPN
although it originates in the common noun újezd “district, riding”. There are many locations named Újezd and the noun is perceived as a proper noun in current Czech. Červený is an adjective meaning “red” and it is taggedADJ
. - Červená.
ADJ
řeka.NOUN
“Red River”. Even though the two words together are a name of a particular river, řeka is a common noun and is tagged as such. - Organizace.
NOUN
spojených.ADJ
národů.NOUN
“United Nations Organization” consists of three words, none of which is proper noun. However, the acronym OSN “UNO” is a single-token name and is taggedPROPN
.
Conversion from the Prague Dependency Treebank
The PDT set of morphological (part-of-speech) tags does not distinguish
common and proper nouns. However, lemmas in PDT contain additional features
that also encode types of named entities. When converting the PDT annotation
to UD, these lemma features are removed, the PROPN
tag is used and the feature
cs-feat/NameType is added to the universal features to preserve the type.
Only nouns are treated this way.
Foreign adjectives are not converted to PROPN
despite the fact
that they entered Czech as parts of foreign names and their lemmas contain
the name type feature.
The following table lists the name types together with the most frequent examples. See http://ufal.mff.cuni.cz/techrep/tr27.pdf, page 8, section 2.1 (Lemma structure) for more details.
_;Y | given name | Jan, Jiří, Václav, Petr, Josef | “Jan, Jiří, Václav, Petr, Josef” |
_;S | surname | Klaus, Havel, Němec, Jelcin, Svoboda | “Klaus, Havel, Němec, Yeltsin, Svoboda” |
_;E | member of a particular nation, inhabitant of a particular territory | Němec, Čech, Srb, Američan, Slovák | “German, Czech, Serbian, American, Slovak” |
_;G | geographical name | Praha, ČR, Evropa, Německo, Brno | “Prague, CR, Europe, Germany, Brno” |
_;K | company, organization, institution | ODS, OSN, Sparta, ODA, Slavia | “ODS, UN, Sparta, ODA, Slavia” |
_;R | product | LN, Mercedes, Tatra, PC, MF | “LN, Mercedes, Tatra, PC, MF” |
_;m | other proper name: names of mines, stadiums, guerilla bases etc. | US, PVP, Prix, Rapaport, Tour | “US, PVP, Prix, Rapaport, Tour” |
Diffs
Prague Dependency Treebank
Articles in foreign names (the, die, le) are tagged ADJ, not DET. Otherwise, the morphological analysis usually includes the original part of speech of foreign words.
References
Treebank Statistics (UD_Czech)
There are 14162 PROPN
lemmas (25%), 20294 PROPN
types (17%) and 74824 PROPN
tokens (6%).
Out of 17 observed tags, the rank of PROPN
is: 2 in number of lemmas, 3 in number of types and 6 in number of tokens.
The 10 most frequent PROPN
lemmas: Praha, ČR, Evropa, LN, Jan, Jiří, Německo, Brno, USA, ODS
The 10 most frequent PROPN
types: Praha, ČR, Praze, LN, USA, ODS, J, Jiří, Jan, OSN
The 10 most frequent ambiguous lemmas: J (PROPN 381, ADJ 24), M (PROPN 219, NOUN 8), V (PROPN 183, NUM 21, NOUN 7, ADJ 5), A (PROPN 158, NOUN 8, ADJ 8), York (PROPN 145, ADJ 4), P (PROPN 127, ADJ 4, NOUN 2), čt (NOUN 4, PROPN 2), S (PROPN 108, ADJ 11, NOUN 2), r (NOUN 5, PROPN 1, ADV 1), F (PROPN 85, ADJ 10, NOUN 9)
The 10 most frequent ambiguous types: J (PROPN 381, ADJ 24, NOUN 3), M (PROPN 219, NOUN 35), V (ADP 3315, PROPN 183, NUM 21, NOUN 15, ADJ 6), A (CCONJ 933, PROPN 158, NOUN 54, ADJ 19, X 1), Rusko (PROPN 132, ADJ 3), P (PROPN 127, ADJ 15, NOUN 10, ADP 1), Německo (PROPN 119, ADJ 2), S (ADP 414, PROPN 109, NOUN 25, ADJ 13), r (NOUN 351, ADV 1, PROPN 1), Plzeň (PROPN 89, NOUN 20)
- J
- M
- V
- A
- CCONJ 933: A skutečně přišel s návrhem na zjednodušený model řízení .
- PROPN 158: Přenosová rychlost : ( A 4 / sec ) 12
- NOUN 54: Našeho čtenáře bude zajímat licence A .
- ADJ 19: A O Travel se srazit nedala
- X 1: Písně na sebe upozorňují výraznými uvolněnými refrény ( ve Without A Trace krátce přebírá refrénový motiv zpěváka i kytara ) , do nichž se koncentruje energetická síla hudebníků .
- Rusko
- P
- Německo
- S
- r
- Plzeň
Morphology
The form / lemma ratio of PROPN
is 1.432990 (the average of all parts of speech is 2.162583).
The 1st highest number of forms (11) was observed with the lemma “Čech”: ČECH, ČEŠI, Čech, Čecha, Čechem, Čechovi, Čechy, Čechů, Čechům, Češi, Češích.
The 2nd highest number of forms (10) was observed with the lemma “Jan”: JAN, JANA, Jan, Jana, Janem, Janovi, Janové, Janu, Jany, Janů.
The 3rd highest number of forms (10) was observed with the lemma “Němec”: NĚMCI, NĚMCŮ, NĚMEC, Němce, Němcem, Němci, Němcích, Němců, Němcům, Němec.
PROPN
occurs with 9 features: cs-feat/NameType (74824; 100% instances), cs-feat/Polarity (74824; 100% instances), cs-feat/Gender (73077; 98% instances), cs-feat/Number (61287; 82% instances), cs-feat/Case (59253; 79% instances), cs-feat/Animacy (43715; 58% instances), cs-feat/Abbr (11545; 15% instances), cs-feat/Foreign (3218; 4% instances), cs-feat/Style (136; 0% instances)
PROPN
occurs with 46 feature-value pairs: Abbr=Yes
, Animacy=Anim
, Animacy=Inan
, Case=Acc
, Case=Dat
, Case=Gen
, Case=Ins
, Case=Loc
, Case=Nom
, Case=Voc
, Foreign=Yes
, Gender=Fem
, Gender=Masc
, Gender=Neut
, NameType=Com
, NameType=Com,Geo
, NameType=Com,Giv
, NameType=Com,Giv,Sur
, NameType=Com,Nat
, NameType=Com,Pro
, NameType=Com,Sur
, NameType=Geo
, NameType=Geo,Giv
, NameType=Geo,Giv,Sur
, NameType=Geo,Oth
, NameType=Geo,Pro
, NameType=Geo,Sur
, NameType=Giv
, NameType=Giv,Nat
, NameType=Giv,Oth
, NameType=Giv,Pro
, NameType=Giv,Pro,Sur
, NameType=Giv,Sur
, NameType=Nat
, NameType=Nat,Sur
, NameType=Oth
, NameType=Pro
, NameType=Pro,Sur
, NameType=Sur
, Number=Plur
, Number=Sing
, Polarity=Pos
, Style=Arch
, Style=Coll
, Style=Expr
, Style=Rare
PROPN
occurs with 590 feature combinations.
The most frequent feature combination is Animacy=Anim|Case=Nom|Gender=Masc|NameType=Sur|Number=Sing|Polarity=Pos
(12654 tokens).
Examples: Klaus, Havel, Svoboda, Jelcin, Zeman, Mečiar, Němec, Jelínek, Novák, John
Relations
PROPN
nodes are attached to their parents using 25 different relations: cs-dep/nmod (23813; 32% instances), cs-dep/nsubj (13133; 18% instances), cs-dep/flat (12090; 16% instances), cs-dep/conj (6872; 9% instances), cs-dep/obl (5992; 8% instances), cs-dep/root (4816; 6% instances), cs-dep/dep (2495; 3% instances), cs-dep/obj (2396; 3% instances), cs-dep/appos (1202; 2% instances), cs-dep/iobj (635; 1% instances), cs-dep/orphan (450; 1% instances), cs-dep/flat:foreign (373; 0% instances), cs-dep/nsubj:pass (322; 0% instances), cs-dep/advcl (145; 0% instances), cs-dep/xcomp (29; 0% instances), cs-dep/cc (17; 0% instances), cs-dep/vocative (15; 0% instances), cs-dep/ccomp (7; 0% instances), cs-dep/case (6; 0% instances), cs-dep/amod (5; 0% instances), cs-dep/acl (4; 0% instances), cs-dep/parataxis (3; 0% instances), cs-dep/csubj (2; 0% instances), cs-dep/csubj:pass (1; 0% instances), cs-dep/punct (1; 0% instances)
Parents of PROPN
nodes belong to 15 different parts of speech: NOUN (23369; 31% instances), PROPN (23293; 31% instances), VERB (19956; 27% instances), ROOT (4816; 6% instances), ADJ (2496; 3% instances), NUM (342; 0% instances), ADV (328; 0% instances), DET (94; 0% instances), PRON (87; 0% instances), PART (17; 0% instances), ADP (11; 0% instances), SYM (7; 0% instances), CCONJ (4; 0% instances), INTJ (2; 0% instances), PUNCT (2; 0% instances)
30012 (40%) PROPN
nodes are leaves.
24315 (32%) PROPN
nodes have one child.
11487 (15%) PROPN
nodes have two children.
9010 (12%) PROPN
nodes have three or more children.
The highest child degree of a PROPN
node is 29.
Children of PROPN
nodes are attached using 29 different relations: cs-dep/punct (17318; 21% instances), cs-dep/case (14526; 18% instances), cs-dep/flat (12112; 15% instances), cs-dep/nmod (11822; 14% instances), cs-dep/conj (7337; 9% instances), cs-dep/amod (4696; 6% instances), cs-dep/cc (3211; 4% instances), cs-dep/dep (2970; 4% instances), cs-dep/nummod (1492; 2% instances), cs-dep/flat:foreign (1401; 2% instances), cs-dep/acl (1387; 2% instances), cs-dep/appos (1200; 1% instances), cs-dep/advmod:emph (1084; 1% instances), cs-dep/orphan (455; 1% instances), cs-dep/xcomp (343; 0% instances), cs-dep/mark (298; 0% instances), cs-dep/det (97; 0% instances), cs-dep/advmod (74; 0% instances), cs-dep/parataxis (69; 0% instances), cs-dep/cop (61; 0% instances), cs-dep/obl (57; 0% instances), cs-dep/nsubj (56; 0% instances), cs-dep/nummod:gov (37; 0% instances), cs-dep/advcl (9; 0% instances), cs-dep/obj (9; 0% instances), cs-dep/aux (3; 0% instances), cs-dep/det:numgov (3; 0% instances), cs-dep/det:nummod (3; 0% instances), cs-dep/ccomp (1; 0% instances)
Children of PROPN
nodes belong to 16 different parts of speech: PROPN (23293; 28% instances), PUNCT (17320; 21% instances), ADP (14628; 18% instances), NOUN (11603; 14% instances), ADJ (5983; 7% instances), CCONJ (3525; 4% instances), NUM (2347; 3% instances), VERB (1646; 2% instances), ADV (962; 1% instances), SCONJ (308; 0% instances), DET (219; 0% instances), PART (160; 0% instances), AUX (64; 0% instances), PRON (43; 0% instances), SYM (22; 0% instances), INTJ (8; 0% instances)
Treebank Statistics (UD_Czech-CAC)
There are 3405 PROPN
lemmas (12%), 4315 PROPN
types (7%) and 9676 PROPN
tokens (2%).
Out of 16 observed tags, the rank of PROPN
is: 4 in number of lemmas, 4 in number of types and 11 in number of tokens.
The 10 most frequent PROPN
lemmas: Praha, KSČ, ROH, SSSR, ÚJČ, SSM, ČSAV, ČSSR, Československo, Škoda
The 10 most frequent PROPN
types: KSČ, Praze, ROH, SSSR, ÚJČ, SSM, ČSAV, ČSSR, Praha, Škoda
The 10 most frequent ambiguous lemmas: hora (NOUN 24, PROPN 19), VB (PROPN 22, NOUN 4), Vyšehrad (PROPN 8, NOUN 1), KRB (PROPN 6, NOUN 1), Janský (PROPN 5, ADJ 3), most (NOUN 42, PROPN 1), KS (PROPN 3, NOUN 3), MP (PROPN 3, NOUN 1), SRPŠ (PROPN 3, NOUN 1), NVP (PROPN 2, NOUN 2)
The 10 most frequent ambiguous types: Praha (PROPN 100, NOUN 1), Škoda (PROPN 66, NOUN 4), Země (PROPN 29, NOUN 6), VB (PROPN 22, NOUN 4), Slunce (PROPN 13, NOUN 2), Svoboda (PROPN 10, NOUN 1), horách (PROPN 5, NOUN 2), Králík (PROPN 9, NOUN 3), Měsíce (PROPN 9, NOUN 4), Karpaty (PROPN 8, NOUN 1)
- Praha
- Škoda
- Země
- VB
- Slunce
- Svoboda
- PROPN 10: Oldřich Svoboda ze # třídy nasbíral # * a Václav Lhotský ze # # * bylin .
- NOUN 1: Poslední díla , která připravila Helena Lisická pro malé i velké čtenáře s názvem Z hradů , zámků a tvrzí a Z českých hradů , zámků a tvrzí , byla oceněna pražským nakladatelstvím Svoboda Výroční cenou za rok # za společensky angažovanou tvorbu .
- horách
- Králík
- Měsíce
- Karpaty
Morphology
The form / lemma ratio of PROPN
is 1.267254 (the average of all parts of speech is 2.180683).
The 1st highest number of forms (6) was observed with the lemma “hora”: Hora, hor, horami, hory, horách, horám.
The 2nd highest number of forms (5) was observed with the lemma “Honza”: Honza, Honzou, Honzovi, Honzové, Honzy.
The 3rd highest number of forms (5) was observed with the lemma “Jan”: Jan, Jana, Janem, Janovi, Janu.
PROPN
occurs with 9 features: cs-feat/NameType (9676; 100% instances), cs-feat/Polarity (9676; 100% instances), cs-feat/Gender (9665; 100% instances), cs-feat/Number (7746; 80% instances), cs-feat/Case (7693; 80% instances), cs-feat/Animacy (5351; 55% instances), cs-feat/Abbr (1858; 19% instances), cs-feat/Foreign (37; 0% instances), cs-feat/Style (7; 0% instances)
PROPN
occurs with 39 feature-value pairs: Abbr=Yes
, Animacy=Anim
, Animacy=Inan
, Case=Acc
, Case=Dat
, Case=Gen
, Case=Ins
, Case=Loc
, Case=Nom
, Case=Voc
, Foreign=Yes
, Gender=Fem
, Gender=Masc
, Gender=Neut
, NameType=Com
, NameType=Com,Geo
, NameType=Com,Giv
, NameType=Com,Pro
, NameType=Com,Sur
, NameType=Geo
, NameType=Geo,Giv
, NameType=Geo,Oth
, NameType=Geo,Sur
, NameType=Giv
, NameType=Giv,Oth
, NameType=Giv,Pro
, NameType=Giv,Sur
, NameType=Nat
, NameType=Nat,Sur
, NameType=Oth
, NameType=Pro
, NameType=Pro,Sur
, NameType=Sur
, Number=Plur
, Number=Sing
, Polarity=Pos
, Style=Arch
, Style=Coll
, Style=Expr
PROPN
occurs with 239 feature combinations.
The most frequent feature combination is Animacy=Anim|Case=Nom|Gender=Masc|NameType=Sur|Number=Sing|Polarity=Pos
(1574 tokens).
Examples: Fučík, Erben, Horálek, Němec, Lenin, Záveský, Kraus, Huxley, Gottwald, Vávra
Relations
PROPN
nodes are attached to their parents using 19 different relations: cs-dep/nmod (4198; 43% instances), cs-dep/nsubj (1503; 16% instances), cs-dep/conj (1329; 14% instances), cs-dep/flat (840; 9% instances), cs-dep/obl (757; 8% instances), cs-dep/obj (274; 3% instances), cs-dep/orphan (195; 2% instances), cs-dep/root (185; 2% instances), cs-dep/dep (162; 2% instances), cs-dep/appos (113; 1% instances), cs-dep/iobj (48; 0% instances), cs-dep/nsubj:pass (25; 0% instances), cs-dep/advcl (15; 0% instances), cs-dep/xcomp (12; 0% instances), cs-dep/flat:foreign (9; 0% instances), cs-dep/vocative (8; 0% instances), cs-dep/amod (1; 0% instances), cs-dep/cc (1; 0% instances), cs-dep/csubj (1; 0% instances)
Parents of PROPN
nodes belong to 12 different parts of speech: NOUN (4160; 43% instances), PROPN (2604; 27% instances), VERB (2184; 23% instances), ADJ (395; 4% instances), ROOT (185; 2% instances), ADV (51; 1% instances), SYM (38; 0% instances), DET (19; 0% instances), PRON (19; 0% instances), NUM (17; 0% instances), ADP (2; 0% instances), CCONJ (2; 0% instances)
3713 (38%) PROPN
nodes are leaves.
3645 (38%) PROPN
nodes have one child.
1477 (15%) PROPN
nodes have two children.
841 (9%) PROPN
nodes have three or more children.
The highest child degree of a PROPN
node is 95.
Children of PROPN
nodes are attached using 24 different relations: cs-dep/case (2257; 23% instances), cs-dep/nmod (1915; 19% instances), cs-dep/punct (1407; 14% instances), cs-dep/conj (1391; 14% instances), cs-dep/flat (841; 8% instances), cs-dep/amod (638; 6% instances), cs-dep/cc (577; 6% instances), cs-dep/advmod:emph (184; 2% instances), cs-dep/appos (179; 2% instances), cs-dep/acl (149; 1% instances), cs-dep/orphan (149; 1% instances), cs-dep/xcomp (63; 1% instances), cs-dep/dep (59; 1% instances), cs-dep/mark (46; 0% instances), cs-dep/nummod (34; 0% instances), cs-dep/det (32; 0% instances), cs-dep/flat:foreign (29; 0% instances), cs-dep/advmod (8; 0% instances), cs-dep/cop (8; 0% instances), cs-dep/nsubj (7; 0% instances), cs-dep/obl (7; 0% instances), cs-dep/obj (6; 0% instances), cs-dep/parataxis (5; 0% instances), cs-dep/nummod:gov (1; 0% instances)
Children of PROPN
nodes belong to 15 different parts of speech: PROPN (2604; 26% instances), ADP (2242; 22% instances), PUNCT (1408; 14% instances), NOUN (1348; 13% instances), ADJ (695; 7% instances), CCONJ (596; 6% instances), SYM (575; 6% instances), ADV (162; 2% instances), VERB (158; 2% instances), NUM (55; 1% instances), DET (50; 1% instances), SCONJ (46; 0% instances), PART (40; 0% instances), AUX (8; 0% instances), PRON (5; 0% instances)
PROPN in other languages: [am] [ar] [bg] [bxr] [ca] [ckb] [cop] [cs] [cu] [da] [de] [el] [en] [es] [et] [eu] [fa] [fi] [fo] [fr] [ga] [gl] [got] [grc] [he] [hi] [hr] [hu] [id] [it] [ja] [kk] [kmr] [ko] [la] [lv] [mr] [nl] [no] [pl] [pt] [ro] [ru] [sa] [sk] [sla] [sl] [so] [sr] [sv] [swl] [ta] [tr] [ug] [uk] [u] [urj] [ur] [vi] [yue] [zh]