Release checklist
This checklist is meant to provide guidance for teams contributing treebank data for a new release of Universal Dependencies. It was created for release v1.2 and applies, unless otherwise noted, to any upcoming release.
Contents:
Executive summary
- Make sure your repository has the right files, correct metadata in the README, and data being prepared for the next release lives on the
dev
branch. - Make sure your data and repository pass the format validation: direct link to the format validator output.
- Make sure your data does not show major deviations in the content validation: direct link to the content validator output.
Repository and files
Every language has its own GitHub repository called UD_Language
, where Language
is the name of the language. For example, the repository for Finnish is called UD_Finnish
. Make sure to create the repository for your language if it does not already exist. Some languages have more than one treebank and the additional treebanks have their own repositories with a -Treebank
identifier after the language name. For example, UD_Finnish-FTB
is the repository for the FinnTreeBank, while the plain UD_Finnish
holds the Turku Dependency Treebank.
Every language repository should contain the following five files (where xx
is the ISO code for the given language; if this is not the first treebank for the language, use xx_y
instead, where y
is the lowercased treebank identifier):
xx-ud-train.conllu
xx-ud-dev.conllu
xx-ud-test.conllutemporary rule for UD 2.0: do not publish the test set! Validate it offline and then send it by e-mail to ud.conll.shared.task.2017@gmail.com.README.txt
orREADME.md
LICENSE.txt
The first three files contain the treebank data split into a training, development and test set. These should be in CONLL-U
format and conform to the universal guidelines. They need to be validated as described below.
If the treebank consists of more than 20,000 words, make the test set and dev set at least 10,000 words each, even if it leaves you with training data smaller than development data (that is necessary for the CoNLL 2017 shared task). There is no upper limit on the size of dev/test. If you cannot reach 10,000 words of test data, use a more typical split, e.g. 80-10-10% (but the treebank will not be included in the shared task).
The training-development-test data split should be stable across releases. It should not happen that a sentence that was once part of training data ever appears in the test data, and vice versa (except for sentences that are naturally occurring duplicates in independent texts). We want to prevent accidental misguided results of experiments where people take a parser trained on UD 1.1 and apply it to test data from UD 1.2. We decided to make an exception to this rule for UD 2.0 where it is needed to achieve 10K test or dev, on the ground that v2 annotation is not backward-compatible anyway.
The README.txt
file contains basic documentation of the treebank and machine-readable metadata for the UD main page (see below) and the LICENSE.txt
specifies under what license the treebank is made available.
Repositories of released treebanks also contain a stats.xml
file, which is generated as part of the release-building process, using the script conllu-stats.pl
available from the tools
repository. Data providers do not have to care about this file.
The README file
The README
file is distributed together with the data and summarizes information about the treebank for its users;
the contents of the file is also displayed by Github when readers land on the Github homepage of the treebank repository.
At the same time, certain pre-defined parts of the README
file are automatically copied to the UD website to places
where individual treebanks are described. In these cases, the contents is interpreted as MarkDown and you can use the
MarkDown syntax
to add a little formatting (but please remember that some users will read directly the README
file, so it should stay
reasonably human-readable).
The last part of the README
file contains machine-readable metadata (described below) where selected vital information
must be provided in a fixed pre-defined way.
MarkDown source files usually have the .md
extension (README.md
); but for historical reasons,
it is also possible to name the file README.txt
.
The README
file should minimally contain the following information:
- A description of the treebank and its origin (creation method, data sources, etc.)
- A description of how the data was split into training, development and test sets
- If there are multiple genres/domains, can they be told apart by sentence ids? Does the treebank consist of complete documents, or just randomly shuffled sentences?
- Acknowledgments and references that should be cited when using the treebank
- A changelog section for treebanks that will be released for the second (or subsequent) time
- A machine-readable section with treebank metadata. This is described below.
MarkDown uses the #
character to mark section headings. Several sections with fixed names are expected in every README
and will be searched for by various scripts. Use the following template (from Swedish) to adjust your README
.
The first section, called Summary, should be rather short (one-two lines), so it can appear in an index page listing all
treebanks. An automatically generated treebank page in the UD documentation will take over the sections Summary, Introduction
and Acknowledgments.
# Summary UD Swedish-TP is a conversion of the Prose section of Talbanken, originally annotated in the MAMBA annotation scheme, and consisting of a variety of informative text genres, including textbooks, information brochures and newspaper articles. # Introduction UD Swedish-TP is a conversion of the Prose section of Talbanken (Einarsson, 1976), originally annotated… # Acknowledgments The new conversion has been performed by Joakim Nivre and Aaron Smith at Uppsala University. We thank everyone who… # (possibly any number of extra sections) … # Changelog * 2015-05-15 v1.1 * Added lemmas * Corrected tokenization in sentences 123 and 456 * 2015-01-15 v1.0 * First release in UD === Machine-readable metadata (DO NOT REMOVE!) ================================ (described in more detail below)
Treebank metadata
The table on the front page is automatically generated from special lines (metadata)
in the README.txt
or README.md
file for every treebank. The metadata are used for various
other automated tasks as well, for example the list of contributors to every UD release is
collected from the READMEs.
The metadata describe individual treebanks and there are often multiple treebanks per language. If we want to work on UD documentation for a new language without having actual data, we still must create a Github repository for the future treebank, and fill in the metadata so that the language appears on the front page. The names of the contributors to the documentation should be listed among the treebank contributors, otherwise they will not be included in the overall UD list of contributors.
Here is an example of the treebank metadata block from the Czech README file
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v1.0
License: CC BY-NC-SA 3.0
Includes text: yes
Genre: news
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Zeman, Daniel; Hajič, Jan
Contributing: elsewhere
Contact: zeman@ufal.mff.cuni.cz
===============================================================================
This block should be the last thing in the README
file. The properties are as follows:
Data available since
can beUD v1.0
,UD v1.1
,UD v1.2
,UD v1.3
,UD v1.4
,UD v2.0
,UD v2.1
etc. Pick the number of the first release where this treebank appears. Do not change it when the treebank is released the next time.License
: anything containing the stringBY-NC-SA
will be given the CC non-commercial logo,BY-SA
orBY
the CC logo, andGNU
the GNU logo. To add any other license, please provide a suitable icon to ginter@cs.utu.fi and zeman@ufal.mff.cuni.cz.Includes text
: Most treebanks should sayyes
here. But there are a few instances where the license of the underlying text does not allow redistribution. Here, the UD repository contains only the annotation without words and lemmas, but with a merging script that the user can run and merge the annotation with the corpus that they obtained through another channel. Such treebanks should sayno
here.
Genre
: this is simply a space-separated list of genres which gets mapped into symbols in the table. The possible genres are listed in this file in the repository. If you don’t see yours, just edit the file on GitHub and add your genre, choosing one of the symbols from the FontAwesome list. Please make sure you get the syntax right, since this is a machine-readable JSON file. It is also possible to not add the genre to thegenre_symbols.json
file, in which case the default symbol will be used automatically. The genre name will still remain visible in the mouse-over tooltip.- Source of annotation of lemmas, POS tags, morphological features and dependency relations.
There are several possible values:
manual native
… means that the annotation was done manually, directly in the UD annotation scheme. Note that manual verification of automatic annotation (e.g. you pre-parse the text before you give it to humans) counts as manual annotation.converted from manual
… means that it was originally annotated in a non-UD scheme, then converted to UD by a program, but the converted annotation has not been verified by a human annotator.converted with corrections
… significant spot-checking and manual corrections occurred after the conversion; however, it does not qualify as full manual annotation because not all words were visited systematically. This is an intermediate level between “converted from manual” and “manual native”.automatic
… means that the annotation was predicted by a program such as tagger or parser.automatic with corrections
… significant spot-checking and manual corrections occurred after the automatic prediction; however, it does not qualify as full manual annotation because not all words were visited systematically. This is an intermediate level between “automatic” and “manual native”.not available
… means that this type of annotation is not present.
- Note that some values are available only for some types of annotation.
UPOS tags and relations must always be available and cannot be automatic.
Lemmas
…manual native | converted from manual | converted with corrections | automatic | automatic with corrections | not available
UPOS
…manual native | converted from manual | converted with corrections
XPOS
…manual native | automatic | automatic with corrections | not available
Features
…manual native | converted from manual | converted with corrections | automatic | automatic with corrections | not available
Relations
…manual native | converted from manual | converted with corrections
Contributors
: the list of contributors to be included with the data release and in the LINDAT download page. This is a semicolon-separated list where every name is in theLast, First
form and theREADME
file should be utf-8 encoded to make sure special characters are preserved correctly.Contributing
:here
… The changes are done directly in the dev branch of the UD repository. Bugs can be fixed via pull requests.here source
… The changes are done in the UD repository but not directly in the final CoNLL-U files. Instead, there is a folder callednot-to-release
where source files have to be located and fixed.elsewhere
… Do not submit pull requests; create issues. Main development happens somewhere else. If there is a bug, either the original data or the conversion procedure must be fixed.to be adopted
… The treebank currently misses a maintainer. If you know the language, please consider adopting the treebank.
Contact
: please add an e-mail address where the current maintainer of the data can be contacted. You can also include several e-mail addresses separated by commas.
If you want to see what web content will be generated from your README file, run the
generate_treebank_hub.pl
script from the tools repository on your treebank folder,
e.g.
tools/generate_treebank_hub.pl UD_Czech > for_web.md
Repository branches
While the official UD release is always through Lindat, many users of UD source their data from the GitHub language repositories. Therefore, the master
branch of every language should contain the last, officially released version of the data for the given language. The development in between releases should happen on the dev
branch of the repository.
Although it is currently not locked, treebank maintainers should never touch the master
branch, they should always push to dev
. At release time, the release task force will take care of merging the contents of the dev
branch into master
.
Please do not submit pull requests from the dev branch (or from anywhere else) to the master branch.
This is not needed for the release merge to take place, and if someone overlooks the destination branch and accepts the pull request,
it will again result in a commit to the master branch at wrong time.
(To make things a bit more confusing, this policy of data repositories does not apply to some other repositories that we use. In the docs
repository you must work with the pages-source
branch. That is done automatically if you edit the documentation in your browser via the edit page link. You will also need to access the tools
repository and upload the deprel
and feat_val
files specific for your treebank. In this case, please use the master
branch.)
If you have no previous experience with Git, here is a quick tutorial on how to deal with the branches. Please refer to on-line documentation of Git and Github for more details. The tutorial assumes that you are communicating with Github from a Linux shell. The interface may be different if your OS is Windows. If you are working only with the Github web interface, you are not dependent on your operating system but you must remember to switch the Branch: master
drop-down menu (left-hand side of the page) to Branch: dev
; it always starts in master
by default. In contrast, when you want to clone the repository to your local system, you need the address that is hidden under Clone or download
in the right-hand side of the page, and that address is common for all branches. Our example is the Italian repository. Here is how you clone the repo to your system (git clone
is the command, the remainder is the address copied from the Github web):
git clone git@github.com:UniversalDependencies/UD_Italian.git
Cloning into 'UD_Italian'...
remote: Counting objects: 215, done.
remote: Total 215 (delta 0), reused 0 (delta 0), pack-reused 215
Receiving objects: 100% (215/215), 6.98 MiB | 4.55 MiB/s, done.
Resolving deltas: 100% (134/134), done.
Checking connectivity... done.
Then enter the cloned folder and switch to (“checkout”) the dev
branch. Your copy of the repository knows that such a branch exists on the server but it only creates your local copy of that branch once you ask for it. You may subsequently want to call git pull
to make sure that you have the latest contents of the dev branch from the server:
cd UD_Italian git checkout dev Branch dev set up to track remote branch dev from origin. Switched to a new branch 'dev' git branch * dev master git pull Already up-to-date.
Once you do this, you are all set. Your copy will stay switched to the dev branch unless you call git checkout master
(or other git checkout
) again. You will probably mostly need just git status
, git diff
, git add
, git commit
, git push
and git pull
commands. All pushes and pulls will be done against the remote dev
branch.
Validation
Data format and repository
Up-to-date automatic validation runs of the repositories are available here. These are based on the dev
branch of the data and use the validate.py
script described below.
The final data validation is an important step and each file released in the project is expected to validate as conforming to the basic requirements on the data and format. For this purpose, there is a validation script in the tools repository.
$ git clone git@github.com:UniversalDependencies/tools.git
$ cd tools
$ python validate.py -h
In general, you validate the data like so:
python validate.py --lang=xx [file.conllu]
for example for Finnish:
$ python validate.py --lang=fi ../UD_Finnish/fi-ud-dev.conllu
*** PASSED ***
Among other items, the script also validates the language-specific set
of tags and relations and therefore it needs to know about these. The
language-specific lists are stored in data/deprel.xx
(language-specific relations) and data/feat_val.xx
(language-specific features). In addition data/*.ud
stores the UD
taglists. Before you can validate data for a given language, you need to
produce and commit the necessary tag lists. You can make the initial lists
like so:
$ python conllu-stats.py --deprels=langspec path_to_your_data/*.conllu > data/deprel.xx
$ python conllu-stats.py --catvals=langspec path_to_your_data/*.conllu > data/feat_val.xx
This will gather the language-specific lists in descending order by their frequency. It is important to check the resulting files for correctness, because otherwise the validation would of course be a no-op. Once you have checked the lists manually, you can add them to the repository:
$ git add data/deprel.xx data/feat_val.xx
$ git commit -m "Adding language-specific data for xx."
$ git push
Since the v2.0
release, whitespace is allowed in the FORM
and LEMMA
fields under conditions specified in here. This is supported in the validator through the UD-wide file data/tokens_w_space.ud
and its language-specific variants data/tokens_w_space.xx
. In these files, each line is a Python regular expression defining the permissible forms and lemmas that can contain a whitespace.
Syntax
For the v1.3
release, we have created an additional number of tests which try to uncover possible logical inconsistencies in the treebank data. Automatic validation runs for this syntax validation are available here. Unlike the data format and repository validation, this validation machinery is not streamlined enough to be distributed for offline use, therefore it is important to regularly push your data to the dev
branch of the repository.
The tests are specified in the file gen_index/stests.yaml
and rely on the query language of the SETS search interface.
Language-specific guidelines
Every treebank should be accompanied by a set of language-specific guidelines at http://universaldependencies.org/. These guidelines should minimally specify the following:
- Tokenization: How was word segmentation performed? Does the treebank include multiword tokens?
- POS tags: What universal POS tags (if any) are not used?
- Features: What universal features are not used? What language-specific features/values have been added?
- Relations: What universal relations are not used? What language-specific subtypes have been added?
There are more detailed guidelines for language-specific documentation. Also see the general guidelines about how to contribute (which covers the conventions used in writing UD documentation, such as how to format examples).
Building the release
Documentation of the steps to be taken by the release task force is on a separate page.