Welcome to University of Turku cell line name recognition and normalization
The project is a result of collaboration of University of Turku, Gent University and Textimi. The aim is to create corpora that is suitable for training and evaluating machine learning systems to recognize and normalize established cell line names from text. The full detail is discussed in this article.
Gellus and CLL Corpora
We created two manually annotated corpora, Gellus and CLL. Gellus is suitable for the training of any machine learning systems in recognizing cell line name mentions while CLL is for evaluating the systems in recognizing the Cellosaurus cell line names.
Cell line names dictionary
We prepared a dictionary of cell line names derived from the Cellosaurus resource (version 6.5) separating information into synonyms, resources and categories. We also gathered human cancer cell line mutation information from CCLE and Cosmic.
NERsuite model and corpora
We used NERsuite as our machine learning system trained on Gellus corpus and supplemented with cell line name dictionary to achieve state-of-the-art performance. In addition to NERsuite model, we provide the derived corpora used in our studies for those who want to train their own systems.
- NERsuite model
- dictionary: To use this dictionary with NERsuite, you can skip the compiling step (command: nersuite_dic_compiler) and use it directly with command nersuite_dic_tagger.
- JNPBA-CL
- CellFinder-CL
Tagged and normalized cell line name documents (updated 20 September 2018)
Originally we provided text documents and cell line names annotation in standoff format in roughly 24M documents, including both PubMed abstracts and PMC-Open Access (PMCOA) full texts.
Since then, there are newly published articles available from those two literature database, currently >27M PubMed abstracts and >2M PMCOA full texts. We instead provide the following link to our weekly-update large-scale named entity recognition project, where the cell line annotations and text documents are in TEES-xml format.
- Syntactic parses and named entity recognition for PubMed abstracts and PubMed Central full documents
Authors and Contributors
- Suwisa Kaewphan, University of Turku
- Sofie Van Landeghem, Gent University
- Tomoko Ohta, Textimi
- Yves Van de Peer, Gent University
- Filip Ginter, University of Turku
- Sampo Pyysalo, University of Turku
Support or Contact
Please contact sukaew@utu.fi or sampo.pyysalo@gmail.com for further information or questions.