CRPC-Named Entity Recognizer
CRPC-NER
Handle: | https://hdl.handle.net/21.11129/0000-000E-5CB3-1 (persistent URL to this page) |
---|
A NER-classifier based on memory-based learning, trained on the CINTIL dataset, a corpus that contains part of the Corpus de Referência do Português Contemporâneo - CRPC (Reference Corpus of Contemporary Portuguese).
https://portulanclarin.net/repository/browse/cintil-corpus-internacional-do-portugues/fe32ebf2485511e2a2aa782bcb074135aa0fdcd287ac45e7b67de9c36d8d2890/
http://clul.ulisboa.pt/en/projeto/crpc-reference-corpus-contemporary-portuguese
Availability
The tool is freely available on the PORTULAN CLARIN infrastructure.
https://portulanclarin.net/repository/search/
Annotation
Categories
EVT - Event
LOC - Location
ORG - Organization
PER - Person
WRK - Work
MSC - Miscellaneous (remaining cases)
The tool applies tags to each token
/0 indicates that the token is not (part of) a named entity
/B indicates that the token is the first unit of a named entity
/I indicates that the token is the middle or last unit of a named entity
Output will have one sentence per line with tags after each token separated with a slash:
De_/O a/O parte/O de_/O a/O tarde/O ,*/O Maria/B-PER Cristina/B-PER Portugal/I-PER ,*/O advogada//O ,*/O moderou//O o/O painel/O \*"/O Restrições//B-WRK a_/I-WRK o/I-WRK Conteúdo//I-WRK de_/I-WRK a/I-WRK Publicidade/I-WRK "/O ,*/O em/O que/O se /O abordaram//O duas/O temáticas/O <utt>
Evaluation
The NER tool was evaluated by splitting the CINTIL corpus in 50k for training and for testing.
This gave the following accuracy, precision and recall scores on the held-out testset:
processed 211479 tokens with 10631 phrases; found: 10628 phrases; correct: 10409.
accuracy: 99.72%; precision: 97.94%; recall: 97.91%; FB1: 97.93