CRPC-Named Entity Recognizer

CRPC-NER

A NER-classifier based on memory-based learning, trained on the CINTIL dataset, a corpus that contains part of the Corpus de Referência do Português Contemporâneo - CRPC (Reference Corpus of Contemporary Portuguese).
https://portulanclarin.net/repository/browse/cintil-corpus-internacional-do-portugues/fe32ebf2485511e2a2aa782bcb074135aa0fdcd287ac45e7b67de9c36d8d2890/

http://clul.ulisboa.pt/en/projeto/crpc-reference-corpus-contemporary-portuguese

Availability
The tool is freely available on the PORTULAN CLARIN infrastructure.
https://portulanclarin.net/repository/search/

Annotation

Categories
EVT - Event
LOC - Location
ORG - Organization
PER - Person
WRK - Work
MSC - Miscellaneous (remaining cases)

The tool applies tags to each token
/0 indicates that the token is not (part of) a named entity
/B indicates that the token is the first unit of a named entity
/I indicates that the token is the middle or last unit of a named entity

Output will have one sentence per line with tags after each token separated with a slash:

De_/O a/O parte/O de_/O a/O tarde/O ,*/O Maria/B-PER Cristina/B-PER Portugal/I-PER ,*/O advogada//O ,*/O moderou//O o/O painel/O \*"/O Restrições//B-WRK a_/I-WRK o/I-WRK Conteúdo//I-WRK de_/I-WRK a/I-WRK Publicidade/I-WRK "/O ,*/O em/O que/O se /O abordaram//O duas/O temáticas/O <utt>

Evaluation

The NER tool was evaluated by splitting the CINTIL corpus in 50k for training and for testing.
This gave the following accuracy, precision and recall scores on the held-out testset:

processed 211479 tokens with 10631 phrases; found: 10628 phrases; correct: 10409.
accuracy:  99.72%; precision:  97.94%; recall:  97.91%; FB1:  97.93
 

Download




People who looked at this resource also viewed the following:
People who downloaded this resource also downloaded the following:
Resources from the same creators