LexMan-ChunkerTokenizer is a tokenizer and sentence splitter tool. Marks sentence boundaries, multi-word boundaries. Size: Lemmas verbs: 12 995; Lemmas nouns and adj: 38 180; Lemmas adverbs: 7 250; Compound words: 35 201. Language: Portuguese.
MARv-POS is a part-of-speech tagger tool (probabilistic POS annotation module). MARv4's architecture comprehends two submodules: a set of linguistically-oriented disambiguation rules module and a probabilistic disambiguation module. The linguistic-oriented is no longer used in the STRING chain be...
Grafone-Tool is a tool for conversion from grapheme to phoneme for European Portuguese. The converter works with the Portuguese spelling, both prior to and after the Orthographic Agreement of 1990.
RudriCo-POS is a part-of-speech disambiguation tool that performs 188 morphological disambiguation rules.
LX-UTagger is a POS tagger for Portuguese that adopts the Universal Part-of-Speech tagset (UPOS), related to the Universal Dependency framework, with an initial performance of 99.06% under a ten-fold cross validation scheme. It is described in this article: António Branco, João Ricardo Silv...
A NER-classifier based on memory-based learning, trained on the CINTIL dataset, a corpus that contains part of the Corpus de Referência do Português Contemporâneo - CRPC (Reference Corpus of Contemporary Portuguese). https://portulanclarin.net/repository/browse/cintil-corpus-internacional-do-por...
Lince is a multi-platform stand-alone application that updates the textual contents of documents in a range of popular formats to the spelling prescribed by the 1990 Portuguese language reform. It works with both previously existing Portuguese language orthographic standards (1943, previously val...
LX-UDParser is a UD parser for Portuguese, which adopts the Universal Dependency framework, with an initial performance of 90.87 for UAS and 88.01 for LAS under a ten-fold cross validation scheme. It is described in this article: António Branco, João Ricardo Silva, Luís Gomes and João Rodri...
CSTParser is a multi-document discourse parser. Based on machine learning techniques and hand-crafted rules, the system identifies a set of relations predicted by CST (Cross-document Structure Theory) among sentences of different texts on the same topic.
Based on the MXPOST part of speech tagger and UNITEX dictionaries for Portuguese, this tool produces the lemmas of the words of a text stored in a plain text file. The source code is also provided.