LexMan-ChunkerTokenizer is a tokenizer and sentence splitter tool. Marks sentence boundaries, multi-word boundaries. Size: Lemmas verbs: 12 995; Lemmas nouns and adj: 38 180; Lemmas adverbs: 7 250; Compound words: 35 201. Language: Portuguese.
RudriCo-TOK is a tokenizer tool that splits contractions. De-contraction rules: 178.
LX-Stopwords resource is a manual list of words from Portuguese composed by 2631 words of 51 types. The words are grouped in three big classes, arranged according to their morpho-syntactic category and inflectional feature value (closed classes, open classes, and multi-word units). This list was ...
CSTParser is a multi-document discourse parser. Based on machine learning techniques and hand-crafted rules, the system identifies a set of relations predicted by CST (Cross-document Structure Theory) among sentences of different texts on the same topic.
The lexicon of discourse markers for European Portuguese contains 252 pairs of discourse marker/rhetorical sense. The lexicon covers conjunctions, prepositions, adverbs, adverbial phrases and alternative lexicalizations with a connective function, as in the PDTB (Prasad et al., 2008; Prasad et al...
A collection of language resources for the evaluation of distributional semantic models of Portuguese: LX-SimLex-999: http://metashare.metanet4u.eu/go2/lx-simlex-999 LX-Rare Word Similarity Data set: http://metashare.metanet4u.eu/go2/lx-rare-word-similarity-dataset LX-WordSim-353: h...
Web service created by exporting UIMA-based workflow from the U-Compare text mining system. Functionality: Identifies sentences and tokens in plain text. Tools in workflow: Freeling sentence splitter web service (service provided by the PANACEA project), LX-Tokenizer (web service provided by th...
RudriCo-POS is a part-of-speech disambiguation tool that performs 188 morphological disambiguation rules.
MARv-DISAMB is a part-of-speech disambiguation tool (probabilistic disambiguation module).
The LX-SimLex-999 was created from SimLex-999 (Hill et al., 2015) which, in turn, was based in the University of South Florida Free Association Database (USF) (Nelson et al., 2014). There were strict guidelines to create SimLex-999. Both words in each pair have the same morphosyntactic category ...