CIPM-POS is a set of historical, religious, notarial, literary texts in prose and verse, written is medieval portuguese. It contains around 88000 words.
The Computational Linguistics Toolset is a set of tools for computational linguistics. It contains re-usable code for cleaning, splitting, refining, and taking samples from corpora (ICE, Penn, and a native one), for tagging them using the TnT-tagger, for doing permutation statistics on N-grams (u...
This resource includes a spoken corpus with approximately 300.000 words, covering both formal (152.755 words) and informal (165.838 words) speech, with aligned sound and orthographic transcription and POS-tag information.
Corpus PEAPL2 - transc level 1 is a text corpus of elicitated written texts by Portuguese SL learners between May 2009 and May 2010. It contains 119 381 tokens (text). Transcriptions of hand-written learner texts
Corpus PEAPL2 - transc level 2 is a text corpus of elicitated written texts by Portuguese SL learners between May 2009 and May 2010. It contains 30000 tokens (text). Transcriptions of hand-written learner texts.
CSTParser is a multi-document discourse parser. Based on machine learning techniques and hand-crafted rules, the system identifies a set of relations predicted by CST (Cross-document Structure Theory) among sentences of different texts on the same topic.
The Complex Word (CW) Corpus contains 731 sentences each with one annotated CW. These simplifications were mined from Simple Wikipedia edit histories. Each entry gives an example of a sentence requiring simplification by means of a single lexical edit. This resource is primarily designed for t...
Technical Description: http://qtleap.eu/wp-content/uploads/2015/05/Pilot1_technical_description.pdf http://qtleap.eu/wp-content/uploads/2015/05/TechnicalDescriptionPilot2_D2.7.pdf http://qtleap.eu/wp-content/uploads/2016/11/TechnicalDescriptionPilot3_D2.10.pdf