Parallel corpora

Parallel corpora is a set of parallel texts in the domain of Law and Health, with 1 G per language. Languages: cs-pt, de-pt, en-pt, es-pt, fr-pt, it-pt, and pt-sk.

Resource Type:Corpus
Media Type:Text
Languages:Arabic
Chinese
Czech
English
French
German
Portuguese
Spanish; Castilian
YamCha: Yet Another Multipurpose CHunk Annotator

YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. We used it for NP chunking.

Resource Type:Tool / Service
Biographies of Portuguese People

This is a set of 11.361 biographies of Portuguese people. The compilation of the data involved the biography collection from wikipedia and data conversion. Several filters were applied to remove entries that were mostly empty or non applicable content. Format: JSON (conversion from HTML) ...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Brands.Br – a Portuguese Reviews Corpus

The Brands.Br corpus was built from a fraction of B2W-Reviews01 corpus. We use a set of 252 samples selected by B2W to be enriched. In Brands.Br corpus we want to solve two main challenges in product reviews corpus. The first: it is very common to find customer reviews referring to distinct thing...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
ExtraGLUE

ExtraGLUE is a Portuguese dataset obtained by the automatic translation of some of the tasks in the GLUE and SuperGLUE benchmarks. Two variants of Portuguese are considered, namely European Portuguese and American Portuguese. The 14 tasks in extraGLUE cover different aspects of language unders...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
UIMA/U-Compare OpenNLP Sentence Detector

This is a UIMA wrapper for the OpenNLP Sentence Detector tool. It splits English text into individual sentences. The tool forms part of the in-built library of components provided with the U-Compare platform (see separate META-SHARE record) for building and evaluating text mining workflows. ...

Resource Type:Tool / Service
Language:English
Czech to English Machine translation module

Technical Description: http://qtleap.eu/wp-content/uploads/2015/05/Pilot1_technical_description.pdf http://qtleap.eu/wp-content/uploads/2015/05/TechnicalDescriptionPilot2_D2.7.pdf http://qtleap.eu/wp-content/uploads/2016/11/TechnicalDescriptionPilot3_D2.10.pdf

Resource Type:Tool / Service
Languages:Czech
English
Basque Postedition corpus

Corpus of raw and manual post-edited translations (50.204 words). It was created by manual post-editing of the Basque outputs given by Matxin RBMT system translating 100 entries from the Spanish Wikipedia.

Resource Type:Corpus
Media Type:Text
Language:Basque
English to Czech Machine translation module

Technical Description: http://qtleap.eu/wp-content/uploads/2015/05/Pilot1_technical_description.pdf http://qtleap.eu/wp-content/uploads/2015/05/TechnicalDescriptionPilot2_D2.7.pdf http://qtleap.eu/wp-content/uploads/2016/11/TechnicalDescriptionPilot3_D2.10.pdf

Resource Type:Tool / Service
Languages:Czech
English
UIMA/U-Compare GENIA Tagger

The GENIA tagger analyzes English sentences and outputs the base forms, part-of-speech tags, chunk tags, and named entity tags. The tagger is specifically tuned for biomedical text such as MEDLINE abstracts. The tool is provided as a UIMA component, which forms part of the in-built library of...

Resource Type:Tool / Service
Language:English

Order by:

Filter by:

Text (446)
Audio (18)
Image (1)