ACOPOST - A Collection of POS Taggers

ACOPOST is a free and open source collection of four part-of-speech taggers (t3, met, tbt, and et). In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up the words in a text (corpus) as co...

Resource Type:Tool / Service
askIT Dataset

Collection of dialogues extracted from subreddits related to Information Technology (IT) and extracted with RDET (Reddit Dataset Extraction Tool). It is composed of around 61,842,638 tokens in 179,358 dialogues.

Resource Type:Corpus
Media Type:Text
Language:English
A Terminological Inventory for Biodiversity

In order to construct the inventory, we firstly compiled a species name dictionary by combining all of the names available in Catalogue of Life (CoL), Encyclopedia of Life (EoL) and Global Biodiversity Information Facility (GBIF). The terms contained in this dictionary were then located within ...

Resource Type:Lexical / Conceptual
Media Type:Text
Language:English
Basic English-Maltese Dictionary

Bilingual wordlist, consisting of alphabetically ordered English lemmas with their Maltese translation and Maltese pronunciation (transcribed in ad-hoc system by the original author).

Resource Type:Lexical / Conceptual
Media Type:Text
Languages:English
Maltese
Basque-English ParDeepBank

This resource is part of Deliverable 4.6 of the QTLeap FP7 project (Contract number 610516). In its current development (15% of the intended goal of the project), it is composed of 150 sentences (1,416 English tokens and 1,275 Basque tokens). The sentences are excerpts from journalistic text from...

Resource Type:Corpus
Media Type:Text
Languages:Basque
English
Basque to English Machine translation module

Technical Description: http://qtleap.eu/wp-content/uploads/2015/05/Pilot1_technical_description.pdf http://qtleap.eu/wp-content/uploads/2015/05/TechnicalDescriptionPilot2_D2.7.pdf http://qtleap.eu/wp-content/uploads/2016/11/TechnicalDescriptionPilot3_D2.10.pdf

Resource Type:Tool / Service
Languages:Basque
English
BioLexicon

The BioLexicon is a large-scale, wide-coverage computational lexicon covering the biomedical domain. A large part of the lexicon is concerned with covering biomedical terms and their variants. Entries for domain-specific verbs include syntactic and semantic information. The lexicon includes entri...

Resource Type:Corpus
Media Type:Text
Language:English
Bulgarian-English Wikipedia WSD/NED corpus

Bulgarian-English Wikipedia WSD/NED corpus is composed of articles from the Bulgarian version of Wikipedia and their English counterparts.

Resource Type:Corpus
Media Type:Text
Languages:Bulgarian
English
CINTIL-Corpus Internacional do Português

CINTIL-Corpus Internacional do Português is a linguistically interpreted corpus of Portuguese. At present it is composed of 1 Million annotated tokens, verified by human expert annotators. The annotation comprises information on part-of-speech, open classes lemma and inflection, multi-word expres...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
CINTIL-DeepBank 1.3

CINTIL-DeepBank (Branco et al., 2010) is a corpus of Portuguese texts annotated with deep grammatical information. This document refers to version 1.3 of the corpus, delivered in September of 2015, which adds over 2,000 annotated sentences to the previous version from March 2015. The current vers...

Resource Type:Corpus
Media Type:Text
Language:Portuguese

Order by:

Filter by:

Text/xml (16)
Wav (3)
Text (2)
Xml (2)
20 (1)
Sgml (1)