Search and Browse – PORTULAN CLARIN

F_Mona_1/ Spoken Newspaper

108 WAV files of spoken Maltese newspaper texts, subdivided into 12 directories with a variable number of sentences (sometimes: clauses) each. They come together with transcriptions and tables of phoneme durations.

Resource Type:	Corpus
Media Type:	Audio
Language:	Maltese

HIMERA Corpus

The HIMERA annotated corpus contains a set of published historical medical documents that have been manually annotated with semantic information that is relevant to the study of medical history and public health. Specifically, annotations correspond to seven different entity types and two differe...

Resource Type:	Corpus
Media Type:	Text
Language:	English

MalToBi/SPAN Corpus

Audio corpus: 8 subfolders with .wav files Each containing : • 2 sound files containing a read story (“The sun and the wind”, each by speaker A and speaker B) • 2 sound files containing each 30 read sentences (each by speaker A and speaker B) • 2 x each of the 30 sentences as a single sound f...

Resource Type:	Corpus
Media Type:	Audio
Language:	Maltese

GENIA POS & Term Corpus

A corpus of 2,000 MEDLINE abstracts, collected using the three MeSH terms human, blood cells and transcription factors. The corpus is available in three formats: 1) A text file containing part-of-speech (POS) annotation, based on the Penn Treebank format, 2) An XML file containing inline POS anno...

Resource Type:	Corpus
Media Type:	Text
Language:	English

SpeakerID

SpeakerID is a corpus of 100 spoken sentences and pseudosentences in European Portuguese (PT) and Mandarin Chinese (CH) designed to enable research on speaker identity. The utterances were recorded by five male speakers of European Portuguese (Speakers A-E) and five male speakers of Mandarin Chi...

Resource Type:	Corpus
Media Types:	Text
Media Types:	Audio
Languages:	Chinese
Languages:	Portuguese

CIPM-POS

CIPM-POS is a set of historical, religious, notarial, literary texts in prose and verse, written is medieval portuguese. It contains around 88000 words.

Resource Type:	Corpus
Media Type:	Text
Language:	Portuguese

HESITA database

The HESITA database is a corpus consisting of television daily news collected over a month and was annotated regarding to hesitation events, acoustical environments, speaking styles, speaker characteristics and respiratory events, among other characteristic sounds.

Resource Type:	Corpus
Media Types:	Text
Media Types:	Audio
Language:	Portuguese

Arquivo Dialetal CLUP - Áudio

Arquivo Dialetal CLUP - Áudio is an audio corpus of spontaneous speech, mainly from Northern Portugal.

Resource Type:	Corpus
Media Type:	Audio
Language:	Portuguese

Multilingual corpus from the Publications Office of the EU on the medical domain v.2

277780 sentence pairs (in 23 EN-X language pairs in total) extracted from the Publications Office of the EU on the medical domain. These are sourced from laws, studies, EC announcements, etc. labelled with concepts like epidemiology, epidemic, disease surveillance, health control, public hygiene,...

Resource Type:	Corpus
Media Type:	Text
Languages:	Bulgarian
	Croatian
	Czech
	Danish
	Dutch; Flemish
	English
	Estonian
	Finnish
	French
	German
	Greek, Modern (1453-)
	Hungarian
	Irish
	Italian
	Latvian
	Lithuanian
	Maltese
	Polish
	Portuguese
	Romanian
	Slovak
	Slovenian
	Spanish; Castilian
	Swedish

Romanian - English news corpus (Processed)

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Bilingual Romanian – English news corpus built from Sout...

Resource Type:	Corpus
Media Type:	Text
Languages:	English
Languages:	Romanian

Order by:

Filter by: