Search and Browse – PORTULAN CLARIN

RudriCo-TOK is a tokenizer tool that splits contractions. De-contraction rules: 178.

Resource Type:	Tool / Service
Language:	Portuguese

RudriCo-POS

RudriCo-POS is a part-of-speech disambiguation tool that performs 188 morphological disambiguation rules.

Resource Type:	Tool / Service
Language:	Portuguese

Romanian Ombudsman archive (Processed)

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Parallel aligned corpus in tmx format built from the Rom...

Resource Type:	Corpus
Media Type:	Text
Languages:	English
Languages:	Romanian

Romanian - English news corpus (Processed)

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Bilingual Romanian – English news corpus built from Sout...

Resource Type:	Corpus
Media Type:	Text
Languages:	English
Languages:	Romanian

Romanian - English New Criminal Procedure Code (Processed)

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. The New Civil Procedure Code in Romanian and English (bi...

Resource Type:	Corpus
Media Type:	Text
Languages:	English
Languages:	Romanian

Romanian - English literature corpus (Processed)

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Bilingual Romanian – English literature corpus built fro...

Resource Type:	Corpus
Media Type:	Text
Languages:	English
Languages:	Romanian

Romanian-English corpus with studies, reports and statistical data in the field of culture from the National Institute for Cultural Research and Training website (Processed)

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Romanian-English corpus with studies, reports and statis...

Resource Type:	Corpus
Media Type:	Text
Languages:	Moldavian; Moldovan
Languages:	Romanian

RedditPT Dataset

This dataset is a collection of dialogues extracted from the Portugal subreddit with RDET (Reddit Dataset Extraction Tool). It is composed of around 58,964,715 tokens in 218,550 dialogues.

Resource Type:	Corpus
Media Type:	Text
Language:	Portuguese

Reddit Dataset Extraction Tool

Reddit Dataset Extraction Tool (RDET) is a tool that takes advantage of the resources available at 'pushshift.io' that relate to Reddit comments and submissions and generates new datasets based on any given subreddit.

Resource Type:	Tool / Service

Radio Bulgaria WSD/NED corpus

Radio Bulgaria WSD/NED corpus is composed of texts from Bulgarian and English articles from the website of Radio Bulgaria.

Resource Type:	Corpus
Media Type:	Text
Languages:	Bulgarian
Languages:	English

Order by:

Filter by: