Alignment of Parallel Texts from Cyrillic to Latin

The text of the novel Sania (eng. The Sledge) served as a training corpus. It was written in 1955 by Ion Druță and printed originally in Cyrillic scripts. We have followed a special previously developed technology of recognition and specialized lexicons. In such a way, we have obtained the electronic version of Cyrillic script variant of the text. On the other hand, we did the same procedure with Latin script variant of the same text, transliterated manually by expert linguists. It permitted us to make an automatic aligning of Cyrillic variant of the text to contemporary Latin variant of the same text at the word/expression level. The process was semi-automated, based on the heuristics for transcription of letters and the expert linguists’ validation. The corpus is annotated at sentence and word levels, providing morpho-lexical information using UAIC Romanian Part of Speech Tagger (Simionescu, 2011).


