The merits of a parallel corpus and how to get the most out of it

Alexandr Rosen

InterCorp, a multilingual parallel component of the Czech National Corpus, has been on-line since 2008, growing steadily to its present size of 1.7 billion words in 40 languages. A substantial share of fiction is complemented by legal and journalistic texts, parliament proceedings, film subtitles and the Bible. The texts are sentence-aligned, tagged and lemmatized. After a brief presentation of the corpus design, content and access options, we will see how useful it can be in linguistic and literary studies, and for practical tasks in fields such as lexicography, teaching or translating. Finally, we will look at the issue of language-specific morphosyntactic annotation, turning a multilingual corpus into a tagset Babylon, and present some solutions.

Úvod > The merits of a parallel corpus and how to get the most out of it