Corpora & tools - LETRINT project

The LETRINT corpora: infographic

The LETRINT corpora are four sets of trilingual textual datasets, including one comparable and three parallel corpora. Their scope and features are determined by the goals of the project. They comprise documents published in English, French and Spanish by the four main European Union institutions (the Commission, the Council, the Parliament and the Court of Justice), the United Nations and its International Court of Justice, and the World Trade Organization in 2005, 2010 and 2015.

To ensure representativeness and balance, each set of the corpus-building sequence, from LETRINT 0 to LETRINT 1+ (i.e., all corpora derived from the first all-inclusive compilation LINST), is composed of texts selected from the previous set according to quantitative and qualitative criteria. This means that text processing and available metadata increase in relation to prior sets, whereas the size of each set decreases to become a subset of the previous one. These corpora have provided empirical data for several analyses of discourse features, translation patterns and quality indicators, including adequacy and consistency (for further details, see project outputs per workstream).

The following infographic will allow you to discover the composition and methodological details of each corpus.

"The LETRINT corpora" infographic

LETRINT-Q: the LETRINT corpus query tool

LETRINT-Q is a corpus query interface that enables users to explore the LETRINT 1 and the LETRINT 1+ corpora (for further details, see Prieto Ramos, Cerutti & Guzmán 2019) through monolingual and parallel queries in English, French and Spanish. It was developed for the project on the basis of the corpus-querying application ParaVoz. Users can perform “basic” queries (i.e., by token, lexeme or grammatical tag) or use the CQP query language, according to the following parameters: organization, main legal function and functional sub-category of the text, year, textual genre, and document code (assigned during compilation). The platform renders results in several formats (e.g., lists or charts) and offers the possibility to download data as xlsx or tsv files.

By default, LETRINT-Q shows results in a contextual view. Additionally, users can query the corpora through the following functionalities:

Frequency: shows the raw frequency of a query item together with its frequency per million words.
Collocations: renders the item’s collocations using various scores (llr, mi, t-score, z-score, dice and mi3), as well as the raw frequency of each collocate.
N-gram: offers a list of n-grams with their corresponding frequencies.
Distribution: helps create tables and charts of the distribution of the queried item according to multiple variables, such as the organization, the legal function, the year of publication or the textual genre. Users can visualize the results as raw or relative frequencies.

LETRINT-Q is an open access resource. If you wish to make use of the platform, please fill in the form below and we will provide you with the access credentials.

Request for access to LETRINT-Q

If you already have your credentials, click here to access LETRINT-Q:

LETRINT-Q