Documentation - CHEU-lex

This section provides details on the pipeline used to process texts during corpus compilation.

Text retrieval, structural and contextual annotation, and corpus cleansing

Texts were retrieved from Fedlex (January 2020) and subsequently processed with a Perl script that embedded structural and contextual annotations. The contents of the outputs, including metadata and structural tags, were verified and corrected manually. When the original text version was only available in PDF format (rather than HTML), texts were retrieved and annotated manually.

Segmentation and alignment

Texts were segmented and aligned at the sentence level using the Intertext Editor (Vondřička 2014). For the purposes of this project, a segment is defined as a sequence of words starting with a capital letter and ending with a strong punctuation mark. Given the legal nature of compiled texts, some exceptions were made for textual elements such as lists, titles, subtitles and preambles, as well as for headings and closings of Exchanges of Letters. Within these sections, the content is grouped in the same segment to preserve contextual meaning.

POS Tagging, lemmatisation and dependency parsing

POS tagging and lemmatisation in Italian and French texts were carried out with TreeTagger (with the corresponding parameter files; see Baroni 2007 and Schmid 1994). For German texts, we used RFTagger (Schmid 2008). The raw output was manually revised and corrected. Tagsets were modified to ensure an accurate annotation of legal texts’ typical textual elements. Tags for abbreviations, foreign words, list markers and abrogated elements were added.

CHEU-lex’ tagsets are available below:

Dependency Parsing was carried out using SpaCy (Honnibal 2020). The Italian and French tags are based on the Universal Dependencies (UD Italian ISDT v2.5 and UD French Sequoia v2.5, respectively), whereas the German ones are based on the TIGER Corpus. The output of the SpaCy parser did not undergo any manual revision. The dependency parsed version of the corpus is available upon request.

Metadata (corpus filters)

Corpus queries can be filtered using the following contextual information, which is embedded in each text.

Table 1. CHEU-lex' metadata

Text type	Type of text (law or agreement)
Macro-topic	Topic section within the Systematic Compilation of Swiss Federal Legislation
Micro-topic	Subtopic section within the Systematic Compilation of Swiss Federal Legislation
Date: Entry date (precise date)	Date of entry into force of the law/agreement
Date: Entry date (Decade)	Decade of entry into force of the law/agreement
Date: Signature date	Date of signature of the law/agreement
Date: Status date	Date of entry into force of the last partial revision of the law/agreement
Is source text?	Original text (Y) vs. translation (N)
Text ID	SR/RS code of the law/agreement

Structures

Corpus queries can be limited to specific sections of laws and agreements, based on structural annotation.

Table 2. CHEU-lex' structural annotations

Section	Subsections (if any)	Description
Title	--	Title of law/agreement
Title info	--	General information on law/agreement (e.g. relevant dates, other references, etc.)
Preamble	--	Introductory paragraph or section of the agreement, deed, statute, treaty, setting out its intention, scope, etc.
Body	--	Body of the law/agreement, encompassing its articles
	Article title	Title of the article (e.g. “Art. 1 Campo d’applicazione”)
	Article text	Text of the article (e.g. “1 La presente ordinanza si applica in quanto gli Accordi di associazione…”)
Annex	--	Section devoted to the annexes of a law/agreement
	Annex title	Title of the annex (e.g. “Allegati 1 e 2”)
	Annex text	Text of the annex (e.g. “Gli Accordi di associazione alla normativa di Dublino comprendono gli accordi seguenti:”)

Known issues

Some noise is still present within the corpus. Below we list some of the potential issues that can be experienced when consulting the corpus:

1. Incorrect formatting of elements within tables.
2. Rare encoding problems of non-Latin characters, deriving from the original HTML files.
3. Discrepancies among parallel segments due to:

3.1. Differences in translation (alphabetically ordered lists or sentence ordering).
3.2. Minor segmentation errors.

References

Baroni M., Schmid H., Zanchetta E., Stein A. (2007). The Enriched TreeTagger System. Proceedings of the Evalita Workshop (10th Congress of Italian Association for Artificial Intelligence, AI*IA 2007). University of Roma "Tor Vergata", Rome, Italy.

Honnibal M., Montani I., Van Landeghem S. and Boyd A. (2020). SpaCy: Industrial-strength Natural Language Processing in Python. doi:10.5281/zenodo.1212303.

Schmid H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing. Manchester, UK.

Schmid H., Laws F. (2008). Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging. In COLING '08: Proceedings of the 22nd International Conference on Computational Linguistics (pp. 777-784). Volume 1. Manchester: Association for Computational Linguistics.

Vondřička P. (2014). Aligning parallel texts with InterText, In: N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk and S. Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) (pp. 1875-1879). Reykjavik: European Language Resources Association (ELRA).