The dissemination of judicial decisions not only provides a valuable source of decision support for judges and legal practitioners but also strengthens public confidence in the judicial system. However, the nature of the data raises privacy concerns as the documents include personal and, often, sensitive data such as health, financial, religious beliefs, sexual orientation, and so on. In recent years, especially since the introduction of GDPR, the international scientific community has paid much attention to the issue of privacy and automatic anonymization tools, but unfortunately, nothing has been done in the Italian legal context. In this paper, we present a first solution aimed at automatic anonymization of the Italian National Jurisprudential Archive (Archivio Giurisprudenziale Nazionale) domain based on pre-trained Transformers embeddings (Clark et al., 2020, Devlin et al., 2019) and spaCy’s transition-based parsing for entity recognition (Honnibal and Montani, 2017). It achieves more than 94.7% recall (>99% for Person and ID entities) and supports several anonymization methods that can be applied to the text depending on the purpose of anonymization
Automatic Anonymization of Italian Legal Textual Documents using Deep Learning
Licari D
;Romano MF;Comande' G
2022-01-01
Abstract
The dissemination of judicial decisions not only provides a valuable source of decision support for judges and legal practitioners but also strengthens public confidence in the judicial system. However, the nature of the data raises privacy concerns as the documents include personal and, often, sensitive data such as health, financial, religious beliefs, sexual orientation, and so on. In recent years, especially since the introduction of GDPR, the international scientific community has paid much attention to the issue of privacy and automatic anonymization tools, but unfortunately, nothing has been done in the Italian legal context. In this paper, we present a first solution aimed at automatic anonymization of the Italian National Jurisprudential Archive (Archivio Giurisprudenziale Nazionale) domain based on pre-trained Transformers embeddings (Clark et al., 2020, Devlin et al., 2019) and spaCy’s transition-based parsing for entity recognition (Honnibal and Montani, 2017). It achieves more than 94.7% recall (>99% for Person and ID entities) and supports several anonymization methods that can be applied to the text depending on the purpose of anonymizationI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.