AI-nonyme : AUTOMATED ANONYMIZATION TOOL FOR COURT DECISIONS
The application for anonymizing court decisions, currently under patent, was invented by Tana Corp. to address the particular needs for automated anonymization of unstructured information.
Total project duration: 4 months
The total cost of the project was 300 man-days.
Details of the project :
With a view to the automated application of the CNIL’s deliberation of 29 November 2001 “on the recommendation on the dissemination of personal data on the Internet by case law databases”, the Council of State wished to acquire an anonymization computer tool. This operation is very complex and a source of error, because of the multitude of anonymization rules recommended for personal information “of parties to the proceedings and witnesses”, but also very costly in terms of time, since anonymization concerns an existing stock of 3 million documents to which 240,000 other documents are added each year.
The exclusivity of the application is due to several factors:
- Innovative and unique solution: no competitor in the field of automated anonymisation of unstructured information without syntax analysis.
- The application does not require a precise document structure to make it anonymous, it deals with documents written in free form.
- Application entirely dedicated to the anonymisation of court decisions: no anonymisation of the court panel or lawyers.
- Full compliance with the rules of the CNIL (Commission Nationale Informatique et Liberté) applicable to the publication of official administrative documents.
- Solution can be integrated into an existing workflow without any specific evolution
- Scalability: due to its modular multi-agent oriented architecture, the application can easily be used on a multi-server farm to respond to an increase in processing load.
- Robustness: system guaranteeing the automatic resumption of processing in the event of an unexpected technical or application incident.
- Processing speed: between 10,000 and 200,000 documents processed per hour (depending on the file format) on a single server.
- Performance: success rate of more than 95% of correctly anonymized documents without human intervention.
- Autonomy: fully automated system that can operate without human intervention indefinitely.
- Centralized administration: easy administration of anonymization rule settings
The project has several phases:
1/ A phase of realization of the anonymization tool which is composed of two application modules:
- The anonymization engine
- The curing interface
The anonymization engine takes the form of a batch processing that is integrated into an existing EAI-like chain and will run nightly. The tool receives as input a list of documents in several formats (.doc, .txt, etc.) and provides as output a list of anonymized documents in the same or different formats (.doc, .txt, xml, etc.). The tool can easily integrate new formats as input and output. The engine does not use any other structured database-type information to identify surnames/first names/addresses, but only an exhaustive syntactic analysis of each decision.
The curation interface allows different users to control the anonymization process by planning and tracking the execution of batches of documents to be anonymized. In addition, it allows the consultation of the anonymization results, the comparative view of a document in its initial and anonymized form as well as the correction of the anonymization of a document or the relaunching of the anonymization of a document or a batch of documents. Access to the curation interface is restricted to authorized users and advanced profile management defines user access to features and documents from different jurisdictions. In terms of administration, in addition to the management of users and profiles, the interface allows the evolution of the anonymization engine by enriching the dictionaries it uses.
The tool meets the following objectives:
- Anonymize in a coherent way the input documents (to eliminate the personal information of the persons, but keep the meaning of the document)
- Automatically anonymize more than 90% of the documents to be processed
- Allow the supervision of the anonymization operation from a dedicated interface
- Identify the documents that could not be anonymous, or on which a doubt remains and allow their manual anonymization from a dedicated interface.
- To be able to compare documents in their initial, anonymized version
- Process approximately 600 documents in less than 3 hours
- Offer near 100% uptime, including maintenance
2/ A phase of deployment and production start-up and training
3/ A phase of TMA (Third Party Application Maintenance) which includes :
- Corrective and adaptive maintenance
- Achievement of the changes compared to the initial scope of the project
For the implementation of the project, 7 persons have been allocated :
- 1 project manager
- 1 project manager, technical and functional manager
- 1 technical architect
- 3 development engineers
- 1 graphic designer
This contract is executed entirely with Tana Corp. resources.
Main characteristics of the client
The French Conseil d’Etat is an ancient public institution that was created by Napoleon Bonaparte when he constituted the Council of State in An VIII (Consulate: 1799). This body has been based at the Palais-Royal in Paris since 1875.
The Conseil d’État has two historic missions: it advises the Government on the preparation of bills, decrees, etc., and it is also the supreme administrative judge who decides disputes relating to the acts of administrations. The Conseil d’État is also responsible for managing the entire administrative jurisdiction.
Approximately 380 persons, civil servants and contractual employees, assist in the smooth functioning of the Council of State and the rest of the administrative jurisdiction.
Conduct of the project
A real force of accompaniment: The organization of the project gives a primordial importance to the quality of the accompaniment, to make the customer benefit from the expertise and the know-how capitalized by Tana Corp on its projects of the same type.
Consulting strength : Our solid experience on statistical applications and fixed price projects has enabled us to bring the client the best market practices in this field. In addition, Tana Corp is a strong proposal force in highlighting functional options likely to bring a strong functional added value.
Methodology proposed to validate the stages of the project from a customer’s point of view
Proposed project reporting
A regular steering committee and a weekly project review allowed us to steer this long-term project in close collaboration with the client.
Tools chosen for the project management
Project Management: MS Project
Incident follow-up : BugX (based on Mantis) http://www.mantisbt.org/)
Version and configuration tracking: SubVersion (http://subversion.tigris.org/)
Business processes covered by the solution
- Automatic Anonymization of Documents
- Authentication and access rights: management of the three profiles (Administrator, Supervisor and Corrector) with all the related rights management part
- Management and follow-up of document batches: creation and manual launch of a possible batch
- Viewing the list of results with the anonymization status (correctly anonymized, doubtful, in error), with integrated filter and sorting, and with the possibility to launch the Correction module.
- Reading and action on a decision with comfortable reading of the 2 documents in parallel and possibility to interact on the correction. Moreover, the user can easily propose from this screen the enrichment of the application dictionaries with key words, requests that will have to be accepted by the System Administrator before being operational.
- Parameterization and administration of anonymization rules, thresholds as well as consultation of anonymization statistics
What were the most difficult features to implement ?
The anonymization engine has to meet very strict qualitative and quantitative criteria. Indeed, it must manage more than 600 documents every 3 hours with an automatic anonymization rate of more than 90% while integrating a large number of anonymization rules:
The solution also integrates an approximate search for surnames by Wagner and Fisher type algorithms, in order to identify possible typing errors. A doubt is removed in the case of similar words, the threshold remaining configurable for the system administrator. This system is also correlated to the number of occurrences of each word in order to obtain a different level of doubt.
Tana Corp has successfully met this challenge, with the results of the solution far exceeding the expectations of the specifications: more than 95% success with a frequency of 30,000 documents every 3 hours (50 times faster than the required performance level).
Technical platform elements
Software package, software or development languages used
After analysis of all the constraints, Tana Corp has realized an information system based entirely on a Microsoft :
- SQL Server 2008 R2
- Microsoft .Net Framework, ASP.NET MVC 4, C# languages
Reasons for the choice
- SQLServer 2008 R2 Standard: Reliable and powerful database server,
- ASP.NET and MVC4: scripting languages allowing the development of WEB interfaces with a strong ergonomics,
- Microsoft .Net Framework with C#: development platform allowing an excellent cost/delivery time/quality implementation ratio.
Share of pre-existing developments used on which the Supplier has capitalized to complete the project
This project was carried out entirely from scratch.
Challenges and key success factors
- Ergonomics: daily use of the application in a production context,
- Diversity of the user population: more than 500 users from more than 100 organizations throughout France.
- Duration of the implementation phase: 4 months
- Duration of the production start-up phase: 2 months
- Duration of the third-party application maintenance phase: 1 year
The cost of the implementation and production phase of the project amounts to 300 man-days.
Main customer benefits
- Return on investment in 6 months thanks to the automation of anonymization instead of manual anonymization,
- Excellent performance, both in processing speed and accuracy,
- Verification and correction of anonymization using a curation interface,
- Ergonomics plebiscited by users,
- Ease of administration.