Privacy Protection Algorithms
The objective of this work is to understand automated text anonymization system for protection of personal information of users and after anonymization, should still remain relevant in syntactic and semantics terms, without losing the conveyed meaning of text. Anonymized data can be used in many tasks such as data mining, machine learning, analysis, etc without revealing the identity of entities involved in the creation of data and be useful to improve the accuracy of the applied data analysis tool.
Beyond simply recognizing entities, variation in the types of documents (e.g., financial or medical) and type of identifying information poses a challenge for automated systems.
One common approach is to develop domain specific anonymization tools, where one can utilize knowledge about the structure and information content of documents to construct high quality anonymization models, other is to build general models.
As different people have different writing style, domain specific models depends of some predefined keywords and tokens.
Also, we should anonymize the data only until it doesn’t loses its syntactic and semantic meaning. The information that should be converted is only those parts of data that is personal to someone and if revealed, can help in tracing the entity back using revealed information.
Hundreds of entities include - Name such as name of person, organization, place, etc, Numbers such as phone number, street address, price tags, Events such as party, meeting, time based entities, etc.
Taking example of named entity, the problem is not of finding named entities in the text but of finding named entities that are particular to the data and not outside it.
HIPAA states 18 identifiers which should be removed before exposing private data to public.
This approach will use hybrid techniques of techniques defined above, consisting of four stage:
- Preprocessing: Feature extraction techniques such as normalization, tokenization, POS tagging, NER, etc will be done in this step which will store the results which will be used further. There are many open source tools available for the same.
- Pattern detection: This will use the work done in previous stage for detecting the entities to be replaced. It will detect the patterns in them which will be used in exploring those tokens which are not detected in earlier stage. It will also generalize the entities to be replaced, so that it’s meaning should remain more explainable.
- Coreference resolution: In simple terms, it groups together all the mentions of same entity and replace it with a single token. This includes mentioning same entity in different ways such as addition or removal of title, using acronym, possessive terms, prepositional expressions, etc.
- Anonymization: This is the main step, which uses the results given by the previous steps and combine them to produce a relevant anonymized document, along with the replacements. We will use the combination of entity substitution, random substitution and generalization. Entity substitution is the substitution of entity by its token. Random substitution converts tokens of same entity with similar non existent entity, while generalization is to replace the entity with the general entity it belongs to or the superclass.
Text: “University of Florida is a great institution for higher education”
Anonymized: “PLACE is a great institution for higher education”
Text: “Mumbai has floods this year”
Anonymized: “PLACE has floods this year”
Usual NER detects both the entities as places, replacing the token will lose the general meaning of sentences.
Instead they should be converted to more general form of the object or the superclass that they belong to. In this case, “University of Florida” can be replaced by “Institute” and “Mumbai” can be replaced by “CITY”.
Text: “Bill had $5. Bill gave $2 to Jack. Bill was left with $3”
Anonymized: “PERSON had $5. PERSON gave $2 to PERSON. PERSON was left with $3”
We can see that above anonymized version is also not correct as after anonymization, it loses its information about the events happening in the context.
Text: “My name is John” to “My name is PERSON” is correct but converting “Bill Clinton was the best President” to “PERSON was the best President” is not correct.
Text: “Sherlock lives at 221-B, Baker Street, London.”
Anonymized(Using NER): “PERSON lives at NUMBER-B, PLACE, PLACE”.
Anonymized(Using proposed approach): “PERSON lives at ADDRESS”