Posts

Privacy Protection Algorithms

Objective:The objective of this work is to understand automated text anonymization system for protection of personal information of users and after anonymization, should still remain relevant in syntactic and semantics terms, without losing the conveyed meaning of text. Anonymized data can be used in many tasks such as data mining, machine learning, analysis, etc without revealing the identity of entities involved in the creation of data and be useful to improve the accuracy of the applied data analysis tool. Approach:Beyond simply recognizing entities, variation in the types of documents (e.g., financial or medical) and type of identifying information poses a challenge for automated systems.One common approach is to develop domain specific anonymization tools, where one can utilize knowledge about the structure and information content of documents to construct high quality anonymization models, other is to build general models. As different people have different writing style, domain s…

Automating the Machine Learning Workflow - AutoML

Motivation:Using Machine Learning will not require expert knowledge.All machine learning tasks follow the same basic flow.Difficult to find best fit hyper parameters.Hard to make hand made features.Fun. Automatic Machine Learning in progress: My motivation to write blog on this topic was Google's new project - AutoML.
Google's AutoML project focuses on deep learning, a technique that involves passing data through layers of neural networks. Creating these layers is complicated, so Google’s idea was to create AI that could do it for them.
There are many other open source projects, like AutoML and Auto-SKLEARN working towards a similar goal.
Goal:The goal is to design the perfect machine learning “black box” capable of performing all model selection and hyper-parameter tuning without any human intervention. 
AutoML draws on many disciplines of machine learning, prominently including
Bayesian optimization - It is a sequential design strategy for global optimization of black box …

Research Paper Summary - Detecting Outliers in Categorical Data

Image
Original Paper: An Optimization Model for Outlier Detection in Categorical Data
Authors : Zengyou He, Xiaofei Xu, Shengchun Deng
What are outliers?An out-lier is an observation that lies at an abnormal distance from other values in a random sample from a population. Before abnormal observations can be singled out, it is necessary to characterize normal observations. There are many ways to detect outliers in continuous variables but there exist only few techniques which can detect outliers in categorical variables.  Example:
Suppose you have 1000 people choose between apples and oranges. If 999 choose oranges and only one person chooses apple, I would say that that person is an out-lier.
We use measurement as a way to detect anomalies. With categorical data you have to explain why choosing an apple is considered an anomaly (that data point does not behave as the rest 99.9% of the population). 
One technique to detect outliers in categorical variables is using an optimization model call…

Natural Language Interface for Relational Database - What is it and How it can be made?

Motivation.Database management systems are are systems used for accessing and manipulating information. The data can be manipulated using a set of keywords following of syntax rules. To perform operations on the database, it is required to learn the structured query language(SQL). Hence, a user who do not know SQL cannot directly access information from the database. Natural Language Interface for Database(NLIDB). What is it? It is a proposed solution to the problem of accessing information in a database using natural language like English, having no technical knowledge about language like database languages like SQL. It is a tool which can understand a user's query in natural language, convert it into appropriate SQL query, so that the user can get the required information from the database.

Various approaches for building NLIDB. Symbolic or Rule based approach. This is the approach in which the translation from natural language to SQL is done using human-crafted and curated set…