Posts

Replication Explained - Designing Distributed Systems (Part 2)

Image
For Part 1, check this out Designing Distributed File Storage Systems - Explained with Google File System(GFS) Overview: Replication in distributed systems involves the process in which we can have multiple copies of data at various locations to guard against unknown and random hardware failures, hence ensuring the availability of the system as a whole. While considering replication, we have to consider the basic assumption that machines will have uncorrelated failures , otherwise, replication will not help in any way. Replication should be considered or not, the number of replicas, etc all depends on the use case and the amount of inconvenience or how much it will cost you if you lose the data and compute power at a given point in time. Replication is achieved, intuitively , when we have two servers, one primary and a replica server, and we have to keep them in synchronize in a way if the primary server fails at any point in time, the replica server should have everything it needs t

Designing Distributed File Storage Systems - Explained with Google File System(GFS) - Part 1

Image
Part 2: Replication-Explained: Designing Distributed systems Overview: Original paper: The Google File System Authors: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung A distributed file system(DFS) is a file system that allows you to access files stored in multiple hosts distributed across them. In contrast to other data stores like MySQL or NoSQL which can also be made distributed, the data is DFS is stored in raw form and as files instead of a table or a fixed-document format. The use cases where DFS is used are very vast. From storing media content like Youtube videos to Instagram images to blogs like these, anything that can be stored as a file, is big and is not suitable to store in other data stores, DFS is generally a preferred choice for everyone. Key points to consider while designing a DFS: Understanding from the perspective of using a distributed file system, there are some characteristics that need to be considered while choosing the data store and tweaked according to

Privacy Protection Algorithms

Objective: The objective of this work is to understand automated text anonymization system for protection of personal information of users and after anonymization, should still remain relevant in syntactic and semantics terms, without losing the conveyed meaning of text. Anonymized data can be used in many tasks such as data mining, machine learning, analysis, etc without revealing the identity of entities involved in the creation of data and be useful to improve the accuracy of the applied data analysis tool.   Approach: Beyond simply recognizing entities, variation in the types of documents (e.g., financial or medical) and type of identifying information poses a challenge for automated systems. One common approach is to develop domain specific anonymization tools, where one can utilize knowledge about the structure and information content of documents to construct high quality anonymization models , other is to build general models. As different people have different wri

Automating the Machine Learning Workflow - AutoML

Motivation: Using Machine Learning will not require expert knowledge. All machine learning tasks follow the same basic flow. Difficult to find best fit hyper parameters. Hard to make hand made features. Fun. Automatic Machine Learning in progress: My motivation to write blog on this topic was Google's new project - AutoML . Google's AutoML project focuses on deep learning , a technique that involves passing data through layers of neural networks . Creating these layers is complicated, so Google’s idea was to create AI that could do it for them. There are many other open source projects, like AutoML and Auto-SKLEARN working towards a similar goal. Goal: The goal is to design the perfect machine learning “black box” capable of performing all model selection and hyper-parameter tuning without any human intervention.  AutoML draws on many disciplines of machine learning, prominently including Bayesian optimization - It is a sequential design strategy for gl

Research Paper Summary - Detecting Outliers in Categorical Data

Image
Original Paper: An Optimization Model for Outlier Detection in Categorical Data Authors : Zengyou He, Xiaofei Xu, Shengchun Deng What are outliers? An out-lier is an observation that lies at an abnormal distance from other values in a random sample from a population. Before abnormal observations can be singled out, it is necessary to characterize normal observations. There are many ways to detect outliers in continuous variables but there exist only few techniques which can detect outliers in categorical variables.    Example : Suppose you have 1000 people choose between apples and oranges. If 999 choose oranges and only one person chooses apple, I would say that that person is an out-lier. We use measurement as a way to detect anomalies. With categorical data you have to explain why choosing an apple is considered an anomaly (that data point does not behave as the rest 99.9% of the population).  One technique to detect outliers in categorical variables is using a