

Further, we can see there are two kinds of categorical data. In the above examples, the variables only have definite possible values. The grades of a student: A+, A, B+, B, B- etc.The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.The department a person works in: Finance, Human resources, IT, Production.The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.Categorical variables are usually represented as ‘strings’ or ‘categories’ and are finite in number. Since we are going to be working on categorical variables in this article, here is a quick refresher on the same with a couple of examples. In case you want to learn concepts of data science in video format, check out our course- Introduction to Data Science In this article, I will be explaining various types of categorical data encoding methods with implementation in Python. Now the question is, how do we proceed? Which categorical data encoding method should we use? It not only elevates the model quality but also helps in better feature engineering. And converting categorical data is an unavoidable activity. We need to convert these categorical variables to numbers such that the model is able to understand and extract valuable information.Ī typical data scientist spends 70 – 80% of his time cleaning and preparing the data. Since most machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step.

The performance of a machine learning model not only depends on the model and the hyperparameters but also on how we process and feed different types of variables to the model. Learn different encoding techniques and when to use them.Understand what is Categorical Data Encoding.
