Join Digital Marketing Foundation MasterClass worth Rs 1999 FREE

Complete Guide on Data Cleaning in Python

Bannerartboard 13 1f5cf6739f8f9e82e33ddf0770fede51

Data cleaning and Python, both are separately known and preferred across the world for their features. And what’s interesting to know in this technical world that they can now be put together for the tedious task of data cleaning. So, we have prepared this guide where you will learn all about data cleaning in Python and how to run a Python program as well.

For instance, let’s consider that we have a list of tasks to be done be it a household chore or a deadline to be met in the office. In order to do so, we make sure that the tasks are done in a manner, don’t we?

Well, the same scenario comes into play when handling data because the way we handle our data decides how effective our results are going to be. And in order to get the best-filtered data Python has become an active part of the play. So, in this guide, we will learn about the importance of data cleaning, how to do it with Python and even how to run a Python program in cmd and how to run a Python program in windows too.

Want to Know the Path to Become a Data Science Expert?

Download Detailed Brochure and Get Complimentary access to Live Online Demo Class with Industry Expert.

Date: April 20 (Sat) | 11 AM - 12 PM (IST)
This field is for validation purposes and should be left unchanged.

What is Data Cleaning In Python?

The meaning is rather simple than you must be thinking. Just as the two words suggest, data that has been collected for analysis is cleaned to get the relevant information out of it. The process of removing the kind of data that is incorrect or incomplete or duplicate and can affect the end results of the analysis is called data cleaning.

This does not mean that data cleaning is about the removal of certain kinds of irrelevant data. It is a process for ensuring dependability and increasing the accuracy of the data which has been collected.

“Data scientists claim that 80% of their time is consumed by the hectic process of data cleaning.”

In the technically advanced world of today’s, that talks all about machine learning are factually dependent on the accuracy of the data and hence becomes an important parameter to be met. However, what’s interesting here is that data cleaning is now days being done with tools and languages like Python.

Yes, there are Python programs to be written and executed to create data sets that are standardized and uniform to be further used by tools of data analytics. So along with handling data and cleaning it, there is also the aspect of how to run a python program which will also be covered in the subsequent sections, so continue reading.

How To Do Data Cleaning in Python?

Let’s take the example of a survey in which a particular form is filled by a number of people. Now, this data which has been entered by people is to be processed and there are good chances of finding some cases of this data being irrelevant or incomplete due to fields left blank or forms not filled at all.

But the data collected has to be processed and in order to avoid any further degradation of it, programs are written. One of the most preferred languages to do the task uses Python and let’s get back to the forms we were talking about in the example and learn how to run a python program. This will enable us to understand how to do data cleaning in Python much better.

Now, in a programming language, there are certain parameters to be filled and certain dependencies to be met to make sure the process if time-efficient as well. Already counting the factors in the picture, right?

So the parameters of the programming languages are called data types. Just like we categorize matter into solid, liquid and gas, Python also categorizes data entered into data types like integer, float, Boolean and others.

Once this classification is done, the first step towards building a Python program is completed. But are you thinking that how does a declaration of data types works? It works with the help of dependencies. There are generally called the libraries and contain the basic definition of all predefined terms of any programming language like Python.

Another aspect that comes into play while creating a program the size of it. Think of reading a book, would it be better divided into chapters or just continued text to interpret it better?

Similar to this, the codes for data cleaning in python can be stored into several files which are together called a module and then interpreted by software like Eclipse or Jupiter. They read the instructions mentioned in the Python program and apply them to the data collected to produce the accountable data.

A sample python program
A sample python program

Given all this information, we have now understood the importance of data cleaning in python and the basic flow of how to run a python program that is centered on data cleaning.

What is Python and why a Python program for data cleaning?

Just like the many programming languages present in the technical world around, Python is another major contributor to its advancements and is indeed a preferred language among the developers.

The main factors because of which it has gained importance are its ease of learning, simple syntax norms, enhanced readability and hence reduced the cost of maintenance. Given all these advantages, data cleaning in python for beginners is the ideal choice.

So, before proceeding to understand how to do data cleaning in python for beginners and write a Python program for the process of cleansing data, let us understand the various elements of the same which are said to be prerequisites for writing logic to carry out a process and understand what is data cleaning in python.

Later on, we will also learn about how to run a Python program in cmd and also about how to run a python program in windows.

The reason that Python is favored for its simple syntax is due to the design structure of the language and packing it in modules called libraries. These libraries behave like an encyclopedia for every declaration that is made in a Python program and is validated against the rules written in them.

It can also be said that these libraries are like just like the libraries we have and we keep accessing them to gather information as and when required. Now, apart from declaring variables the next advantage that comes from these Python libraries is the inbuilt functions.

We know that you are not caught up with these two words if you are not from a technical background, but just another minute and you’ll get to know what it is. Let’s take the example of a calculator.

Understanding Python with an example

In a calculator, we enter a set of numbers and then press the button for some common mathematical operations like addition, subtraction, multiplication and division and the result are generated in fractions of seconds.

Just like a calculator, Python is also capable of performing such and other high-level operations with the help of inbuilt methods in them. Say, you want to add two numbers ‘a’ and ‘b’ and a simple expression of ‘a+b’ will give you the desired result.

That happens because the ‘+’ symbol has been given a particular job in Python and the details of it are present in these libraries. All you have to do is import them into your program and make use of them to generate an even better logic say, calculate the value of Pi to thousands of places.

But just like a calculator has a mini screen of its own to display the output of the operation, where do you think will the output of a Python program will be displayed?

But we know you are troubled by the question of how to run a python program? For doing that you will require to save you file with an extension ‘py’ and run it in cmd or through an interpreter or through various software’s online like Jupiter. For more clarification continue reading on how to run a python program in cmd and how to run a python program in windows.

How to Run a Python Program in cmd?

For running your Python program in cmd, first of all, arrange a python.exe on your machine. After that, go “Run” by pressing Ctrl + R and type cmd and then hit enter. A terminal window will open and copy the path to you python.exe onto it. This terminal window will now behave as a window to run your program and you will learn how to run a python program in cmd.

If that doesn’t suit you or your program is very large in size then you can also pass the path to your script in the terminal. Once again, open Run and type: C:\python27\python.exe Z:\code\hw01\script.py where former is the path to python executor and latter is the path to file in which program is written.

If you don’t find this method feasible for any reason then take a deep breath and relax because we have got you covered with the steps on how to run a python program in windows.

How to run a Python Program in Windows?

For running a python program on windows, you can always rely on setting the environment variable on your machine. For this go to Computer ? Properties ? Advanced System Settings ? Environment Variables ? Path.

There must be a long list present against this variable. Just add the path to your python exe which is default like ‘C:\Python27’ once you’ve installed python on your machine. Click on save.

Now go to Run a cmd a type: Python and enter. A new window will open and pass the path to your program file into it to see the output. It could be anything like: ‘C:\Users\Username\Desktop\my_python_script.py’ where you have saved it with extension py.

Please note, the extension ‘py’ is what helps your machine to understand that a file is a python program. Sometimes, this minute error of saving your file with any other extension fails and your steps on how to run a python program in windows don’t work at all.

Data Cleaning in Python

So far now, we have understood what is data cleaning in python, how to do data cleaning in python, why it is important, what Python is and how to run a python program in cmd and how to run a python program in windows. Moving onto the next and main milestone of our guide is to use the two of them together.

For understanding how the two work together we will get back to the example of data collection through form filling. To keep things simpler, we will now choose the fields out of the many filled in a form.

We are taking fields such as house number, street name, occupancy of the house and number of bedrooms in the house. What we have collected through the forms are the details of certain people, the address of their houses, whether they have themselves living there or it is a rental and how many rooms are there in their houses.

But given the long description of these fields, it will become difficult to bring the use them again and again. So just like a program that has some variables defined, we will also define these fields meaning give them simple and shorter names.

For, house number it will be hnum, for street number it will be sname, for owner occupancy it will be occupancy and for a number of rooms, it will be Num_bedrooms. Please note that you are free to give any kind of name to a variable in a Python program. It can be ‘a’, ‘b’, ‘c’ or like the ones we have chosen.

Moving on to the types of data that has been stored against these fields in the form. As a common understanding, the house number ‘hnum’ will be a number like 1104 and will be listed as data type integer in Python, the street number will consist of names so it will be alphabetical and will be listed as String.

Since the occupancy only deals with whether the owner is living by themselves or it is rental, it can be a yes or no question and will be under data type Boolean and a number of rooms again as an integer.

This means till now, we have learned about 3 unique data types in Python and how to categorise a field in Python as well. You can assume that your collection of data will be represented somewhat like this:

Data cleaning with python example
Data cleaning with python example

The process of Data Cleaning in Python for Beginners with an Example

If you’ll look at this table carefully you’ll notice that there are certain fields which are either blank or have been filled as NA. Though, there can be many reasons for that but our purpose of collecting data as a whole has been hampered because the data is not completely reliable until we rule out the unnecessary information from it.

So, our next step would be to read this data through the Python program so that we can process it. A code like the one in the image below would be apt and helpful:

Capture 35 2eaf244d0650c8b4cd76a640522e4173

Notice the output table carefully, it is exactly the same as the table we had in the first place with all values filled it or left blank. So here’s what you can collect from this one. First of all, there are two imports, Pandas and NumPy. These are the two libraries that are essential for any Python to deal with data types.

Another thing to notice is the acronym given to them. This saves writing time and space as well when creating large modules and is considered good practice as well. ‘df’ is the variable which has been used to read the data stored in the table in a file named ‘property data.csv.’

However, this reading aspect is done with the help of the inbuilt method ‘read_csv’ and printed on the screen with the help of print command and head method. For the fields which were not filled in the table or have been left blank, Python has filled in the value NaN (not a number) for its own ease.

So isn’t it great, that with just one method you are able to read the entire table at once and process the output as desired?

Let’s take an easy example to learn how data cleaning in Python. Consider the field Num_bedrooms and we will figure out how many of them have been left blank. For doing this a code snapshot has been arranged below:

Capture 36 5d008632b637b76761872c2a5c7dc4ad

If you’ll observe the lines of code, it has been asked to print the field ‘Num_bedrooms’. After that, a method is null has been used to determine if the value is null or blank or NaN as per Python. If it is blank, the methods give a Boolean value True otherwise False.

So you can easily match between the two outputs that for each null value, True has been printed. Given this output, you can further try a logic to read the output and if it contains True then do not consider that entire row for generating the results of the data collection. And this is how your data can be considered clean.

Conclusion

We understand that was a lot of information at one go but this is enough to get started with what is data cleaning in python for beginners. Once this is clearly understood by you, learn more about data cleaning in Python with the Data Science Using Python Course. In case you have any doubts just let us know and we are here to help you.

Avatar of sugandha singh
Sugandha Singh
She is a person with an interest in reading, exploring places and trying new food outlets. And writing has a special love in her heart and gets her going every day.

Leave a Comment

Your email address will not be published. Required fields are marked *

In-Demand Courses

4-7 months Instructor Led Live Online Training
Starts April 20, 21, 22, 23, 2024
  • Covers all Digital Marketing Techniques

4 months Online
New Batch Dates are not Open
  • Digital Media Mastery (with Paid Media Expertise)
Digital Marketing Webinars
Apr 20
Upcoming
Raj Sharma, Digital Vidya Team 11:00 AM - 12:00 PM (IST)
Apr 28
Completed
Marketing Leaders from Paytm Insider, Cognizant and Digital Vidya 03:00 PM - 04:00 PM (IST)
Mar 24
Completed
Marketing Leaders from Merkle Sokrati, 3M, Uber India and VIP Industries Limited 03:00 PM - 04:00 PM (IST)

Discuss With A Career Advisor

Not Sure, What to learn and how it will help you?

Call Us Live Chat Free MasterClass
Scroll to Top