Data forms the backbone of any data analytics you do. Regarding data, there are many things to go wrong – be it the construction, arrangement, formatting, spellings, duplication, extra spaces, and so on. To perform the data analytics properly we need various data cleaning techniques so that our data is ready for analysis. It’s commonly said that,
“Data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time actually analyzing it.”
Thus, it is important to grow accustomed to the process of data cleaning techniques and all of the data cleansing tools that are related to data cleansing methods. This post provides a very basic introduction to data cleansing techniques in Excel.
This post covers the following data cleaning steps in Excel along with data cleansing examples:
- Get Rid of Extra Spaces
- Select and Treat All Blank Cells
- Convert Numbers Stored as Text into Numbers
- Remove Duplicates
- Highlight Errors
- Change Text to Lower/Upper/Proper Case
- Spell Check
- Delete all Formatting
What is Data Cleansing?
Data cleansing or data cleaning is the process of identifying and removing (or correcting) inaccurate records from a dataset, table, or database and refers to recognizing unfinished, unreliable, inaccurate, or non-relevant parts of the data and then restoring, remodeling, or removing the dirty or crude data.
Data cleaning techniques may be performed as batch processing through scripting or interactively with data cleansing tools.
After cleaning, a dataset should be uniform with other related datasets in the operation. The discrepancies identified or eliminated may have been basically caused by user entry mistakes, by corruption in storage or transmission, or by various data dictionary descriptions of similar items in various stores.
Are Data Cleaning Techniques Essential?
Data cleaning techniques are not only an essential part of the data science process – it’s also the most time-consuming part. As the New York Times reported in a 2014 article called “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”,
“Data scientists … spend from 50 per cent to 80 per cent of their time mired in this more mundane labour of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”
Unfortunately, Data Cleaning techniques are generally not spoken about in the media nor is it taught in most intro Data Scientist Course because it is not as important as training a neural network or identifying images but to perform those things data cleaning plays a very important role.
Without the data cleaning techniques, the neural networks and image identification modules will not be as efficient as we want them to be.
With the rise of big data, data cleaning methods have become more important than ever before. Every industry – banking, healthcare, retail, hospitality, education – is now navigating in a large ocean of data.
And as the data pool is getting bigger, the variables of things going wrong too are getting larger. Each fault becomes difficult to find when you can’t just look at the whole dataset in a spreadsheet on your computer. In fact, this could be true for a variety of reasons.
Data Cleansing Examples and Data Cleaning Methods in Excel
In this post, I will show you various ways to clean data in Excel with data cleansing examples & data cleansing techniques.
1. Data Cleaning Techniques-Get Rid of Extra Spaces
Here I have the text Welcome To Digital Vidya written in four different ways.
welcome to digital vidya welcome to digital vidya welcome to digital vidya welcome to digital vidya
The first one is the regular way with only one space between words, in the second case I have more than one space between words, in a third case I have some leading spaces along with a couple of spaces between words, and in the fourth case I have trailing spaces, you can see there is a couple of space after the last word.
Now, this could typically be the case if you get this data from a colleague or you get it from a text file or imported from a database. So to clean this data and get rid of these extra spaces you can use the function trim.
Syntax: =TRIM(Text)
The trim function takes one single argument which could either be the text which you type manually or it could be the cell reference, in this case, I will take the cell reference A1 and what this function does is it would remove all the leading spaces and trailing spaces and extra spaces between words except one single space that is allowed.
So if I drag this down you would see that it has corrected all these texts. It has removed the extra space here between welcome into it has removed the leading spaces and trailing spaces.
2. Data Cleaning Techniques-Select and Treat All Blank Cells
If you just need to use the text you can convert it into values by using paste special I have student names here and their marks in three subjects. You can see that there are gaps in this dataset which could be because the student could not appear in the exam.
Now you may not want to leave this data set with blanks, you may want to type not appear in all these cells which are blank. So to do that you can either go and select each cell manually and type not appear. But if you have a huge data set that because could be very tiresome. So to do it at one go,
- Select the entire data set,
- Go to find and select and select this option Go to Special this opens the go-to special dialog box. You can also use the keyboard shortcut F5 and when you do this it opens the go-to dialog box here you have the special button, click on it and it again opens it equal to a special dialogue box
- Click blanks and click okay, this would select all the blank cells in your data set at the same time
- So now you have these cells in grey and the first cell is in white because this is the active cell so to type not appear in all these cells just start typing not appear and hit ctrl+enter and as soon as you hit ctrl+enter this gets entered in all the cells.
3. Data Cleaning Techniques-Convert Numbers Stored as Text into Numbers
Here I have this number entered in three different ways,
123 123 '123
In the first case, it is a number as you can see it is aligned to the right of the cell numbers are always aligned to the right while text gets aligned to the left of a cell and in the other two cases, you can see these are text format because these are aligned to the left. now to convert all these three back into numbers.
The first one is already a number but to convert these two back into a number there are two ways to do it. The first one is I would go to the formatting box and I would type general and when I hit enter the second one gets converted back into a number because in this case it was merely in the text format but the third case is a little more difficult because it has been entered by using a leading apostrophe and a lot of people do this, a lot of people enter numbers for starting with an apostrophe so that it gets converted into a text and this could create some problems.
A very foolproof method is to type in any of the blank cells go to the cell and copy this now select these cells go to paste -> paste special and this opens the paste special dialog box. Here you have operation category within this select multiply and click okay so what this does is this multiplies this number with one and any number multiplied by one is unchanged but this also takes care of the apostrophe so now all these three numbers get converted back into a number format.
4. Data Cleaning Techniques-Remove Duplicates
Here I have a data set of students and their marks in three subjects and there are duplicates in this data so you can see there is a duplicate for Bill and a duplicate for Phil now if you want to remove these duplicate values there are two ways to do it first is using conditional formatting,
- So you can select the data set
- Go to home -> conditional formatting
- Highlight cell rules -> duplicate values and as soon as you select this it gives you the option to highlight duplicates and the formatting. I will keep the formatting as a red fill with dark red text and when I hit OK you can see that this has been highlighted and all those numbers and names that appear more than once it highlighted in red.
I can manually see that Phil repeats twice and Bill repeats twice so I can select this data and manually delete this.
The other way to remove duplicates & get cleaned data is by selecting the entire data set going to data and here I have the option remove duplicates I click on this option and it opens the remove duplicates dialog box here make sure that if your data has headers which in this case it has this option is selected if this is not selected then this is also counted as a part of your data it should not be the case when you have selected this option these names are the names of the columns so I can see that there is a student column math column physics column in chemistry column I can then select ok to remove all those rows or all the data set which is duplicate but in this case, it would not remove a number which repeats again rather the entire row has to be an exact duplicate so.
For example, Jack and these three marks have to meet exactly the numbers and name here Jack and these three marks and if that is the case then this row gets deleted similar is the case with Jill and these numbers so now when I click OK it says one duplicate values found and removed unique values remain the reason being that in case of Jill you can see that the marks do not match so this entire row is not an exact match and hence it remains but since in case of Bill it was an exact match then the row was removed.
5. Data Cleaning Techniques-Highlight Errors
Here I have a dataset for five companies. I have their revenue number for three years and net income numbers for three years and using these numbers I have calculated the net income margin which is net income by revenue.
Now you can see that there are errors in data for Company X and Company Z, the reason being that there is no revenue number for these companies in and hence I get a division error because I try to divide their net income by nothing. Now this is a small data set and you can visually spot these errors but if you have a huge data set these errors could be difficult to spot so to do that you can use two methods first is using conditional formatting,
So select this entire dataset go-to home -> conditional formatting and select new rule within new formatting rule dialog box select Format only cells that contain and from this drop-down select errors when you select errors you would get the option to format the cells which have an error, in this case, let me select red and I click OK and as soon as I do this all the cells that have errors in it get highlighted in red
We control Z to go back the other way to do this would be to select those cells which have errors and you can do this by using the go-to special dialog box so to do that press F5 this opens the go-to dialog box here you have the special button, click on it, this opens the go-to special dialog box.
Here select formulas and within formulas as soon as you selected all these four options to get available, deselect the first three options and only keep the errors option selected and now click OK. When you do this all those cells which have an error in them get selected now you can manually either delete all these cells or type something like not available and hit control enter so that it gets entered in all the cells which have an error in.
6. Data Cleaning Techniques-Change Text to Lower/Upper/Proper Case
Here I have names written in different ways you can see either it could be all caps, it could be all lowercase and in some cases, it’s a mix-and-match of uppercase lowercase so to make it all consistent you can use one of these three formulas,
SYNTAX: LOWER() – Converts all text into Lower Case. ex. mary jane UPPER() – Converts all text into Upper Case. ex. MARY JANE PROPER() – Converts all Text into Proper Case. ex. Mary Jane
LOWER() formula takes one argument, it could be either the text that you type in or you can use a cell reference in this case if I’ll use the cell reference A1 and when I hit ctrl enter this gives me mary jane the name but all the alphabets have been converted into lowercase and when I drag this down this is the case for all these names all these names now look consistent in lowercase you may want these all in the uppercase so in that case you can use the formula UPPER() and you can see this these are all in uppercase now as I drag this down the most used way is proper case because it would keep the first alphabet of your name as in capital and the rest all would be in the lowercase may show you I will select PROPER() and I hit ctrl enter and you can see M of Mary and J of Jane is in caps and rest all the alphabets are in the lowercase and now I drag it down so these are three formulas that can very quickly make your text consistent this could be the case when you are sharing a worksheet or you get it from a text file where a lot of people enter it in different ways these formulas can quickly make these consistent.
7. Data Cleaning Techniques-Spell Check
If you have huge data set and you want to only extract a part of it in the form of cleaned data while Microsoft PowerPoint and Microsoft Word have a feature where it would underline if there are any errors grammatical errors or spelling errors Microsoft Excel does not have that feature however you can still run spellcheck and correct these errors.
So to do that select the data and press F7 and when you do that it runs the spellcheck for you and it is the same thing that you see in Microsoft Word or PowerPoint it will show you the text that it thinks is a spelling error and it will show you the suggestions as well so you can change these and once it is done it will show you that spellcheck is complete and you are good to go.
8. Data Cleaning Techniques-Delete all Formatting
If you have a worksheet where there is a lot of formatting and you need to clear all the formatting you can quickly do that by,
- Selecting the entire data
- Go to Home –> Clear –> Clear Formats
- You can also use clear all this would remove everything from your sheet including the content you can only clear the content would remain the formatting would remain intact you can clear the comments and the hyperlinks.
Data Cleansing Tools
Here are some interesting Data Cleansing tools relating to data cleaning techniques, analysis and modeling of data,
JASP – Open Source statistical software similar to SPSS with support of COS
Rattle – GUI for user-friendly machine learning with R
RapidMiner – Another point and click machine learning package
Orange – Open Source GUI for user-friendly machine learning with Python
Talend data preparation – Data cleaning, preparation tool with smarts
Trifacta Wrangler – Data cleaning, preparation tool with the match by example feature
They are all open-source or have free versions focusing on cleaning, analyzing, and modeling data.
Conclusion
Data cleaning is an inherent part of the data science process to get cleaned data. In simple terms, you might divide data cleaning techniques down into four stages: collecting the data, cleaning the data, analyzing/modeling the data, and publishing the results to the relevant audience.
If you try to skip the data cleaning steps, you’ll often run into problems getting the raw data & not cleaned to work with traditional data cleansing tools for analysis in, say, R or Python. Thus, it becomes important to take into consideration, the data cleaning steps & data cleaning methods.
If you are looking forward to building a career in the Data Science domain, enroll in our Data Science Using Python Course.
—
Image and concept credits
trumpexcel.com | P E X E L S
Aaron Tay from Blog – Musings about librarianship
Wonderful read. Such a useful and very interesting stuff to do in every research and data analysis you wanna do! Thank you very much for the very organized data analysis tips I learned a lot from it. I really loved this write up, You Nailed It.
Wowww amazing
very well explained …….Thank u very much
i am inspired to begin my data science course. Thank you for this explicit lesson.