Introduction
Data is everywhere and part and parcel of every business or processes functioning a business or in simpler an eternal resource, but very little people realized its importance before this era of the surge towards data-intensive applications. Let’s learn Data Visualization in Python.
The biggest reason attributed so far for the delay of coming of this data age where data seems to be the new oil has been the lack of computation power.
The most frequent and powerful machine learning algorithms were developed way before in late 1990 but the computation power was never there to fully augment these algorithms to be used where their power could be felt.
A major roadblock in the usage of these resources had been the non-availability of efficient open-source languages capable of dealing with different algorithms and enabling a single user to delve into the world of data and really harness the usage of data in building amazing applications.
In terms of languages that are enabling people to make a dent in this data-driven world then, certainly python has taken the cake considering its use both as a general-purpose language and specifically for dealing with different facets of data science be it computer vision, Natural language processing, Predictive modelling, and of-course visualization.
Python is not only used to apply powerful ML algorithms but also has inherent features to tell amazing stories of data powered through its visualizations and this enables data to show insights through stories and create visualizations that enable the decision-makers to understand data well and to implement changes effectively.
In order to explore the various facets of python, visualization let us tell a story in our way using different libraries of python around a famous and popular dataset known as iris.
Why is Visualization important?
A lot of information received is generally stored in form of tables and text and we can access the exact amount of numbers that respond or correspond to the information but then why do we need to visualize it. Data Visualization in python is very important, let’s learn why.
The answer to this question lies in the inner workings of the human mind and it is a well-known fact that our brain recognizes images faster. At birth, a baby understands the image of his or her mother months before being able to understand what the word mommy means. We understand images instantly and we have to work to process text.
In fact, the brain processes image 60,000 times faster than it does text. And it’s more accustomed to processing images—ninety percent of the information sent to the brain is visual, and 93 % of all human communication is visual and none of this is new or recent. The human brain has always processed images ridiculously faster than words.
Then it certainly means a lot to visualize data and build stories around about it to explain them intuitively and thus enables us to derive insight out of visual information and that is why we need visualization as important part of the process while dealing with data providing insights that can help us take effective business decisions.
Iris dataset
It includes three iris species with 50 samples each as well as some properties of each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
The columns in this dataset are:
- Id
- SepalLengthCm
- SepalWidthCm
- PetalLengthCm
- PetalWidthCm
- Species
Visualization libraries in python
We will look at the main libraries in python that are used to create exciting plots and enable us to draw inferences out of them. Some of the main libraries in python are as under
Matplotlib
Matplotlib is the backbone of Python data visualization libraries. Despite being over a decade old, it’s still the most widely used library for plotting in the Python community. It was designed to closely resemble MATLAB, a proprietary programming language developed in the 1980s.
Matplotlib was the first Python data visualization library and so many other libraries are built on top of it or designed to work in tandem with it during analysis. Some libraries like pandas and Seaborn are wrappers over matplotlib. They allow you to access a number of matplotlib’s methods with less code.
Seaborn
It harnesses the power of matplotlib to create beautiful charts by using only a few lines of code. The key difference is Seaborn’s default styles and color palettes, which are designed to be more aesthetically pleasing and modern. Since Seaborn is built on top of matplotlib, you will need to know matplotlib to tweak Seaborn’s defaults.
Geoplotlib
Geoplotlib is used for creating maps and plotting geographical data. We can use it to create a variety of map-types, like choropleths and heat maps .we must have Pyglet (an object-oriented programming interface) installed to use geoplotlib. It is a single library meant for plotting geographical data and proves to be efficient in its own form
Plots in Python
In order to fully understand the power of visualization, there can be no other way but to visualize the data in exciting and let data speak out itself. We are going to use the dataset iris to fully explore the resourcefulness of python and make data talk.
Loading libraries
We start by loading all the necessary libraries in python
Now lets us have a look at the dataset we need to work on
As expected we have the dataset with all the necessary variables now let us create different plots one by one.
Scatterplot
A scatterplot is mostly used to create a plot between two continuous variables and it reflects points scattered all over.
Scatterplot shown above depicts the highest value of SepalwidthCm is for a value of SepalLengthCm close to 5.70 and the highest SepalLengthCm values correspond to SepalwidthCm values close to 3.75.
Joint plot
A Joint plot shows bivariate scatterplots and univariate histograms in the same plot. We can also use the seaborn library to create a joint plot.
The points depict the same scenario as described above but the additional histograms in the above plot describe the count of values. The peak of the histogram for both the x-axis and y-axis shows the maximum concentration of points. In other words, these values of SepalwidthCm SepalLengthCm are most frequent in the dataset.
In order to present species, we will use seaborn Facetgrid to color the species as shown below
Plotting species in the above plot helps us to get an idea of variation in values SepalwidthCm and SepalLengthCm for a particular species. As it is evident Iris-Setosa possesses the highest value of SepalwidthCm and Iris-Virginica possesses the highest value for SepalLengthCm and Iris-Versicolor possesses moderate values for both the variables.
Boxplot
Boxplots are generally used to detect outliers and relate categorical variables with a continuous one and are quite relevant to show the distribution of different variables in different percentages.
Boxplots are best meant to understand variability and as it shows Iris-Setosa species has the least variability with respect to its PetalLengthCm as the size of IQR (75 th percentile minus 25 th percentile) is less. Iris-Virginica has high variability of data with respect to the y-axis
Variability in simpler words tells us about the spread of data or the range of values and in order to have the best assessment we consider only middle 50 percentile(IQR) as it devoid of any outliers
In order to see outliers, we will add a layer of points through seaborn’s striplot.
The above variation enables us to see the outliers present in the plot and as we see there are some values within every species that lie either below the whisker lines in both the directions.
Violin plot
It takes advantage of both the above plots and produces it in a simple manner. Denser regions are fatter and sparser regions are thinner
A simple way we can understand the violin plot with the following image.
Violin plot indicates the median value (middle white dot) for the Iris-Setosa species is lowest and highest for Iris-Virginica species and also the spread (width) of the plot for Iris-Setosa indicates that probability that majority of values will take on the given value is higher as compared to other species.
Kdeplot
A final seaborn library-based plot which utilizes and creates an estimate of kernel density of the underlying features is called kdeplot. It estimates the kernel density of PetalLengthCm for all the different Species.
Pair plot
A pair plot is meant to show the bivariate relation between each pair of features and as it shows that Iris Setosa species is placed differently across other species as its values are somewhat different from other two species.in other words, it seems that this species is making its own separate cluster based on its own values.
We can further use the Seaborn library to tweak the plot.
Andrew curves
It uses pandas and involves using attributes of samples as coefficients of Fourier series. It aims at plotting the different species into close to each other as it clubs them together and it easily shows clusters in the dataset.
Another feature pandas have parallel coordinates as we can plot each feature on a separate line and draw lines. Here we club different species even further on the basis of different variables.
End Notes
In this article, we looked at how effective visualization in python can be by drawing different varieties of plots. The blog gave the details of data visualization in Python. It is equally important to understand which types of variables are suited for which plots and how to combine variables effectively for better insights. Visualization of data in itself is a domain and a necessity to understand data fundamentally and communicate the results effectively.