Introduction
Data Visualization turns data into images that nearly anyone can understand making them invaluable for explaining the significance of digits to people who are more visually oriented
~Jonsen Carmack
Not every time the numbers will sound meaningful to people working with data. This is where Data Visualization comes in. It is a technique of encoding those numbers into images which can be much more helpful to gain meaningful insights. It is one of the essential steps in every Data Science process.
But, do not get upset if Data Visualization is a new term for you. We’ll talk about Data Visualization in Python throughout this blog.
If you are a beginner in Python, I recommend you to please refer to this blog before proceeding further, in case you haven’t :).
Why Python for Data Visualization?
Though there are lots of tools available for Data Visualization, Python has few best libraries that make Python Visualization easy for any dataset. These libraries make Python Visualization affordable for large and small datasets. There are several courses available on the internet that just focuses on Data Visualization with Python and especially with Matplotlib. Matplotlib is very useful to create and present Python Visualization.
Popular Libraries For Data Visualization in Python:
Some of the most popular Libraries for Python Data Visualizations are:
- Matplotlib
- Seaborn
- Pandas
- Plotly
- and many more
Further, We’ll create different types of Python Visualizations using these libraries.
Types of Python Visualization:
Let us explore different types of techniques for python visualization. we’ll use a jupyter notebook with python for writing all the codes.
First, we’ll import Python Visualization Libraries using following code.
Remember, %matplotlib inline is only for jupyter notebooks, if you are using another editor, you’ll use: plt.show() at the end of all your plotting commands to have the figure pop up in another window.
Now, we’ll import an inbuilt iris dataset from Seaborn library which will be used to create various Python Visualization.
Now, we’ll use this dataset to create various Python Visualization.
1.) Scatterplot:
This is used to find a relationship in a bivariate data. It is most commonly used to find correlations between two continuous variables. Here, we’ll see scatter plot for Petal Length and Petal Width using matplotlib.
We can notice that the relationship between the two variables is linear and positive.
We used plt.title to add a title to our post, plt.xlabel to add a label for the x-axis and similarly plt.ylabel to add a label for the y-axis. There are plenty of such options which can be useful for adding/modifying plots. you can refer the matplotlib documentation for a complete guide.
2.) Histogram:
The histogram shows the distribution of a continuous variable. It can discover the frequency distribution for a single variable in a univariate analysis.
Here we’ll plot a histogram for sepal width to check it’s frequency distribution.
We observe that the distribution is normally distributed. bins is used to divide the entire range of values into a series of intervals.
3.) Bar Chart:
Bar Chart or Bar Plot is used to represent categorical data with vertical or horizontal bars. It is a general plot that allows you to aggregate the categorical data based on some function, by default the mean.
Here we’ll plot a Bar Chart for the three Species with Sepal Length using Seaborn.
We can notice that the y-axis is the mean of Sepal Length for the three classes of Species namely Setosa, Versicolor, and Virginia. Also, the three bars have different colors which represent each of the species uniquely.
4.) Pie Chart:
Pie Chart is a type of plot which is used to represent the proportion of each category in categorical data. The whole pie is divided into slices which are equal to the number of categories.
The three slices in the above chart represent three categories of species. we have used explode to separate the three slices. Similar to a histogram, The three slices have different colors which represent each of the categories uniquely.
5.) Countplot:
Countplot is similar to a bar plot except that we only pass the X-axis and Y-axis represents explicitly counting the number of occurrences. Each bar represents count for each category of species.
Here, we’ll plot Countplot for three categories of species using Seaborn.
We can observe that the three bars represent the count for the three categories of species.
6.) Boxplot:
Boxplot is used to show the distribution of a variable. The box plot is a standardized way of displaying the distribution of data based on the five-number summary: minimum, first quartile, median, third quartile, and maximum.
Here, we’ll plot a Boxplot for checking the distribution of Sepal Length.
Also, A box plot shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable.
Here, we’ll plot Boxplot to compare the distribution of Sepal Length for each level of Species.
We can also plot a Boxplot for the entire dataset with Horizontal orientation.
So, we can observe that all the plots represent the distribution of dataset with four quartiles. Also, it represents the maximum and minimum value. While the dots outside the plot represent outliers.
7.) Heatmap:
Heatmap is a type of Matrix plot that allows you to plot data as color-encoded matrices. It is mostly used to find multi-collinearity in a dataset.
To plot a heatmap, your data should already be in a matrix form, the heatmap basically just colors it in for you.
Here, we’ll plot a heatmap to find the correlation between variables of the iris dataset. First, we’ll create a correlation matrix for iris dataset.
Now, we’ll plot the heatmap for the above correlation matrix.
Here, we can observe that the correlation is shown with color-coded matrices. The value of correlation ranging from 0 to 1. cmap is used to change the color codings and cannot is used to display the value of correlation in the plot.
8.) Distplot:
The Distplot shows the distribution of a univariate data
Here, we’ll use Distplot to check distribution for Sepal Width.
So, we can observe that the distribution is normal. Also, to remove the distribution layer we can use kde = False
9.) Jointplot:
Jointplot is used to represent the distribution of one variable to match up with the distribution of another variable. To be more specific, Jointplot allows you to basically match up two Distplots for bivariate data.
Here, we’ll plot a Jointplot for petal length and sepal length.
Grids, Style, and Color
Grids are general types of plots that allow you to map plot types to rows and columns of a grid, this helps you create similar plots separated by features.
First, we’ll create a subplot grid for plotting pairwise relationships in a dataset using pairgrid. Then we’ll map the pairwise relationship to those grids.
Here, sns.PairGrid() will create a pairwise grid of variables in a dataset and the map function will map the relationship among variables to those grids.
Also, we can use map.upper, map.lower, map.diag to map different types of relationships for upper, lower and diagonal pairs.
Now, we’ll see how to control figure aesthetics in seaborn briefly.
we’ll see how we can change the grid style or color using seaborn.
There are five preset seaborn themes: darkgrid, whitegrid, dark, white, and ticks. darkgrid is the default for Seaborn. For all the plots above, we have used white grid-style Set defaults using sns.set().
Also, We can change the grid style in seaborn using sns.set_style().
you should try with different grid options available in Seaborn and notice changes in the grid style.
So, sns.despine() will remove borders from the top and right side of the figure. Further, we can also remove border from the left as well as bottom using the argument, left= True & bottom= True.
We can use matplotlib’s plt.figure(figsize=(width,height) to change the size of most seaborn plots. Also, can control the size and aspect ratio of the plots by passing in parameters: size, and aspect.
Now, Let’s have a look at an example.
So, we can see that the Width and Height of the plot have changed according to the parameters passed. For some of the plots, we can also pass these parameters inside the sns.
For example:
The set_context() allows you to override default parameters in order to scale the plot:
Conclusions:
Hence, we have covered most of the basics of Python Visualization using seaborn and matplotlib. I hope this article will give you a head start for diving into Python Visualization. Also, You can refer to the official documentation for Matplotlib and Seaborn for further reference and deep understandings.