How to Read File in Python
If you are working as a data analyst or in any field of information technology, you always have challenges in front of you about execution of different data types. In this extended field of data types you have a lot to learn yet so that you could move in the industry with ease. You would have to learn the things like how to read file in Python. Different systems have different formats and different compressions.
People working in data industry usually get a complex data that is not easy to understand. They hardly get any neat data in a tabular form. SO, it is very important for any data analyst o know the different types of file formats, challenges in reading different files and different techniques to handle them in real world. This article is going to cover all the points you need to know about how to read file in Python or how to create a file in Python.
Before getting started to know file read in Python, you should know about the file format. A file format is a method with the help of which information in the files is stored. It also lets you know about the file type i.e. it is binary or ASCII file.
If you want to identify the file format, simply see the extension of the file and you will come to know the file format of the file. This will help to a great extent to read file in Python also.
Some of the file formats used in Python is:
- CSV (Comma Separated Value)
- Txt (Plain Text)
- ZIP File
- Hierarchical Data Format
Different ways of how to read file in Python with different file formats:
1.) Comma Separated Values (CSV)
Comma Separated Values are the files with extension CSV. These files are created in spreadsheets in the form of cells. The cells here are organized by rows and columns. The spreadsheet can be of any type whether its string type, a data type or an integer type. Commonly used spreadsheet file formats are csv, xls and xlsx.
To read csv file in Python, you need to know that a CSV file is represented as an observation known as a record. A record includes one or more fields separated by a comma. In order to read csv file in Python, you should use pandas library from the different libraries used in Python. In order to grab data from a CSV file, one should apply the reader function to create a reader object. The reader function helps to get every single line of the file and create a list of all columns. After creating the list, select the column that need to be changed into variable data. This is how you can read csv file in Python.
XLSX file format as we discussed further is a XML-based spreadsheet file format from Microsoft Excel. It was initiated by Microsoft Office 2007. As in CSV file format, the data is organized in XLSX in the form of cells and columns in a spreadsheet. The file can have more than one sheet in a single workbook. You can add multiple sheets in a workbook.
Now the file with XLSX format can be read in easily in Excel but how to read file in Python with XLSX file format. Let us study about how to read file in Python. For reading file in python, you need to use Pandas library.
df = pd.read_excel(“management.xlsx”, sheetname = “Management – B”)
This code will load “Management – B” sheet from “management.xlsx” file in DataFrame df and you will be able to read file in Python.
(b.) TXT file format
In txt file format, the text is written in plain format. The text in txt format is basically unstructured and is not associated with any meta-data. It can be read with an ease by all the programs. The computer program finds it difficult to interpret it.
Here’s the example of plain text file.
In order to read the above file named “plain text file.txt” you should use this code:
(c.) ZIP Files
ZIP file format is the format of an archive file. An Archive file is the compilation of multiple files with different metadata. You can collect multiple files into a single file just by compressing the files to make them occupy less space for storage.
Different compression algorithms are used for compressing the files. A ZIP file can be identified by the .zip extension.
ZIP file read in Python
Reading zip file can be done by importing a package called “zipfile”. The Python code used to read the “management.csv” file present in the “M.zip” in Python.
(d.) JSON file format
JSON or Java Script Object Notation is designed to exchange the data on the internet. It is a text-based file. Being a language independent data format, any programming language can read JSON file format and is highly used for transferring ordered data on the web.
Let us throw some light on how to read file in Python if it is in JSON file format. In this case also, you need to use pandas library from Python so as to load data from JSON file.
The code to use is
df = pd.read_json(“/home/love/downloads/management_files/management.json”)
XML or Extensible Markup Language is a markup language. Some definite rules are to be used for encoding data. XML file format can be read by human as well as machines. XML is an expressive language that is designed to send information on the web. It is much comparable to HTML with little differences in it. One of the difference in XML and HTML is predefined tags are not used in XML but are used in HTML.
The XML file always initiates with “<?xml version=”1.0”?>”.
If you are looking for a way of how to read file in Python if it is of XML file format, here you go –
import xml.etree.ElementTree as ET
tree = ET.parse('/home/sunilray/Desktop/2 sigma/management.xml')
root = tree.getroot()
3.) HTML File Format
HTML or Hyper Text Markup Language is a standard markup language commonly used for web page creation. The structure of the web pages can be described using this language. HTML and XML have same tags but in case of XML the tags are predefined. HTML document can be recognized by tags like <head> identifies the heading HTML file and <p> identifies with the paragraph. HTML is not case sensitive and you can use both upper case and lower case while using tags.
Let us see an example of HTML file.
<title>Title of the page</title>
<p>Beginning of the paragraph</p>
The tags in HTML document are enclosed by angular brackets. The tag <!DOCTYPE html> represents that the file is an HTML file. The root tag is <html> in the document. The <head> tag is for heading of the document, the <title> tag for title, <body> tag for body of the page, <h1> for sub-heading and <p> for paragraph of the document.
Let me show you how to read file in Python (HTML File).
Python contains several libraries one of from which is BeautifulSoup library. Using this library, you can read file in Python.
PDF or Portable Document Format is a very helpful file for analysis and representation of text documents that also includes incorporated graphics. It has a special characteristic of applying security password for security purposes.
To read a PDF file, you need to use the library PDFMiner. In order to read a PDF file in Python, you have to download the PDFMiner and install it. After installation process simply use the code below:
5.) DOCX file
DOCX file format is the format for word document used for data based on text. The features of word document includes inline insertion of tables, pictures and hyperlinks that helps in making the word document an important file with an important file format i.e. docx.
You can edit or modify the word document at any time you want unlike PDF file. Its format can also be changed to another file format.
If you wish to know about how to read file in Python (DOCX file), you should use the library python-docx2txt. This library can be installed via pip using the code
pip install docx2txt
Using the below code, you can easily read a docx file in Python.
text = docx2txt.process(“file.docx”)
MP3 or multimedia file format is one of the complex file formats in Python. Any kind of data whether its text image, video, audio or graphics can be stored in MP3 format and the text data can be stored in Rich Text Format (RTF) and not in ASCII data.
MP3 file format allows the audio to compress the data to reduce the size of the file. It helps in saving a lot of space in the drive.
The format of MP3 files are created by various frames. Each frame is partitioned into a header and a data block. The series of frames are known to be elementary stream.
How to read a MP3 file in Python
A library named PyMedia in Python is used to read or manipulate multimedia files in Python.
MP4 file format is usually found in storing videos and movies. Multiple images are contained in the videos that are called by the name of frames. Each image is played in the form of a video for a definite time.
How to read file in Python (MP4 file)
MP4 can be read and edited by using community built library known as MoviePy. The library can be installed from the internet and can be used to open a file in Python and read it using the below code:
From moviepy.editor import VideoFileClip
clip = VideoFileClip(‘<video_file>.mp4’)
Image files are quite an interesting file formats in data analytics. Image processing s required to nake any vision application in computer. You must be aware of different file formats used for images.
Usually the image files are 3-dimensional with RGB values. They also come in 2-Dimensional i.e. grayscale image or 4-Dimensional i.e. of high intensity.
The image contains frames of pixels. The frame is created by 2-dimensional array of pixel values. Pixel values have different intensities.
To read the image, you have to first check the type of the image and shape. Use the code below to check the type and shape of the image.
type(f) , f.shape
numpy.ndarray, (768, 1024, 3)
8.) Hierarchical Data Format
HDF or Hierarchical Data Format is the best and easy option for you if you want to store huge amount of data. It does not only store high volumes of data but also used to store simple and small volumes of data.
The most useful thing that HDF has is it is flexible and has efficient data storage along with high I/Q. It is also supported by several formats. Different HDF formats are available in the field of data industry but HDF5 is the version that is mostly used by the data analysts. Its features are similar to that of XML.
How to read HDF file format?
In order to read HDF file format you need to use pandas library in Python. The code to be used for reading the HDF file format is:
m = pd.read_hdf(‘management.h5’)
where management.h5 is the file to be loaded into the HDF file format “m”.
The above discussed are the file formats used frequently in data science and the ways to read those files. These codes can be better understood with the practical application of the codes used in the above article. The coding needs practice. Once you start practicing, you can excel in the field of data industry in no time. You can also achieve excellence in Python Data Science Course as well while using reading these articles.