Attend FREE Webinar on Data Science for Career Growth Register Now

Document Classification Using Python and Machine Learning

5 (100%) 8 votes

Document Classification or Document Categorization is a problem in information science or computer science. We assign a document to one or more classes or categories. This can be done either manually or using some algorithms.

Manual Classification is also called intellectual classification and has been used mostly in library science while as the algorithmic classification is used in information and computer science. Problems solved using both the categories are different but still, they overlap and hence there is interdisciplinary research on document classification.

Why is Document Classification Useful and Why do We Employ it?

Classification can help an organisation to meet legal and regulatory requirements for retrieving specific information in a set timeframe, and this is often the motivation behind implementing data classification.

However, data strategies differ greatly from one organisation to the next, as each generates different types and volumes of data.

Andy Whitton, a partner in Deloitte’s data practice, says:

Full data classification can be a very expensive activity that very few organisations do well. Certified database technologies can tag every data item but, in our experience, only governments do this because of the cost implications.”

Instead, Whitton says, companies need to choose certain types of data to classify, such as account data, personal data, or commercially valuable data. He adds that the start point for most companies is to classify data in line with their confidentiality requirements, adding more security for increasingly confidential data.

“If it goes wrong, this could be the most externally damaging – and internally sensitive. For example, everyone is very protective over salary data,” says Whitton.

Types of Document Classification and Techniques

  1. Supervised Document Classification
  2. Unsupervised Document Classification

(i) Supervised Document Classification:

In supervised classification, an external mechanism (such as human feedback) provides correct information on the classification of documents.

Document Classification Using Python and Machine Learning

Supervised Classification

(ii) Unsupervised Document Classification:

In unsupervised document classification, also called document clustering, where classification must be done entirely without reference to external information. Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.

Document Classification Using Python and Machine Learning

Supervised vs Unsupervised Classification

Applications of Document Clustering

The application of document clustering can be categorized into two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications. Text clustering may be used for different tasks, such as grouping similar documents (news, tweets, etc.) and the analysis of customer/employee feedback, discovering meaningful implicit subjects across all documents.

Algorithms

In general, there are two common algorithms.

(i) The first one is the hierarchical based algorithm, which includes a single link, complete linkage, group average and Ward’s method. By aggregating or dividing, documents can be clustered into a hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems.

(ii) The other algorithm is developed using the K-means algorithm and its variants. Generally, hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the K-Means algorithm are more efficient and provide sufficient information for most purposes. These algorithms can further be classified as hard or soft clustering algorithms.

Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters. Dimensionality  reduction methods can be considered a subtype of soft clustering; for documents, these include latent semantic indexing (truncated singular value decomposition on term histograms) and topic models

In practice, document clustering often takes the following steps:

1. Tokenization

Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases.

2. Stemming  and Lemmatization

Different tokens might carry out similar information (e.g. tokenization and tokenizing). And you can avoid calculating similar information repeatedly by reducing all tokens to its base form using various stemming and lemmatization dictionaries.

Data Analytics Course by Digital Vidya

Free Data Analytics Webinar

Date: 28th Mar, 2019 (Thu)
Time: 3 PM (IST/GMT +5:30)

3. Removing Stop Words and Punctuation

Some tokens are less important than others. For instance, common words such as “the” might not be very helpful for revealing the essential characteristics of a text. So usually it is a good idea to eliminate stop words and punctuation marks before doing further analysis.

4. Computing term frequencies or tf-idf

After pre-processing the text data, you can then proceed to generate features. For document clustering, one of the most common ways to generate features for a document is to calculate the term frequencies of all its tokens. Although not perfect, these frequencies can usually provide some clues about the topic of the document.

5. Clustering

You can then cluster different documents based on the features that have been generated.

6. Evaluation and Visualization

Finally, the clustering models can be assessed by various metrics. And it is sometimes helpful to visualize the results by plotting the clusters into low (two) dimensional space.

Automatic Document Classification Techniques Include:

  • Expectation maximization (EM)
  • Naive Bayes classifier
  • Instantaneously trained neural networks
  • Latent semantic indexing
  • Support vector machines (SVM)
  • Artificial neural network
  • K-nearest neighbour algorithms
  • Decision trees such as ID3 or C4.5
  • Concept Mining
  • Rough set-based classifier
  • Soft set-based classifier
  • Multiple-instance learning
  • Natural language processing approaches
Document Classification Using Python and Machine Learning

Overview of a Document Classification Application

Document Classification Using Python

Text classification is one of the most important tasks in Natural Language Processing. It is the process of classifying text strings or documents into different categories, depending upon the contents of the strings.

Text classification has a variety of applications, such as detecting user sentiment from a tweet, classifying an email as spam or ham, classifying blog posts into different categories, automatic tagging of customer queries, and so on.

Here, python and scikit-learn will be used to analyse the problem in this case, sentiment analysis. Scikit-Learn is one of the libraries of python used in Machine Learning and data analysis.

Following are the steps required to create a text classification model in Python:

  1. Importing Libraries
  2. Importing The dataset
  3. Text Preprocessing
  4. Converting Text to Numbers
  5. Training and Test Sets
  6. Training Text Classification Model and Predicting Sentiment
  7. Evaluating The Model
  8. Saving and Loading the Model

Here, I will perform a series of steps required to predict sentiments from reviews of different movies. These steps can be used for any text classification task. Moreover, I will use Python’s Scikit-Learn library for machine learning to train a text classification model.

(i) Importing Libraries

Execute the following script to import the required libraries:

import numpy as np  
import re  
import nltk  
from sklearn.datasets import load_files  
nltk.download('stopwords')  
import pickle  
from nltk.corpus import stopwords

(ii) Importing Libraries

Execute the following script to import the required libraries:

import numpy as np  
import re  
import nltk  
from sklearn.datasets import load_files  
nltk.download('stopwords')  
import pickle  
from nltk.corpus import stopwords 

(iii) Importing the Dataset

I will use the load_files function from the sklearn_datasets library to import the dataset into the application. Execute the following script to see load_files function in action:

movie_data = load_files(r"D:\txt_sentoken")  
X, y = movie_data.data, movie_data.target 

(iv) Text Preprocessing

Once the dataset has been imported, the next step is to preprocess the text. Text may contain numbers, special characters, and unwanted spaces. Depending upon the problem you face, you may or may not need to remove these special characters and numbers from text.

(v) Converting Text to Numbers

Machines, unlike humans, cannot understand the raw text. Machines can only see numbers. Particularly, statistical techniques such as machine learning can only deal with numbers. Therefore, you need to convert the text into numbers.

(vi) Training and Testing Sets

Like any other supervised machine learning problem, you need to divide the data into training and testing sets. To do so, I will use the train_test_split utility from the sklearn.model_selection library. Execute the following script:

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

(vii) Training Text Classification Model and Predicting Sentiment

Take a look at the following script:

classifier = RandomForestClassifier(n_estimators=1000, random_state=0)  
classifier.fit(X_train, y_train)  

Finally, to predict the sentiment for the documents in the test set you can use the predict method of the RandomForestClassifier class as shown below:

y_pred = classifier.predict(X_test)

(viii) Evaluating the Model

To evaluate the performance of a classification model such as the one that you just trained, you can use metrics such as the confusion matrix, F1 measure, and the accuracy.

(ix) Saving and Loading the Model

You can save your model as a pickle object in Python. To do so, execute the following script:

with open('text_classifier', 'wb') as picklefile:  
    pickle.dump(classifier,picklefile)

Once you execute the above script, you can see the text_classifier file in your working directory. This means that you have saved the trained model and can use it later for directly making predictions, without training.

To load the model, you can use the following code:

with open('text_classifier', 'rb') as training_model:  
    model = pickle.load(training_model)  

Document Classification Machine Learning

Text documents are one of the richest sources of data for businesses: whether in the shape of customer support tickets, emails, technical documents, user reviews or news articles. They all contain valuable information that can be used to automate slow manual processes, better understand users, or find valuable insights.

However, traditional algorithms struggle at processing these unstructured documents, and this is where machine learning comes to the rescue!

Here, I will show how off-the-shelf ML tools can be used to automatically label news articles. The approach I’ll describe can be used in any task related to processing text documents, and even to other types of ML tasks.

Moreover, you will also learn how data can be extracted and pre-processed, how you can make some initial observations about it, how to build ML models, and—last but not least—how to evaluate and interpret them.

Data Analytics Course by Digital Vidya

Free Data Analytics Webinar

Date: 28th Mar, 2019 (Thu)
Time: 3 PM (IST/GMT +5:30)

Data Extraction & Exploration

Loading Data

Data is an essential resource for any ML project.

Data Analysis

Before diving head-first into training machine learning models, you should become familiar with the structure and characteristics of the dataset: these properties might inform the problem-solving approach. First, it’s always useful to look at the number of documents per class:

Document Classification Using Python and Machine Learning

Category Source: Cloud Google

Here, you will see that the number of articles per class is roughly balanced, which is helpful! If the dataset was imbalanced, there would have been a need to carefully configure the model or artificially balance the dataset, for example by undersampling or oversampling each class.

To further analyze the dataset, you need to transform each article’s text into a feature vector, a list of numerical values representing some of the text’s characteristics.

This is because most ML models cannot process raw text, instead only dealing with numerical values.

One common approach for extracting features from the text is to use the bag of words model: a model where for each document, an article, in this case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.

Specifically, for each term in the dataset, a measure called Term Frequency, Inverse Document Frequency abbreviated to tf-idf will be calculated. This statistic represents the words’ importance in each document. A word’s frequency is used as a proxy for its importance: if “football” is mentioned 25 times in a document, it might be more important than if it was only mentioned once.

Document Classification Using Python and Machine Learning

Term Frequency, Inverse Document Frequency Source: Cloud Google

Model Training and Evaluation

With this initial data exploration achieved, you are now more familiar with the way data is represented, and relatively confident that machine learning is a good fit to solve the classification problem. You are now ready to experiment with different machine learning models, evaluate their accuracy, and tweak the model to avoid any potential issues scikit-learn provides implementations for a large number of machine learning models, spanning a few different families:

(i) Linear Models: Linear Regression, Logistic Regression, …

(ii) Ensemble Models: Random Forest, Gradient Boosting Trees, Adaboost, …

(iii) Bayesian Models: (Multinomial/Gaussian/…) Naive Bayes, Gaussian Processes, Support Vector Machines, k-Nearest Neighbors, and various other models

It is common practice to split the data into three parts:

  • A training set that the model will be trained on.
  • A validation set used for finding the optimal parameters (as discussed previously).
  • A test set to evaluate the model’s performance.

Since a hyperparameter search is not being performed, only a train/test split will be used. To evaluate each model, we will use the K-fold cross-validation technique: iteratively training the model on different subsets of the data, and testing against the held-out data.

Next, the models are evaluated using this technique (with five validation folds) to obtain the following results:

Document Classification Using Python and Machine Learning

Source: Cloud Google

Model Interpretation

It’s not enough to have a model that performs well according to a given metric: you must also have a model that you can understand and whose results can be explained. Start by training the model on part of the dataset, and then analyze the main sources of misclassification on the test set. One way to eliminate sources of error is to look at the confusion matrix, a matrix used to show the discrepancies between predicted and actual labels.

Document Classification Using Python and Machine Learning

Matrix Model Source: Cloud Google

Using off-the-shelf tools and simple models, you solved a complex task, that of document classification, which might have seemed daunting at first! To do so, follow the following steps :

  1. Load and pre-process data.
  2. Analyze patterns in the data, to gain insights.
  3. Train different models, and rigorously evaluate each of them.
  4. Interpret the trained model

Final Thoughts

Document classification is exceptionally important when it comes to retrieving specific information in a set timeframe. Nonetheless, it is a worthwhile tool that can reduce the cost and time of searching and retrieving the information that matters. Hopefully, I was able to provide you with everything you need to get started with. To build a promising career in Data Science, join the Data Science Master Course.

Muneeb Ahmad is currently studying Computer Science & Engineering at Islamic University of Science & Technology, Kashmir. He is keen to learn new things & technologies and is solution driven. Main areas where he likes to work is Machine Learning, Data Science, AI, BigData. He also knows Android & Web Development. He is a hardcore Manchester United Fan & likes to read books on diverse topics.

  • Data-Analytics


  • There are 2 comments


    • 3 months ago

      Omara John Jacob   /   Reply

      It’s such a good insight. An inspiration to advancement in Information Technology.

      • 3 months ago

        Niharika Mahendra   /   Reply

        Hi Omara,

        Thank you so much for the appreciation.

    Your Comment

    Your email address will not be published.