Document Classification or Document Categorization is a problem in information science or computer science. We assign a document to one or more classes or categories. This can be done either manually or using some algorithms.
Manual Classification is also called intellectual classification and has been used mostly in library science while as the algorithmic classification is used in information and computer science. Problems solved using both the categories are different but still, they overlap and hence there is interdisciplinary research on document classification.
Why is Document Classification Useful and Why do We Employ it?
Classification can help an organization to meet legal and regulatory requirements for retrieving specific information in a set timeframe, and this is often the motivation behind implementing data classification.
However, data strategies differ greatly from one organization to the next, as each generates different types and volumes of data.
Andy Whitton, a partner in Deloitte’s data practice, says:
“Full data classification can be a very expensive activity that very few organisations do well. Certified database technologies can tag every data item but, in our experience, only governments do this because of the cost implications.”
Instead, Whitton says, companies need to choose certain types of data to classify, such as account data, personal data, or commercially valuable data. He adds that the start point for most companies is to classify data in line with their confidentiality requirements, adding more security for increasingly confidential data.
Download Detailed Brochure and Get Complimentary access to Live Online Demo Class with Industry Expert.
“If it goes wrong, this could be the most externally damaging – and internally sensitive. For example, everyone is very protective over salary data,” says Whitton.
Types of Document Classification and Techniques
- Supervised Document Classification
- Unsupervised Document Classification
(i) Supervised Document Classification:
In supervised classification, an external mechanism (such as human feedback) provides correct information on the classification of documents.
(ii) Unsupervised Document Classification:
In unsupervised document classification, also called document clustering, where classification must be done entirely without reference to external information. Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.
Applications of Document Clustering
The application of document clustering can be categorized into two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications. Text clustering may be used for different tasks, such as grouping similar documents (news, tweets, etc.) and the analysis of customer/employee feedback, discovering meaningful implicit subjects across all documents.
Algorithms
In general, there are two common algorithms.
(i) The first one is the hierarchical based algorithm, which includes a single link, complete linkage, group average and Ward’s method. By aggregating or dividing, documents can be clustered into a hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems.
(ii) The other algorithm is developed using the K-means algorithm and its variants. Generally, hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the K-Means algorithm are more efficient and provide sufficient information for most purposes. These algorithms can further be classified as hard or soft clustering algorithms.
Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters. Dimensionality reduction methods can be considered a subtype of soft clustering; for documents, these include latent semantic indexing (truncated singular value decomposition on term histograms) and topic models
In practice, document clustering often takes the following steps:
1. Tokenization
Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases.
2. Stemming and Lemmatization
Different tokens might carry out similar information (e.g. tokenization and tokenizing). And you can avoid calculating similar information repeatedly by reducing all tokens to its base form using various stemming and lemmatization dictionaries.
3. Removing Stop Words and Punctuation
Some tokens are less important than others. For instance, common words such as “the” might not be very helpful for revealing the essential characteristics of a text. So usually it is a good idea to eliminate stop words and punctuation marks before doing further analysis.
4. Computing term frequencies or tf-idf
After pre-processing the text data, you can then proceed to generate features. For document clustering, one of the most common ways to generate features for a document is to calculate the term frequencies of all its tokens. Although not perfect, these frequencies can usually provide some clues about the topic of the document.
5. Clustering
You can then cluster different documents based on the features that have been generated.
6. Evaluation and Visualization
Finally, the clustering models can be assessed by various metrics. And it is sometimes helpful to visualize the results by plotting the clusters into low (two) dimensional space.
Automatic Document Classification Techniques Include:
- Expectation maximization (EM)
- Naive Bayes classifier
- Instantaneously trained neural networks
- Latent semantic indexing
- Support vector machines (SVM)
- Artificial neural network
- K-nearest neighbour algorithms
- Decision trees such as ID3 or C4.5
- Concept Mining
- Rough set-based classifier
- Soft set-based classifier
- Multiple-instance learning
- Natural language processing approaches
Document Classification Using Python
Text classification is one of the most important tasks in Natural Language Processing. It is the process of classifying text strings or documents into different categories, depending upon the contents of the strings.
Text classification has a variety of applications, such as detecting user sentiment from a tweet, classifying an email as spam or ham, classifying blog posts into different categories, automatic tagging of customer queries, and so on.
Here, python and scikit-learn will be used to analyze the problem in this case, sentiment analysis. Scikit-Learn is one of the libraries of python used in Machine Learning and data analysis.
Following are the steps required to create a text classification model in Python:
- Importing Libraries
- Importing The dataset
- Text Preprocessing
- Converting Text to Numbers
- Training and Test Sets
- Training Text Classification Model and Predicting Sentiment
- Evaluating The Model
- Saving and Loading the Model
Here, I will perform a series of steps required to predict sentiments from reviews of different movies. These steps can be used for any text classification task. Moreover, I will use Python’s Scikit-Learn library for machine learning to train a text classification model.
(i) Importing Libraries
Execute the following script to import the required libraries:
import numpy as np import re import nltk from sklearn.datasets import load_files nltk.download('stopwords') import pickle from nltk.corpus import stopwords
(ii) Importing Libraries
Execute the following script to import the required libraries:
import numpy as np import re import nltk from sklearn.datasets import load_files nltk.download('stopwords') import pickle from nltk.corpus import stopwords
(iii) Importing the Dataset
I will use the load_files function from the sklearn_datasets library to import the dataset into the application. Execute the following script to see load_files function in action:
movie_data = load_files(r"D:\txt_sentoken") X, y = movie_data.data, movie_data.target
(iv) Text Preprocessing
Once the dataset has been imported, the next step is to preprocess the text. Text may contain numbers, special characters, and unwanted spaces. Depending upon the problem you face, you may or may not need to remove these special characters and numbers from text.
(v) Converting Text to Numbers
Machines, unlike humans, cannot understand the raw text. Machines can only see numbers. Particularly, statistical techniques such as machine learning can only deal with numbers. Therefore, you need to convert the text into numbers.
(vi) Training and Testing Sets
Like any other supervised machine learning problem, you need to divide the data into training and testing sets. To do so, I will use the train_test_split utility from the sklearn.model_selection library. Execute the following script:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
(vii) Training Text Classification Model and Predicting Sentiment
Take a look at the following script:
classifier = RandomForestClassifier(n_estimators=1000, random_state=0) classifier.fit(X_train, y_train)
Finally, to predict the sentiment for the documents in the test set you can use the predict method of the RandomForestClassifier class as shown below:
y_pred = classifier.predict(X_test)
(viii) Evaluating the Model
To evaluate the performance of a classification model such as the one that you just trained, you can use metrics such as the confusion matrix, F1 measure, and the accuracy.
(ix) Saving and Loading the Model
You can save your model as a pickle object in Python. To do so, execute the following script:
with open('text_classifier', 'wb') as picklefile: pickle.dump(classifier,picklefile)
Once you execute the above script, you can see the text_classifier file in your working directory. This means that you have saved the trained model and can use it later for directly making predictions, without training.
To load the model, you can use the following code:
with open('text_classifier', 'rb') as training_model: model = pickle.load(training_model)
Document Classification Machine Learning
Text documents are one of the richest sources of data for businesses: whether in the shape of customer support tickets, emails, technical documents, user reviews or news articles. They all contain valuable information that can be used to automate slow manual processes, better understand users, or find valuable insights.
However, traditional algorithms struggle at processing these unstructured documents, and this is where machine learning comes to the rescue! This makes it very important for an aspiring Data Scientist to learn Machine Learning.
Here, I will show how off-the-shelf ML tools can be used to automatically label news articles. The approach I’ll describe can be used in any task related to processing text documents, and even to other types of ML tasks.
Moreover, you will also learn how data can be extracted and pre-processed, how you can make some initial observations about it, how to build ML models, and—last but not least—how to evaluate and interpret them.
Data Extraction & Exploration
Loading Data
Data is an essential resource for any ML project.
Data Analysis
Before diving head-first into training machine learning models, you should become familiar with the structure and characteristics of the dataset: these properties might inform the problem-solving approach. First, it’s always useful to look at the number of documents per class:
Here, you will see that the number of articles per class is roughly balanced, which is helpful! If the dataset was imbalanced, there would have been a need to carefully configure the model or artificially balance the dataset, for example by undersampling or oversampling each class.
To further analyze the dataset, you need to transform each article’s text into a feature vector, a list of numerical values representing some of the text’s characteristics.
This is because most ML models cannot process raw text, instead only dealing with numerical values.
One common approach for extracting features from the text is to use the bag of words model: a model where for each document, an article, in this case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.
Specifically, for each term in the dataset, a measure called Term Frequency, Inverse Document Frequency abbreviated to tf-idf will be calculated. This statistic represents the words’ importance in each document. A word’s frequency is used as a proxy for its importance: if “football” is mentioned 25 times in a document, it might be more important than if it was only mentioned once.
Model Training and Evaluation
With this initial data exploration achieved, you are now more familiar with the way data is represented, and relatively confident that machine learning is a good fit to solve the classification problem. You are now ready to experiment with different machine learning models, evaluate their accuracy, and tweak the model to avoid any potential issues scikit-learn provides implementations for a large number of machine learning models, spanning a few different families:
(i) Linear Models: Linear Regression, Logistic Regression, …
(ii) Ensemble Models: Random Forest, Gradient Boosting Trees, Adaboost, …
(iii) Bayesian Models: (Multinomial/Gaussian/…) Naive Bayes, Gaussian Processes, Support Vector Machines, k-Nearest Neighbors, and various other models
It is common practice to split the data into three parts:
- A training set that the model will be trained on.
- A validation set used for finding the optimal parameters (as discussed previously).
- A test set to evaluate the model’s performance.
Since a hyperparameter search is not being performed, only a train/test split will be used. To evaluate each model, we will use the K-fold cross-validation technique: iteratively training the model on different subsets of the data, and testing against the held-out data.
Next, the models are evaluated using this technique (with five validation folds) to obtain the following results:
Model Interpretation
It’s not enough to have a model that performs well according to a given metric: you must also have a model that you can understand and whose results can be explained. Start by training the model on part of the dataset, and then analyze the main sources of misclassification on the test set. One way to eliminate sources of error is to look at the confusion matrix, a matrix used to show the discrepancies between predicted and actual labels.
Using off-the-shelf tools and simple models, you solved a complex task, that of document classification, which might have seemed daunting at first! To do so, follow the following steps :
- Load and pre-process data.
- Analyze patterns in the data, to gain insights.
- Train different models, and rigorously evaluate each of them.
- Interpret the trained model
Final Thoughts
Document classification is exceptionally important when it comes to retrieving specific information in a set timeframe. Nonetheless, it is a worthwhile tool that can reduce the cost and time of searching and retrieving the information that matters. Hopefully, I was able to provide you with everything you need to get started with. To build a promising career in Machine Learning, join the Machine Learning Course using Python.
It’s such a good insight. An inspiration to advancement in Information Technology.
Hi Omara,
Thank you so much for the appreciation.