Attend FREE Webinar on Data Science for Career Growth Register Now

Data Analytics Blog

Data Analytics Case Studies, WhyTos, HowTos, Interviews, News, Events, Jobs and more...

What is Text Mining? – The Complete Beginner’s Guide

5 (100%) 4 votes

Text Mining is used to help answer specific research questions. So the question is “What is Text Mining?” Say you want to answer “why cats sit on mats?” it would be impossible for you to read all the millions of research articles on the topic yourself. Here is where Text Mining can help. It filters large amounts of research and extracts the relevant information you need.

WHY CATS SIT ON MATS? (Image Credits: Elsevier)

Let me explain he question “What is Text Mining?” by giving some text mining examples, in the sentence – “Why cats sit on mats” the program would identify the ‘cat’ is the noun, ‘sit’ is the verb and ‘on’ is the proposition. But it is not just a search tool, it can also understand that the ‘cat’ is an animal, ‘sit’ is an action, and a ‘mat’ is an object. It then identifies and maps patterns and trends across the millions of articles. For example, it can even tell us if most of the cats who sit on mats come from cold climates. This detailed relevant information helps us determine what additional research is needed in order to answer our question. So now we can go back into the lab with a head start in order to do further research to find out the exact reasons. Although it might seem easy, text mining requires a lot of different tools and resources to make this work. Read on to find out more.

Analysing Results (Image credits: Elsevier)

What is Text Mining and How does it work?

Researchers can solve specific research questions by using text mining. you can text mine by first collecting the content you want to mine. For example, within academic articles, then you can apply a text mining tool which helps extract the information you need from large amounts of contents. The tool extracts by learning how to find information from each article. It examines complex research content containing unique language, abbreviations, codes and symbols. Researchers then end up with a long list of extracted words and sentences. The text mining tool also understands how the words relate to one another and can analyze the results. It enables researchers to see emerging trends and patterns, impossible to do if you had to read all the content yourself. This results in new insights which helps answer their research questions. After text mining, researchers can share their results with the research community as a new article or as a resource like a searchable database.

Solve specific Research Questions (Image Credits: Elsevier)

The Concept of Text Mining

Text Mining is a tool which helps in getting the data cleaned up. Text mining techniques are basically cleaning up unstructured data to be available for text analytics

If we talk about the framework, text mining is similar to ETL (i. e. Extract, Transform, Load) which means to be able to insert data into a database, these steps are to be followed. For example, text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization, production of granular taxonomies, entity relation modelling.

Understand how the words relate to one another (Image credits: Elsevier)

Areas of Text Mining

Information Extraction -> Data Mining -> Natural Language Processing -> Information Retrieval

Information Extraction (IE) – IE is the process of automatically obtaining structured data from unstructured data. This action includes Natural Langauge Processing

Data Mining (DM) – Data Mining looks for patterns in data. It can be more described as the retrieval of hidden information from data. Text Mining in Data Mining tools can predict responses and trends of future. It enables businesses to make positive decisions based on knowledge and answer business questions.

Natural Language Processing (NLP) – The purpose of NLP in text mining is to deliver the system in the knowledge retrieval phase as an input.

Information Retrieval (IR) – IR is considered as an extension to document extraction. IR systems help in to narrow down the set of records that are associated with a specific problem. Text mining involves applying complicated text mining algorithms to large-scale documents. By reducing the number of documents, IR can increase the speed of the analysis significantly.

Data Analytics Course by Digital Vidya

Free Data Analytics Webinar

Date: 15th Nov, 2018 (Thu)
Time: 3 PM to 4 PM (IST/GMT +5:30)

How to perform Text Mining?

Python and R are the most famous text mining tools out there for text mining.

The following steps are to be followed for Text Mining Python and Text mining in R,

Information Retrieval | Data Preparation and Cleaning | Segmentation | Tokenization | Stop-word numbers and punctuation removal | Stemming | Convert to lowercase | POS tagging | Create text corpus | Term-Document matrix

Tokenization

The process of splitting the whole data (corpus) into smaller chunks or smaller words usually single words is known as tokenization (N-Gram model or Bag of words Model)

Stemming and Lemmatization

We do lemmatization in order to prevent data duplication by linking words with the root word. For instance, the words – [big, bigger and biggest] all mean the same and it will cause data redundancy.

Stop-word numbers and punctuation removal

To go from raw text to fitting a deep learning model. We have to clean the text first, which means – splitting it into words and checking punctuation and case (by converting to lower case).

Stop Words: The search engine has been programmed to ignore these stop words during indexing entries and retrieving them as the result. Stop words are no use in analytics which will include words like “the”, “a”, “an”, “in”, “is”, “and” etc.

Sample text with Stop Words (Image Credits: geeksforgeeks.org)

POS-tagging

POS-tagging stands for Part of Speech tagging which is part of NLP. It is one of the main components of almost any NLP analysis. The process of POS-tagging simply means labelling words with their relevant Part-Of-Speech (Noun, Adjective, Verb, Pronoun, Adverb, …).

POS-tagging (Image credits: Learn Steps)

Part of Speech Tagging (Image Credits: Bogdan from NLPFH)

POS-tagging – python code snippet

Create Text Corpus

Text Corpus

Term-Document matrix

Term-Document Matrix (Image Credits: SPE3DLab)

Association Mining Analysis – Real world text mining applications of text mining

An application on which some guys were working called “Adverse Drug Event Probabilistic model”. In this model, we can check the following, on taking a particular medicine what adverse events are caused by which adverse event.

Association Mining Analysis (Image credits: educba.com)

Inputs, Excerpts and Image Credits:

Why cats sit on mats? – an original concept by Elsevier.

                               

YouTube videos on text mining published by the official Elsevier

www.geeksforgeeks.org/removing-stop-words-nltk-python/

machinelearningmastery.com/clean-text-machine-learning-python/

www.learnsteps.com/part-of-speech-tagging-noun-phrases-sentences-and-tokenization-for-natural-language-processing/

www.speedlab.io/en/2017/03/08/creation-semantic-variables-based-document-term-matrix-dtm/

People’s go-to IT guy, Blogger, Pune University survivor, an Engineer.

  • Data-Analytics

  • Your Comment

    Your email address will not be published.