Introduction
In this data-intensive world every day, a lot of data is generated involving almost all businesses and processes. It becomes necessary to take the data into account and channelize to generate competitive edge thereby solving business problems efficiently. Let’s learn text analysis in Python.
In the pursuit to solve business problems effectively using data lot of important factors play a crucial role. One of the major problems that a person is dealing with unstructured messy data and you wouldn’t be surprised to know that 70 percent of this data is in text format. It does not solve the problem as it becomes considerably difficult to clean text and make sense out of it instead of integers present in rows or images lined up in a stack in Python.
One of the famous quote interpretation says ” If it is not tough, then it is not worth it”. The incentives for making sense out of test data and utilizing it is huge as text sources are unlimited and have widespread applications from exams in a university to drug reports to tweets and messages on social media. A lot of useful analysis can be done for this data and even diverse set of applications can be made out of it.
NLP- Natural language processing
In order to make sense out of textual data, a new field of study is developed with new research happening on it every day and it is referred to as Natural language processing.
It is defined as a field of Artificial Intelligence which enables computers to analyze and understand the human language. Natural Language Processing (NLP) was formulated to build software that generates and understand natural languages so that a user can have natural conversations with his computer instead of through programming or artificial languages like Java or C.
NLP aim is to enable machines to interact and understand the human language its emotions and its working and meanings in real-world so comprehensively so that it can itself outdo humans in major tasks starting from simple tasks of solving queries in the form of a chatbot to leafing through thousands of documents at once to understand them and come out with conclusions on their own.
There are some diverse areas in which NLP is being used to achieve different goals and let us have a look in some of most sought-after field such as social media which generates the huge amount of data every day.
The internet never sleeps and it is one of the hot areas as per the generation and usage of data and too important considering the amount of information that can be derived out of it. To give you an idea
How much? In any given minute, 277,000 tweets are published on Twitter, 216,000 photos are sent to Instagram and 8,333 videos are shared on Vine, and we’re just getting started. Over that same 60 second period, 347,222 photos are sent on WhatsApp, 416,667 swipes are made on Tinder and 3,472 images are pinned on Pinterest.
And if you think that’s impressive, Google receives 4 million search queries, Facebook users share 2.46 million pieces of content and 204 million email messages are sent each and every minute of the day.
Applications in NLP
Sentimental analysis
Majority of customers while shopping online provide feedback and this feedback is further classified into categories such as positive and negative and neutral enabling users to have made a better decision regarding product and also helping a company to filter out flaws from negative reviews received so that product can be improved. A quick look at Amazon, Flipkart product reviews page shows the desired changes.
Chatbots
A lot of the times customers face problems due to products or issues related to services regarding the product. Although customer service has been there for the majority of products, it is not effectively available 24 *7 because the majority of people want to get their complaints addressed at normal working hours and that creates a rush. The solution to providing a service available at all times and able to handle a load of customers effectively thereby understanding their own language easily reducing human manpower cost have to lead us to chatbots.
They are being developed for different industries be it customer service for banks, replacing receptionists for hotels and even slowly replacing repetitive tasks like replying to specific personalized emails after analyzing them.
Identifying sock puppets
One of the most advanced areas in social media which affect loyal users on a regular basis is fake mischievious accounts posting inadequate content and undermining the sanctity of a platform. A lot of advanced research is being done on this account and popular sites such as Quora have banned accounts on account of these issues.
NLTK
In all of the above applications, python is at the forefront enabling to use all of the advanced libraries particularly Natural language processing toolkit – NLTK.It posses all the advanced functions and libraries to perform specific operations in the text to pre-process it to be used for deriving information out of it.
Installing NLTK
Python users can simply type the following line in their jupyter notebook to download it from there or download directly from the official site http://www.nltk.org/install.html.
It will lead us to the following window being opened downloading all important dependencies one by one.
Loading nltk and sample text to be preprocessed.
Preprocessing steps
We consider the following preprocessing steps to get started with text data.
1. Noise removal
We can specifically remove certain keywords that do not make sense in the context of text as it stopwords provided only contains limited words.we have saved the words we don’t want not_required.
As we see specific noise that included few words have been removed from the text and thus enabling us to make better analysis of text.
2.Tokenization
It is the process of a converting a text in tokens( words or entities present in the text) as it becomes easy to perform the other preprocessing steps.
Sentence tokenizer
It tokenizes text into sentences.
As we have only had a single sentence so it displays one complete full sentence.
Word tokenizer
It further tokenizes text into each specific words.
As we see it display the complete sentence in the form of words and thus make it further easy to perform following preprocessing steps.
3.Converting to Lowercase
In order to make tokens generated similar and proper, we need to align them all in one of the cases. Generally, we convert them in lowercase.
4.Removing Punctuation
Punctuation and symbols are not providing any valuable information regarding text and thus only come under noise and thus we need to remove it.
As we see the “$” sign has been removed as it removes all words that are non- alphabetic.
We can even remove punctuations using string operations in python too.
It also removes the punctuations present however effectiveness of both approaches depends on the type of text and though both can be utilized to see which one is more useful.
5.Removing stopwords
Stopwords are those words that don’t contribute towards the deeper meaning of the text.
It includes highly common words such as [“is”,”are”]
NLTK provides a list of stopwords present in it that can be removed instantly from the text to have only important words present in the text.
As we see it shows the list of stopwords present in the nltk that can be easily removed from text.
Lot of words have been removed as we are left with certain specific words that do are not very common in nature and thus contribute towards our goal of analysis.
6.Stemming
Stemming refers to the process of reducing each word to its root or base form.
For example “playing”,”play”,”played” all reduced to one word “play”.
There are many stemming algorithms, although a popular and long-standing method is the Porter Stemming algorithm. This method is available in NLTK via the PorterStemmerclass
We have added certain words in the text like”fishing”, “playing” and now we see the output as it converts fishing and fish at both places into fish.
End Notes
Text analysis through challenging has a lot of incentives as it can pave the way for machines to understand our language and achieve efficiency in the tasks performed by us. At present, the field of NLP suffers from certain limitations such as understanding context from text, same meaning conveyed by different sentences, sarcasm detection and many more but it is a very hot topic nowadays and further research going is bridging these limitations and hopefully it won’t be that far that when machine will be able to understand human languages with utmost efficiency and apply it.