Download PDf directly
Or Read Directly
Text Analytics | Lexalytics
1. Language identification
The first step in text analytics is identifying what language the text is written in.
Spanish? Russian? Arabic? Chinese?
Lexalytics supports text analytics for more than 25 languages and dialects.
Together, these languages include a complex tangle of alphabets, abjads and
logographies. Each language has its own idiosyncrasies and unique rules of
grammar. So, as basic as it might seem, language identification determines the
whole process for every other text analytics function.
2. Tokenization
Fig. 2 - Most text analytics systems rely on rules-based algorithms to
tokenize alphabetic languages, but logographic languages require the
use of complex machine learning algorithms
Tokenization is the process of breaking apart a sentence or phrase into its
component pieces. Tokens are usually words or numbers. Depending on the type of
unstructured text you’re processing, however, tokens can also be:
• Punctuation (exclamation points amplify sentiment)
• Hyperlinks (https://…)
• Possessive markers (apostrophes)
Tokenization is language-specific, so it’s important to know which language you’re
analyzing. Most alphabetic languages use whitespace and punctuation to denote
tokens within a phrase or sentence. Logographic (character-based) languages such
as Chinese, however, use other systems.
Lexalytics uses rules-based algorithms to tokenize alphabetic languages, but
logographic languages require the use of complex machine learning algorithms.
3. Sentence breaking
Small text documents, such as tweets, usually contain a single sentence. But longer
documents require sentence breaking to separate each unique statement. In
some documents, each sentence is separated by a punctuation mark. But some
sentences contain punctuation marks that don’t mean the end of the statement (like
the period in “Dr.”)
4. Part of Speech tagging
Part of Speech tagging (or PoS tagging) is the process of determining the part of
speech of every token in a document, and then tagging it as such. Most languages
follow some basic rules and patterns that can be written into a basic Part of Speech
tagger. When shown a text document, the tagger figures out whether a given token
represents a proper noun or a common noun, or if it’s a verb, an adjective, or
something else entirely.
Accurate part of speech tagging is critical for reliable sentiment analysis. Through
identifying adjective-noun combinations, a sentiment analysis system gains its first
clue that it’s looking at a sentiment-bearing phrase. At Lexalytics, due to our
breadth of language coverage, we’ve had to train our systems to understand 93
unique Part of Speech tags.
5. Chunking
Chunking refers to a range of sentence-breaking systems that splinter a sentence
into its component phrases (noun phrases, verb phrases, and so on).
Chunking in text analytics is different than Part of Speech tagging:
• PoS tagging means assigning parts of speech to tokens
• Chunking means assigning PoS-tagged tokens to phrases
For example, take the sentence:
The tall man is going to quickly walk under the ladder.
PoS tagging will identify man and ladder as nouns and walk as a verb.
Chunking will return: [the tall man]_np [is going to quickly walk]_vp [under the
ladder]_pp
(np stands for “noun phrase,” vp stands for “verb phrase,” and pp stands for
“prepositional phrase.”)
6. Syntax parsing
Syntax parsing is the analysis of how a sentence is formed. Syntax parsing is a
critical preparatory step in sentiment analysis and other natural language
processing features.
The same sentence can have multiple meanings depending on how it’s structured:
• Apple was doing poorly until Steve Jobs…
• Because Apple was doing poorly, Steve Jobs…
• Apple was doing poorly because Steve Jobs…
In the first sentence, Apple is negative, whereas Steve Jobs is positive.
In the second, Apple is still negative, but Steve Jobs is now neutral.
In the final example, both Apple and Steve Jobs are negative.
Syntax parsing is one of the most computationally-intensive steps in text analytics.
At Lexalytics, we use special unsupervised machine learning models, based on
billions of input words and complex matrix factorization, to help us understand
syntax just like a human would.
7. Sentence chaining
The final step in preparing unstructured text for deeper analysis is sentence
chaining. Sentence chaining uses a technique called lexical chaining to connect
individual sentences based on their association to a larger topic. Take the
sentences:
• I like beer.
• Miller just launched a new pilsner.
• But I only drink Belgian ale.
Even if these sentences don’t appear near each other in a body of text, they are still
connected to each other through the topics of beer->pilsner->ale. Lexical
chaining allows us to make these kinds of connections. The “score” of a lexical chain
is directly related to the length of the chain and the relationships between the
chaining nouns (same words, antonyms, synonyms, homonyms, meronyms,
hypernyms or holonyms).
Lexical chains flow through the document and help a machine detect over-arching
topics and quantify the overall “feel”. Lexalytics uses sentence chaining to weight
individual themes, compare sentiment scores and summarize long documents.
Basic applications of text mining and natural
language processing
Text mining and natural language processing technologies add powerful historical
and predictive analytics capabilities to business intelligence and data analytics
platforms. The flexibility and customizability of these systems make them
applicable across a wide range of industries, such as hospitality, financial services,
pharmaceuticals, and retail.
Broadly speaking, applications of text mining and NLP fall into three categories:
Voice of Customer
Customer Experience Management and Market Research
It can take years to gain a customer, but only minutes to lose them. Business
analysts use text mining tools to understand what consumers are saying about their
brands, products and services on social media, in open-ended experience surveys,
and around the web. Through sentiment analysis, categorization and other natural
language processing features, text mining tools form the backbone of data-driven
Voice of Customer programs.
Read more about text analytics for Voice of Customer
Social Media Monitoring
Social Listening and Brand Management
Social media users generate a goldmine of natural-language content for brands to
mine. But social comments are usually riddled with spelling errors, and laden with
abbreviations, acronyms, and emoticons. The sheer volume poses a problem, too.
On your own, analyzing all this data would be impossible. Business Intelligence
tools like the Lexalytics Intelligence Platform use text analytics and natural
language processing to quickly transform these mountains of hashtags, slang, and
poor grammar into useful data and insights into how people feel, in their own
words.
Read more about text analytics for Social Media Monitoring
Voice of Employee
Workforce Analytics and Employee Satisfaction
The cost of replacing a single employee can range from 20-30% of salary. But
companies struggle to attract and retain good talent. Structured employee
satisfaction surveys rarely give people the chance to voice their true opinions. And
by the time you’ve identified the causes of the factors that reduce productivity and
drive employees to leave, it’s too late. Text analytics tools help human resources
professionals uncover and act on these issues faster and more effectively, cutting off
employee churn at the source.
Read more about text analytics for Voice of Employee
Further reading
Try text analytics and text mining for free
Text analytics and NLP in action:
Try our web demo for a quick sample of Lexalytics’ own text analytics and NLP
features
Signup for an interactive demo of the Lexalytics Intelligence Platform with sample
data and dashboards
Contact us for a live demo with your data, or to discuss our on-premise and cloud
APIs
Build your own text analytics system:
Start working with the Stanford NLP or Natural Language Toolkit (NLTK) open
source distributions
Browse this Predictive Analytics list of 27 free text analytics toolkits
Take a Coursera course on Text Mining and Analytics
Learn more about text analytics
Read about sentiment analysis and other natural language processing features
Explore the difference between machine learning and natural language processing
Dive deep with this practitioner’s guide to NLP on KDnuggets
0 Comments