Lec 2 | Text Analytics | Lexalytics

Download PDf directly

Or Read Directly

Text Analytics | Lexalytics 1. Language identification The first step in text analytics is identifying what language the text is written in. Spanish? Russian? Arabic? Chinese? Lexalytics supports text analytics for more than 25 languages and dialects. Together, these languages include a complex tangle of alphabets, abjads and logographies. Each language has its own idiosyncrasies and unique rules of grammar. So, as basic as it might seem, language identification determines the whole process for every other text analytics function. 2. Tokenization Fig. 2 - Most text analytics systems rely on rules-based algorithms to tokenize alphabetic languages, but logographic languages require the use of complex machine learning algorithms Tokenization is the process of breaking apart a sentence or phrase into its component pieces. Tokens are usually words or numbers. Depending on the type of unstructured text you’re processing, however, tokens can also be: • Punctuation (exclamation points amplify sentiment) • Hyperlinks (https://…) • Possessive markers (apostrophes) Tokenization is language-specific, so it’s important to know which language you’re analyzing. Most alphabetic languages use whitespace and punctuation to denote tokens within a phrase or sentence. Logographic (character-based) languages such as Chinese, however, use other systems. Lexalytics uses rules-based algorithms to tokenize alphabetic languages, but logographic languages require the use of complex machine learning algorithms. 3. Sentence breaking Small text documents, such as tweets, usually contain a single sentence. But longer documents require sentence breaking to separate each unique statement. In some documents, each sentence is separated by a punctuation mark. But some sentences contain punctuation marks that don’t mean the end of the statement (like the period in “Dr.”) 4. Part of Speech tagging Part of Speech tagging (or PoS tagging) is the process of determining the part of speech of every token in a document, and then tagging it as such. Most languages follow some basic rules and patterns that can be written into a basic Part of Speech tagger. When shown a text document, the tagger figures out whether a given token represents a proper noun or a common noun, or if it’s a verb, an adjective, or something else entirely. Accurate part of speech tagging is critical for reliable sentiment analysis. Through identifying adjective-noun combinations, a sentiment analysis system gains its first clue that it’s looking at a sentiment-bearing phrase. At Lexalytics, due to our breadth of language coverage, we’ve had to train our systems to understand 93 unique Part of Speech tags. 5. Chunking Chunking refers to a range of sentence-breaking systems that splinter a sentence into its component phrases (noun phrases, verb phrases, and so on). Chunking in text analytics is different than Part of Speech tagging: • PoS tagging means assigning parts of speech to tokens • Chunking means assigning PoS-tagged tokens to phrases For example, take the sentence: The tall man is going to quickly walk under the ladder. PoS tagging will identify man and ladder as nouns and walk as a verb. Chunking will return: [the tall man]_np [is going to quickly walk]_vp [under the ladder]_pp (np stands for “noun phrase,” vp stands for “verb phrase,” and pp stands for “prepositional phrase.”) 6. Syntax parsing Syntax parsing is the analysis of how a sentence is formed. Syntax parsing is a critical preparatory step in sentiment analysis and other natural language processing features. The same sentence can have multiple meanings depending on how it’s structured: • Apple was doing poorly until Steve Jobs… • Because Apple was doing poorly, Steve Jobs… • Apple was doing poorly because Steve Jobs… In the first sentence, Apple is negative, whereas Steve Jobs is positive. In the second, Apple is still negative, but Steve Jobs is now neutral. In the final example, both Apple and Steve Jobs are negative. Syntax parsing is one of the most computationally-intensive steps in text analytics. At Lexalytics, we use special unsupervised machine learning models, based on billions of input words and complex matrix factorization, to help us understand syntax just like a human would. 7. Sentence chaining The final step in preparing unstructured text for deeper analysis is sentence chaining. Sentence chaining uses a technique called lexical chaining to connect individual sentences based on their association to a larger topic. Take the sentences: • I like beer. • Miller just launched a new pilsner. • But I only drink Belgian ale. Even if these sentences don’t appear near each other in a body of text, they are still connected to each other through the topics of beer->pilsner->ale. Lexical chaining allows us to make these kinds of connections. The “score” of a lexical chain is directly related to the length of the chain and the relationships between the chaining nouns (same words, antonyms, synonyms, homonyms, meronyms, hypernyms or holonyms). Lexical chains flow through the document and help a machine detect over-arching topics and quantify the overall “feel”. Lexalytics uses sentence chaining to weight individual themes, compare sentiment scores and summarize long documents. Basic applications of text mining and natural language processing Text mining and natural language processing technologies add powerful historical and predictive analytics capabilities to business intelligence and data analytics platforms. The flexibility and customizability of these systems make them applicable across a wide range of industries, such as hospitality, financial services, pharmaceuticals, and retail. Broadly speaking, applications of text mining and NLP fall into three categories: Voice of Customer Customer Experience Management and Market Research It can take years to gain a customer, but only minutes to lose them. Business analysts use text mining tools to understand what consumers are saying about their brands, products and services on social media, in open-ended experience surveys, and around the web. Through sentiment analysis, categorization and other natural language processing features, text mining tools form the backbone of data-driven Voice of Customer programs. Read more about text analytics for Voice of Customer Social Media Monitoring Social Listening and Brand Management Social media users generate a goldmine of natural-language content for brands to mine. But social comments are usually riddled with spelling errors, and laden with abbreviations, acronyms, and emoticons. The sheer volume poses a problem, too. On your own, analyzing all this data would be impossible. Business Intelligence tools like the Lexalytics Intelligence Platform use text analytics and natural language processing to quickly transform these mountains of hashtags, slang, and poor grammar into useful data and insights into how people feel, in their own words. Read more about text analytics for Social Media Monitoring Voice of Employee Workforce Analytics and Employee Satisfaction The cost of replacing a single employee can range from 20-30% of salary. But companies struggle to attract and retain good talent. Structured employee satisfaction surveys rarely give people the chance to voice their true opinions. And by the time you’ve identified the causes of the factors that reduce productivity and drive employees to leave, it’s too late. Text analytics tools help human resources professionals uncover and act on these issues faster and more effectively, cutting off employee churn at the source. Read more about text analytics for Voice of Employee Further reading Try text analytics and text mining for free Text analytics and NLP in action: Try our web demo for a quick sample of Lexalytics’ own text analytics and NLP features Signup for an interactive demo of the Lexalytics Intelligence Platform with sample data and dashboards Contact us for a live demo with your data, or to discuss our on-premise and cloud APIs Build your own text analytics system: Start working with the Stanford NLP or Natural Language Toolkit (NLTK) open source distributions Browse this Predictive Analytics list of 27 free text analytics toolkits Take a Coursera course on Text Mining and Analytics Learn more about text analytics Read about sentiment analysis and other natural language processing features Explore the difference between machine learning and natural language processing Dive deep with this practitioner’s guide to NLP on KDnuggets

Post a Comment

0 Comments