Text Mining: Fundamentals

Download PDf directly

Or Read Directly

Text Mining: Fundamentals Spring 2017 Today ● Discussion of lab exercise ● Syllabus notes ● Fundamentals of Text Analysis: Terms, Tokens, unigrams, N-grams, Tokenization, Document, Bag of words, Term frequencies, Bag of words model, Collocations, Concordances, Keyword-in-Context, Entities/Stemming/Lemmatization (briefly) ● Introduction to Jupyter Notebooks ● Next Week Kiki Watch Current age: Still 6 months Current height: Still short Current trick: find her ball on command Voyant Administrative Notes Office hours Using iSchool drop-in hours: go.ischool.illinois.edu/dropin When you're in the room, you can select one or more people, right-click, and select "New Breakout Room" Lab Marking and Late Policy ● There are 10 lab exercises, worth 30% of your mark (each 3%) ● Labs are assessed marks out of 10. Sometimes, just doing the task is a 10/10, other times it is divided by tasks. ● Due: 1 hour before the following week's class (Thurs @ 5:15pm) Late Policy ● Lose 10% day, up to 50%. Late is better than never. ● 2 late 'freebies': We won't count late marks for two labs. Sometimes life gets in the way. ● Last day to submit late labs: May 3rd. Other assignments: May 8th. Study Group Are you interested in study group hours? i.e. A chat room for students, paired with meeting hours for working on labs. This is up to you! Fundamentals of Text Mining I have a dream that one day this nation will rise up and live out the true meaning of its creed: "We hold these truths to be self-evident, that all men are created equal." I have a dream that one day on the red hills of Georgia, the sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood. I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice. I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. Token The unit of measurement chosen to split up a text. In this course, our tokens will almost always be: TERM TOKENS But can be: Characters, sentences, phenomes, syllables, phrases Tokenization ● The process of splitting up a text to tokens ['I', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the', 'true', 'meaning', 'of', 'its', 'creed:', '"We', 'hold', 'these', 'truths', 'to', 'be', 'self-evident,', 'that', 'all', 'men', 'are', 'created', 'equal."', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'on', 'the', 'red', 'hills', 'of', 'Georgia,', 'the', 'sons', 'of', 'former', 'slaves', 'and', 'the', 'sons', 'of', 'former', 'slave', 'owners', 'will', 'be', 'able', 'to', 'sit', 'down', 'together', 'at', 'the', 'table', 'of', 'brotherhood.', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'even', 'the', 'state', 'of', 'Mississippi,', 'a', 'state', 'sweltering', 'with', 'the', 'heat', 'of', 'injustice,', 'sweltering', 'with', 'the', 'heat', 'of', 'oppression,', 'will', 'be', 'transformed', 'into', 'an', 'oasis', 'of', 'freedom', 'and', 'justice.', 'I', 'have', 'a', 'dream', 'that', 'my', 'four', 'little', 'children', 'will', 'one', 'day', 'live', 'in', 'a', 'nation', 'where', 'they', 'will', 'not', 'be', 'judged', 'by', 'the', 'color', 'of', 'their', 'skin', 'but', 'by', 'the', 'content', 'of', 'their', 'character.'] speech.split() ['I', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the', 'true', 'meaning', 'of', 'its', 'creed:', '"We', 'hold', 'these', 'truths', 'to', 'be', 'self-evident,', 'that', 'all', 'men', 'are', 'created', 'equal."', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'on', 'the', 'red', 'hills', 'of', 'Georgia,', 'the', 'sons', 'of', 'former', 'slaves', 'and', 'the', 'sons', 'of', 'former', 'slave', 'owners', 'will', 'be', 'able', 'to', 'sit', 'down', 'together', 'at', 'the', 'table', 'of', 'brotherhood.', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'even', 'the', 'state', 'of', 'Mississippi,', 'a', 'state', 'sweltering', 'with', 'the', 'heat', 'of', 'injustice,', 'sweltering', 'with', 'the', 'heat', 'of', 'oppression,', 'will', 'be', 'transformed', 'into', 'an', 'oasis', 'of', 'freedom', 'and', 'justice.', 'I', 'have', 'a', 'dream', 'that', 'my', 'four', 'little', 'children', 'will', 'one', 'day', 'live', 'in', 'a', 'nation', 'where', 'they', 'will', 'not', 'be', 'judged', 'by', 'the', 'color', 'of', 'their', 'skin', 'but', 'by', 'the', 'content', 'of', 'their', 'character.'] speech.split() ['I', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the', 'true', 'meaning', 'of', 'its', 'creed', ':', '``', 'We', 'hold', 'these', 'truths', 'to', 'be', 'self-evident', ',', 'that', 'all', 'men', 'are', 'created', 'equal', '.', "''", 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'on', 'the', 'red', 'hills', 'of', 'Georgia', ',', 'the', 'sons', 'of', 'former', 'slaves', 'and', 'the', 'sons', 'of', 'former', 'slave', 'owners', 'will', 'be', 'able', 'to', 'sit', 'down', 'together', 'at', 'the', 'table', 'of', 'brotherhood', '.', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'even', 'the', 'state', 'of', 'Mississippi', ',', 'a', 'state', 'sweltering', 'with', 'the', 'heat', 'of', 'injustice', ',', 'sweltering', 'with', 'the', 'heat', 'of', 'oppression', ',', 'will', 'be', 'transformed', 'into', 'an', 'oasis', 'of', 'freedom', 'and', 'justice', '.', 'I', 'have', 'a', 'dream', 'that', 'my', 'four', 'little', 'children', 'will', 'one', 'day', 'live', 'in', 'a', 'nation', 'where', 'they', 'will', 'not', 'be', 'judged', 'by', 'the', 'color', 'of', 'their', 'skin', 'but', 'by', 'the', 'content', 'of', 'their', 'character', '.'] from nltk.tokenize import word_tokenize word_tokenize(speech) ['I', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the', 'true', 'meaning', 'of', 'its', 'creed', ':', '``', 'We', 'hold', 'these', 'truths', 'to', 'be', 'self-evident', ',', 'that', 'all', 'men', 'are', 'created', 'equal', '.', "''", 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'on', 'the', 'red', 'hills', 'of', 'Georgia', ',', 'the', 'sons', 'of', 'former', 'slaves', 'and', 'the', 'sons', 'of', 'former', 'slave', 'owners', 'will', 'be', 'able', 'to', 'sit', 'down', 'together', 'at', 'the', 'table', 'of', 'brotherhood', '.', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'even', 'the', 'state', 'of', 'Mississippi', ',', 'a', 'state', 'sweltering', 'with', 'the', 'heat', 'of', 'injustice', ',', 'sweltering', 'with', 'the', 'heat', 'of', 'oppression', ',', 'will', 'be', 'transformed', 'into', 'an', 'oasis', 'of', 'freedom', 'and', 'justice', '.', 'I', 'have', 'a', 'dream', 'that', 'my', 'four', 'little', 'children', 'will', 'one', 'day', 'live', 'in', 'a', 'nation', 'where', 'they', 'will', 'not', 'be', 'judged', 'by', 'the', 'color', 'of', 'their', 'skin', 'but', 'by', 'the', 'content', 'of', 'their', 'character', '.'] from nltk.tokenize import word_tokenize word_tokenize(speech) ['I have a dream that one day this nation will rise up and live out the true meaning of its creed: "We hold these truths to be self-evident, that all men are created equal."', 'I have a dream that one day on the red hills of Georgia, the sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood.', 'I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice.', 'I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. '] from nltk.tokenize import sent_tokenize sent_tokenize(speech) Ngrams ● Number of tokens collected together ○ Unigram: ['I', 'have', 'a', 'dream'] ○ Bigram: ['I have', 'have a', 'a dream'] ○ Trigrams: ['I have a', 'have a dream'] ○ After: 4-grams, 5-grams Ngram Viewer Data alert: you can download all the info! Word Cloud Frequencies ● Counts of tokens by type Stopping Stopping: Process of removing words from analysis by using a list of uninteresting words Bag of Words ● Simplifying assumption about texts that ignores positional information. Problems with BOW Model ● Missing context! ● Meaning isn't solely in words ○ The cat chased the dog. ○ The dog chased the cat. ■ cat:1 ■ Chased:1 ■ dog:1 ■ the:2 So why would we use it? ● Good enough! BOW simplified mathematical assumptions and the complexity of our models Quality is unintuitively robust, in cases such as information retrieval (unigram models), classification (naive bayes), and topic modelling (latent dirichlet allocation) Better term tokens CHICAGO vs. NEW YORK CITY No reason to let notions of "words" drag us down. Why not allow entity phrases you when you we can find them? Entity Extraction ● Process of identifying what type of noun a word is. Commonly: locations, organizations, persons. ● NER: Named Entity Recognizers ● Part of a body of information extraction techniques that we'll cover in the next three weeks: learning more about the word ○ Part-of-speech tagging Concordances; also Keyword-in-Context (KWIC) Concordances; also Keyword-in-Context (KWIC) Collocation A collocation is the name given to a pair, group, or sequence of words that occur together more often than would be expected by chance. (Croft, Metzler and Strohman) Stemming and Lemmatization Word level transformations to normalize different word forms ● Stemming: heuristic (i.e. rule-based) normalization. ○ E.g. 'Fish', 'fishes', 'fishing' ● Lemmatization: More complex: morphological analysis, dictionaries, sometimes algorithmically trained ○ E.g. goose/geese, mouse/mice More in our NLP lecture, but good to know the terms Introduction to Jupyter Notebooks Homework Lab Task This week's lab task is posted externally, so you can see the code and output nicely formatted. There is no writing portion, just questions, so it is posted as a quiz (whereas last week was a forum response). Questions?

Post a Comment

0 Comments