Download PDf directly
Or Read Directly
Text Mining: Fundamentals
Spring 2017
Today
● Discussion of lab exercise
● Syllabus notes
● Fundamentals of Text Analysis: Terms, Tokens, unigrams, N-grams,
Tokenization, Document, Bag of words, Term frequencies, Bag of words
model, Collocations, Concordances, Keyword-in-Context,
Entities/Stemming/Lemmatization (briefly)
● Introduction to Jupyter Notebooks
● Next Week
Kiki Watch
Current age: Still 6 months
Current height: Still short
Current trick: find her ball on command
Voyant
Administrative Notes
Office hours
Using iSchool drop-in hours: go.ischool.illinois.edu/dropin
When you're in the room, you can select one or more people, right-click, and
select "New Breakout Room"
Lab Marking and Late Policy
● There are 10 lab exercises, worth 30% of your mark (each 3%)
● Labs are assessed marks out of 10. Sometimes, just doing the task is a 10/10, other
times it is divided by tasks.
● Due: 1 hour before the following week's class (Thurs @ 5:15pm)
Late Policy
● Lose 10% day, up to 50%. Late is better than never.
● 2 late 'freebies': We won't count late marks for two labs. Sometimes life gets in the
way.
● Last day to submit late labs: May 3rd. Other assignments: May 8th.
Study Group
Are you interested in study group hours?
i.e. A chat room for students, paired with meeting hours for working on labs.
This is up to you!
Fundamentals of Text Mining
I have a dream that one day this nation will rise up and live out the true meaning of its creed: "We hold
these truths to be self-evident, that all men are created equal."
I have a dream that one day on the red hills of Georgia, the sons of former slaves and the sons of former
slave owners will be able to sit down together at the table of brotherhood.
I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice,
sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice.
I have a dream that my four little children will one day live in a nation where they will not be judged by
the color of their skin but by the content of their character.
Token
The unit of measurement chosen to split up a text.
In this course, our tokens will almost always be:
TERM TOKENS
But can be:
Characters, sentences, phenomes, syllables, phrases
Tokenization
● The process of splitting up a text to tokens
['I', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the',
'true', 'meaning', 'of', 'its', 'creed:', '"We', 'hold', 'these', 'truths', 'to', 'be', 'self-evident,', 'that',
'all', 'men', 'are', 'created', 'equal."', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'on', 'the', 'red',
'hills', 'of', 'Georgia,', 'the', 'sons', 'of', 'former', 'slaves', 'and', 'the', 'sons', 'of', 'former', 'slave',
'owners', 'will', 'be', 'able', 'to', 'sit', 'down', 'together', 'at', 'the', 'table', 'of', 'brotherhood.', 'I',
'have', 'a', 'dream', 'that', 'one', 'day', 'even', 'the', 'state', 'of', 'Mississippi,', 'a', 'state',
'sweltering', 'with', 'the', 'heat', 'of', 'injustice,', 'sweltering', 'with', 'the', 'heat', 'of', 'oppression,',
'will', 'be', 'transformed', 'into', 'an', 'oasis', 'of', 'freedom', 'and', 'justice.', 'I', 'have', 'a', 'dream',
'that', 'my', 'four', 'little', 'children', 'will', 'one', 'day', 'live', 'in', 'a', 'nation', 'where', 'they', 'will',
'not', 'be', 'judged', 'by', 'the', 'color', 'of', 'their', 'skin', 'but', 'by', 'the', 'content', 'of', 'their',
'character.']
speech.split()
['I', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the',
'true', 'meaning', 'of', 'its', 'creed:', '"We', 'hold', 'these', 'truths', 'to', 'be', 'self-evident,', 'that',
'all', 'men', 'are', 'created', 'equal."', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'on', 'the', 'red',
'hills', 'of', 'Georgia,', 'the', 'sons', 'of', 'former', 'slaves', 'and', 'the', 'sons', 'of', 'former', 'slave',
'owners', 'will', 'be', 'able', 'to', 'sit', 'down', 'together', 'at', 'the', 'table', 'of', 'brotherhood.', 'I',
'have', 'a', 'dream', 'that', 'one', 'day', 'even', 'the', 'state', 'of', 'Mississippi,', 'a', 'state',
'sweltering', 'with', 'the', 'heat', 'of', 'injustice,', 'sweltering', 'with', 'the', 'heat', 'of', 'oppression,',
'will', 'be', 'transformed', 'into', 'an', 'oasis', 'of', 'freedom', 'and', 'justice.', 'I', 'have', 'a', 'dream',
'that', 'my', 'four', 'little', 'children', 'will', 'one', 'day', 'live', 'in', 'a', 'nation', 'where', 'they', 'will',
'not', 'be', 'judged', 'by', 'the', 'color', 'of', 'their', 'skin', 'but', 'by', 'the', 'content', 'of', 'their',
'character.']
speech.split()
['I', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the',
'true', 'meaning', 'of', 'its', 'creed', ':', '``', 'We', 'hold', 'these', 'truths', 'to', 'be', 'self-evident', ',',
'that', 'all', 'men', 'are', 'created', 'equal', '.', "''", 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'on',
'the', 'red', 'hills', 'of', 'Georgia', ',', 'the', 'sons', 'of', 'former', 'slaves', 'and', 'the', 'sons', 'of',
'former', 'slave', 'owners', 'will', 'be', 'able', 'to', 'sit', 'down', 'together', 'at', 'the', 'table', 'of',
'brotherhood', '.', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'even', 'the', 'state', 'of', 'Mississippi',
',', 'a', 'state', 'sweltering', 'with', 'the', 'heat', 'of', 'injustice', ',', 'sweltering', 'with', 'the', 'heat',
'of', 'oppression', ',', 'will', 'be', 'transformed', 'into', 'an', 'oasis', 'of', 'freedom', 'and', 'justice', '.',
'I', 'have', 'a', 'dream', 'that', 'my', 'four', 'little', 'children', 'will', 'one', 'day', 'live', 'in', 'a', 'nation',
'where', 'they', 'will', 'not', 'be', 'judged', 'by', 'the', 'color', 'of', 'their', 'skin', 'but', 'by', 'the',
'content', 'of', 'their', 'character', '.']
from nltk.tokenize import word_tokenize
word_tokenize(speech)
['I', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the',
'true', 'meaning', 'of', 'its', 'creed', ':', '``', 'We', 'hold', 'these', 'truths', 'to', 'be', 'self-evident', ',',
'that', 'all', 'men', 'are', 'created', 'equal', '.', "''", 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'on',
'the', 'red', 'hills', 'of', 'Georgia', ',', 'the', 'sons', 'of', 'former', 'slaves', 'and', 'the', 'sons', 'of',
'former', 'slave', 'owners', 'will', 'be', 'able', 'to', 'sit', 'down', 'together', 'at', 'the', 'table', 'of',
'brotherhood', '.', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'even', 'the', 'state', 'of', 'Mississippi',
',', 'a', 'state', 'sweltering', 'with', 'the', 'heat', 'of', 'injustice', ',', 'sweltering', 'with', 'the', 'heat',
'of', 'oppression', ',', 'will', 'be', 'transformed', 'into', 'an', 'oasis', 'of', 'freedom', 'and', 'justice', '.',
'I', 'have', 'a', 'dream', 'that', 'my', 'four', 'little', 'children', 'will', 'one', 'day', 'live', 'in', 'a', 'nation',
'where', 'they', 'will', 'not', 'be', 'judged', 'by', 'the', 'color', 'of', 'their', 'skin', 'but', 'by', 'the',
'content', 'of', 'their', 'character', '.']
from nltk.tokenize import word_tokenize
word_tokenize(speech)
['I have a dream that one day this nation will rise up and live out the true meaning of its
creed: "We hold these truths to be self-evident, that all men are created equal."',
'I have a dream that one day on the red hills of Georgia, the sons of former slaves and the
sons of former slave owners will be able to sit down together at the table of brotherhood.',
'I have a dream that one day even the state of Mississippi, a state sweltering with the heat of
injustice, sweltering with the heat of oppression, will be transformed into an oasis of freedom
and justice.',
'I have a dream that my four little children will one day live in a nation where they will not be
judged by the color of their skin but by the content of their character. ']
from nltk.tokenize import sent_tokenize
sent_tokenize(speech)
Ngrams
● Number of tokens collected together
○ Unigram: ['I', 'have', 'a', 'dream']
○ Bigram: ['I have', 'have a', 'a dream']
○ Trigrams: ['I have a', 'have a dream']
○ After: 4-grams, 5-grams
Ngram Viewer
Data alert: you can download all the info!
Word Cloud
Frequencies
● Counts of tokens by
type
Stopping
Stopping: Process of
removing words from
analysis by using a list
of uninteresting words
Bag of Words
● Simplifying assumption about texts that ignores positional information.
Problems with BOW Model
● Missing context!
● Meaning isn't solely in words
○ The cat chased the dog.
○ The dog chased the cat.
■ cat:1
■ Chased:1
■ dog:1
■ the:2
So why would we use it?
● Good enough!
BOW simplified mathematical assumptions and the complexity of our
models
Quality is unintuitively robust, in cases such as information retrieval
(unigram models), classification (naive bayes), and topic modelling (latent
dirichlet allocation)
Better term tokens
CHICAGO
vs.
NEW YORK CITY
No reason to let notions of "words" drag us down. Why not allow entity phrases
you when you we can find them?
Entity Extraction
● Process of identifying what type of noun a word is. Commonly: locations,
organizations, persons.
● NER: Named Entity Recognizers
● Part of a body of information extraction techniques that we'll cover in the
next three weeks: learning more about the word
○ Part-of-speech tagging
Concordances; also Keyword-in-Context (KWIC)
Concordances; also Keyword-in-Context (KWIC)
Collocation
A collocation is the name given
to a pair, group, or sequence of
words that occur together more
often than would be expected by
chance. (Croft, Metzler and
Strohman)
Stemming and Lemmatization
Word level transformations to normalize different word forms
● Stemming: heuristic (i.e. rule-based) normalization.
○ E.g. 'Fish', 'fishes', 'fishing'
● Lemmatization: More complex: morphological analysis, dictionaries,
sometimes algorithmically trained
○ E.g. goose/geese, mouse/mice
More in our NLP lecture, but good to know the terms
Introduction to Jupyter
Notebooks
Homework
Lab Task
This week's lab task is posted externally, so you can see the code and output nicely formatted.
There is no writing portion, just questions, so it is posted as a quiz (whereas last week was a forum
response).
Questions?
0 Comments