Tokenization
Tokenization¶
A token is the technical name for a sequence of characters — such as car, his, or :) — that we want to treat as a group. Tokenization is breaking up a text into these tokens.
The most common module for natural language in Python is nltk
that is an acronym for the Natural Language Tool Kit. That are several tokenization commands that can be used by using from nltk.tokenize import <cmd>
where <cmd>
is replaced by the token commands below:
token commands | description | explanation |
---|---|---|
sent_tokenize | tokenize a document to sentences | breaks up an article into sentences |
word_tokenize | tokenize a document to words | breaks up ana article into words |
regexp_tokenize | tokenize based on regex | let's you decide how you want to break up the text using a regular expression |
TweetTokenizer | tokenizer for Tweets | tokenizes but accounts for #, @, and smileys. If you did this with word_tokenize, it would split on #, @, punctuataion, symbol etc. |
Importing modules¶
import os # manipulation of file directories
import itertools # tools for iterator functions
import nltk # natural language toolkit module
import re # regex module
nltk.download('punkt') # for sentence tokenization
nltk.download('popular')
from nltk.tokenize import sent_tokenize # sentence tokenization
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import regexp_tokenize # regex tokenization
Reading a file¶
The below article was downloaded from aeon.co which is a great website that I read whenever I can! The article is entitled How materialism became an ethos of hope for Jewish reformers that I've used as an example in this exercise.
currDir = os.getcwd()
fileName = 'aeon.txt'
readFile = currDir + '\\inputs\\' + fileName
f = open(readFile,'r')
article = f.read()
f.close()
print(article)
Break up article into sentences¶
sentences = sent_tokenize(article)
Print out different sentences¶
print(sentences[0])
print(sentences[1])
print(sentences[3])
regexp_tokenize¶
regexp_tokenize
is powerful as it allows you to set how tokens should be created from your string. You can tokenize based on specific patterns, or emojis.
my_string = "Morpheus: This is your last chance. After this, there is no turning back. You take the blue pill - the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill - you stay in Wonderland and I show you how deep the rabbit-hole goes. Do you want to go ?!55"
pattern1 = r"\w+(\?!)" # words with both ?!
print(regexp_tokenize(my_string,pattern1))
pattern2 = r'(\w+|#\d|\?|!)' # words OR #digit OR ? OR !
regexp_tokenize(my_string,pattern2)
pattern3 = r'(#\d\w+\?!)' # hash digit with words with ?!
regexp_tokenize(my_string,pattern3)
pattern4 = r'\s+' # any amount of spaces
regexp_tokenize(my_string,pattern4)
Tokenize sentence 3 into words¶
word_sent3 = word_tokenize(sentences[3])
print(word_sent3)
Extract the set of unique tokens from the entire article¶
uniq_words = set(word_tokenize(article))
print(uniq_words)
for i, val in enumerate(itertools.islice(uniq_words, 10)):
print(i,val)
Searching and matching¶
match
checks for a match only at the beginning of the string, while search
checks for a match anywhere in the string
pattern = r'J\w+'
re.search(pattern,article)
pattern = r"\d+s"
re.search(pattern,article)
pattern = r'\w+'
re.match(pattern,article)
Tokenizing Tweets¶
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
tweet_sample = ['iPython jupyter for #NLP is awesome #18 all the best to @rand','#nlp by @rand is good :)','We should :-( use #python for ^^ analytics :(','#python is good for qualitative analysis <3','I always use #PYthon']
Read all #hashtags from a tweet
pattern_hashtag = r"#\w+"
regexp_tokenize(tweet_sample[0],pattern_hashtag)
Read all #hashtags and @mentions
pattern_hash_mention = r"[@#]\w+"
regexp_tokenize(tweet_sample[0],pattern_hash_mention)
Using TweetTokenizer means that the code will tokenize
- #(hastag)
- @(mentions)
- :) :-( <3 (smileys)
tweetTknzr = TweetTokenizer()
tokenList = [tweetTknzr.tokenize(tweet) for tweet in tweet_sample]
print(tokenList)
If we just use the word_tokenize, we find that it will not recognize hastags, mentions, smileys, etc.
tokenList = [word_tokenize(tweet) for tweet in tweet_sample]
print(tokenList)
Comments
Comments powered by Disqus