Tokenization

Tokenization

A token is the technical name for a sequence of characters — such as car, his, or :) — that we want to treat as a group. Tokenization is breaking up a text into these tokens.

The most common module for natural language in Python is nltk that is an acronym for the Natural Language Tool Kit. That are several tokenization commands that can be used by using from nltk.tokenize import <cmd> where <cmd> is replaced by the token commands below:

token commands description explanation
sent_tokenize tokenize a document to sentences breaks up an article into sentences
word_tokenize tokenize a document to words breaks up ana article into words
regexp_tokenize tokenize based on regex let's you decide how you want to break up the text using a regular expression
TweetTokenizer tokenizer for Tweets tokenizes but accounts for #, @, and smileys. If you did this with word_tokenize, it would split on #, @, punctuataion, symbol etc.

Importing modules

In [31]:
import os # manipulation of file directories
import itertools # tools for iterator functions

import nltk # natural language toolkit module
import re # regex module

nltk.download('punkt') # for sentence tokenization
nltk.download('popular')
from nltk.tokenize import sent_tokenize # sentence tokenization
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import regexp_tokenize # regex tokenization
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cmudict.zip.
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\gazetteers.zip.
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\genesis.zip.
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\gutenberg.zip.
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\movie_reviews.zip.
[nltk_data]    | Downloading package names to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\names.zip.
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\shakespeare.zip.
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\treebank.zip.
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\twitter_samples.zip.
[nltk_data]    | Downloading package omw to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\omw.zip.
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\wordnet_ic.zip.
[nltk_data]    | Downloading package words to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package punkt to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection popular

Reading a file

The below article was downloaded from aeon.co which is a great website that I read whenever I can! The article is entitled How materialism became an ethos of hope for Jewish reformers that I've used as an example in this exercise.

In [14]:
currDir = os.getcwd()
fileName = 'aeon.txt'

readFile = currDir + '\\inputs\\' + fileName

f = open(readFile,'r')
article = f.read()
f.close()
print(article)
Be ‘a man in the street and a Jew in the home’: a common piece of advice that liberal Jews often gave their co-religionists in the 19th century. If Jewishness was kept invisible and private, they wagered, then Jews could become citizens and professionals, and be granted equal access to the material resources made available to any other member of society. There was plenty of Christian bias to combat, encapsulated by images of Jewish avarice and materialism such as Shylock’s greedy hands and Rothschild’s beard in the form of snake-like tentacles. If only Jews could fit into the spiritual boxes established by the European Protestant elite, they would be accepted, or at least tolerated in the public sphere as Frenchmen, Germans or Englishmen. Though compelling in theory, the deal became more fraught as rampant anti-Semitic violence in eastern Europe continued to remind Jews that, no matter how much they tried to look like ‘everyone else’, their bodies were marked as Jewish. 
In the 1870s European Judaism underwent an intellectual revolution. Around then, a group of young Russian Jewish radicals began to identify Judaism with materialism, and to theorise about what they called – whether in Russian, German, Yiddish or Hebrew – the ‘material’ (material’nii, materiell, gashmi, ?omri) aspects of the Universe. For many Jews living in this period, ‘materialism’ was a worldview that brought into focus latent Jewish ideas and beliefs about the physical world. The materialists claimed that a theory of Judaism, defined by the way people related to land, labour and bodies, had been lying dormant within Jewish literature – in Hasidic texts, the Bible, Spinoza’s philosophy – and could now be clearly recognised and fully articulated. Jewish particularity was based on specific historical economic differences between Jews and others. What made Jews different was a certain socioeconomic dynamic that distinguished them from their neighbours.
The Jewish revolutionaries in 1870s Russia who embraced the idea of materialism shared a number of critical assumptions. They all rejected the notion that Judaism was based on abstract metaphysical theories (Scholasticism), rituals (Hasidism), study (Mitnagdim), and ethics and reason (Enlighteners). Judaism was not a religion, like Protestantism. Instead it was something attached to their bodies and expressed through one’s relationship to land, labour and resources. The materialists had also given up hope that the state could protect them and ensure their economic wellbeing. And finally, they no longer believed that history was headed in a positive direction. Over no amount of time would Jews living in Russia ever be granted greater rights and opportunities. Therefore, only a radical reclaiming of the physical world on the part of Jews could ensure that they would be protected and given a fair and equal share of resources.
Soon, the Jewish materialism of the Russians could be found among western European Jews residing in England and Germany. Only half-jokingly, the German anarchist Gustav Landauer claimed in 1921 that what distinguished ‘the modern “conscious” Jew from a German was that when the latter writes about … the conservation of energy, … he writes about the conservation of energy, but when the conscious Jew writes about the conservation of energy, he writes about the conservation of energy and Judaism’ (emphasis mine). Eventually, there would be those, such as the Englishman Israel Zangwill, who considered themselves adherents to ‘a religion of pots and pans’, and others who identified Judaism as a faith based on ‘bagels and lox’. Over the course of the 20th century, Jews would increasingly come to believe that ‘there is nothing purely spiritual that stands on its own … Everything spiritual requires a necessary material basis.’
Updates on everything new at Aeon.
Top of Form
Bottom of Form
JJjjjJJ  Jewish materialists were despised not only by staunch liberals but also by ‘defenders of the faith’. Moses Leib Lilienblum, who would go on to found the Zionist movement in Russia, wrote a novel in which he described his youthful yeshiva education as one long masturbatory experience – for this, he was denounced by rabbis and communal leaders who forced him to flee his hometown in fear for his life. The future Russian revolutionary Hasia Schur was pelted with stones and jeered at by the townspeople of Mohilev for going on a Sabbath walk hand-in-hand with her boyfriend, the socialist Eliezer Tsukerman: the rabbis were up in arms that two young people had dared to touch one another in public. Jewish materialists were cast as upstarts, deviants, social provocateurs and, of course, with providing Jew-haters with excuses to promote anti-Semitism.
But the Jewish materialists’ deviancies reflected a radically new kind of Jewish identity, one focused on their bodies and the physical world. The Jewish body they imagined would offer a contrast to both the hunchbacked, traditional Jewish Torah scholar incapable of supporting his family, and the muscular gentile male whose energies were directed at conquering and dominating the physical world. The new Jewish body would be shaped in the image of a healthy traditional Jewish woman who laboured to provide for her family’s material wellbeing while her husband spent his day in the house of study: by tending to the material aspects of existence, Jews’ needs and desires would now be seen as the primary feature of Judaism. The material Jewish identity set the stage for Jews’ involvement in 20th-century politics: Zionism, Bundism (the Jewish labour movement), the Minority Rights movement, and Jewish forms of communism all assumed that the organising structure of Jewish identity was a Jewish body, and not a Judaism of the heavens or the heart. Jewish materialism made Jews political without them possessing their own state or even citizenship in a host country.
Though the idea of the Jewish body as the locus of collective identity would always be suspect in western Europe, it would, however, become the basis of a new kind of Jewish identity most commonly witnessed in Israel and the United States. Jewish immigrants to Palestine at the turn of the century saw in Zion the actualisation of materialism as first imagined in the 1870s. The Marxist Ber Borchov’s students, such as future leaders of Israel Yitzhak Ben-Zvi and David Ben-Gurion, identified Palestine as a response to the crisis of the fork and the knife (a pithy phrase meant to capture the economic challenges of Russian Jews in the 1870s) originally theorised by the Jewish materialist Aaron Shemuel Lieberman in the 1870s. They envisioned a new kind of Jew – the ?aluts (pioneer) – who was attached to the physical world. As described by the 20th-century Zionist poet Avraham Shlonsky, a former Hasidic Jew, the ?aluts would be the embodiment of the idea that ‘a human being is meat, and he toils here in the sacred/and the land/bread’. The people of the book had now become a people of labour, land and the body.
In the US, eastern European Jews established large-scale defence organisations directed at protecting Jewish bodies and providing a platform for Jews to speak as a distinct ethnic minority in the American public sphere. From the poet Emma Lazarus to the American rabbi Mordecai Kaplan to the philosopher Horace Kallen, American Jews in the early 20th century developed political programmes and established organisations rooted around the physical aspects of Jewish life.
Jewish materialism remains the defining element of most American Jews’ identity. Following the Second World War, the influx of another wave of Jewish immigrants from Russian lands gave rise to a new brand of US literature that placed the Jewish body front and centre. The late US novelist Phillip Roth might have been familiar only in passing with the name Moses Lilienblum. But it was Lilienblum who put into circulation the Jewish genre of overbearing parents, unrealisable social expectations, failed sexual encounters, silly rabbis, bankrupt synagogues and God-fearing charlatans encased in a narrative about masturbation. Whether he knew it or not, when Roth wrote his novel Portnoy’s Complaint(1969), he was channelling the same tradition first articulated by Lilienblum a century earlier.
Roth took those commitments to his grave when he died on 22 May 2018. While the grandmaster of late-20th-century American letters asked to be interred next to Jews, he strictly prohibited the performance of any Jewish rituals at his funeral. His final requests, allegedly, were inspired by a desire ‘to have someone to talk to’. His corpse did not need a rabbi to eulogise it, or a perfunctory kaddish (or hymn) to kasher it; it was simply Jewish – nothing more and nothing less. Indeed, it was a fitting conclusion to the life of a Jewish materialist.




Break up article into sentences

In [15]:
sentences = sent_tokenize(article)
In [16]:
print(sentences[0])
print(sentences[1])
print(sentences[3])
Be ‘a man in the street and a Jew in the home’: a common piece of advice that liberal Jews often gave their co-religionists in the 19th century.
If Jewishness was kept invisible and private, they wagered, then Jews could become citizens and professionals, and be granted equal access to the material resources made available to any other member of society.
If only Jews could fit into the spiritual boxes established by the European Protestant elite, they would be accepted, or at least tolerated in the public sphere as Frenchmen, Germans or Englishmen.

regexp_tokenize

regexp_tokenize is powerful as it allows you to set how tokens should be created from your string. You can tokenize based on specific patterns, or emojis.

In [17]:
my_string = "Morpheus: This is your last chance. After this, there is no turning back. You take the blue pill - the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill - you stay in Wonderland and I show you how deep the rabbit-hole goes.  Do you want to go ?!55"
pattern1 = r"\w+(\?!)" # words  with both ?!
print(regexp_tokenize(my_string,pattern1))
[]
In [18]:
pattern2 = r'(\w+|#\d|\?|!)' # words OR #digit OR ? OR !
regexp_tokenize(my_string,pattern2)
Out[18]:
['Morpheus',
 'This',
 'is',
 'your',
 'last',
 'chance',
 'After',
 'this',
 'there',
 'is',
 'no',
 'turning',
 'back',
 'You',
 'take',
 'the',
 'blue',
 'pill',
 'the',
 'story',
 'ends',
 'you',
 'wake',
 'up',
 'in',
 'your',
 'bed',
 'and',
 'believe',
 'whatever',
 'you',
 'want',
 'to',
 'believe',
 'You',
 'take',
 'the',
 'red',
 'pill',
 'you',
 'stay',
 'in',
 'Wonderland',
 'and',
 'I',
 'show',
 'you',
 'how',
 'deep',
 'the',
 'rabbit',
 'hole',
 'goes',
 'Do',
 'you',
 'want',
 'to',
 'go',
 '?',
 '!',
 '55']
In [19]:
pattern3 = r'(#\d\w+\?!)' # hash digit with words with ?!
regexp_tokenize(my_string,pattern3)
Out[19]:
[]
In [20]:
pattern4 = r'\s+' # any amount of spaces
regexp_tokenize(my_string,pattern4)
Out[20]:
[' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '  ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ']

Tokenize sentence 3 into words

In [21]:
word_sent3 = word_tokenize(sentences[3])
print(word_sent3)
['If', 'only', 'Jews', 'could', 'fit', 'into', 'the', 'spiritual', 'boxes', 'established', 'by', 'the', 'European', 'Protestant', 'elite', ',', 'they', 'would', 'be', 'accepted', ',', 'or', 'at', 'least', 'tolerated', 'in', 'the', 'public', 'sphere', 'as', 'Frenchmen', ',', 'Germans', 'or', 'Englishmen', '.']

Extract the set of unique tokens from the entire article

In [22]:
uniq_words = set(word_tokenize(article))
print(uniq_words)
for i, val in enumerate(itertools.islice(uniq_words, 10)):
    print(i,val)
{'rabbis', 'Rights', 'Israel', 'From', 'form', 'matter', 'silly', 'theory', 'recognised', '1870s', 'feature', 'World', '’', 'sacred/and', 'energies', 'or', 'rampant', 'shared', 'purely', 'Second', 'took', 'bagels', 'placed', 'based', 'They', 'necessary', 'Shlonsky', 'own', 'deviants', 'Christian', 'others', 'for', 'Zion', 'focused', 'lands', 'Kallen', 'Shylock', 'land/bread', 'bankrupt', 'jeered', 'snake-like', 'a', 'defined', 'existence', 'among', 'omri', 'Bottom', 'which', 'David', 'sexual', 'German', 'woman', 'centre', 'Bible', ':', 'only', 'citizenship', 'poet', 'ethics', 'another', 'developed', 'related', 'marked', 'Jew-haters', '?', 'course', 'primary', 'whose', 'Form', 'Yitzhak', 'heavens', 'piece', 'compelling', 'structure', 'fit', '”', '1969', 'requests', 'reflected', 'avarice', 'positive', 'For', 'spent', 'Horace', 'grave', 'invisible', 'wave', ';', 'set', 'co-religionists', 'Jews', 'dormant', 'embodiment', 'Ber', 'Spinoza', 'In', 'distinct', 'basis', 'him', 'Aaron', 'texts', 'anti-Semitic', 'reclaiming', 'violence', 'Indeed', 'identity', 'United', 'muscular', 'allegedly', 'be', 'Zangwill', 'movement', 'grandmaster', 'Instead', 'direction', 'now', 'Soon', 'Lilienblum', 'kaddish', 'ever', 'gashmi', 'half-jokingly', 'no', 'beard', 'look', 'way', 'any', 'residing', 'dared', 'seen', 'while', 'radically', 'Germany', 'hand-in-hand', 'directed', 'granted', 'Marxist', 'heart', 'crisis', 'hymn', 'townspeople', 'saw', 'wagered', 'continued', 'protect', 'pans', 'unrealisable', 'The', 'contrast', 'greater', 'programmes', 'Avraham', 'found', 'body', 'late', 'Around', 'professionals', 'host', 'always', 'began', 'imagined', 'tending', 'living', 'their', 'is', 'Tsukerman', 'someone', 'finally', 'Eventually', 'lying', 'Yiddish', 'circulation', 'with', 'actualisation', 'pots', 'possessing', 'were', 'Scholasticism', 'Roth', 'tradition', 'first', 'earlier', 'fork', 'relationship', 'period', 'as', 'around', 'Universe', 'modern', 'healthy', 'England', 'aluts', 'how', 'Updates', 'Torah', 'this', 'Complaint', 'Lieberman', 'rituals', 'idea', 'rooted', 'revolutionary', 'all', 'Zionist', 'masturbatory', 'laboured', 'turn', 'life', 'materialist', 'encased', 'philosophy', 'And', '‘', 'land', 'energy', 'also', 'European', ',', 'forced', 'offer', 'charlatans', 'fear', 'reason', 'them', 'There', 'overbearing', 'on', 'historical', 'Emma', 'socioeconomic', 'passing', 'theories', 'Everything', 'Jewish', 'western', 'notion', 'genre', 'political', 'fully', 'What', 'socialist', 'something', 'would', 'Frenchmen', 'distinguished', 'social', 'longer', 'Top', 'assumptions', 'who', 'its', 'advice', 'encapsulated', 'providing', 'States', 'talk', 'Hebrew', 'brand', 'had', 'have', 'different', 'materiell', 'when', 'envisioned', 'eulogise', 'early', 'time', 'But', 'Phillip', 'conclusion', 'witnessed', 'ensure', 'might', 'Whether', 'Eliezer', '–', 'Russian', 'become', 'walk', 'touch', 'neighbours', 'sphere', 'provocateurs', 'politics', 'to', 'American', 'Therefore', 'been', 'citizens', 'narrative', 'they', 'Schur', 'Over', 'suspect', 'Jew', 'Enlighteners', 'wellbeing', 'remind', 'greedy', '20th', 'response', 'theorised', 'study', 'denounced', 'scholar', 'Protestant', 'pioneer', 'ethnic', 'Moses', 'image', 'here', 'faith', 'forms', 'boyfriend', 'hope', 'new', 'Englishmen', 'Ben-Gurion', 'materialists', 'materialism', 'within', 'expectations', 'Russians', 'economic', 'increasingly', '2018', 'specific', 'Hasidism', 'by', 'book', 'Though', 'Following', 'Landauer', 'experience', 'stage', 'literature', 'the', 'name', 'influx', 'religion', 'conservation', 'described', 'come', 'long', 'Hasia', 'commonly', 'God-fearing', 'involvement', 'element', 'asked', 'worldview', 'hunchbacked', 'conquering', 'Russia', 'private', 'originally', 'yeshiva', 'plenty', 'Englishman', 'lox', 'bodies', 'youthful', 'much', 'US', 'available', 'liberal', 'themselves', 'stands', 'former', 'May', 'speak', '19th', 'man', 'share', 'street', 'education', 'His', 'simply', 'latter', 'in', 'supporting', 'put', 'Zionism', 'philosopher', 'channelling', 'organisations', 'knew', 'through', 'rabbi', 'everything', 'traditional', 'novel', 'intellectual', 'ideas', 'particularity', 'without', 'immigrants', 'material', 'whether', 'perfunctory', 'claimed', 'radicals', 'between', 'male', 'go', 'else', 'anti-Semitism', 'focus', 'clearly', 'least', 'familiar', 'not', 'Aeon', 'incapable', 'Ben-Zvi', 'called', 'beliefs', 'deal', 'did', 'mine', 'resources', 'parents', 'Protestantism', 'Germans', 'up', 'toils', 'even', 'those', 'promote', 'rise', 'boxes', 'Minority', 'established', 'Mohilev', 'final', 'assumed', 'hometown', 'died', 'cast', 'kept', '“', '22', 'one', 'collective', 'pelted', 'of', 'home', 'prohibited', 'platform', 'group', 'gentile', 'more', 'Europe', 'deviancies', 'commitments', 'front', 'tolerated', 'he', '20th-century', 'emphasis', 'Palestine', 'going', 'such', 'shaped', 'elite', 'certain', 'Mitnagdim', 'provide', 'meant', 'into', 'opportunities', 'liberals', 'made', 'arms', 'most', 'images', 'critical', 'identified', 'husband', 'however', 'being', 'failed', 'given', 'performance', 'tentacles', 'was', 'next', 'basis.', 'despised', 'then', '(', 'encounters', 'leaders', 'defence', 'articulated', 'nothing', 'Sabbath', 'aspects', 'human', 'society', 'remains', 'same', 'metaphysical', 'believed', 'need', 'attached', 'about', 'staunch', 'defining', 'revolution', 'two', 'number', 'Rothschild', 'Mordecai', 'latent', 'While', '…', 'fair', 'dynamic', 'wrote', 'As', 'excuses', 'everyone', 'capture', 'future', 'Portnoy', 'other', 'rights', 'people', 'history', 'masturbation', 'brought', 'headed', 'inspired', 'state', 'Jewishness', 'common', 'nii', 'what', 'Hasidic', 'his', 'upstarts', 'underwent', 'kasher', '1921', 'believe', 'needs', ')', 'hands', 'gave', 'radical', 'it', 'day', 'minority', 'like', 'public', 'Kaplan', 'eastern', 'Judaism', 'tried', 'War', 'kind', 'locus', 'embraced', 'bias', 'If', 'protecting', 'large-scale', 'flee', 'that', 'physical', 'rejected', 'Bundism', 'late-20th-century', 'fitting', 'member', 'communal', 'country', 'from', 'accepted', 'world', 'desires', 'letters', 'interred', '.', 'phrase', 'less', 'part', 'defenders', 'considered', 'Only', 'identify', 'theorise', 'anarchist', 'both', 'and', 'amount', 'JJjjjJJ', 'Borchov', 'revolutionaries', 'requires', 'differences', 'novelist', 'equal', 'meat', 'synagogues', 'at', 'organising', 'challenges', 'Lazarus', 'many', 'expressed', 's', 'conscious', 'Shemuel', 'combat', 'became', 'century', 'her', 'knife', 'an', 'dominating', 'abstract', 'access', 'communism', 'often', 'stones', 'but', 'family', 'there', 'spiritual', 'students', 'corpse', 'could', 'fraught', 'Gustav', 'adherents', 'desire', 'strictly', 'house', 'Leib', 'labour', 'pithy', 'Be', 'writes', 'young', 'funeral', 'protected'}
0 rabbis
1 Rights
2 Israel
3 From
4 form
5 matter
6 silly
7 theory
8 recognised
9 1870s

Searching and matching

match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string

In [23]:
pattern = r'J\w+'
re.search(pattern,article)
Out[23]:
<_sre.SRE_Match object; span=(30, 33), match='Jew'>
In [24]:
pattern = r"\d+s"
re.search(pattern,article)
Out[24]:
<_sre.SRE_Match object; span=(993, 998), match='1870s'>
In [25]:
pattern = r'\w+'
re.match(pattern,article)
Out[25]:
<_sre.SRE_Match object; span=(0, 2), match='Be'>

Tokenizing Tweets

In [26]:
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

tweet_sample = ['iPython jupyter for #NLP is awesome #18 all the best to @rand','#nlp by @rand is good :)','We should :-( use #python for ^^ analytics :(','#python is good for qualitative analysis <3','I always use #PYthon']

Read all #hashtags from a tweet

In [27]:
pattern_hashtag = r"#\w+"
regexp_tokenize(tweet_sample[0],pattern_hashtag)
Out[27]:
['#NLP', '#18']

Read all #hashtags and @mentions

In [28]:
pattern_hash_mention = r"[@#]\w+"
regexp_tokenize(tweet_sample[0],pattern_hash_mention)
Out[28]:
['#NLP', '#18', '@rand']

Using TweetTokenizer means that the code will tokenize

  1. #(hastag)
  2. @(mentions)
  3. :) :-( <3 (smileys)
In [29]:
tweetTknzr = TweetTokenizer()
tokenList = [tweetTknzr.tokenize(tweet) for tweet in tweet_sample]
print(tokenList)
[['iPython', 'jupyter', 'for', '#NLP', 'is', 'awesome', '#18', 'all', 'the', 'best', 'to', '@rand'], ['#nlp', 'by', '@rand', 'is', 'good', ':)'], ['We', 'should', ':-(', 'use', '#python', 'for', '^', '^', 'analytics', ':('], ['#python', 'is', 'good', 'for', 'qualitative', 'analysis', '<3'], ['I', 'always', 'use', '#PYthon']]

If we just use the word_tokenize, we find that it will not recognize hastags, mentions, smileys, etc.

In [30]:
tokenList = [word_tokenize(tweet) for tweet in tweet_sample]
print(tokenList)
[['iPython', 'jupyter', 'for', '#', 'NLP', 'is', 'awesome', '#', '18', 'all', 'the', 'best', 'to', '@', 'rand'], ['#', 'nlp', 'by', '@', 'rand', 'is', 'good', ':', ')'], ['We', 'should', ':', '-', '(', 'use', '#', 'python', 'for', '^^', 'analytics', ':', '('], ['#', 'python', 'is', 'good', 'for', 'qualitative', 'analysis', '<', '3'], ['I', 'always', 'use', '#', 'PYthon']]

Comments

Comments powered by Disqus