Regular expressions (regex)

Regular Expressions (regex)


This post explores performing regular expressions (also known as 'regex') in Python using the re module in Python. The regular expression is a sequence of characters that define a search pattern. Applications of regular expressions are (i) searching for certain files names in a directory (ii) string processing such as search and replacement (iii) syntax highlighting (iv) data/web scraping (v) natural language processing.

The tasklist for you to understand is is as follows:

  • [x] Understanding common regex patterns.
  • [ ] Understanding common regex commands.
  • [ ] Exploring commonly used re commands.

Important notes in using the re module.

  1. re commands have two main inputs (i) pattern (ii) string.
  2. Make sure your patterns start with an r as this forces Python to recognize it as a raw string.

We can't possibly learn all the commands and patterns in re; however, we should at least understnad the most commonly used ones. These are shown in the tables below:

1. Common regex patterns
Pattern Matches Word
\w+ word 'Ditto'
\d digit 8
\s space ' '
.* wildcard 'any456'
+ or * greed 'zzzzz'
\S not space 'non-space'
[a-z] lowercase group 'abcxyz'
[A-A] uppercase group 'ABCXYZ'
2. Common regex commands
Commands Description Example
| OR To combine two different groups/ranges
( ) Group Used for an explicit set of characters
[ ] Character ranges Used for a range of characters
(\d+|\w+) Matches digits and words '12' 'abc'
[A-Za-z]+ Matches uppercase and lowercase characters 'AASDASDlasdasd'
[0-9] numbers from 0 to 9 9
[A-Za-z-.]+ Upper & lowercase characters with . and - 'My-website.com'
(a-z) a,-, and z 'a-z'
(\s+l,) spaces or comma ','
3. Commonly used re commands
command description
split split a spring
findall find all patterns in a strong
search search for a pattern within the string
match match an entire string or substring based on pattern. Looks at the very start of the string
Importing the re module
In [2]:
import re
test_string1 = "All       students' need to learn how to use RegEx!"
test_string2 = "   We all like using Python.  Python is awesome!  Why don't you try using it? I've used it for C++ for 5 years, Matlab for 6 years, and Python for 15 years!   "
test_string3 = "We all like using Python.  Python is awesome!"

The pattern only looks for spaces.
So if you have a series of spaces, it lists down each one

In [4]:
pattern = r'\s'
re.findall(pattern,test_string1)
Out[4]:
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

The pattern looks for 'greedy' spaces. Thus the long set of spaces is treated as one

In [5]:
pattern = r'\s+'
re.findall(pattern,test_string1)
Out[5]:
['       ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

The pattern looks for 'greedy' words.

In [6]:
pattern = r'\w+'
re.findall(pattern,test_string2)
Out[6]:
['We',
 'all',
 'like',
 'using',
 'Python',
 'Python',
 'is',
 'awesome',
 'Why',
 'don',
 't',
 'you',
 'try',
 'using',
 'it',
 'I',
 've',
 'used',
 'it',
 'for',
 'C',
 'for',
 '5',
 'years',
 'Matlab',
 'for',
 '6',
 'years',
 'and',
 'Python',
 'for',
 '15',
 'years']

The pattern looks for each character

In [7]:
pattern = r'\w'
re.findall(pattern,test_string2)
Out[7]:
['W',
 'e',
 'a',
 'l',
 'l',
 'l',
 'i',
 'k',
 'e',
 'u',
 's',
 'i',
 'n',
 'g',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 'i',
 's',
 'a',
 'w',
 'e',
 's',
 'o',
 'm',
 'e',
 'W',
 'h',
 'y',
 'd',
 'o',
 'n',
 't',
 'y',
 'o',
 'u',
 't',
 'r',
 'y',
 'u',
 's',
 'i',
 'n',
 'g',
 'i',
 't',
 'I',
 'v',
 'e',
 'u',
 's',
 'e',
 'd',
 'i',
 't',
 'f',
 'o',
 'r',
 'C',
 'f',
 'o',
 'r',
 '5',
 'y',
 'e',
 'a',
 'r',
 's',
 'M',
 'a',
 't',
 'l',
 'a',
 'b',
 'f',
 'o',
 'r',
 '6',
 'y',
 'e',
 'a',
 'r',
 's',
 'a',
 'n',
 'd',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 'f',
 'o',
 'r',
 '1',
 '5',
 'y',
 'e',
 'a',
 'r',
 's']

Pattern only takes lowercase characters from f to z

In [8]:
pattern = r'[f-z]'
re.findall(pattern,test_string2)
Out[8]:
['l',
 'l',
 'l',
 'i',
 'k',
 'u',
 's',
 'i',
 'n',
 'g',
 'y',
 't',
 'h',
 'o',
 'n',
 'y',
 't',
 'h',
 'o',
 'n',
 'i',
 's',
 'w',
 's',
 'o',
 'm',
 'h',
 'y',
 'o',
 'n',
 't',
 'y',
 'o',
 'u',
 't',
 'r',
 'y',
 'u',
 's',
 'i',
 'n',
 'g',
 'i',
 't',
 'v',
 'u',
 's',
 'i',
 't',
 'f',
 'o',
 'r',
 'f',
 'o',
 'r',
 'y',
 'r',
 's',
 't',
 'l',
 'f',
 'o',
 'r',
 'y',
 'r',
 's',
 'n',
 'y',
 't',
 'h',
 'o',
 'n',
 'f',
 'o',
 'r',
 'y',
 'r',
 's']

Split sentences in paragraphs. The strip() command removes white space before and after the text in the string.

In [9]:
pattern = r'[?.!]'
sentences = re.split(pattern,test_string2)
[sentence.strip() for sentence in sentences] # remove all white space in each sentence
Out[9]:
['We all like using Python',
 'Python is awesome',
 "Why don't you try using it",
 "I've used it for C++ for 5 years, Matlab for 6 years, and Python for 15 years",
 '']

Obtain digits only

In [10]:
digits = r'\d+'
print(test_string2)
digits = re.findall(digits,test_string2)
digits
   We all like using Python.  Python is awesome!  Why don't you try using it? I've used it for C++ for 5 years, Matlab for 6 years, and Python for 15 years!   
Out[10]:
['5', '6', '15']
In [11]:
caps = r'[A-Z]\w+'
print(test_string2)
firstword = re.findall(caps,test_string2)
print(firstword)
   We all like using Python.  Python is awesome!  Why don't you try using it? I've used it for C++ for 5 years, Matlab for 6 years, and Python for 15 years!   
['We', 'Python', 'Python', 'Why', 'Matlab', 'Python']

The pattern below has \[ and \] as escape characters for the square brackets. The .* indicates that anything can be in the square brackets.

In [12]:
pattern1 = r"\[.*\]"

The pattern below states that allow for characters and spaces "[\w\s]" and the + means as many words/spaces until you see a :

In [13]:
pattern2 = r"[\w\s]+:"

Tokenization

Tokenization is breaking up a text into separate chunks (i.e., tokens). The most common module for natural language in Python is nltk

from nltk.tokenize import ***

token commands description explanation
sent_tokenize tokenize a document to sentences breaks up an article into sentences
word_tokenize tokenize a document to words breaks up ana article into words
regexp_tokenize tokenize based on regex let's you decide how you want to break up the test
TweetTokenizer tokenizer for Tweets tokenizes but accounts for #, @, and smileys. If you did this with word_tokenize, it would split on #, @, punctuataion, symbol etc.

Importing modules

In [78]:
import nltk
import itertools
import os
nltk.download('punkt')
from nltk.tokenize import sent_tokenize # sentence tokenization
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import regexp_tokenize # regex tokenization

Reading a file

The below article was downloaded from aeon.co which is a great website that I read whenever I can! The article is entitled How materialism became an ethos of hope for Jewish reformers that I've used as an example in this exercise.

In [16]:
currDir = os.getcwd()
fileName = 'aeon.txt'

readFile = currDir + '\\inputs\\' + fileName

f = open(readFile,'r')
article = f.read()
f.close()
print(article)
Be ‘a man in the street and a Jew in the home’: a common piece of advice that liberal Jews often gave their co-religionists in the 19th century. If Jewishness was kept invisible and private, they wagered, then Jews could become citizens and professionals, and be granted equal access to the material resources made available to any other member of society. There was plenty of Christian bias to combat, encapsulated by images of Jewish avarice and materialism such as Shylock’s greedy hands and Rothschild’s beard in the form of snake-like tentacles. If only Jews could fit into the spiritual boxes established by the European Protestant elite, they would be accepted, or at least tolerated in the public sphere as Frenchmen, Germans or Englishmen. Though compelling in theory, the deal became more fraught as rampant anti-Semitic violence in eastern Europe continued to remind Jews that, no matter how much they tried to look like ‘everyone else’, their bodies were marked as Jewish. 
In the 1870s European Judaism underwent an intellectual revolution. Around then, a group of young Russian Jewish radicals began to identify Judaism with materialism, and to theorise about what they called – whether in Russian, German, Yiddish or Hebrew – the ‘material’ (material’nii, materiell, gashmi, ?omri) aspects of the Universe. For many Jews living in this period, ‘materialism’ was a worldview that brought into focus latent Jewish ideas and beliefs about the physical world. The materialists claimed that a theory of Judaism, defined by the way people related to land, labour and bodies, had been lying dormant within Jewish literature – in Hasidic texts, the Bible, Spinoza’s philosophy – and could now be clearly recognised and fully articulated. Jewish particularity was based on specific historical economic differences between Jews and others. What made Jews different was a certain socioeconomic dynamic that distinguished them from their neighbours.
The Jewish revolutionaries in 1870s Russia who embraced the idea of materialism shared a number of critical assumptions. They all rejected the notion that Judaism was based on abstract metaphysical theories (Scholasticism), rituals (Hasidism), study (Mitnagdim), and ethics and reason (Enlighteners). Judaism was not a religion, like Protestantism. Instead it was something attached to their bodies and expressed through one’s relationship to land, labour and resources. The materialists had also given up hope that the state could protect them and ensure their economic wellbeing. And finally, they no longer believed that history was headed in a positive direction. Over no amount of time would Jews living in Russia ever be granted greater rights and opportunities. Therefore, only a radical reclaiming of the physical world on the part of Jews could ensure that they would be protected and given a fair and equal share of resources.
Soon, the Jewish materialism of the Russians could be found among western European Jews residing in England and Germany. Only half-jokingly, the German anarchist Gustav Landauer claimed in 1921 that what distinguished ‘the modern “conscious” Jew from a German was that when the latter writes about … the conservation of energy, … he writes about the conservation of energy, but when the conscious Jew writes about the conservation of energy, he writes about the conservation of energy and Judaism’ (emphasis mine). Eventually, there would be those, such as the Englishman Israel Zangwill, who considered themselves adherents to ‘a religion of pots and pans’, and others who identified Judaism as a faith based on ‘bagels and lox’. Over the course of the 20th century, Jews would increasingly come to believe that ‘there is nothing purely spiritual that stands on its own … Everything spiritual requires a necessary material basis.’
Updates on everything new at Aeon.
Top of Form
Bottom of Form
JJjjjJJ  Jewish materialists were despised not only by staunch liberals but also by ‘defenders of the faith’. Moses Leib Lilienblum, who would go on to found the Zionist movement in Russia, wrote a novel in which he described his youthful yeshiva education as one long masturbatory experience – for this, he was denounced by rabbis and communal leaders who forced him to flee his hometown in fear for his life. The future Russian revolutionary Hasia Schur was pelted with stones and jeered at by the townspeople of Mohilev for going on a Sabbath walk hand-in-hand with her boyfriend, the socialist Eliezer Tsukerman: the rabbis were up in arms that two young people had dared to touch one another in public. Jewish materialists were cast as upstarts, deviants, social provocateurs and, of course, with providing Jew-haters with excuses to promote anti-Semitism.
But the Jewish materialists’ deviancies reflected a radically new kind of Jewish identity, one focused on their bodies and the physical world. The Jewish body they imagined would offer a contrast to both the hunchbacked, traditional Jewish Torah scholar incapable of supporting his family, and the muscular gentile male whose energies were directed at conquering and dominating the physical world. The new Jewish body would be shaped in the image of a healthy traditional Jewish woman who laboured to provide for her family’s material wellbeing while her husband spent his day in the house of study: by tending to the material aspects of existence, Jews’ needs and desires would now be seen as the primary feature of Judaism. The material Jewish identity set the stage for Jews’ involvement in 20th-century politics: Zionism, Bundism (the Jewish labour movement), the Minority Rights movement, and Jewish forms of communism all assumed that the organising structure of Jewish identity was a Jewish body, and not a Judaism of the heavens or the heart. Jewish materialism made Jews political without them possessing their own state or even citizenship in a host country.
Though the idea of the Jewish body as the locus of collective identity would always be suspect in western Europe, it would, however, become the basis of a new kind of Jewish identity most commonly witnessed in Israel and the United States. Jewish immigrants to Palestine at the turn of the century saw in Zion the actualisation of materialism as first imagined in the 1870s. The Marxist Ber Borchov’s students, such as future leaders of Israel Yitzhak Ben-Zvi and David Ben-Gurion, identified Palestine as a response to the crisis of the fork and the knife (a pithy phrase meant to capture the economic challenges of Russian Jews in the 1870s) originally theorised by the Jewish materialist Aaron Shemuel Lieberman in the 1870s. They envisioned a new kind of Jew – the ?aluts (pioneer) – who was attached to the physical world. As described by the 20th-century Zionist poet Avraham Shlonsky, a former Hasidic Jew, the ?aluts would be the embodiment of the idea that ‘a human being is meat, and he toils here in the sacred/and the land/bread’. The people of the book had now become a people of labour, land and the body.
In the US, eastern European Jews established large-scale defence organisations directed at protecting Jewish bodies and providing a platform for Jews to speak as a distinct ethnic minority in the American public sphere. From the poet Emma Lazarus to the American rabbi Mordecai Kaplan to the philosopher Horace Kallen, American Jews in the early 20th century developed political programmes and established organisations rooted around the physical aspects of Jewish life.
Jewish materialism remains the defining element of most American Jews’ identity. Following the Second World War, the influx of another wave of Jewish immigrants from Russian lands gave rise to a new brand of US literature that placed the Jewish body front and centre. The late US novelist Phillip Roth might have been familiar only in passing with the name Moses Lilienblum. But it was Lilienblum who put into circulation the Jewish genre of overbearing parents, unrealisable social expectations, failed sexual encounters, silly rabbis, bankrupt synagogues and God-fearing charlatans encased in a narrative about masturbation. Whether he knew it or not, when Roth wrote his novel Portnoy’s Complaint(1969), he was channelling the same tradition first articulated by Lilienblum a century earlier.
Roth took those commitments to his grave when he died on 22 May 2018. While the grandmaster of late-20th-century American letters asked to be interred next to Jews, he strictly prohibited the performance of any Jewish rituals at his funeral. His final requests, allegedly, were inspired by a desire ‘to have someone to talk to’. His corpse did not need a rabbi to eulogise it, or a perfunctory kaddish (or hymn) to kasher it; it was simply Jewish – nothing more and nothing less. Indeed, it was a fitting conclusion to the life of a Jewish materialist.




Break up article into sentences

In [17]:
sentences = sent_tokenize(article)
In [18]:
print(sentences[0])
print(sentences[1])
print(sentences[3])
Be ‘a man in the street and a Jew in the home’: a common piece of advice that liberal Jews often gave their co-religionists in the 19th century.
If Jewishness was kept invisible and private, they wagered, then Jews could become citizens and professionals, and be granted equal access to the material resources made available to any other member of society.
If only Jews could fit into the spiritual boxes established by the European Protestant elite, they would be accepted, or at least tolerated in the public sphere as Frenchmen, Germans or Englishmen.

regexp_tokenize

regexp_tokenize is powerful as it allows you to set how tokens should be created from your string. You can tokenize based on specific patterns, or emojis.

In [19]:
my_string = "Morpheus: This is your last chance. After this, there is no turning back. You take the blue pill - the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill - you stay in Wonderland and I show you how deep the rabbit-hole goes.  Do you want to go ?!55"
pattern1 = r"\w+(\?!)" # words  with both ?!
print(regexp_tokenize(my_string,pattern1))
[]
In [20]:
pattern2 = r'(\w+|#\d|\?|!)' # words OR #digit OR ? OR !
regexp_tokenize(my_string,pattern2)
Out[20]:
['Morpheus',
 'This',
 'is',
 'your',
 'last',
 'chance',
 'After',
 'this',
 'there',
 'is',
 'no',
 'turning',
 'back',
 'You',
 'take',
 'the',
 'blue',
 'pill',
 'the',
 'story',
 'ends',
 'you',
 'wake',
 'up',
 'in',
 'your',
 'bed',
 'and',
 'believe',
 'whatever',
 'you',
 'want',
 'to',
 'believe',
 'You',
 'take',
 'the',
 'red',
 'pill',
 'you',
 'stay',
 'in',
 'Wonderland',
 'and',
 'I',
 'show',
 'you',
 'how',
 'deep',
 'the',
 'rabbit',
 'hole',
 'goes',
 'Do',
 'you',
 'want',
 'to',
 'go',
 '?',
 '!',
 '55']
In [21]:
pattern3 = r'(#\d\w+\?!)' # hash digit with words with ?!
regexp_tokenize(my_string,pattern3)
Out[21]:
[]
In [22]:
pattern4 = r'\s+' # any amount of spaces
regexp_tokenize(my_string,pattern4)
Out[22]:
[' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '  ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ']

Tokenize sentence 3 into words

In [23]:
word_sent3 = word_tokenize(sentences[3])
print(word_sent3)
['If', 'only', 'Jews', 'could', 'fit', 'into', 'the', 'spiritual', 'boxes', 'established', 'by', 'the', 'European', 'Protestant', 'elite', ',', 'they', 'would', 'be', 'accepted', ',', 'or', 'at', 'least', 'tolerated', 'in', 'the', 'public', 'sphere', 'as', 'Frenchmen', ',', 'Germans', 'or', 'Englishmen', '.']

Extract the set of unique tokens from the entire article

In [25]:
uniq_words = set(word_tokenize(article))
print(uniq_words)
for i, val in enumerate(itertools.islice(uniq_words, 10)):
    print(i,val)
{'healthy', 'Frenchmen', 'without', 'defining', 'Bottom', 'one', 'access', 'hand-in-hand', 'history', 'all', 'life', 'suspect', 'Russia', 'writes', 'up', 'Rothschild', 'others', 'political', 'late', 'historical', 'Form', 'bias', 'always', 'Instead', 'rabbis', 'only', ':', 'articulated', 'theory', 'expectations', 'were', 'everyone', 'Rights', 'knife', 'he', 'hunchbacked', 'in', 'materiell', 'anti-Semitic', 'denounced', 'wagered', 'excuses', 'half-jokingly', 'ethnic', 'greater', 'conquering', 'image', 'such', 'Hebrew', '1870s', 'idea', 'Israel', 'spiritual', 'Schur', 'ideas', 'Minority', 'seen', 'Indeed', 'gentile', 'physical', 'German', 'bodies', 'rampant', 'Ben-Gurion', 'toils', 'host', 'boyfriend', 'young', 'whether', 'street', 'theories', 'encased', 'Ber', 'silly', 'state', 'distinguished', 'revolutionary', 'form', 'Mohilev', 'available', 'citizenship', 'found', 'combat', 'been', 'pots', 'sacred/and', 'shaped', 'identify', 'through', 'strictly', 'around', 'lying', 'wave', 'Scholasticism', 'private', '19th', 'Aaron', 'long', 'Top', 'radically', 'provocateurs', 'David', 'elite', 'materialism', 'commitments', 'And', 'like', 'at', 'residing', 'interred', 'stands', 'hope', 'Phillip', 'plenty', 'claimed', 'structure', 'Jewish', 'while', 'certain', 'family', 'lands', 'politics', 'matter', 'focus', 'Horace', 'reflected', 'co-religionists', 'number', 'Shemuel', 'established', 'Yitzhak', 'turn', '–', 'traditional', 'among', 'fully', 'hymn', 'commonly', 'was', 'ensure', 'society', 'witnessed', 'Europe', 'spent', 'stage', 'bagels', 'Protestantism', 'identity', 'asked', 'violence', 'its', 'when', 'May', 'longer', 'distinct', 'shared', 'If', 'fitting', 'course', 'Sabbath', 'anarchist', 'an', 'for', '(', 'Updates', 'even', 'clearly', 'who', 'cast', 'deviants', 'programmes', 'inspired', 'War', 'much', 'meant', 'materialists', 'based', 'circulation', 'with', 'gave', 'positive', 'period', 'way', 'might', 'deal', 'the', 'study', 'hands', 'actualisation', 'bankrupt', 'stones', 'ethics', 'into', 'his', 'movement', 'nothing', 'placed', 'heart', 'how', 'embraced', 'genre', 'on', 'Kallen', 'Shlonsky', 'desire', 'fear', 'differences', 'most', 'modern', 'tolerated', 'masturbation', 'considered', 'Lazarus', 'had', 'dormant', 'centre', 'For', 'identified', 'mine', 'passing', 'States', 'omri', 'beard', 'pithy', 'Lieberman', 'about', 'directed', 'first', 'rooted', 'texts', 'time', 'embodiment', 'literature', 'revolutionaries', 'lox', 'underwent', 'but', 'capture', 'fit', 'protected', 'assumed', 'day', 'world', 'Universe', 'kaddish', 'meat', 'Eliezer', 'radicals', 'originally', 'human', 'imagined', 'theorised', 'given', 'Englishman', 'Zangwill', 'someone', 'could', 'more', '”', 'offer', 'often', 'kind', 'now', 'notion', 'Whether', 'heavens', 'His', 'knew', 'feature', 'wellbeing', 'But', ')', '.', 'late-20th-century', 'two', 'become', 'philosophy', 'touch', 'familiar', 'influx', 'Mordecai', 'tentacles', 'Mitnagdim', 'finally', 'come', '1969', 'arms', 'defence', 'need', 'tending', 'common', 'piece', 'talk', 'possessing', 'husband', 'would', 'Over', 'professionals', 'Protestant', 'share', 'pioneer', 'conservation', '1921', 'pans', 'American', 'direction', 'requests', ',', 'While', 'Though', 'reason', 'same', 'something', 'They', 'desires', 'headed', 'that', 'assumptions', 'fair', 'envisioned', 'Jewishness', 'parents', 'different', 'sexual', 'member', 'Soon', 'minority', 'economic', 'opportunities', 'element', 'neighbours', 'purely', 'Borchov', 'Eventually', 'Lilienblum', 'set', 'country', 'latter', 'synagogues', 'another', 'energies', 'performance', 'defenders', 'Kaplan', 'him', 'final', 'to', 'Christian', '20th-century', 'Everything', 'crisis', 'believe', 'accepted', 'any', 'philosopher', 'kept', 'allegedly', 'masturbatory', 'US', 'defined', 'what', 'land', 'group', 'there', 'From', 'by', 'communism', 'adherents', 'recognised', 'invisible', 'worldview', 'townspeople', 'European', 'jeered', 'described', 'it', 'World', 'man', 'however', 'beliefs', 'began', 'male', ';', 'existence', '2018', 'dominating', 'a', 'dynamic', 'tried', 'staunch', 'they', 'land/bread', 'citizens', 'unrealisable', 'Zionist', 'hometown', 'protecting', '“', 'metaphysical', 'overbearing', 'Zion', 'Avraham', 'no', 'Hasidic', 'necessary', 'Shylock', 'phrase', 'century', 'Emma', 'contrast', 'particularity', 'Zionism', 'labour', 'forced', 'Germany', 'failed', 'Portnoy', 'upstarts', 'encapsulated', 'rise', 'whose', 'conscious', 'then', 'There', 'be', 'within', 'immigrants', 'energy', 'book', 'challenges', 'organisations', '?', 'being', 'resources', 'Hasia', 'promote', 'Landauer', 'flee', 'simply', 'made', 'conclusion', 'brought', 'saw', 'materialist', 'channelling', 'Germans', 'corpse', 'material', 'Jews', 'intellectual', 'Moses', 'basis', 'here', 'believed', 'which', 'primary', 'wrote', 'look', 'Around', 'reclaiming', 'Bundism', 'rabbi', 'continued', 'basis.', 'communal', 'encounters', 's', 'brand', 'greedy', 'Englishmen', 'Tsukerman', 'abstract', 'Russians', 'eastern', 'avarice', 'critical', 'home', 'died', 'Bible', 'prohibited', 'Second', 'fraught', 'amount', 'England', 'forms', 'charlatans', 'their', 'Following', 'Ben-Zvi', 'Hasidism', 'youthful', 'United', 'Torah', 'everything', '‘', 'radical', 'walk', 'granted', 'pelted', 'put', 'emphasis', 'faith', 'rejected', 'advice', 'What', 'dared', 'tradition', 'next', 'In', 'Jew-haters', '22', 'liberals', 'social', 'involvement', 'supporting', 'grave', 'experience', 'from', 'yeshiva', 'earlier', 'funeral', 'education', 'Spinoza', 'did', 'new', 'ever', 'sphere', 'Enlighteners', 'or', 'focused', 'Only', 'locus', 'former', 'perfunctory', 'expressed', 'organising', 'Palestine', 'platform', 'took', 'many', 'people', 'else', 'is', 'speak', 'Gustav', 'Leib', 'developed', 'fork', 'anti-Semitism', '20th', 'attached', 'equal', 'living', 'this', 'requires', 'also', 'letters', 'laboured', 'protect', 'boxes', 'theorise', 'increasingly', 'body', 'became', 'poet', 'aluts', 'future', 'God-fearing', 'themselves', 'specific', 'front', 'not', 'aspects', 'provide', 'eulogise', 'despised', 'other', 'JJjjjJJ', 'rituals', 'name', 'large-scale', 'deviancies', 'remains', 'grandmaster', 'Yiddish', 'early', 'Therefore', 'called', 'relationship', 'own', 'both', 'compelling', 'The', 'religion', 'have', '…', 'kasher', 'scholar', 'As', 'needs', 'going', 'Judaism', 'least', 'nii', 'as', 'revolution', 'Complaint', 'of', 'between', 'response', 'novelist', 'gashmi', 'leaders', 'marked', 'less', 'liberal', 'them', 'woman', 'images', '’', 'house', 'public', 'Russian', 'her', 'snake-like', 'incapable', 'socioeconomic', 'Roth', 'and', 'Aeon', 'part', 'rights', 'latent', 'narrative', 'related', 'Jew', 'providing', 'collective', 'Be', 'those', 'novel', 'muscular', 'socialist', 'go', 'Marxist', 'remind', 'students', 'western'}
0 healthy
1 Frenchmen
2 without
3 defining
4 Bottom
5 one
6 access
7 hand-in-hand
8 history
9 all

Searching and matching

match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string

In [70]:
pattern = r'J\w+'
re.search(pattern,article)
Out[70]:
<_sre.SRE_Match object; span=(30, 33), match='Jew'>
In [71]:
pattern = r"\d+s"
re.search(pattern,article)
Out[71]:
<_sre.SRE_Match object; span=(993, 998), match='1870s'>
In [72]:
pattern = r'\w+'
re.match(pattern,article)
Out[72]:
<_sre.SRE_Match object; span=(0, 2), match='Be'>

Tokenizing Tweets

In [104]:
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

tweet_sample = ['iPython jupyter for #NLP is awesome #18 all the best to @rand','#nlp by @rand is good :)','We should :-( use #python for ^^ analytics :(','#python is good for qualitative analysis <3','I always use #PYthon']

Read all #hashtags from a tweet

In [105]:
pattern_hashtag = r"#\w+"
regexp_tokenize(tweet_sample[0],pattern_hashtag)
Out[105]:
['#NLP', '#18']

Read all #hashtags and @mentions

In [106]:
pattern_hash_mention = r"[@#]\w+"
regexp_tokenize(tweet_sample[0],pattern_hash_mention)
Out[106]:
['#NLP', '#18', '@rand']

Using TweetTokenizer means that the code will tokenize

  1. #(hastag)
  2. @(mentions)
  3. :) :-( <3 (smileys)
In [107]:
tweetTknzr = TweetTokenizer()
tokenList = [tweetTknzr.tokenize(tweet) for tweet in tweet_sample]
print(tokenList)
[['iPython', 'jupyter', 'for', '#NLP', 'is', 'awesome', '#18', 'all', 'the', 'best', 'to', '@rand'], ['#nlp', 'by', '@rand', 'is', 'good', ':)'], ['We', 'should', ':-(', 'use', '#python', 'for', '^', '^', 'analytics', ':('], ['#python', 'is', 'good', 'for', 'qualitative', 'analysis', '<3'], ['I', 'always', 'use', '#PYthon']]

If we just use the word_tokenize, we find that it will not recognize hastags, mentions, smileys, etc.

In [103]:
tokenList = [word_tokenize(tweet) for tweet in tweet_sample]
print(tokenList)
[['iPython', 'jupyter', 'for', '#', 'NLP', 'is', 'awesome', '#', '18', 'all', 'the', 'best', 'to', '@', 'rand'], ['#', 'nlp', 'by', '@', 'rand', 'is', 'good', ':', ')'], ['We', 'should', ':', '-', '(', 'use', '#', 'python', 'for', '^^', 'analytics', ':', '('], ['#', 'python', 'is', 'good', 'for', 'qualitative', 'analysis', '<', '3'], ['I', 'always', 'use', '#', 'PYthon']]

tf-idf

Term frequency - inverse document frequency

In [ ]:
 
In [35]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

from nltk.corpus import stopwords
nltk.download('stopwords')
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[35]:
True

Comments

Comments powered by Disqus