Regular expressions (regex)
Regular Expressions (regex)¶
This post explores performing regular expressions (also known as 'regex') in Python using the re
module in Python. The regular expression is a sequence of characters that define a search pattern. Applications of regular expressions are (i) searching for certain files names in a directory (ii) string processing such as search and replacement (iii) syntax highlighting (iv) data/web scraping (v) natural language processing.
The tasklist for you to understand is is as follows:
- [x] Understanding common
regex
patterns. - [ ] Understanding common
regex
commands. - [ ] Exploring commonly used
re
commands.
Important notes in using the re
module.
-
re
commands have two main inputs (i) pattern (ii) string. - Make sure your patterns start with an
r
as this forces Python to recognize it as a raw string.
We can't possibly learn all the commands and patterns in re
; however, we should at least understnad the most commonly used ones. These are shown in the tables below:
1. Common regex patterns¶
Pattern | Matches | Word |
---|---|---|
\w+ | word | 'Ditto' |
\d | digit | 8 |
\s | space | ' ' |
.* | wildcard | 'any456' |
+ or * | greed | 'zzzzz' |
\S | not space | 'non-space' |
[a-z] | lowercase group | 'abcxyz' |
[A-A] | uppercase group | 'ABCXYZ' |
2. Common regex commands¶
Commands | Description | Example |
---|---|---|
| | OR | To combine two different groups/ranges |
( ) | Group | Used for an explicit set of characters |
[ ] | Character ranges | Used for a range of characters |
(\d+|\w+) | Matches digits and words | '12' 'abc' |
[A-Za-z]+ | Matches uppercase and lowercase characters | 'AASDASDlasdasd' |
[0-9] | numbers from 0 to 9 | 9 |
[A-Za-z-.]+ | Upper & lowercase characters with . and - | 'My-website.com' |
(a-z) | a,-, and z | 'a-z' |
(\s+l,) | spaces or comma | ',' |
3. Commonly used re commands¶
command | description |
---|---|
split | split a spring |
findall | find all patterns in a strong |
search | search for a pattern within the string |
match | match an entire string or substring based on pattern. Looks at the very start of the string |
Importing the re
module¶
import re
test_string1 = "All students' need to learn how to use RegEx!"
test_string2 = " We all like using Python. Python is awesome! Why don't you try using it? I've used it for C++ for 5 years, Matlab for 6 years, and Python for 15 years! "
test_string3 = "We all like using Python. Python is awesome!"
The pattern only looks for spaces.
So if you have a series of spaces, it lists down each one
pattern = r'\s'
re.findall(pattern,test_string1)
The pattern looks for 'greedy' spaces. Thus the long set of spaces is treated as one
pattern = r'\s+'
re.findall(pattern,test_string1)
The pattern looks for 'greedy' words.
pattern = r'\w+'
re.findall(pattern,test_string2)
The pattern looks for each character
pattern = r'\w'
re.findall(pattern,test_string2)
Pattern only takes lowercase characters from f to z
pattern = r'[f-z]'
re.findall(pattern,test_string2)
Split sentences in paragraphs. The strip()
command removes white space before and after the text in the string.
pattern = r'[?.!]'
sentences = re.split(pattern,test_string2)
[sentence.strip() for sentence in sentences] # remove all white space in each sentence
Obtain digits only
digits = r'\d+'
print(test_string2)
digits = re.findall(digits,test_string2)
digits
caps = r'[A-Z]\w+'
print(test_string2)
firstword = re.findall(caps,test_string2)
print(firstword)
The pattern below has \[
and \]
as escape characters for the square brackets. The .*
indicates that anything can be in the square brackets.
pattern1 = r"\[.*\]"
The pattern below states that allow for characters and spaces "[\w\s]" and the + means as many words/spaces until you see a :
pattern2 = r"[\w\s]+:"
Tokenization¶
Tokenization is breaking up a text into separate chunks (i.e., tokens). The most common module for natural language in Python is nltk
from nltk.tokenize import ***
token commands | description | explanation |
---|---|---|
sent_tokenize | tokenize a document to sentences | breaks up an article into sentences |
word_tokenize | tokenize a document to words | breaks up ana article into words |
regexp_tokenize | tokenize based on regex | let's you decide how you want to break up the test |
TweetTokenizer | tokenizer for Tweets | tokenizes but accounts for #, @, and smileys. If you did this with word_tokenize, it would split on #, @, punctuataion, symbol etc. |
Importing modules¶
import nltk
import itertools
import os
nltk.download('punkt')
from nltk.tokenize import sent_tokenize # sentence tokenization
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import regexp_tokenize # regex tokenization
Reading a file¶
The below article was downloaded from aeon.co which is a great website that I read whenever I can! The article is entitled How materialism became an ethos of hope for Jewish reformers that I've used as an example in this exercise.
currDir = os.getcwd()
fileName = 'aeon.txt'
readFile = currDir + '\\inputs\\' + fileName
f = open(readFile,'r')
article = f.read()
f.close()
print(article)
Break up article into sentences¶
sentences = sent_tokenize(article)
Print out different sentences¶
print(sentences[0])
print(sentences[1])
print(sentences[3])
regexp_tokenize¶
regexp_tokenize
is powerful as it allows you to set how tokens should be created from your string. You can tokenize based on specific patterns, or emojis.
my_string = "Morpheus: This is your last chance. After this, there is no turning back. You take the blue pill - the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill - you stay in Wonderland and I show you how deep the rabbit-hole goes. Do you want to go ?!55"
pattern1 = r"\w+(\?!)" # words with both ?!
print(regexp_tokenize(my_string,pattern1))
pattern2 = r'(\w+|#\d|\?|!)' # words OR #digit OR ? OR !
regexp_tokenize(my_string,pattern2)
pattern3 = r'(#\d\w+\?!)' # hash digit with words with ?!
regexp_tokenize(my_string,pattern3)
pattern4 = r'\s+' # any amount of spaces
regexp_tokenize(my_string,pattern4)
Tokenize sentence 3 into words¶
word_sent3 = word_tokenize(sentences[3])
print(word_sent3)
Extract the set of unique tokens from the entire article¶
uniq_words = set(word_tokenize(article))
print(uniq_words)
for i, val in enumerate(itertools.islice(uniq_words, 10)):
print(i,val)
Searching and matching¶
match
checks for a match only at the beginning of the string, while search
checks for a match anywhere in the string
pattern = r'J\w+'
re.search(pattern,article)
pattern = r"\d+s"
re.search(pattern,article)
pattern = r'\w+'
re.match(pattern,article)
Tokenizing Tweets¶
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
tweet_sample = ['iPython jupyter for #NLP is awesome #18 all the best to @rand','#nlp by @rand is good :)','We should :-( use #python for ^^ analytics :(','#python is good for qualitative analysis <3','I always use #PYthon']
Read all #hashtags from a tweet
pattern_hashtag = r"#\w+"
regexp_tokenize(tweet_sample[0],pattern_hashtag)
Read all #hashtags and @mentions
pattern_hash_mention = r"[@#]\w+"
regexp_tokenize(tweet_sample[0],pattern_hash_mention)
Using TweetTokenizer means that the code will tokenize
- #(hastag)
- @(mentions)
- :) :-( <3 (smileys)
tweetTknzr = TweetTokenizer()
tokenList = [tweetTknzr.tokenize(tweet) for tweet in tweet_sample]
print(tokenList)
If we just use the word_tokenize, we find that it will not recognize hastags, mentions, smileys, etc.
tokenList = [word_tokenize(tweet) for tweet in tweet_sample]
print(tokenList)
tf-idf¶
Term frequency - inverse document frequency
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.corpus import stopwords
nltk.download('stopwords')
Comments
Comments powered by Disqus