NLP Capstone

NLP final product (single document)

This code is a capstone of all the processes we learnt so far. It will allow the user to input the text of any single document and we will immediately extract keywords to understand what the document is about.

10 minute read…

Regular expressions (regex)

Regular Expressions (regex)

This post explores performing regular expressions (also known as 'regex') in Python using the re module in Python. The regular expression is a sequence of characters that define a search pattern. Applications of regular expressions are (i) searching for certain files names in a directory (ii) string processing such as search and replacement (iii) syntax highlighting (iv) data/web scraping (v) natural language processing.

19 minute read…

Term Frequency - Inverse Document Frequency (tf-idf) with gensim

Term Frequency - Inverse Document Frquency (tf-idf) using gensim

tf-idf allows the analysis of the most important words in the corpus. A corpus (that is a collection of documents) can have words across each document that are shared. For example, a corpus on finance might mention money and we would like to down-weight this keyword. The idea is to make sure that article-specific frequent words are weighted heavily and these article-shared words are weighed low.

8 minute read…



A token is the technical name for a sequence of characters — such as car, his, or :) — that we want to treat as a group. Tokenization is breaking up a text into these tokens.

The most common module for natural language in Python is nltk that is an acronym for the Natural Language Tool Kit. That are several tokenization commands that can be used by using from nltk.tokenize import <cmd> where <cmd> is replaced by the token commands below:

16 minute read…