Use the Stanford CoreNLP with nltk

NER is basically extracting the most important keywords in the article and categorizing it according to the Table below.

NLP Capstone

Rand Low

2019-Jan-05

Comments

NLP final product (single document)¶

This code is a capstone of all the processes we learnt so far. It will allow the user to input the text of any single document and we will immediately extract keywords to understand what the document is about.

10 minute read…

NLP: Detecting the occurence of fakenews

Rand Low

2019-Jan-05

Comments

NLP fake news classifier¶

We download the fake news dataset from kaggle and perform a supervised classification machine learning model.

11 minute read…

Polyglot

Rand Low

2019-Jan-05

Comments

polyglot¶

The polyglot package is used as it allows NLP to be applied to ~200 languages

1 minute read…

Portfolio optimization & backtesting

Rand Low

2019-Jan-05

Comments

We evaluate, compare, and demonstrate different packages for performing portfolio optimization. There are several options available

Optimization using scipy.optimize
Optimization with cvxopt
Optimiation with cvxpy

33 minute read…

Regular expressions (regex)

Rand Low

2019-Jan-05

Comments

Regular Expressions (regex)¶

This post explores performing regular expressions (also known as 'regex') in Python using the re module in Python. The regular expression is a sequence of characters that define a search pattern. Applications of regular expressions are (i) searching for certain files names in a directory (ii) string processing such as search and replacement (iii) syntax highlighting (iv) data/web scraping (v) natural language processing.

19 minute read…

Term Frequency - Inverse Document Frequency (tf-idf) with gensim

Rand Low

2019-Jan-05

Comments

Term Frequency - Inverse Document Frquency (tf-idf) using gensim¶

tf-idf allows the analysis of the most important words in the corpus. A corpus (that is a collection of documents) can have words across each document that are shared. For example, a corpus on finance might mention money and we would like to down-weight this keyword. The idea is to make sure that article-specific frequent words are weighted heavily and these article-shared words are weighed low.

8 minute read…

Tokenization

Rand Low

2019-Jan-05

Comments

Tokenization¶

A token is the technical name for a sequence of characters — such as car, his, or :) — that we want to treat as a group. Tokenization is breaking up a text into these tokens.

The most common module for natural language in Python is nltk that is an acronym for the Natural Language Tool Kit. That are several tokenization commands that can be used by using from nltk.tokenize import <cmd> where <cmd> is replaced by the token commands below:

16 minute read…