Using spaCy

Using spaCy

spaCy is a module for NLP is an open-source library that similar to gensim. It has useful modules such as Displacy . SpaCy is useful for NER as it has a different set of entity types and can label data different from nltk. It has informal lagnuage corpura as well which is useful for chat and Tweets. spaCy is the fastest library, and is designed to perform real work, rather than research.

4 minute read…

Dealing with imbalanced datasets

/images/ml/ml_imbalance.jpg

Imbalanced datasets are when one class is substantially smaller than another class. For example, we may have a dataset where 1% of transactions are fraudulent (Target = 1) and 99% of the banking transactions are not fraudulent (Target = 0). Most of the problems in machine learning are usually imbalanced (i.e., fraud detection, probability of default), thus we need to have a few strategies to manage this issue.

2 minute read…

What are decision trees and CARTs?

In machine learning, decision trees and Classification and Regression Tree (CART) are used interchangeably. Decision trees can be used as an over-arching term to describe CARTs as Classification Trees are when the target variable takes a discrete set of values and Regression Trees when the target variable takes a continuous set of values.

6 minute read…

Cost functions, gradient descent, and gradient boost

A crucial concept in machine learning is understanding the cost function and gradient descent. Intuitively, in machine learning we are trying to train a model to match a set of outcomes in a training dataset. The difference between the outputs produced by the model and the actual data is the cost function that we are trying to minimize. The method to minimize the cost function is gradient descent. Another important concept is gradient boost as it underpins the some of the most effective machine learning classifiers such as Gradient Boosted Trees.

4 minute read…

Fitting a volatility model on stocks

Today a quant posed me a question:

If I had a sorted timeseries, how would I know if it was ordered correctly? What if it's in reverse?

After having an interesting conversation about how I would solve the issue, he infomed me that a straightforward way was to fit a GARCH model, and that the model fit would be much higher if the timeseries was sorted in the right direction.

7 minute read…

Creating a Nikola coding blog

This is a comprehensive walk-through of how to set up a static blog site with Nikola, specifically for creating a Jupyter Notebook code blog. Although there are other websites that show you how to do this, I found they weren't particularly comprehensive and it took me quite some time to wrap my head around Nikola and how to use it effectively.

Most of the other websites are for older versions of Nikola, and the walk-through below is for Nikola version 8.01 and I am deploying my website to GitHub pages, although you can use other repos like GitLab or BitBucket.

23 minute read…