Term Frequency - Inverse Document Frequency (tf-idf) with gensim
Term Frequency - Inverse Document Frquency (tf-idf) using gensim¶
tf-idf allows the analysis of the most important words in the corpus. A corpus (that is a collection of documents) can have words across each document that are shared. For example, a corpus on finance might mention money and we would like to down-weight this keyword. The idea is to make sure that article-specific frequent words are weighted heavily and these article-shared words are weighed low.
$ w_{i,j} = tf_{i,j} \times \log(\frac{N}{df_i}) $
$N$: total number of articles in corpus.
$w_{i,j}$: tf-idf weight for token $i$ in article $j$.
$tf_{i}$: number of occurences of token $i$ in article $j$.
$df_{i}$ number of articles that contain token $i$.
the tf_idf weight for a token $i$ in article $j$ is low if
- Term Frequency: It does not occur frequently in the document ($tf_{i,j}$)
- Document Frequency: Words that occur across many or all documents are down weighted. For example, when $ \frac{N}{df_{i}} $ is close to 1, then the log formula makes it 0. So the more articles in the corpus, and the less times the word occurs in each article, this word is weighted more heavily.
Thus the tf-idf is better tool than using bag of words when analyzing a corpus, as a bag of words only counts the number of word occurences, without down-weighting common words across the corpus.
import numpy as np
print('log(1): {0}'.format(np.log(1)))
print('log(10): {0}'.format(np.log(10)))
dictionary: dictionary of token IDs and tokens corpus: list of list of tuples. It is a list of all the documents in the corpus (i.e., set of documents). Each tuple gives the (i) token ID (ii) frequency of token occuring for that document.
Reading in files (dictionary, corpus)¶
import pickle
import os
cwd = os.getcwd()
dirName = '\\inputs\\'
fileName = 'nlp_data.pkl'
file = cwd+dirName+fileName
def load_pkl(fileName):
with open(fileName,'rb') as f:
obj = pickle.load(f)
return obj
[dictionary,corpus] = load_pkl(file)
print(dictionary.token2id)
print(corpus[:5])
Create tfidf model¶
Creating tfidf model from the corpus
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
print(tfidf)
Applying tfidf model to the document(s) in the corpus
tfidf_wtg = tfidf[corpus[1]]
print('tfidf_wtg',tfidf_wtg)
sorted_tfidf_wtg = sorted(tfidf_wtg,key = lambda wtg: wtg[1], reverse=True)
print('sorted_tfidf_wtg:',sorted_tfidf)
for word_id, word_wtg in sorted_tfidf_wtg[:5]:
print('{0}:{1}'.format(dictionary.get(word_id),word_wtg))
In entire corpus, find words with the highest tfidf weight
for article in corpus:
tfidf_wtg = tfidf[article]
sorted_tfidf_wtg = sorted(tfidf_wtg,key = lambda wtg: wtg[1], reverse=True)
#print('sorted_tfidf_wtg:',sorted_tfidf)
for word_id, word_wtg in sorted_tfidf_wtg:
if word_wtg > 0.5:
print('{0}:{1}'.format(dictionary.get(word_id),word_wtg))
Comments
Comments powered by Disqus