Term Frequency - Inverse Document Frquency (tf-idf) using gensim¶

tf-idf allows the analysis of the most important words in the corpus. A corpus (that is a collection of documents) can have words across each document that are shared. For example, a corpus on finance might mention money and we would like to down-weight this keyword. The idea is to make sure that article-specific frequent words are weighted heavily and these article-shared words are weighed low.

$ w_{i,j} = tf_{i,j} \times \log(\frac{N}{df_i}) $

$N$: total number of articles in corpus.
$w_{i,j}$: tf-idf weight for token $i$ in article $j$.
$tf_{i}$: number of occurences of token $i$ in article $j$.
$df_{i}$ number of articles that contain token $i$.

the tf_idf weight for a token $i$ in article $j$ is low if

Term Frequency: It does not occur frequently in the document ($tf_{i,j}$)
Document Frequency: Words that occur across many or all documents are down weighted. For example, when $ \frac{N}{df_{i}} $ is close to 1, then the log formula makes it 0. So the more articles in the corpus, and the less times the word occurs in each article, this word is weighted more heavily.

Thus the tf-idf is better tool than using bag of words when analyzing a corpus, as a bag of words only counts the number of word occurences, without down-weighting common words across the corpus.

In [11]:

import numpy as np
print('log(1): {0}'.format(np.log(1)))
print('log(10): {0}'.format(np.log(10)))

log(1): 0.0
log(10): 2.302585092994046

dictionary: dictionary of token IDs and tokens corpus: list of list of tuples. It is a list of all the documents in the corpus (i.e., set of documents). Each tuple gives the (i) token ID (ii) frequency of token occuring for that document.

Reading in files (dictionary, corpus)¶

In [19]:

import pickle
import os

cwd = os.getcwd()
dirName = '\\inputs\\'
fileName = 'nlp_data.pkl'
file = cwd+dirName+fileName

def load_pkl(fileName):  
    with open(fileName,'rb') as f:
        obj = pickle.load(f)
    return obj

[dictionary,corpus] = load_pkl(file)

In [20]:

print(dictionary.token2id)

{'advice': 0, 'century': 1, 'common': 2, 'gave': 3, 'home': 4, 'jew': 5, 'liberal': 6, 'man': 7, 'often': 8, 'piece': 9, 'street': 10, 'access': 11, 'available': 12, 'become': 13, 'citizen': 14, 'could': 15, 'equal': 16, 'granted': 17, 'invisible': 18, 'jewishness': 19, 'kept': 20, 'made': 21, 'material': 22, 'member': 23, 'private': 24, 'professional': 25, 'resource': 26, 'society': 27, 'wagered': 28, 'avarice': 29, 'beard': 30, 'bias': 31, 'christian': 32, 'combat': 33, 'encapsulated': 34, 'form': 35, 'greedy': 36, 'hand': 37, 'image': 38, 'jewish': 39, 'materialism': 40, 'plenty': 41, 'rothschild': 42, 'shylock': 43, 'tentacle': 44, 'accepted': 45, 'box': 46, 'elite': 47, 'englishman': 48, 'established': 49, 'european': 50, 'fit': 51, 'frenchman': 52, 'german': 53, 'least': 54, 'protestant': 55, 'public': 56, 'sphere': 57, 'spiritual': 58, 'tolerated': 59, 'would': 60, 'became': 61, 'body': 62, 'compelling': 63, 'continued': 64, 'deal': 65, 'eastern': 66, 'else': 67, 'europe': 68, 'everyone': 69, 'fraught': 70, 'like': 71, 'look': 72, 'marked': 73, 'matter': 74, 'much': 75, 'rampant': 76, 'remind': 77, 'theory': 78, 'though': 79, 'tried': 80, 'violence': 81, 'intellectual': 82, 'judaism': 83, 'revolution': 84, 'underwent': 85, 'around': 86, 'aspect': 87, 'began': 88, 'called': 89, 'gashmi': 90, 'group': 91, 'hebrew': 92, 'identify': 93, 'materiell': 94, 'nii': 95, 'omri': 96, 'radical': 97, 'russian': 98, 'theorise': 99, 'universe': 100, 'whether': 101, 'yiddish': 102, 'young': 103, 'belief': 104, 'brought': 105, 'focus': 106, 'idea': 107, 'latent': 108, 'living': 109, 'many': 110, 'period': 111, 'physical': 112, 'world': 113, 'worldview': 114, 'articulated': 115, 'bible': 116, 'claimed': 117, 'clearly': 118, 'defined': 119, 'dormant': 120, 'fully': 121, 'hasidic': 122, 'labour': 123, 'land': 124, 'literature': 125, 'lying': 126, 'materialist': 127, 'people': 128, 'philosophy': 129, 'recognised': 130, 'related': 131, 'spinoza': 132, 'text': 133, 'way': 134, 'within': 135, 'based': 136, 'difference': 137, 'economic': 138, 'historical': 139, 'others': 140, 'particularity': 141, 'specific': 142, 'certain': 143, 'different': 144, 'distinguished': 145, 'dynamic': 146, 'neighbour': 147, 'socioeconomic': 148, 'assumption': 149, 'critical': 150, 'embraced': 151, 'number': 152, 'revolutionary': 153, 'russia': 154, 'shared': 155, 'abstract': 156, 'enlighteners': 157, 'ethic': 158, 'hasidism': 159, 'metaphysical': 160, 'mitnagdim': 161, 'notion': 162, 'reason': 163, 'rejected': 164, 'ritual': 165, 'scholasticism': 166, 'study': 167, 'protestantism': 168, 'religion': 169, 'attached': 170, 'expressed': 171, 'instead': 172, 'one': 173, 'relationship': 174, 'something': 175, 'also': 176, 'ensure': 177, 'given': 178, 'hope': 179, 'protect': 180, 'state': 181, 'wellbeing': 182, 'believed': 183, 'direction': 184, 'finally': 185, 'headed': 186, 'history': 187, 'longer': 188, 'positive': 189, 'amount': 190, 'ever': 191, 'greater': 192, 'opportunity': 193, 'right': 194, 'time': 195, 'fair': 196, 'part': 197, 'protected': 198, 'reclaiming': 199, 'share': 200, 'therefore': 201, 'among': 202, 'england': 203, 'found': 204, 'germany': 205, 'residing': 206, 'soon': 207, 'western': 208, 'anarchist': 209, 'conscious': 210, 'conservation': 211, 'emphasis': 212, 'energy': 213, 'gustav': 214, 'landauer': 215, 'latter': 216, 'mine': 217, 'modern': 218, 'writes': 219, 'adherent': 220, 'bagel': 221, 'considered': 222, 'eventually': 223, 'faith': 224, 'identified': 225, 'israel': 226, 'lox': 227, 'pan': 228, 'pot': 229, 'zangwill': 230, 'aeon': 231, 'believe': 232, 'come': 233, 'course': 234, 'everything': 235, 'increasingly': 236, 'necessary': 237, 'new': 238, 'nothing': 239, 'purely': 240, 'requires': 241, 'stand': 242, 'update': 243, 'bottom': 244, 'defender': 245, 'despised': 246, 'jjjjjjj': 247, 'staunch': 248, 'top': 249, 'communal': 250, 'denounced': 251, 'described': 252, 'education': 253, 'experience': 254, 'fear': 255, 'flee': 256, 'forced': 257, 'go': 258, 'hometown': 259, 'leader': 260, 'leib': 261, 'life': 262, 'lilienblum': 263, 'long': 264, 'masturbatory': 265, 'moses': 266, 'movement': 267, 'novel': 268, 'rabbi': 269, 'wrote': 270, 'yeshiva': 271, 'youthful': 272, 'zionist': 273, 'another': 274, 'arm': 275, 'boyfriend': 276, 'dared': 277, 'eliezer': 278, 'future': 279, 'going': 280, 'hasia': 281, 'jeered': 282, 'mohilev': 283, 'pelted': 284, 'sabbath': 285, 'schur': 286, 'socialist': 287, 'stone': 288, 'touch': 289, 'townspeople': 290, 'tsukerman': 291, 'two': 292, 'walk': 293, 'cast': 294, 'deviant': 295, 'excuse': 296, 'promote': 297, 'providing': 298, 'provocateur': 299, 'social': 300, 'upstart': 301, 'deviancies': 302, 'focused': 303, 'identity': 304, 'kind': 305, 'radically': 306, 'reflected': 307, 'conquering': 308, 'contrast': 309, 'directed': 310, 'dominating': 311, 'family': 312, 'gentile': 313, 'hunchbacked': 314, 'imagined': 315, 'incapable': 316, 'male': 317, 'muscular': 318, 'offer': 319, 'scholar': 320, 'supporting': 321, 'torah': 322, 'traditional': 323, 'whose': 324, 'day': 325, 'desire': 326, 'existence': 327, 'feature': 328, 'healthy': 329, 'house': 330, 'husband': 331, 'laboured': 332, 'need': 333, 'primary': 334, 'provide': 335, 'seen': 336, 'shaped': 337, 'spent': 338, 'tending': 339, 'woman': 340, 'assumed': 341, 'bundism': 342, 'communism': 343, 'heart': 344, 'heaven': 345, 'involvement': 346, 'minority': 347, 'organising': 348, 'politics': 349, 'set': 350, 'stage': 351, 'structure': 352, 'zionism': 353, 'citizenship': 354, 'country': 355, 'even': 356, 'host': 357, 'political': 358, 'possessing': 359, 'without': 360, 'always': 361, 'basis': 362, 'collective': 363, 'commonly': 364, 'however': 365, 'locus': 366, 'suspect': 367, 'united': 368, 'witnessed': 369, 'actualisation': 370, 'first': 371, 'immigrant': 372, 'palestine': 373, 'saw': 374, 'turn': 375, 'zion': 376, 'aaron': 377, 'ber': 378, 'borchov': 379, 'capture': 380, 'challenge': 381, 'crisis': 382, 'david': 383, 'fork': 384, 'knife': 385, 'lieberman': 386, 'marxist': 387, 'meant': 388, 'originally': 389, 'phrase': 390, 'pithy': 391, 'response': 392, 'shemuel': 393, 'student': 394, 'theorised': 395, 'yitzhak': 396, 'aluts': 397, 'envisioned': 398, 'pioneer': 399, 'avraham': 400, 'embodiment': 401, 'former': 402, 'human': 403, 'meat': 404, 'poet': 405, 'shlonsky': 406, 'toil': 407, 'book': 408, 'american': 409, 'defence': 410, 'distinct': 411, 'ethnic': 412, 'organisation': 413, 'platform': 414, 'protecting': 415, 'speak': 416, 'u': 417, 'developed': 418, 'early': 419, 'emma': 420, 'horace': 421, 'kallen': 422, 'kaplan': 423, 'lazarus': 424, 'mordecai': 425, 'philosopher': 426, 'programme': 427, 'rooted': 428, 'defining': 429, 'element': 430, 'remains': 431, 'brand': 432, 'centre': 433, 'following': 434, 'front': 435, 'influx': 436, 'placed': 437, 'rise': 438, 'second': 439, 'war': 440, 'wave': 441, 'familiar': 442, 'late': 443, 'might': 444, 'name': 445, 'novelist': 446, 'passing': 447, 'phillip': 448, 'roth': 449, 'bankrupt': 450, 'charlatan': 451, 'circulation': 452, 'encased': 453, 'encounter': 454, 'expectation': 455, 'failed': 456, 'genre': 457, 'masturbation': 458, 'narrative': 459, 'overbearing': 460, 'parent': 461, 'put': 462, 'sexual': 463, 'silly': 464, 'synagogue': 465, 'unrealisable': 466, 'channelling': 467, 'complaint': 468, 'earlier': 469, 'knew': 470, 'portnoy': 471, 'tradition': 472, 'commitment': 473, 'died': 474, 'grave': 475, 'may': 476, 'took': 477, 'asked': 478, 'funeral': 479, 'grandmaster': 480, 'interred': 481, 'letter': 482, 'next': 483, 'performance': 484, 'prohibited': 485, 'strictly': 486, 'allegedly': 487, 'final': 488, 'inspired': 489, 'request': 490, 'someone': 491, 'talk': 492, 'corpse': 493, 'eulogise': 494, 'hymn': 495, 'kaddish': 496, 'kasher': 497, 'le': 498, 'perfunctory': 499, 'simply': 500, 'conclusion': 501, 'fitting': 502, 'indeed': 503}

In [21]:

print(corpus[:5])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)], [(11, 1), (12, 1), (13, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (27, 1), (28, 1)], [(29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)], [(15, 1), (45, 1), (47, 1), (49, 1), (50, 1), (51, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1)], [(39, 1), (61, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1)]]

Create tfidf model¶

Creating tfidf model from the corpus

In [26]:

from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
print(tfidf)

50
TfidfModel(num_docs=50, num_nnz=648)

Applying tfidf model to the document(s) in the corpus

In [71]:

tfidf_wtg = tfidf[corpus[1]]
print('tfidf_wtg',tfidf_wtg)

sorted_tfidf_wtg = sorted(tfidf_wtg,key = lambda wtg: wtg[1], reverse=True)
print('sorted_tfidf_wtg:',sorted_tfidf)

for word_id, word_wtg in sorted_tfidf_wtg[:5]:
    print('{0}:{1}'.format(dictionary.get(word_id),word_wtg))

tfidf_wtg [(11, 0.2883310904193643), (12, 0.2883310904193643), (13, 0.20735915372567357), (15, 0.15627154966131968), (16, 0.23724348635501036), (17, 0.23724348635501036), (18, 0.2883310904193643), (19, 0.2883310904193643), (20, 0.2883310904193643), (21, 0.20735915372567357), (22, 0.16970934724185913), (23, 0.2883310904193643), (24, 0.2883310904193643), (27, 0.2883310904193643), (28, 0.2883310904193643)]
sorted_tfidf_wtg: [(0, 0.32546696717129264), (2, 0.32546696717129264), (4, 0.32546696717129264), (6, 0.32546696717129264), (7, 0.32546696717129264), (8, 0.32546696717129264), (9, 0.32546696717129264), (10, 0.32546696717129264), (3, 0.2677994865999488), (5, 0.210132006028605), (1, 0.19156722387131825)]
access:0.2883310904193643
available:0.2883310904193643
invisible:0.2883310904193643
jewishness:0.2883310904193643
kept:0.2883310904193643

In entire corpus, find words with the highest tfidf weight

In [70]:

for article in corpus:
    tfidf_wtg = tfidf[article]
    
    sorted_tfidf_wtg = sorted(tfidf_wtg,key = lambda wtg: wtg[1], reverse=True)
    #print('sorted_tfidf_wtg:',sorted_tfidf)
    
    for word_id, word_wtg in sorted_tfidf_wtg:
        if word_wtg > 0.5:
            print('{0}:{1}'.format(dictionary.get(word_id),word_wtg))

intellectual:0.5263898970556029
revolution:0.5263898970556029
underwent:0.5263898970556029
protestantism:0.6266919784737742
like:0.51565229967948
religion:0.51565229967948
form:0.5458187783617455
people:0.6555661836289975

In [ ]:

Term Frequency - Inverse Document Frquency (tf-idf) using gensim¶

Reading in files (dictionary, corpus)¶

Create tfidf model¶

Comments