Bag of words with gensim

Bag of words (gensim)

gensim is a popular package that allows us to create word vectors to perform NLP tasks in text. Differently from NLTK, gensim is ideal for being used in a collection of articles, rather tha one article where nltk is the better option.

Corpus: Is a list/large collection of texts.
Dictionary: Builds document/word vectors for topic identification and document comparison. Dictionary creates and id for each token.
Word vector: Multi-dimensional athematical representations of words that give us relationships between words in a corpus.

In [69]:
import gensim, os, nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.corpora.dictionary import Dictionary
from pprint import pprint
nltk.download('stopwords')
nltk.download('wordnet')
wnl = WordNetLemmatizer()
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Using gensim
  1. Create a dictionary: Use Dictionary from gensim.corpora.dictionary on a list of articles (or sentences, in this case). This will create a dictionary of the tokens with IDs. The whole idea is that the user will normally read in a whole set of articles to get a dictionary of words for a whole set of articles.
  2. Create a corpus: Using the dictionary, create a corpus using doc2bow on each article.

Reading a file

The below article was downloaded from aeon.co which is a great website that I read whenever I can! The article is entitled How materialism became an ethos of hope for Jewish reformers that I've used as an example in this exercise.

In [70]:
currDir = os.getcwd()
fileName = 'aeon.txt'

readFile = currDir + '\\inputs\\' + fileName

f = open(readFile,'r')
article = f.read()
f.close()
print(article)
Be ‘a man in the street and a Jew in the home’: a common piece of advice that liberal Jews often gave their co-religionists in the 19th century. If Jewishness was kept invisible and private, they wagered, then Jews could become citizens and professionals, and be granted equal access to the material resources made available to any other member of society. There was plenty of Christian bias to combat, encapsulated by images of Jewish avarice and materialism such as Shylock’s greedy hands and Rothschild’s beard in the form of snake-like tentacles. If only Jews could fit into the spiritual boxes established by the European Protestant elite, they would be accepted, or at least tolerated in the public sphere as Frenchmen, Germans or Englishmen. Though compelling in theory, the deal became more fraught as rampant anti-Semitic violence in eastern Europe continued to remind Jews that, no matter how much they tried to look like ‘everyone else’, their bodies were marked as Jewish. 
In the 1870s European Judaism underwent an intellectual revolution. Around then, a group of young Russian Jewish radicals began to identify Judaism with materialism, and to theorise about what they called – whether in Russian, German, Yiddish or Hebrew – the ‘material’ (material’nii, materiell, gashmi, ?omri) aspects of the Universe. For many Jews living in this period, ‘materialism’ was a worldview that brought into focus latent Jewish ideas and beliefs about the physical world. The materialists claimed that a theory of Judaism, defined by the way people related to land, labour and bodies, had been lying dormant within Jewish literature – in Hasidic texts, the Bible, Spinoza’s philosophy – and could now be clearly recognised and fully articulated. Jewish particularity was based on specific historical economic differences between Jews and others. What made Jews different was a certain socioeconomic dynamic that distinguished them from their neighbours.
The Jewish revolutionaries in 1870s Russia who embraced the idea of materialism shared a number of critical assumptions. They all rejected the notion that Judaism was based on abstract metaphysical theories (Scholasticism), rituals (Hasidism), study (Mitnagdim), and ethics and reason (Enlighteners). Judaism was not a religion, like Protestantism. Instead it was something attached to their bodies and expressed through one’s relationship to land, labour and resources. The materialists had also given up hope that the state could protect them and ensure their economic wellbeing. And finally, they no longer believed that history was headed in a positive direction. Over no amount of time would Jews living in Russia ever be granted greater rights and opportunities. Therefore, only a radical reclaiming of the physical world on the part of Jews could ensure that they would be protected and given a fair and equal share of resources.
Soon, the Jewish materialism of the Russians could be found among western European Jews residing in England and Germany. Only half-jokingly, the German anarchist Gustav Landauer claimed in 1921 that what distinguished ‘the modern “conscious” Jew from a German was that when the latter writes about … the conservation of energy, … he writes about the conservation of energy, but when the conscious Jew writes about the conservation of energy, he writes about the conservation of energy and Judaism’ (emphasis mine). Eventually, there would be those, such as the Englishman Israel Zangwill, who considered themselves adherents to ‘a religion of pots and pans’, and others who identified Judaism as a faith based on ‘bagels and lox’. Over the course of the 20th century, Jews would increasingly come to believe that ‘there is nothing purely spiritual that stands on its own … Everything spiritual requires a necessary material basis.’
Updates on everything new at Aeon.
Top of Form
Bottom of Form
JJjjjJJ  Jewish materialists were despised not only by staunch liberals but also by ‘defenders of the faith’. Moses Leib Lilienblum, who would go on to found the Zionist movement in Russia, wrote a novel in which he described his youthful yeshiva education as one long masturbatory experience – for this, he was denounced by rabbis and communal leaders who forced him to flee his hometown in fear for his life. The future Russian revolutionary Hasia Schur was pelted with stones and jeered at by the townspeople of Mohilev for going on a Sabbath walk hand-in-hand with her boyfriend, the socialist Eliezer Tsukerman: the rabbis were up in arms that two young people had dared to touch one another in public. Jewish materialists were cast as upstarts, deviants, social provocateurs and, of course, with providing Jew-haters with excuses to promote anti-Semitism.
But the Jewish materialists’ deviancies reflected a radically new kind of Jewish identity, one focused on their bodies and the physical world. The Jewish body they imagined would offer a contrast to both the hunchbacked, traditional Jewish Torah scholar incapable of supporting his family, and the muscular gentile male whose energies were directed at conquering and dominating the physical world. The new Jewish body would be shaped in the image of a healthy traditional Jewish woman who laboured to provide for her family’s material wellbeing while her husband spent his day in the house of study: by tending to the material aspects of existence, Jews’ needs and desires would now be seen as the primary feature of Judaism. The material Jewish identity set the stage for Jews’ involvement in 20th-century politics: Zionism, Bundism (the Jewish labour movement), the Minority Rights movement, and Jewish forms of communism all assumed that the organising structure of Jewish identity was a Jewish body, and not a Judaism of the heavens or the heart. Jewish materialism made Jews political without them possessing their own state or even citizenship in a host country.
Though the idea of the Jewish body as the locus of collective identity would always be suspect in western Europe, it would, however, become the basis of a new kind of Jewish identity most commonly witnessed in Israel and the United States. Jewish immigrants to Palestine at the turn of the century saw in Zion the actualisation of materialism as first imagined in the 1870s. The Marxist Ber Borchov’s students, such as future leaders of Israel Yitzhak Ben-Zvi and David Ben-Gurion, identified Palestine as a response to the crisis of the fork and the knife (a pithy phrase meant to capture the economic challenges of Russian Jews in the 1870s) originally theorised by the Jewish materialist Aaron Shemuel Lieberman in the 1870s. They envisioned a new kind of Jew – the ?aluts (pioneer) – who was attached to the physical world. As described by the 20th-century Zionist poet Avraham Shlonsky, a former Hasidic Jew, the ?aluts would be the embodiment of the idea that ‘a human being is meat, and he toils here in the sacred/and the land/bread’. The people of the book had now become a people of labour, land and the body.
In the US, eastern European Jews established large-scale defence organisations directed at protecting Jewish bodies and providing a platform for Jews to speak as a distinct ethnic minority in the American public sphere. From the poet Emma Lazarus to the American rabbi Mordecai Kaplan to the philosopher Horace Kallen, American Jews in the early 20th century developed political programmes and established organisations rooted around the physical aspects of Jewish life.
Jewish materialism remains the defining element of most American Jews’ identity. Following the Second World War, the influx of another wave of Jewish immigrants from Russian lands gave rise to a new brand of US literature that placed the Jewish body front and centre. The late US novelist Phillip Roth might have been familiar only in passing with the name Moses Lilienblum. But it was Lilienblum who put into circulation the Jewish genre of overbearing parents, unrealisable social expectations, failed sexual encounters, silly rabbis, bankrupt synagogues and God-fearing charlatans encased in a narrative about masturbation. Whether he knew it or not, when Roth wrote his novel Portnoy’s Complaint(1969), he was channelling the same tradition first articulated by Lilienblum a century earlier.
Roth took those commitments to his grave when he died on 22 May 2018. While the grandmaster of late-20th-century American letters asked to be interred next to Jews, he strictly prohibited the performance of any Jewish rituals at his funeral. His final requests, allegedly, were inspired by a desire ‘to have someone to talk to’. His corpse did not need a rabbi to eulogise it, or a perfunctory kaddish (or hymn) to kasher it; it was simply Jewish – nothing more and nothing less. Indeed, it was a fitting conclusion to the life of a Jewish materialist.




In [71]:
sentences = sent_tokenize(article) # break up the article into sentences
sList = [ word_tokenize(s.lower()) for s in sentences] # for each sentences, break into words and store it as a list for each sentence
print(sList)
[['be', '‘', 'a', 'man', 'in', 'the', 'street', 'and', 'a', 'jew', 'in', 'the', 'home', '’', ':', 'a', 'common', 'piece', 'of', 'advice', 'that', 'liberal', 'jews', 'often', 'gave', 'their', 'co-religionists', 'in', 'the', '19th', 'century', '.'], ['if', 'jewishness', 'was', 'kept', 'invisible', 'and', 'private', ',', 'they', 'wagered', ',', 'then', 'jews', 'could', 'become', 'citizens', 'and', 'professionals', ',', 'and', 'be', 'granted', 'equal', 'access', 'to', 'the', 'material', 'resources', 'made', 'available', 'to', 'any', 'other', 'member', 'of', 'society', '.'], ['there', 'was', 'plenty', 'of', 'christian', 'bias', 'to', 'combat', ',', 'encapsulated', 'by', 'images', 'of', 'jewish', 'avarice', 'and', 'materialism', 'such', 'as', 'shylock', '’', 's', 'greedy', 'hands', 'and', 'rothschild', '’', 's', 'beard', 'in', 'the', 'form', 'of', 'snake-like', 'tentacles', '.'], ['if', 'only', 'jews', 'could', 'fit', 'into', 'the', 'spiritual', 'boxes', 'established', 'by', 'the', 'european', 'protestant', 'elite', ',', 'they', 'would', 'be', 'accepted', ',', 'or', 'at', 'least', 'tolerated', 'in', 'the', 'public', 'sphere', 'as', 'frenchmen', ',', 'germans', 'or', 'englishmen', '.'], ['though', 'compelling', 'in', 'theory', ',', 'the', 'deal', 'became', 'more', 'fraught', 'as', 'rampant', 'anti-semitic', 'violence', 'in', 'eastern', 'europe', 'continued', 'to', 'remind', 'jews', 'that', ',', 'no', 'matter', 'how', 'much', 'they', 'tried', 'to', 'look', 'like', '‘', 'everyone', 'else', '’', ',', 'their', 'bodies', 'were', 'marked', 'as', 'jewish', '.'], ['in', 'the', '1870s', 'european', 'judaism', 'underwent', 'an', 'intellectual', 'revolution', '.'], ['around', 'then', ',', 'a', 'group', 'of', 'young', 'russian', 'jewish', 'radicals', 'began', 'to', 'identify', 'judaism', 'with', 'materialism', ',', 'and', 'to', 'theorise', 'about', 'what', 'they', 'called', '–', 'whether', 'in', 'russian', ',', 'german', ',', 'yiddish', 'or', 'hebrew', '–', 'the', '‘', 'material', '’', '(', 'material', '’', 'nii', ',', 'materiell', ',', 'gashmi', ',', '?', 'omri', ')', 'aspects', 'of', 'the', 'universe', '.'], ['for', 'many', 'jews', 'living', 'in', 'this', 'period', ',', '‘', 'materialism', '’', 'was', 'a', 'worldview', 'that', 'brought', 'into', 'focus', 'latent', 'jewish', 'ideas', 'and', 'beliefs', 'about', 'the', 'physical', 'world', '.'], ['the', 'materialists', 'claimed', 'that', 'a', 'theory', 'of', 'judaism', ',', 'defined', 'by', 'the', 'way', 'people', 'related', 'to', 'land', ',', 'labour', 'and', 'bodies', ',', 'had', 'been', 'lying', 'dormant', 'within', 'jewish', 'literature', '–', 'in', 'hasidic', 'texts', ',', 'the', 'bible', ',', 'spinoza', '’', 's', 'philosophy', '–', 'and', 'could', 'now', 'be', 'clearly', 'recognised', 'and', 'fully', 'articulated', '.'], ['jewish', 'particularity', 'was', 'based', 'on', 'specific', 'historical', 'economic', 'differences', 'between', 'jews', 'and', 'others', '.'], ['what', 'made', 'jews', 'different', 'was', 'a', 'certain', 'socioeconomic', 'dynamic', 'that', 'distinguished', 'them', 'from', 'their', 'neighbours', '.'], ['the', 'jewish', 'revolutionaries', 'in', '1870s', 'russia', 'who', 'embraced', 'the', 'idea', 'of', 'materialism', 'shared', 'a', 'number', 'of', 'critical', 'assumptions', '.'], ['they', 'all', 'rejected', 'the', 'notion', 'that', 'judaism', 'was', 'based', 'on', 'abstract', 'metaphysical', 'theories', '(', 'scholasticism', ')', ',', 'rituals', '(', 'hasidism', ')', ',', 'study', '(', 'mitnagdim', ')', ',', 'and', 'ethics', 'and', 'reason', '(', 'enlighteners', ')', '.'], ['judaism', 'was', 'not', 'a', 'religion', ',', 'like', 'protestantism', '.'], ['instead', 'it', 'was', 'something', 'attached', 'to', 'their', 'bodies', 'and', 'expressed', 'through', 'one', '’', 's', 'relationship', 'to', 'land', ',', 'labour', 'and', 'resources', '.'], ['the', 'materialists', 'had', 'also', 'given', 'up', 'hope', 'that', 'the', 'state', 'could', 'protect', 'them', 'and', 'ensure', 'their', 'economic', 'wellbeing', '.'], ['and', 'finally', ',', 'they', 'no', 'longer', 'believed', 'that', 'history', 'was', 'headed', 'in', 'a', 'positive', 'direction', '.'], ['over', 'no', 'amount', 'of', 'time', 'would', 'jews', 'living', 'in', 'russia', 'ever', 'be', 'granted', 'greater', 'rights', 'and', 'opportunities', '.'], ['therefore', ',', 'only', 'a', 'radical', 'reclaiming', 'of', 'the', 'physical', 'world', 'on', 'the', 'part', 'of', 'jews', 'could', 'ensure', 'that', 'they', 'would', 'be', 'protected', 'and', 'given', 'a', 'fair', 'and', 'equal', 'share', 'of', 'resources', '.'], ['soon', ',', 'the', 'jewish', 'materialism', 'of', 'the', 'russians', 'could', 'be', 'found', 'among', 'western', 'european', 'jews', 'residing', 'in', 'england', 'and', 'germany', '.'], ['only', 'half-jokingly', ',', 'the', 'german', 'anarchist', 'gustav', 'landauer', 'claimed', 'in', '1921', 'that', 'what', 'distinguished', '‘', 'the', 'modern', '“', 'conscious', '”', 'jew', 'from', 'a', 'german', 'was', 'that', 'when', 'the', 'latter', 'writes', 'about', '…', 'the', 'conservation', 'of', 'energy', ',', '…', 'he', 'writes', 'about', 'the', 'conservation', 'of', 'energy', ',', 'but', 'when', 'the', 'conscious', 'jew', 'writes', 'about', 'the', 'conservation', 'of', 'energy', ',', 'he', 'writes', 'about', 'the', 'conservation', 'of', 'energy', 'and', 'judaism', '’', '(', 'emphasis', 'mine', ')', '.'], ['eventually', ',', 'there', 'would', 'be', 'those', ',', 'such', 'as', 'the', 'englishman', 'israel', 'zangwill', ',', 'who', 'considered', 'themselves', 'adherents', 'to', '‘', 'a', 'religion', 'of', 'pots', 'and', 'pans', '’', ',', 'and', 'others', 'who', 'identified', 'judaism', 'as', 'a', 'faith', 'based', 'on', '‘', 'bagels', 'and', 'lox', '’', '.'], ['over', 'the', 'course', 'of', 'the', '20th', 'century', ',', 'jews', 'would', 'increasingly', 'come', 'to', 'believe', 'that', '‘', 'there', 'is', 'nothing', 'purely', 'spiritual', 'that', 'stands', 'on', 'its', 'own', '…', 'everything', 'spiritual', 'requires', 'a', 'necessary', 'material', 'basis.', '’', 'updates', 'on', 'everything', 'new', 'at', 'aeon', '.'], ['top', 'of', 'form', 'bottom', 'of', 'form', 'jjjjjjj', 'jewish', 'materialists', 'were', 'despised', 'not', 'only', 'by', 'staunch', 'liberals', 'but', 'also', 'by', '‘', 'defenders', 'of', 'the', 'faith', '’', '.'], ['moses', 'leib', 'lilienblum', ',', 'who', 'would', 'go', 'on', 'to', 'found', 'the', 'zionist', 'movement', 'in', 'russia', ',', 'wrote', 'a', 'novel', 'in', 'which', 'he', 'described', 'his', 'youthful', 'yeshiva', 'education', 'as', 'one', 'long', 'masturbatory', 'experience', '–', 'for', 'this', ',', 'he', 'was', 'denounced', 'by', 'rabbis', 'and', 'communal', 'leaders', 'who', 'forced', 'him', 'to', 'flee', 'his', 'hometown', 'in', 'fear', 'for', 'his', 'life', '.'], ['the', 'future', 'russian', 'revolutionary', 'hasia', 'schur', 'was', 'pelted', 'with', 'stones', 'and', 'jeered', 'at', 'by', 'the', 'townspeople', 'of', 'mohilev', 'for', 'going', 'on', 'a', 'sabbath', 'walk', 'hand-in-hand', 'with', 'her', 'boyfriend', ',', 'the', 'socialist', 'eliezer', 'tsukerman', ':', 'the', 'rabbis', 'were', 'up', 'in', 'arms', 'that', 'two', 'young', 'people', 'had', 'dared', 'to', 'touch', 'one', 'another', 'in', 'public', '.'], ['jewish', 'materialists', 'were', 'cast', 'as', 'upstarts', ',', 'deviants', ',', 'social', 'provocateurs', 'and', ',', 'of', 'course', ',', 'with', 'providing', 'jew-haters', 'with', 'excuses', 'to', 'promote', 'anti-semitism', '.'], ['but', 'the', 'jewish', 'materialists', '’', 'deviancies', 'reflected', 'a', 'radically', 'new', 'kind', 'of', 'jewish', 'identity', ',', 'one', 'focused', 'on', 'their', 'bodies', 'and', 'the', 'physical', 'world', '.'], ['the', 'jewish', 'body', 'they', 'imagined', 'would', 'offer', 'a', 'contrast', 'to', 'both', 'the', 'hunchbacked', ',', 'traditional', 'jewish', 'torah', 'scholar', 'incapable', 'of', 'supporting', 'his', 'family', ',', 'and', 'the', 'muscular', 'gentile', 'male', 'whose', 'energies', 'were', 'directed', 'at', 'conquering', 'and', 'dominating', 'the', 'physical', 'world', '.'], ['the', 'new', 'jewish', 'body', 'would', 'be', 'shaped', 'in', 'the', 'image', 'of', 'a', 'healthy', 'traditional', 'jewish', 'woman', 'who', 'laboured', 'to', 'provide', 'for', 'her', 'family', '’', 's', 'material', 'wellbeing', 'while', 'her', 'husband', 'spent', 'his', 'day', 'in', 'the', 'house', 'of', 'study', ':', 'by', 'tending', 'to', 'the', 'material', 'aspects', 'of', 'existence', ',', 'jews', '’', 'needs', 'and', 'desires', 'would', 'now', 'be', 'seen', 'as', 'the', 'primary', 'feature', 'of', 'judaism', '.'], ['the', 'material', 'jewish', 'identity', 'set', 'the', 'stage', 'for', 'jews', '’', 'involvement', 'in', '20th-century', 'politics', ':', 'zionism', ',', 'bundism', '(', 'the', 'jewish', 'labour', 'movement', ')', ',', 'the', 'minority', 'rights', 'movement', ',', 'and', 'jewish', 'forms', 'of', 'communism', 'all', 'assumed', 'that', 'the', 'organising', 'structure', 'of', 'jewish', 'identity', 'was', 'a', 'jewish', 'body', ',', 'and', 'not', 'a', 'judaism', 'of', 'the', 'heavens', 'or', 'the', 'heart', '.'], ['jewish', 'materialism', 'made', 'jews', 'political', 'without', 'them', 'possessing', 'their', 'own', 'state', 'or', 'even', 'citizenship', 'in', 'a', 'host', 'country', '.'], ['though', 'the', 'idea', 'of', 'the', 'jewish', 'body', 'as', 'the', 'locus', 'of', 'collective', 'identity', 'would', 'always', 'be', 'suspect', 'in', 'western', 'europe', ',', 'it', 'would', ',', 'however', ',', 'become', 'the', 'basis', 'of', 'a', 'new', 'kind', 'of', 'jewish', 'identity', 'most', 'commonly', 'witnessed', 'in', 'israel', 'and', 'the', 'united', 'states', '.'], ['jewish', 'immigrants', 'to', 'palestine', 'at', 'the', 'turn', 'of', 'the', 'century', 'saw', 'in', 'zion', 'the', 'actualisation', 'of', 'materialism', 'as', 'first', 'imagined', 'in', 'the', '1870s', '.'], ['the', 'marxist', 'ber', 'borchov', '’', 's', 'students', ',', 'such', 'as', 'future', 'leaders', 'of', 'israel', 'yitzhak', 'ben-zvi', 'and', 'david', 'ben-gurion', ',', 'identified', 'palestine', 'as', 'a', 'response', 'to', 'the', 'crisis', 'of', 'the', 'fork', 'and', 'the', 'knife', '(', 'a', 'pithy', 'phrase', 'meant', 'to', 'capture', 'the', 'economic', 'challenges', 'of', 'russian', 'jews', 'in', 'the', '1870s', ')', 'originally', 'theorised', 'by', 'the', 'jewish', 'materialist', 'aaron', 'shemuel', 'lieberman', 'in', 'the', '1870s', '.'], ['they', 'envisioned', 'a', 'new', 'kind', 'of', 'jew', '–', 'the', '?', 'aluts', '(', 'pioneer', ')', '–', 'who', 'was', 'attached', 'to', 'the', 'physical', 'world', '.'], ['as', 'described', 'by', 'the', '20th-century', 'zionist', 'poet', 'avraham', 'shlonsky', ',', 'a', 'former', 'hasidic', 'jew', ',', 'the', '?', 'aluts', 'would', 'be', 'the', 'embodiment', 'of', 'the', 'idea', 'that', '‘', 'a', 'human', 'being', 'is', 'meat', ',', 'and', 'he', 'toils', 'here', 'in', 'the', 'sacred/and', 'the', 'land/bread', '’', '.'], ['the', 'people', 'of', 'the', 'book', 'had', 'now', 'become', 'a', 'people', 'of', 'labour', ',', 'land', 'and', 'the', 'body', '.'], ['in', 'the', 'us', ',', 'eastern', 'european', 'jews', 'established', 'large-scale', 'defence', 'organisations', 'directed', 'at', 'protecting', 'jewish', 'bodies', 'and', 'providing', 'a', 'platform', 'for', 'jews', 'to', 'speak', 'as', 'a', 'distinct', 'ethnic', 'minority', 'in', 'the', 'american', 'public', 'sphere', '.'], ['from', 'the', 'poet', 'emma', 'lazarus', 'to', 'the', 'american', 'rabbi', 'mordecai', 'kaplan', 'to', 'the', 'philosopher', 'horace', 'kallen', ',', 'american', 'jews', 'in', 'the', 'early', '20th', 'century', 'developed', 'political', 'programmes', 'and', 'established', 'organisations', 'rooted', 'around', 'the', 'physical', 'aspects', 'of', 'jewish', 'life', '.'], ['jewish', 'materialism', 'remains', 'the', 'defining', 'element', 'of', 'most', 'american', 'jews', '’', 'identity', '.'], ['following', 'the', 'second', 'world', 'war', ',', 'the', 'influx', 'of', 'another', 'wave', 'of', 'jewish', 'immigrants', 'from', 'russian', 'lands', 'gave', 'rise', 'to', 'a', 'new', 'brand', 'of', 'us', 'literature', 'that', 'placed', 'the', 'jewish', 'body', 'front', 'and', 'centre', '.'], ['the', 'late', 'us', 'novelist', 'phillip', 'roth', 'might', 'have', 'been', 'familiar', 'only', 'in', 'passing', 'with', 'the', 'name', 'moses', 'lilienblum', '.'], ['but', 'it', 'was', 'lilienblum', 'who', 'put', 'into', 'circulation', 'the', 'jewish', 'genre', 'of', 'overbearing', 'parents', ',', 'unrealisable', 'social', 'expectations', ',', 'failed', 'sexual', 'encounters', ',', 'silly', 'rabbis', ',', 'bankrupt', 'synagogues', 'and', 'god-fearing', 'charlatans', 'encased', 'in', 'a', 'narrative', 'about', 'masturbation', '.'], ['whether', 'he', 'knew', 'it', 'or', 'not', ',', 'when', 'roth', 'wrote', 'his', 'novel', 'portnoy', '’', 's', 'complaint', '(', '1969', ')', ',', 'he', 'was', 'channelling', 'the', 'same', 'tradition', 'first', 'articulated', 'by', 'lilienblum', 'a', 'century', 'earlier', '.'], ['roth', 'took', 'those', 'commitments', 'to', 'his', 'grave', 'when', 'he', 'died', 'on', '22', 'may', '2018', '.'], ['while', 'the', 'grandmaster', 'of', 'late-20th-century', 'american', 'letters', 'asked', 'to', 'be', 'interred', 'next', 'to', 'jews', ',', 'he', 'strictly', 'prohibited', 'the', 'performance', 'of', 'any', 'jewish', 'rituals', 'at', 'his', 'funeral', '.'], ['his', 'final', 'requests', ',', 'allegedly', ',', 'were', 'inspired', 'by', 'a', 'desire', '‘', 'to', 'have', 'someone', 'to', 'talk', 'to', '’', '.'], ['his', 'corpse', 'did', 'not', 'need', 'a', 'rabbi', 'to', 'eulogise', 'it', ',', 'or', 'a', 'perfunctory', 'kaddish', '(', 'or', 'hymn', ')', 'to', 'kasher', 'it', ';', 'it', 'was', 'simply', 'jewish', '–', 'nothing', 'more', 'and', 'nothing', 'less', '.'], ['indeed', ',', 'it', 'was', 'a', 'fitting', 'conclusion', 'to', 'the', 'life', 'of', 'a', 'jewish', 'materialist', '.']]

Clean up articles

Goes through the list of articles and performs the following pre-processing steps

  1. Lowercase all words.
  2. Ensures all tokens are alpha..
  3. Ensures there are no 'stop' words.
  4. Ensures all words are lemmatized.
In [72]:
lower_alpha_list = []
no_stops_list = []
lemmatized_list = []

for s in sentences:
    lower_alpha = [w for w in word_tokenize(s.lower()) if w.isalpha() ]
    no_stops = [t for t in lower_alpha if t not in stopwords.words('english')]
    lemmatized = [wnl.lemmatize(t) for t in no_stops]
    
    lower_alpha_list.append(lower_alpha)
    no_stops_list.append(no_stops)
    lemmatized_list.append(lemmatized)
    

We can see the difference in the print out of the sentence below. We can see that lemmatized is much shorter than the lower_alpha_list as the stop words (i.e., a, was, what) are removed and some words (i.e., jews, neighbours) are lemmatized

In [73]:
sentID=10
print('lower_alpha_list:{0}'.format(lower_alpha_list[sentID]))
print('no_stops_list:{0}'.format(no_stops_list[sentID]))
print('lemmatized_list:{0}'.format(lemmatized_list[sentID]))
lower_alpha_list:['what', 'made', 'jews', 'different', 'was', 'a', 'certain', 'socioeconomic', 'dynamic', 'that', 'distinguished', 'them', 'from', 'their', 'neighbours']
no_stops_list:['made', 'jews', 'different', 'certain', 'socioeconomic', 'dynamic', 'distinguished', 'neighbours']
lemmatized_list:['made', 'jew', 'different', 'certain', 'socioeconomic', 'dynamic', 'distinguished', 'neighbour']

We can see that lemmatized_list obtains keywords from each sentence in the article

In [97]:
pprint(lemmatized_list[0:3])
[['man',
  'street',
  'jew',
  'home',
  'common',
  'piece',
  'advice',
  'liberal',
  'jew',
  'often',
  'gave',
  'century'],
 ['jewishness',
  'kept',
  'invisible',
  'private',
  'wagered',
  'jew',
  'could',
  'become',
  'citizen',
  'professional',
  'granted',
  'equal',
  'access',
  'material',
  'resource',
  'made',
  'available',
  'member',
  'society'],
 ['plenty',
  'christian',
  'bias',
  'combat',
  'encapsulated',
  'image',
  'jewish',
  'avarice',
  'materialism',
  'shylock',
  'greedy',
  'hand',
  'rothschild',
  'beard',
  'form',
  'tentacle']]

Create gensim (i) corpus (ii) dictionary

We then create a Dictionary from each of the key words in all sentences.

In [75]:
dictionary =  Dictionary(lemmatized_list) # Creates a unique ID for each word in the corpus of sentences.  Expects a list of lists.

word = 'economic'
word_id = dictionary.token2id.get(word)
print('Word: {0}; ID: {1}'.format(word,word_id))
Word: economic; ID: 138
In [76]:
print(dictionary) 
Dictionary(504 unique tokens: ['advice', 'century', 'common', 'gave', 'home']...)
In [77]:
#pprint(dictionary.token2id) # prints out the ids for each token

from itertools import islice
d =dictionary.token2id



def take(n, iterable):
    return list(islice(iterable,n))

t1 = take(5,d.items())
print(t1)
[('advice', 0), ('century', 1), ('common', 2), ('gave', 3), ('home', 4)]

Once a dictionary is created, we can use it for

  1. the document that we used to create the dictionary
  2. any new set of articles.

The whole idea is that when we use the doc2bow command, we create a bag of words which is a list of tuples. Each tuple (e.g., [ (0,1), (1,1), ]) consists of two values, where the first value is the token ID (from the dictionary) and the second value is the number of occurene

In [78]:
corpus = [dictionary.doc2bow(s) for s in sList]

id_tkn = 239
print('ID:{0}, Token:{1}'.format(id_tkn,dictionary.get(id_tkn)))
ID:239, Token:nothing

Corpus

The corpus is a list of tuple lists.

  1. Each item in the first list represents sentences.
  2. Within each sentence, we have a list of tuples.
  3. Each tuple is (tokenID, num. of token occurences in that sentence)

Dictionary: List of tuples tokens. (tokenID, token)
Corpus: List of list of tuples. (tokenID, number of token occurenes in that sentence)

In [95]:
pprint(corpus) 
50
[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1)],
 [(11, 1),
  (12, 1),
  (13, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (27, 1),
  (28, 1)],
 [(29, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1)],
 [(15, 1),
  (45, 1),
  (47, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 1)],
 [(39, 1),
  (61, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 1),
  (76, 1),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 1)],
 [(50, 1), (82, 1), (83, 1), (84, 1), (85, 1)],
 [(22, 2),
  (39, 1),
  (40, 1),
  (53, 1),
  (83, 1),
  (86, 1),
  (88, 1),
  (89, 1),
  (90, 1),
  (91, 1),
  (92, 1),
  (93, 1),
  (94, 1),
  (95, 1),
  (96, 1),
  (98, 2),
  (99, 1),
  (100, 1),
  (101, 1),
  (102, 1),
  (103, 1)],
 [(39, 1),
  (40, 1),
  (105, 1),
  (106, 1),
  (108, 1),
  (109, 1),
  (110, 1),
  (111, 1),
  (112, 1),
  (113, 1),
  (114, 1)],
 [(15, 1),
  (39, 1),
  (78, 1),
  (83, 1),
  (115, 1),
  (116, 1),
  (117, 1),
  (118, 1),
  (119, 1),
  (120, 1),
  (121, 1),
  (122, 1),
  (123, 1),
  (124, 1),
  (125, 1),
  (126, 1),
  (128, 1),
  (129, 1),
  (130, 1),
  (131, 1),
  (132, 1),
  (134, 1),
  (135, 1)],
 [(39, 1), (136, 1), (138, 1), (139, 1), (140, 1), (141, 1), (142, 1)],
 [(21, 1), (143, 1), (144, 1), (145, 1), (146, 1), (148, 1)],
 [(39, 1), (40, 1), (107, 1), (150, 1), (151, 1), (152, 1), (154, 1), (155, 1)],
 [(83, 1),
  (136, 1),
  (156, 1),
  (157, 1),
  (159, 1),
  (160, 1),
  (161, 1),
  (162, 1),
  (163, 1),
  (164, 1),
  (166, 1),
  (167, 1)],
 [(71, 1), (83, 1), (168, 1), (169, 1)],
 [(123, 1),
  (124, 1),
  (170, 1),
  (171, 1),
  (172, 1),
  (173, 1),
  (174, 1),
  (175, 1)],
 [(15, 1),
  (138, 1),
  (176, 1),
  (177, 1),
  (178, 1),
  (179, 1),
  (180, 1),
  (181, 1),
  (182, 1)],
 [(183, 1), (184, 1), (185, 1), (186, 1), (187, 1), (188, 1), (189, 1)],
 [(17, 1), (60, 1), (109, 1), (154, 1), (190, 1), (191, 1), (192, 1), (195, 1)],
 [(15, 1),
  (16, 1),
  (60, 1),
  (97, 1),
  (112, 1),
  (113, 1),
  (177, 1),
  (178, 1),
  (196, 1),
  (197, 1),
  (198, 1),
  (199, 1),
  (200, 1),
  (201, 1)],
 [(15, 1),
  (39, 1),
  (40, 1),
  (50, 1),
  (202, 1),
  (203, 1),
  (204, 1),
  (205, 1),
  (206, 1),
  (207, 1),
  (208, 1)],
 [(5, 2),
  (53, 2),
  (83, 1),
  (117, 1),
  (145, 1),
  (209, 1),
  (210, 2),
  (211, 4),
  (212, 1),
  (213, 4),
  (214, 1),
  (215, 1),
  (216, 1),
  (217, 1),
  (218, 1),
  (219, 4)],
 [(48, 1),
  (60, 1),
  (83, 1),
  (136, 1),
  (140, 1),
  (169, 1),
  (222, 1),
  (223, 1),
  (224, 1),
  (225, 1),
  (226, 1),
  (227, 1),
  (230, 1)],
 [(1, 1),
  (22, 1),
  (58, 2),
  (60, 1),
  (231, 1),
  (232, 1),
  (233, 1),
  (234, 1),
  (235, 2),
  (236, 1),
  (237, 1),
  (238, 1),
  (239, 1),
  (240, 1),
  (241, 1)],
 [(35, 2),
  (39, 1),
  (176, 1),
  (224, 1),
  (244, 1),
  (246, 1),
  (247, 1),
  (248, 1),
  (249, 1)],
 [(60, 1),
  (154, 1),
  (173, 1),
  (204, 1),
  (250, 1),
  (251, 1),
  (252, 1),
  (253, 1),
  (254, 1),
  (255, 1),
  (256, 1),
  (257, 1),
  (258, 1),
  (259, 1),
  (261, 1),
  (262, 1),
  (263, 1),
  (264, 1),
  (265, 1),
  (266, 1),
  (267, 1),
  (268, 1),
  (270, 1),
  (271, 1),
  (272, 1),
  (273, 1)],
 [(56, 1),
  (98, 1),
  (103, 1),
  (128, 1),
  (153, 1),
  (173, 1),
  (274, 1),
  (276, 1),
  (277, 1),
  (278, 1),
  (279, 1),
  (280, 1),
  (281, 1),
  (282, 1),
  (283, 1),
  (284, 1),
  (285, 1),
  (286, 1),
  (287, 1),
  (289, 1),
  (290, 1),
  (291, 1),
  (292, 1),
  (293, 1)],
 [(39, 1), (234, 1), (294, 1), (297, 1), (298, 1), (300, 1)],
 [(39, 2),
  (112, 1),
  (113, 1),
  (173, 1),
  (238, 1),
  (302, 1),
  (303, 1),
  (304, 1),
  (305, 1),
  (306, 1),
  (307, 1)],
 [(39, 2),
  (60, 1),
  (62, 1),
  (112, 1),
  (113, 1),
  (308, 1),
  (309, 1),
  (310, 1),
  (311, 1),
  (312, 1),
  (313, 1),
  (314, 1),
  (315, 1),
  (316, 1),
  (317, 1),
  (318, 1),
  (319, 1),
  (320, 1),
  (321, 1),
  (322, 1),
  (323, 1),
  (324, 1)],
 [(22, 2),
  (38, 1),
  (39, 2),
  (60, 2),
  (62, 1),
  (83, 1),
  (167, 1),
  (182, 1),
  (238, 1),
  (312, 1),
  (323, 1),
  (325, 1),
  (327, 1),
  (328, 1),
  (329, 1),
  (330, 1),
  (331, 1),
  (332, 1),
  (334, 1),
  (335, 1),
  (336, 1),
  (337, 1),
  (338, 1),
  (339, 1),
  (340, 1)],
 [(22, 1),
  (39, 5),
  (62, 1),
  (83, 1),
  (123, 1),
  (267, 2),
  (304, 2),
  (341, 1),
  (342, 1),
  (343, 1),
  (344, 1),
  (346, 1),
  (347, 1),
  (348, 1),
  (349, 1),
  (350, 1),
  (351, 1),
  (352, 1),
  (353, 1)],
 [(21, 1),
  (39, 1),
  (40, 1),
  (181, 1),
  (354, 1),
  (355, 1),
  (356, 1),
  (357, 1),
  (358, 1),
  (359, 1),
  (360, 1)],
 [(13, 1),
  (39, 2),
  (60, 2),
  (62, 1),
  (68, 1),
  (79, 1),
  (107, 1),
  (208, 1),
  (226, 1),
  (238, 1),
  (304, 2),
  (305, 1),
  (361, 1),
  (362, 1),
  (363, 1),
  (364, 1),
  (365, 1),
  (366, 1),
  (367, 1),
  (368, 1),
  (369, 1)],
 [(1, 1),
  (39, 1),
  (40, 1),
  (315, 1),
  (370, 1),
  (371, 1),
  (373, 1),
  (374, 1),
  (375, 1),
  (376, 1)],
 [(39, 1),
  (98, 1),
  (127, 1),
  (138, 1),
  (225, 1),
  (226, 1),
  (279, 1),
  (373, 1),
  (377, 1),
  (378, 1),
  (379, 1),
  (380, 1),
  (382, 1),
  (383, 1),
  (384, 1),
  (385, 1),
  (386, 1),
  (387, 1),
  (388, 1),
  (389, 1),
  (390, 1),
  (391, 1),
  (392, 1),
  (393, 1),
  (395, 1),
  (396, 1)],
 [(5, 1),
  (112, 1),
  (113, 1),
  (170, 1),
  (238, 1),
  (305, 1),
  (397, 1),
  (398, 1),
  (399, 1)],
 [(5, 1),
  (60, 1),
  (107, 1),
  (122, 1),
  (252, 1),
  (273, 1),
  (397, 1),
  (400, 1),
  (401, 1),
  (402, 1),
  (403, 1),
  (404, 1),
  (405, 1),
  (406, 1)],
 [(13, 1), (62, 1), (123, 1), (124, 1), (128, 2), (408, 1)],
 [(39, 1),
  (49, 1),
  (50, 1),
  (56, 1),
  (57, 1),
  (66, 1),
  (298, 1),
  (310, 1),
  (347, 1),
  (409, 1),
  (410, 1),
  (411, 1),
  (412, 1),
  (414, 1),
  (415, 1),
  (416, 1)],
 [(1, 1),
  (39, 1),
  (49, 1),
  (86, 1),
  (112, 1),
  (262, 1),
  (269, 1),
  (358, 1),
  (405, 1),
  (409, 2),
  (418, 1),
  (419, 1),
  (420, 1),
  (421, 1),
  (422, 1),
  (423, 1),
  (424, 1),
  (425, 1),
  (426, 1),
  (428, 1)],
 [(39, 1), (40, 1), (304, 1), (409, 1), (429, 1), (430, 1), (431, 1)],
 [(3, 1),
  (39, 2),
  (62, 1),
  (98, 1),
  (113, 1),
  (125, 1),
  (238, 1),
  (274, 1),
  (432, 1),
  (433, 1),
  (434, 1),
  (435, 1),
  (436, 1),
  (437, 1),
  (438, 1),
  (439, 1),
  (440, 1),
  (441, 1)],
 [(263, 1),
  (266, 1),
  (442, 1),
  (443, 1),
  (444, 1),
  (445, 1),
  (446, 1),
  (447, 1),
  (448, 1),
  (449, 1)],
 [(39, 1),
  (263, 1),
  (300, 1),
  (450, 1),
  (452, 1),
  (453, 1),
  (456, 1),
  (457, 1),
  (458, 1),
  (459, 1),
  (460, 1),
  (462, 1),
  (463, 1),
  (464, 1),
  (466, 1)],
 [(1, 1),
  (101, 1),
  (115, 1),
  (263, 1),
  (268, 1),
  (270, 1),
  (371, 1),
  (449, 1),
  (467, 1),
  (468, 1),
  (469, 1),
  (470, 1),
  (471, 1),
  (472, 1)],
 [(449, 1), (474, 1), (475, 1), (476, 1), (477, 1)],
 [(39, 1),
  (409, 1),
  (478, 1),
  (479, 1),
  (480, 1),
  (481, 1),
  (483, 1),
  (484, 1),
  (485, 1),
  (486, 1)],
 [(326, 1), (487, 1), (488, 1), (489, 1), (491, 1), (492, 1)],
 [(39, 1),
  (239, 2),
  (269, 1),
  (333, 1),
  (493, 1),
  (494, 1),
  (495, 1),
  (496, 1),
  (497, 1),
  (499, 1),
  (500, 1)],
 [(39, 1), (127, 1), (262, 1), (501, 1), (502, 1), (503, 1)]]
In [80]:
print(corpus[10][:5]) # Prints FIRST five word IDs from the 11th sentence
[(21, 1), (143, 1), (144, 1), (145, 1), (146, 1)]
In [81]:
print(corpus[15]) # Prints coropus from the 16th sentence
[(15, 1), (138, 1), (176, 1), (177, 1), (178, 1), (179, 1), (180, 1), (181, 1), (182, 1)]

Find sentence with some repeating words

In [82]:
for idx, s in enumerate(corpus):
    if max(s,key = lambda item:item[1])[1] > 3: # Select maximum in each sentence based on a key.  Default is first item in tuple, so we set the key to the second item and return that value  
        pprint("Sentence ID: {0}; Sentence {1}".format(idx,s))       
('Sentence ID: 20; Sentence [(5, 2), (53, 2), (83, 1), (117, 1), (145, 1), '
 '(209, 1), (210, 2), (211, 4), (212, 1), (213, 4), (214, 1), (215, 1), (216, '
 '1), (217, 1), (218, 1), (219, 4)]')
('Sentence ID: 30; Sentence [(22, 1), (39, 5), (62, 1), (83, 1), (123, 1), '
 '(267, 2), (304, 2), (341, 1), (342, 1), (343, 1), (344, 1), (346, 1), (347, '
 '1), (348, 1), (349, 1), (350, 1), (351, 1), (352, 1), (353, 1)]')

Using the corpus, obtain the sentence, and sort it by the word that has the highest number of occurences

In [83]:
doc = corpus[20]
print('origin_doc: {0}'.format(doc))
bow_doc = sorted(doc,key=lambda item:item[1], reverse=True) # Sort items in each sentence based on a key.  Default is first item and in ascending order.  Thus alter the key and reverse
print('sorted_doc: {0}'.format(bow_doc))
origin_doc: [(5, 2), (53, 2), (83, 1), (117, 1), (145, 1), (209, 1), (210, 2), (211, 4), (212, 1), (213, 4), (214, 1), (215, 1), (216, 1), (217, 1), (218, 1), (219, 4)]
sorted_doc: [(211, 4), (213, 4), (219, 4), (5, 2), (53, 2), (210, 2), (83, 1), (117, 1), (145, 1), (209, 1), (212, 1), (214, 1), (215, 1), (216, 1), (217, 1), (218, 1)]

Use the dictionary, find out which tokens in the corpus occur the most

In [84]:
for word_id, word_cnt in bow_doc:
    print('{0}: {1}'.format(dictionary.get(word_id), word_cnt))
conservation: 4
energy: 4
writes: 4
jew: 2
german: 2
conscious: 2
judaism: 1
claimed: 1
distinguished: 1
anarchist: 1
emphasis: 1
gustav: 1
landauer: 1
latter: 1
mine: 1
modern: 1

Extracting words from entire list of articles

This function uses itertools to go through through the corpus (i.e., list of list of tuples) to extract each word_id and the word_cnt.

In [85]:
from collections import defaultdict
import itertools

total_word_cnt = defaultdict(int) # initializing defaultdict(int) creates a dictionary that is useful for counting items
for word_id, word_cnt in itertools.chain.from_iterable(corpus):
    total_word_cnt[word_id] += word_cnt
    

sorted_word_cnt = sorted(total_word_cnt.items(), key = lambda w:w[1],reverse=True)
pprint(sorted_word_cnt[:5])
[(39, 35), (60, 12), (83, 9), (40, 8), (22, 7)]

Print out the top five most frequent occuring words

In [86]:
for word_id, word_cnt in sorted_word_cnt[:5]:
    print('Word: {0}, Count:{1}'.format(dictionary.get(word_id),word_cnt))
Word: jewish, Count:35
Word: would, Count:12
Word: judaism, Count:9
Word: materialism, Count:8
Word: material, Count:7

Writing files into pickle file

In [93]:
import pickle

def save_pkl(fileName,list_var):
    with open(fileName,'wb') as f:
        pickle.dump(list_var,f)


cwd = os.getcwd()
dirName = '\\inputs\\'
fileName = 'nlp_data.pkl'
file = cwd+dirName+fileName



save_pkl(file,[dictionary,corpus])
In [ ]:
 

Comments

Comments powered by Disqus