# Bag of words with nltk

## Bag of words (NLTK)¶

1. Set all words to lower case.
2. Remove all punctuation.

#### Import modules¶

In [70]:
import os, nltk, collections
import collections
from nltk.tokenize import word_tokenize, sent_tokenize
from pprint import pprint


In [71]:
currDir = os.getcwd()
fileName = 'aeon.txt'

readFile = currDir + '\\inputs\\' + fileName

f.close()
print(article)

Be ‘a man in the street and a Jew in the home’: a common piece of advice that liberal Jews often gave their co-religionists in the 19th century. If Jewishness was kept invisible and private, they wagered, then Jews could become citizens and professionals, and be granted equal access to the material resources made available to any other member of society. There was plenty of Christian bias to combat, encapsulated by images of Jewish avarice and materialism such as Shylock’s greedy hands and Rothschild’s beard in the form of snake-like tentacles. If only Jews could fit into the spiritual boxes established by the European Protestant elite, they would be accepted, or at least tolerated in the public sphere as Frenchmen, Germans or Englishmen. Though compelling in theory, the deal became more fraught as rampant anti-Semitic violence in eastern Europe continued to remind Jews that, no matter how much they tried to look like ‘everyone else’, their bodies were marked as Jewish.
In the 1870s European Judaism underwent an intellectual revolution. Around then, a group of young Russian Jewish radicals began to identify Judaism with materialism, and to theorise about what they called – whether in Russian, German, Yiddish or Hebrew – the ‘material’ (material’nii, materiell, gashmi, ?omri) aspects of the Universe. For many Jews living in this period, ‘materialism’ was a worldview that brought into focus latent Jewish ideas and beliefs about the physical world. The materialists claimed that a theory of Judaism, defined by the way people related to land, labour and bodies, had been lying dormant within Jewish literature – in Hasidic texts, the Bible, Spinoza’s philosophy – and could now be clearly recognised and fully articulated. Jewish particularity was based on specific historical economic differences between Jews and others. What made Jews different was a certain socioeconomic dynamic that distinguished them from their neighbours.
The Jewish revolutionaries in 1870s Russia who embraced the idea of materialism shared a number of critical assumptions. They all rejected the notion that Judaism was based on abstract metaphysical theories (Scholasticism), rituals (Hasidism), study (Mitnagdim), and ethics and reason (Enlighteners). Judaism was not a religion, like Protestantism. Instead it was something attached to their bodies and expressed through one’s relationship to land, labour and resources. The materialists had also given up hope that the state could protect them and ensure their economic wellbeing. And finally, they no longer believed that history was headed in a positive direction. Over no amount of time would Jews living in Russia ever be granted greater rights and opportunities. Therefore, only a radical reclaiming of the physical world on the part of Jews could ensure that they would be protected and given a fair and equal share of resources.
Soon, the Jewish materialism of the Russians could be found among western European Jews residing in England and Germany. Only half-jokingly, the German anarchist Gustav Landauer claimed in 1921 that what distinguished ‘the modern “conscious” Jew from a German was that when the latter writes about … the conservation of energy, … he writes about the conservation of energy, but when the conscious Jew writes about the conservation of energy, he writes about the conservation of energy and Judaism’ (emphasis mine). Eventually, there would be those, such as the Englishman Israel Zangwill, who considered themselves adherents to ‘a religion of pots and pans’, and others who identified Judaism as a faith based on ‘bagels and lox’. Over the course of the 20th century, Jews would increasingly come to believe that ‘there is nothing purely spiritual that stands on its own … Everything spiritual requires a necessary material basis.’
Updates on everything new at Aeon.
Top of Form
Bottom of Form
JJjjjJJ  Jewish materialists were despised not only by staunch liberals but also by ‘defenders of the faith’. Moses Leib Lilienblum, who would go on to found the Zionist movement in Russia, wrote a novel in which he described his youthful yeshiva education as one long masturbatory experience – for this, he was denounced by rabbis and communal leaders who forced him to flee his hometown in fear for his life. The future Russian revolutionary Hasia Schur was pelted with stones and jeered at by the townspeople of Mohilev for going on a Sabbath walk hand-in-hand with her boyfriend, the socialist Eliezer Tsukerman: the rabbis were up in arms that two young people had dared to touch one another in public. Jewish materialists were cast as upstarts, deviants, social provocateurs and, of course, with providing Jew-haters with excuses to promote anti-Semitism.
But the Jewish materialists’ deviancies reflected a radically new kind of Jewish identity, one focused on their bodies and the physical world. The Jewish body they imagined would offer a contrast to both the hunchbacked, traditional Jewish Torah scholar incapable of supporting his family, and the muscular gentile male whose energies were directed at conquering and dominating the physical world. The new Jewish body would be shaped in the image of a healthy traditional Jewish woman who laboured to provide for her family’s material wellbeing while her husband spent his day in the house of study: by tending to the material aspects of existence, Jews’ needs and desires would now be seen as the primary feature of Judaism. The material Jewish identity set the stage for Jews’ involvement in 20th-century politics: Zionism, Bundism (the Jewish labour movement), the Minority Rights movement, and Jewish forms of communism all assumed that the organising structure of Jewish identity was a Jewish body, and not a Judaism of the heavens or the heart. Jewish materialism made Jews political without them possessing their own state or even citizenship in a host country.
Though the idea of the Jewish body as the locus of collective identity would always be suspect in western Europe, it would, however, become the basis of a new kind of Jewish identity most commonly witnessed in Israel and the United States. Jewish immigrants to Palestine at the turn of the century saw in Zion the actualisation of materialism as first imagined in the 1870s. The Marxist Ber Borchov’s students, such as future leaders of Israel Yitzhak Ben-Zvi and David Ben-Gurion, identified Palestine as a response to the crisis of the fork and the knife (a pithy phrase meant to capture the economic challenges of Russian Jews in the 1870s) originally theorised by the Jewish materialist Aaron Shemuel Lieberman in the 1870s. They envisioned a new kind of Jew – the ?aluts (pioneer) – who was attached to the physical world. As described by the 20th-century Zionist poet Avraham Shlonsky, a former Hasidic Jew, the ?aluts would be the embodiment of the idea that ‘a human being is meat, and he toils here in the sacred/and the land/bread’. The people of the book had now become a people of labour, land and the body.
In the US, eastern European Jews established large-scale defence organisations directed at protecting Jewish bodies and providing a platform for Jews to speak as a distinct ethnic minority in the American public sphere. From the poet Emma Lazarus to the American rabbi Mordecai Kaplan to the philosopher Horace Kallen, American Jews in the early 20th century developed political programmes and established organisations rooted around the physical aspects of Jewish life.
Jewish materialism remains the defining element of most American Jews’ identity. Following the Second World War, the influx of another wave of Jewish immigrants from Russian lands gave rise to a new brand of US literature that placed the Jewish body front and centre. The late US novelist Phillip Roth might have been familiar only in passing with the name Moses Lilienblum. But it was Lilienblum who put into circulation the Jewish genre of overbearing parents, unrealisable social expectations, failed sexual encounters, silly rabbis, bankrupt synagogues and God-fearing charlatans encased in a narrative about masturbation. Whether he knew it or not, when Roth wrote his novel Portnoy’s Complaint(1969), he was channelling the same tradition first articulated by Lilienblum a century earlier.
Roth took those commitments to his grave when he died on 22 May 2018. While the grandmaster of late-20th-century American letters asked to be interred next to Jews, he strictly prohibited the performance of any Jewish rituals at his funeral. His final requests, allegedly, were inspired by a desire ‘to have someone to talk to’. His corpse did not need a rabbi to eulogise it, or a perfunctory kaddish (or hymn) to kasher it; it was simply Jewish – nothing more and nothing less. Indeed, it was a fitting conclusion to the life of a Jewish materialist.


In [72]:
tokens = word_tokenize(article)
lower_tokens = [token.lower() for token in tokens]
counter_var = collections.Counter(lower_tokens)


Counter produces a list of tuples, where each tuple contains the word and its frequency count. The command .common(5) will give the top 5 most common words in the word counter.

In [73]:
counter_var.most_common(30)

Out[73]:
[('the', 106),
(',', 79),
('of', 57),
('.', 50),
('and', 45),
('a', 41),
('in', 37),
('to', 36),
('jewish', 35),
('’', 23),
('jews', 20),
('was', 18),
('that', 17),
('as', 15),
('be', 13),
('by', 12),
('would', 12),
('‘', 11),
('(', 11),
(')', 11),
('on', 10),
('his', 10),
('they', 9),
('judaism', 9),
('he', 9),
('materialism', 8),
('or', 8),
('–', 8),
('who', 8),
('it', 8)]

#### Pre-processing for NLP¶

1. Tokenization an article
2. Lowercasing words (i.e., making sure all words are in consistent) .lower()
3. Extracing only alphanumeric characters (i.e., removing punctuation) .isalpha()
4. Removing stop words (i.e., removing words such as: like, and, or, etc.) stopwords.words('english')
5. Lemmatization/Stemming (i.e., removing all plurals from the words) 
6. Using counter to create a bag of words
7. Using most_common to see which word has the most frequency to guess the article.

#### Lower and counting only alpha-numeric words¶

In [74]:
lower_alpha_tokens = [ w for w in word_tokenize(article.lower()) if w.isalpha()] # Tokenizing lower-case article into alphanumeric words [no punctuation]
print('Number of tokens: {0}'.format(len(lower_alpha_tokens)))

Number of tokens: 1390


#### Removing stop words¶

In [75]:
from nltk.corpus import stopwords

no_stops = [t for t in lower_alpha_tokens if t not in stopwords.words('english')] # Extracking tokens from lower_alpha_tokens if the tokens are not in the stopwords database
print('Number of tokens (remove stop words): {0}'.format(len(no_stops)))
bow = Counter(no_stops)
bow.most_common(10)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Number of tokens (remove stop words): 790

Out[75]:
[('jewish', 35),
('jews', 20),
('would', 12),
('judaism', 9),
('materialism', 8),
('material', 7),
('could', 6),
('physical', 6),
('world', 6),
('new', 6)]

#### Lemmatizing the text¶

Lemmatization is breaking down the word into its 'base' form. Stemming is a basic simple version, whereas lemmatization is more advanced.

In [76]:
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
lemmatized =[wnl.lemmatize(t) for t in no_stops] # Goes through each token and lemmatizes it
bow = collections.Counter(lemmatized) # Counter makes it a bag of words
bow.most_common(10)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\xxklow\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

Out[76]:
[('jewish', 35),
('jew', 25),
('would', 12),
('body', 11),
('judaism', 9),
('materialism', 8),
('material', 7),
('materialist', 7),
('could', 6),
('russian', 6)]`