NLP: Detecting the occurence of fakenews
import os
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
We see that we have a dataframe that has the title, author, and text of the fake news article. Articles with a label of '1' are True and articles with a label of '0' are false.
cwd = os.getcwd()
readDir = cwd + '/inputs/'
readFile = readDir+'fakenews_train.csv'
big_df = pd.read_csv(readFile,index_col='id')
big_df.info()
big_df.head()
df = big_df.sample(1000,random_state=50)
df.info()
df.head()
We need to check whether the dataset is well-balanced (i.e., the number of fake and non-fake news is roughly similar). We see below that there are 10413 true vs 10387 false articles, so our aticle is well balanced.
df['label'] = df['label'].map({0:'NOT FAKE',1:'FAKE'})
df.head()
df['label'] = df['label'].astype('category',inplace=True) # count_vectorizer expects a unicode string
df['label'].value_counts()
We prepare the dataset for supervised classification. y
is our target variable and X
are our predictor variables. I convert the X
variable into unicode to it works with the count_vectorizer
function later on
y = df['label']
X = df['text'].values.astype('U') # Conversion to unicode for count_vectorizer function
Using train_test_split
, I create a set of training and testing datasets. The test size is 30% and a random_stage is selected so that the process is reapeatable.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=53)
Creating count vectorizer & tfidf vectorizers¶
We create both the CountVectorizer
and the TfidfVectorizer
as these are two ways of measuring the frequency of words in our corpus. Our corpus consists of all the fake and non-fake news articles. The CountVectorizer
is a simple bag-of-words (i.e., BoW) where it counts the frequency of words occuring in the text. The higher the number of words in a text, the more likely that is what the article is about. However, when we have a large number of different articles, we want to use the TfidfVectorizer
as tf-idf penalizes words that occur frequency across different articles.
The CountVectorizer
and the TfidfVectorizer
models are able to preprocess the datasets by removing all stop words. Once these models are applied to datasets, they will extract all the tokens from the dataset and convert it to features. Thus we have BoW vectors.
Article | Label | Feature 1 (Word 1) | Feature 2 (Word 2) | ... | Feature N (Word N) |
---|---|---|---|---|---|
Article 1 | True | Count/tfidf val for Word 1 | Count/tfidf val for Word 2 | ... | Count/tfidf val for Word N |
Article 2 | False | Count/tfidf val for Word 1 | Count/tfidf val for Word 2 | ... | Count/tfidf val for Word N |
Article N | True | Count/tfidf val for Word 1 | Count/tfidf val for Word 2 | ... | Count/tfidf val for Word N |
The training and test vectors need a consistent set of words so the model can understand the test input.
Count vectorizer¶
count_vectorizer = CountVectorizer(stop_words='english') # Initializes the model and removes all the english stop words.
count_train = count_vectorizer.fit_transform(X_train) # Creates the BoW vectors. Generates a mapping of words with IDs and vectors representing how many times each word occurs in the news article
count_test = count_vectorizer.transform(X_test) # Only transform this as you're not using the test data fot fitting.
We can see that count_vectorizer
returns a list of features of the dataset which are the different tokens. Some of these tokens do not make much sense, so they may not occur in every document in our corpus
print(count_vectorizer.get_feature_names()[0:10])
3500 articles with 74491 tokens that could occur in each article. Because there are almost 15000 tokens (i.e., words) not all words occur in each document, so most of this matrix is a 0, therefore we use a sparse matrix as this saves memory.
print('Data Type: {0}'.format(type(count_train)))
print('Data Type Shape: {0}'.format(count_train.shape))
We attempt to convert the sparse matrix to a normal matrix. However, the sparse matrix is too big ([14560x147995]) so there's a memory error. Instead, we just select from count_train
just [100x100]
print(count_train.A)
Creating tf-idf vectorizer¶
The below is the same process followed for word_vectorizer
.
tfidf_vectorizer = TfidfVectorizer(stop_words='english',max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)
print(tfidf_vectorizer.get_feature_names()[:10])
print(tfidf_train.A)
We convert the sparse vectors of count_vectorizer
and tfidf_vectorizer
into DataFrames.
count_df = pd.DataFrame(count_train.A,columns=count_vectorizer.get_feature_names())
tfidf_df = pd.DataFrame(tfidf_train.toarray(),columns=tfidf_vectorizer.get_feature_names())
count_df.head()
tfidf_df.head()
We check to see if both the count_df
and the tfidf_df
dataframes have the same set of columns, and also see if they are the same. They should have the same set of columns and this is True since difference
results in an empty set. Remember set
extracts all unique items in a list. Both dataframes should be different as one dataframe will be counting the frequency of words, and the other will have the tfidf values for each word.
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)
print(count_df.equals(tfidf_df))
Now that we've trained our datasets using the count_vectorizer
and tfidf_vectorizer
, we can build a supervised classifier is. We have about 87% accuracy. The labels
parameter in the confusion_matrix
rearranges the matrix according to the labelling system. The default is 0, and then 1, however we can change it to labels=[1,0]
and this will change the display of the confusion matrix. As 1 is the fake news, and that's what we are interested in, we force the matrix to show 1 (i.e., fake) classification first.
#from sklearn.naive_bayes import ComplementNB
#os.path.dirname(path)
print(os.environ)
import sklearn
from sklearn.naive_bayes import MultinomialNB, ComplementNB
ComplementNB
mnnb_classifier = MultinomialNB()
mnnb_classifier.fit(count_train,y_train) # provide it with the predictor and target variables
y_pred = mnnb_classifier.predict(count_test)
print(sklearn.metrics.accuracy_score(y_test,y_pred))
print(sklearn.metrics.confusion_matrix(y_test,y_pred))
print(sklearn.metrics.confusion_matrix(y_test,y_pred,labels=['FAKE','NOT FAKE']))
We try using the tfidf vector to see if the model works better
mnnb_classifier.fit(tfidf_train,y_train) # providea it with the predictor and target variables
y_pred = mnnb_classifier.predict(tfidf_test)
print(sklearn.metrics.accuracy_score(y_test,y_pred))
print(sklearn.metrics.confusion_matrix(y_test,y_pred))
print(sklearn.metrics.confusion_matrix(y_test,y_pred,labels=['FAKE','NOT FAKE']))
We try using a complement naive bayes classifier that is usually good for unbalanced datasets. We can see that our dataset is balanced so it does not perform much better.
cnb_classifier = ComplementNB()
cnb_classifier.fit(count_train,y_train)
pred = cnb_classifier.predict(count_test)
sklearn.metrics.accuracy_score(y_test,pred)
Now we try to alter the additive smoothing parameter (i.e., alpha) to see which parameter gives us the best accuracy score. We do this for both the count_train
and tfidf_train
models.
alphas = np.arange(0.1,1.5,0.1)
def fit_and_predict(x_train,x_test,y_train,y_test,alphas):
mnnb_classifier = MultinomialNB(alpha-alphas)
mnnb_classifier.fit(x_train,y_train)
y_pred = mnnb_classifier.predict(x_test)
return sklearn.metrics.accuracy_score(y_test,y_pred)
print('count vector')
for alpha in alphas:
acc_score = fit_and_predict(count_train,count_test,y_train,y_test,alpha-alpha)
print('Alpha: {0:.1f}, Acc Score:{1:.3f}'.format(alpha,acc_score))
print('tfidf vector')
for alpha in alphas:
acc_score = fit_and_predict(tfidf_train,tfidf_test,y_train,y_test,alpha-alpha)
print('Alpha: {0:.1f}, Acc Score:{1:.3f}'.format(alpha,acc_score))
Now we analyze the features of the dataset. We extract the (i) labels (ii) features (iii) feature weights. Once we have the feature and its weights, we sort it so we can say that the top (bottom) 20 features are related to the first (second) class_label. We can see that fake news is characterized by many strange looking tokens. Real news has tokens that make more sense like trump, clinton, percent, just, white.
class_label = mnnb_classifier.classes_
feature_names = tfidf_vectorizer.get_feature_names()
feat_wtg = sorted(zip(mnnb_classifier.coef_[0],feature_names))
print(class_label[0], feat_wtg[:20])
print(class_label[1], feat_wtg[-20:])
print(class_label)
print(feature_names[:20])
Comments
Comments powered by Disqus