NLP: Detecting the occurence of fakenews

NLP fake news classifier


We download the fake news dataset from kaggle and perform a supervised classification machine learning model.

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

We see that we have a dataframe that has the title, author, and text of the fake news article. Articles with a label of '1' are True and articles with a label of '0' are false.

In [2]:
cwd = os.getcwd()
readDir = cwd + '/inputs/'
readFile = readDir+'fakenews_train.csv'

big_df = pd.read_csv(readFile,index_col='id')
big_df.info()
big_df.head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20800 entries, 0 to 20799
Data columns (total 4 columns):
title     20242 non-null object
author    18843 non-null object
text      20761 non-null object
label     20800 non-null int64
dtypes: int64(1), object(3)
memory usage: 812.5+ KB
Out[2]:
title author text label
id
0 House Dem Aide: We Didn’t Even See Comey’s Let... Darrell Lucus House Dem Aide: We Didn’t Even See Comey’s Let... 1
1 FLYNN: Hillary Clinton, Big Woman on Campus - ... Daniel J. Flynn Ever get the feeling your life circles the rou... 0
2 Why the Truth Might Get You Fired Consortiumnews.com Why the Truth Might Get You Fired October 29, ... 1
3 15 Civilians Killed In Single US Airstrike Hav... Jessica Purkiss Videos 15 Civilians Killed In Single US Airstr... 1
4 Iranian woman jailed for fictional unpublished... Howard Portnoy Print \nAn Iranian woman has been sentenced to... 1
In [4]:
df = big_df.sample(1000,random_state=50)
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 3835 to 11328
Data columns (total 4 columns):
title     977 non-null object
author    915 non-null object
text      995 non-null object
label     1000 non-null int64
dtypes: int64(1), object(3)
memory usage: 39.1+ KB
Out[4]:
title author text label
id
3835 Trump Shifts Course on Egypt, Praising Its Aut... Peter Baker and Declan Walsh WASHINGTON — Ever since he seized power in ... 0
11835 Kraft Heinz Offers to Buy Unilever in $143 Bil... Michael J. de la Merced and Chad Bray The world’s grocery carts could soon be filled... 0
6961 Comment on Death of the 2-party system: GOP bi... aroamingcatholicny Posted on August 4, 2016 by Dr. Eowyn | 51 Com... 1
16112 HIDDEN CAMERA: NYC Democratic Election Commiss... NaN \nPoor honest guy will be on the streets now. ... 1
7185 Ann Coulter: Swamp People: 47 Trump: 0 Ann Coulter If this is the budget deal we get when Republi... 0

We need to check whether the dataset is well-balanced (i.e., the number of fake and non-fake news is roughly similar). We see below that there are 10413 true vs 10387 false articles, so our aticle is well balanced.

In [5]:
df['label'] = df['label'].map({0:'NOT FAKE',1:'FAKE'})
df.head()
Out[5]:
title author text label
id
3835 Trump Shifts Course on Egypt, Praising Its Aut... Peter Baker and Declan Walsh WASHINGTON — Ever since he seized power in ... NOT FAKE
11835 Kraft Heinz Offers to Buy Unilever in $143 Bil... Michael J. de la Merced and Chad Bray The world’s grocery carts could soon be filled... NOT FAKE
6961 Comment on Death of the 2-party system: GOP bi... aroamingcatholicny Posted on August 4, 2016 by Dr. Eowyn | 51 Com... FAKE
16112 HIDDEN CAMERA: NYC Democratic Election Commiss... NaN \nPoor honest guy will be on the streets now. ... FAKE
7185 Ann Coulter: Swamp People: 47 Trump: 0 Ann Coulter If this is the budget deal we get when Republi... NOT FAKE
In [6]:
df['label'] = df['label'].astype('category',inplace=True) # count_vectorizer expects a unicode string
df['label'].value_counts() 
Out[6]:
NOT FAKE    515
FAKE        485
Name: label, dtype: int64

We prepare the dataset for supervised classification. y is our target variable and X are our predictor variables. I convert the X variable into unicode to it works with the count_vectorizer function later on

In [7]:
y = df['label']
X = df['text'].values.astype('U') # Conversion to unicode for count_vectorizer function

Using train_test_split, I create a set of training and testing datasets. The test size is 30% and a random_stage is selected so that the process is reapeatable.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=53)

Creating count vectorizer & tfidf vectorizers

We create both the CountVectorizer and the TfidfVectorizer as these are two ways of measuring the frequency of words in our corpus. Our corpus consists of all the fake and non-fake news articles. The CountVectorizer is a simple bag-of-words (i.e., BoW) where it counts the frequency of words occuring in the text. The higher the number of words in a text, the more likely that is what the article is about. However, when we have a large number of different articles, we want to use the TfidfVectorizer as tf-idf penalizes words that occur frequency across different articles.

The CountVectorizer and the TfidfVectorizer models are able to preprocess the datasets by removing all stop words. Once these models are applied to datasets, they will extract all the tokens from the dataset and convert it to features. Thus we have BoW vectors.

Article Label Feature 1 (Word 1) Feature 2 (Word 2) ... Feature N (Word N)
Article 1 True Count/tfidf val for Word 1 Count/tfidf val for Word 2 ... Count/tfidf val for Word N
Article 2 False Count/tfidf val for Word 1 Count/tfidf val for Word 2 ... Count/tfidf val for Word N
Article N True Count/tfidf val for Word 1 Count/tfidf val for Word 2 ... Count/tfidf val for Word N

The training and test vectors need a consistent set of words so the model can understand the test input.

Count vectorizer

In [9]:
count_vectorizer = CountVectorizer(stop_words='english') # Initializes the model and removes all the english stop words.  
count_train = count_vectorizer.fit_transform(X_train) # Creates the BoW vectors.  Generates a mapping of words with  IDs and vectors representing how many times each word occurs in the news article
count_test = count_vectorizer.transform(X_test) # Only transform this as you're not using the test data fot fitting.

We can see that count_vectorizer returns a list of features of the dataset which are the different tokens. Some of these tokens do not make much sense, so they may not occur in every document in our corpus

In [10]:
print(count_vectorizer.get_feature_names()[0:10])
['00', '000', '002', '006', '01', '010', '018', '02', '026', '02863']

3500 articles with 74491 tokens that could occur in each article. Because there are almost 15000 tokens (i.e., words) not all words occur in each document, so most of this matrix is a 0, therefore we use a sparse matrix as this saves memory.

In [11]:
print('Data Type: {0}'.format(type(count_train)))
print('Data Type Shape: {0}'.format(count_train.shape))
Data Type: <class 'scipy.sparse.csr.csr_matrix'>
Data Type Shape: (700, 31762)

We attempt to convert the sparse matrix to a normal matrix. However, the sparse matrix is too big ([14560x147995]) so there's a memory error. Instead, we just select from count_train just [100x100]

In [12]:
print(count_train.A)
[[0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [2 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]]

Creating tf-idf vectorizer

The below is the same process followed for word_vectorizer.

In [13]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english',max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

print(tfidf_vectorizer.get_feature_names()[:10])
print(tfidf_train.A)
['00', '000', '002', '006', '01', '010', '018', '02', '026', '02863']
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.01923386 0.         ... 0.         0.         0.        ]
 [0.17211452 0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.02405801 0.         ... 0.         0.         0.        ]]

We convert the sparse vectors of count_vectorizer and tfidf_vectorizer into DataFrames.

In [38]:
count_df = pd.DataFrame(count_train.A,columns=count_vectorizer.get_feature_names())
tfidf_df = pd.DataFrame(tfidf_train.toarray(),columns=tfidf_vectorizer.get_feature_names())
In [39]:
count_df.head()
Out[39]:
00 000 002 006 01 010 018 02 026 02863 ... هوية وإنها وعثر وقال وكالات وكالة وكانت ويقول ᏸecca ḥaram
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 31762 columns

In [40]:
tfidf_df.head()
Out[40]:
00 000 002 006 01 010 018 02 026 02863 ... هوية وإنها وعثر وقال وكالات وكالة وكانت ويقول ᏸecca ḥaram
0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.000000 0.019234 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.172115 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 31762 columns

We check to see if both the count_df and the tfidf_df dataframes have the same set of columns, and also see if they are the same. They should have the same set of columns and this is True since difference results in an empty set. Remember set extracts all unique items in a list. Both dataframes should be different as one dataframe will be counting the frequency of words, and the other will have the tfidf values for each word.

In [41]:
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)
print(count_df.equals(tfidf_df))
set()
False

Now that we've trained our datasets using the count_vectorizer and tfidf_vectorizer, we can build a supervised classifier is. We have about 87% accuracy. The labels parameter in the confusion_matrix rearranges the matrix according to the labelling system. The default is 0, and then 1, however we can change it to labels=[1,0] and this will change the display of the confusion matrix. As 1 is the fake news, and that's what we are interested in, we force the matrix to show 1 (i.e., fake) classification first.

In [15]:
#from sklearn.naive_bayes import ComplementNB

#os.path.dirname(path)

print(os.environ)
environ({'CONDA_SHLVL': '2', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_EXE': '/home/randlow/anaconda3/bin/conda', 'LANG': 'en_US.UTF-8', 'MANAGERPID': '93', 'DISPLAY': ':0', 'INVOCATION_ID': '52d0044646fc46f4896f06806dfe0a44', 'CONDA_PREFIX': '/home/randlow/anaconda3/envs/machinelearning', 'USER': 'randlow', 'CONDA_PREFIX_1': '/home/randlow/anaconda3/envs/pelican', 'WAYLAND_DISPLAY': 'wayland-0', 'PWD': '/home/randlow', 'HOME': '/home/randlow', 'CONDA_PYTHON_EXE': '/home/randlow/anaconda3/bin/python', 'JOURNAL_STREAM': '8:3870', 'BROWSER': '/usr/bin/garcon-url-handler', 'XCURSOR_SIZE': '48', 'XDG_DATA_DIRS': '/home/randlow/.local/share:/home/randlow/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share', 'CONDA_PROMPT_MODIFIER': '(machinelearning) ', 'SOMMELIER_VERSION': '0.20', 'TERM': 'xterm-color', 'SHELL': '/bin/bash', 'SHLVL': '1', 'LOGNAME': 'randlow', 'XDG_RUNTIME_DIR': '/run/user/1000', 'QT_AUTO_SCREEN_SCALE_FACTOR': '1', 'PATH': '/home/randlow/anaconda3/envs/pelican/bin:/home/randlow/anaconda3/envs/machinelearning/bin:/home/randlow/anaconda3/envs/pelican/bin:/home/randlow/anaconda3/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games', 'CONDA_DEFAULT_ENV': 'machinelearning', '_': '/home/randlow/anaconda3/envs/pelican/bin/jupyter', 'JPY_PARENT_PID': '14653', 'CLICOLOR': '1', 'PAGER': 'cat', 'GIT_PAGER': 'cat', 'MPLBACKEND': 'module://ipykernel.pylab.backend_inline'})
In [42]:
import sklearn
from sklearn.naive_bayes import MultinomialNB, ComplementNB



ComplementNB
mnnb_classifier = MultinomialNB()

mnnb_classifier.fit(count_train,y_train) # provide it with the predictor and target variables
y_pred = mnnb_classifier.predict(count_test)

print(sklearn.metrics.accuracy_score(y_test,y_pred))
print(sklearn.metrics.confusion_matrix(y_test,y_pred))
print(sklearn.metrics.confusion_matrix(y_test,y_pred,labels=['FAKE','NOT FAKE']))
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-42-bad29d04456c> in <module>()
      1 import sklearn
----> 2 from sklearn.naive_bayes import MultinomialNB, ComplementNB
      3 
      4 mnnb_classifier = MultinomialNB()
      5 

ImportError: cannot import name 'ComplementNB' from 'sklearn.naive_bayes' (/home/randlow/anaconda3/lib/python3.7/site-packages/sklearn/naive_bayes.py)

We try using the tfidf vector to see if the model works better

In [ ]:
mnnb_classifier.fit(tfidf_train,y_train) # providea it with the predictor and target variables
y_pred = mnnb_classifier.predict(tfidf_test)

print(sklearn.metrics.accuracy_score(y_test,y_pred))
print(sklearn.metrics.confusion_matrix(y_test,y_pred))
print(sklearn.metrics.confusion_matrix(y_test,y_pred,labels=['FAKE','NOT FAKE']))

We try using a complement naive bayes classifier that is usually good for unbalanced datasets. We can see that our dataset is balanced so it does not perform much better.

In [ ]:
cnb_classifier = ComplementNB()

cnb_classifier.fit(count_train,y_train) 
pred = cnb_classifier.predict(count_test)

sklearn.metrics.accuracy_score(y_test,pred)

Now we try to alter the additive smoothing parameter (i.e., alpha) to see which parameter gives us the best accuracy score. We do this for both the count_train and tfidf_train models.

In [ ]:
alphas = np.arange(0.1,1.5,0.1)

def fit_and_predict(x_train,x_test,y_train,y_test,alphas):
    mnnb_classifier = MultinomialNB(alpha-alphas)
    mnnb_classifier.fit(x_train,y_train)
    y_pred = mnnb_classifier.predict(x_test)
    return sklearn.metrics.accuracy_score(y_test,y_pred)

print('count vector')
for alpha in alphas:
    acc_score = fit_and_predict(count_train,count_test,y_train,y_test,alpha-alpha)
    print('Alpha: {0:.1f}, Acc Score:{1:.3f}'.format(alpha,acc_score))

    
print('tfidf vector')
for alpha in alphas:
    acc_score = fit_and_predict(tfidf_train,tfidf_test,y_train,y_test,alpha-alpha)
    print('Alpha: {0:.1f}, Acc Score:{1:.3f}'.format(alpha,acc_score))    

Now we analyze the features of the dataset. We extract the (i) labels (ii) features (iii) feature weights. Once we have the feature and its weights, we sort it so we can say that the top (bottom) 20 features are related to the first (second) class_label. We can see that fake news is characterized by many strange looking tokens. Real news has tokens that make more sense like trump, clinton, percent, just, white.

In [ ]:
class_label = mnnb_classifier.classes_
feature_names = tfidf_vectorizer.get_feature_names()

feat_wtg = sorted(zip(mnnb_classifier.coef_[0],feature_names))

print(class_label[0], feat_wtg[:20])
print(class_label[1], feat_wtg[-20:])
In [ ]:
print(class_label)
print(feature_names[:20])
In [ ]:
 

Comments

Comments powered by Disqus