Fake and Real News Detector | Machine Learning, Data Science, Statistics

Introduction

Fake news include false news stories,disinformation and misinformation with the intent of misleading people. The proliferation of fake news in recent years poses great danger to the safety and security of many people around the world. Some people have acted on fake news to commit certain crimes which are unpardonable. The growth of the internet and social media around the world has helped to explode the consumption of fake news and conspiracy theories around the world. Determining if a piece of news item is fake or real is not so obvious these days especially when the perpetrators create websites with similar names like well-known authentic news outlets. In this post we will try to do in-depth exploratory analysis and comparison of real and fake new. We will look at negativity, positivity and neutrality in sentiments expressed in both fake and real news. We would end by demonstrating how we can predict whether a news item is fake or real with a classification model.

The dataset for this analysis is located on kaggle. Find the link here.

#!pip install gensim # Gensim is an open-source library for unsupervised topic modeling and natural language processing

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from  nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import re
from nltk.corpus import stopwords
import seaborn as sns 
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
import matplotlib.pyplot as plt
import plotly
import seaborn as sns
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from tqdm import tqdm
import itertools
import collections
from xgboost import XGBClassifier
import xgboost as xgb
%matplotlib inline
%autosave 5

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!




Autosaving every 5 seconds

import warnings
warnings.filterwarnings("ignore")
from tqdm.notebook import tqdm
tqdm.pandas(desc="Completed") # add progressbar to pandas, use progress_apply instead apply
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px
from ipywidgets import interact #interactive plots
from IPython.display import clear_output

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

!unzip /content/drive/MyDrive/Data/Fake_True_News.zip
True_data = pd.read_csv('True.csv')
True_data['label']= 1
Fake_data = pd.read_csv('Fake.csv')
Fake_data['label']=0
True_data.head()

Archive:  /content/drive/MyDrive/Data/Fake_True_News.zip
replace Fake.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: yes
  inflating: Fake.csv                
replace True.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: yes
  inflating: True.csv                

	title	text	subject	date	label
0	As U.S. budget fight looms, Republicans flip t...	WASHINGTON (Reuters) - The head of a conservat...	politicsNews	December 31, 2017	1
1	U.S. military to accept transgender recruits o...	WASHINGTON (Reuters) - Transgender people will...	politicsNews	December 29, 2017	1
2	Senior U.S. Republican senator: 'Let Mr. Muell...	WASHINGTON (Reuters) - The special counsel inv...	politicsNews	December 31, 2017	1
3	FBI Russia probe helped by Australian diplomat...	WASHINGTON (Reuters) - Trump campaign adviser ...	politicsNews	December 30, 2017	1
4	Trump wants Postal Service to charge 'much mor...	SEATTLE/WASHINGTON (Reuters) - President Donal...	politicsNews	December 29, 2017	1

df = pd.concat([True_data,Fake_data],axis=0)
print(df.shape)
print(type(df.label))
df.info()

(44898, 5)
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
Int64Index: 44898 entries, 0 to 23480
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 2.1+ MB

#df= df.sample(frac=0.1)
df.shape

(44898, 5)

# Create and register a new `tqdm` instance with `pandas`
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()
#df["title"] = df["title"].progress_apply(preprocess)
#df["text"] = df["text"].progress_apply(preprocess)

from sklearn.utils import shuffle
df = shuffle(df)
df.head()

	title	text	subject	date	label
13936	FORMER MEXICAN PREZ Sends “Middle Finger” To T...	Wow these sound exactly like the type of peopl...	politics	May 10, 2016	0
10132	OOPS! NY GOV CUOMO Announces Statues of Confed...	Really Andrew? Does all of New York really sta...	politics	Aug 17, 2017	0
7518	Janet Reno, first U.S. woman attorney general,...	(Reuters) - Blunt-spoken Janet Reno, who serve...	politicsNews	November 7, 2016	1
17792	Murdered North Korean Kim Jong Nam had $100,00...	KUALA LUMPUR (Reuters) - The half-brother of N...	worldnews	October 11, 2017	1
22509	PROPAGANDA: Star Trek Beyond – Social Justice ...	Jay Dyer 21st Century WireThe last Star Trek r...	US_News	August 2, 2016	0

The gensim.utils.simple_process utility can be used to accomplish some of the basic common text preprocessing and cleaning such as tokenization and removing stop words.

stop_words = stopwords.words('english')
stop_words.extend(['said','say','from', 'subject', 're', 'edu', 'use'])
lemmatizer = WordNetLemmatizer() 
stemmer = SnowballStemmer("english")


def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2 and token not in stop_words:
           #token = [stemmer.stem(token)  for token in text.split() ]
           result.append(lemmatizer.lemmatize(token))
    return result

#@title Default title text
df["title"] = df["title"].apply(preprocess)
df["text"] = df["text"].apply(preprocess)

Combine title and text columns, this will later be used to demonstrate if classification performance improves with this combination over using either title or text column alone to predict whether the news is Fake or Real.

df["title_text"] = df["text"]+df["title"]

df.subject.value_counts()

politicsNews       11272
worldnews          10145
News                9050
politics            6841
left-news           4459
Government News     1570
US_News              783
Middle-east          778
Name: subject, dtype: int64

Looking at the subjects we can combine related subjects such politicsNews and politics to politicsNews.

df.subject.replace({'politics':'politicsNews'},inplace=True)

#df['label'] = df['label'].astype(str)
df["label"].replace({0:"Fake",  1:"Real"},inplace=True)
df["label"].value_counts()

Fake    23481
Real    21417
Name: label, dtype: int64

What is the distribution of Subjects between the True and Fake News?

temp=df.groupby('label').apply(lambda x:x['title'].count()).reset_index(name='Counts')
#temp.label.replace({0:'False',1:'True'},inplace=True)
temp.style.background_gradient(cmap='Purples')

	label	Counts
0	Fake	23481
1	Real	21417

#Let's draw a Funnel-Chart for better visualization
fig = go.Figure(go.Funnelarea(
    text =temp.label,
    values = temp.Counts,
    title = {"position": "top center", "text": "Funnel-Chart of News Distribution"}
    ))
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)

fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file1.html")

temp=df.groupby('label').apply(lambda x:x['title'].count()).reset_index(name='Counts')
#sub_tf_df.label.replace({0:'False',1:'True'},inplace=True)
fig = px.bar(temp, x="label", y="Counts",
             color='Counts', barmode='group',
             title =  "Frequency of Real and Fake News Distribution",
             height=400)

fig.update_layout(
    font_family="Courier New",
    font_color="white",
    title_font_family="Times New Roman",
    title_font_color="white"
   # legend_title_font_color="green"
)
fig.update_xaxes(title_font_family="Arial")
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="plotly_dark")
#fig =px.scatter(x=range(10), y=range(10))
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file2.html")
fig.show()

Common Words in News Fake News Title

top = Counter([item for sublist in df[df.label == "Fake"]["title"] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Greens')

	Common_words	count
0	trump	9350
1	video	8558
2	obama	2582
3	hillary	2322
4	watch	1941
5	clinton	1175
6	president	1165
7	black	975
8	tweet	936
9	white	905
10	new	905
11	breaking	896
12	news	883
13	republican	867
14	donald	848
15	muslim	842
16	gop	807
17	american	772
18	democrat	772
19	america	703

fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in Fake News Titles', orientation='h', 
             width=700, height=700,color='Common_words')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="plotly_white")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file3.html")

Common Words in News Real News Title

top = Counter([item for sublist in df[df.label == "Real"]["title"] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Purples')

	Common_words	count
0	trump	5567
1	say	2981
2	house	1452
3	russia	977
4	republican	976
5	north	926
6	korea	898
7	new	875
8	state	825
9	white	818
10	china	782
11	senate	759
12	court	753
13	tax	666
14	obama	665
15	clinton	659
16	election	656
17	vote	640
18	talk	597
19	leader	597

fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in Real news Titles', orientation='h', 
             width=700, height=700,color='Common_words')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="plotly_dark")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file4.html")

Which Subjects have received the most News Coverage?

temp=df.groupby('subject').apply(lambda x:x['title'].count()).reset_index(name='Counts')
fig=px.bar(temp,x='subject',y='Counts',color='Counts',title='Count of News Articles by Subject')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file2.html")
fig.update_layout( template="plotly_dark")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file5.html")

Exploring Co-occurring Words (Bigrams)

Let’s now explore certain words which occuur together in the tweets. Such words are called bigrams.A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words

from nltk import bigrams,trigrams,ngrams
# Create list of lists containing bigrams in tweets
terms_bigram = [list(bigrams(text)) for text in df[df.label == "Real"]["text"]]

# Flatten list of bigrams in clean tweets
bigrams_all = list(itertools.chain(*terms_bigram))

# Create counter of words in clean bigrams
bigram_counts = collections.Counter(bigrams_all)

bigram_df = pd.DataFrame(bigram_counts.most_common(20),
                             columns=['bigram', 'count'])
bigram_df.style.background_gradient(cmap='Purples')

	bigram	count
0	('united', 'state')	12215
1	('donald', 'trump')	10169
2	('white', 'house')	8419
3	('washington', 'reuters')	6674
4	('president', 'donald')	5930
5	('north', 'korea')	5659
6	('new', 'york')	4740
7	('prime', 'minister')	4206
8	('told', 'reuters')	3496
9	('islamic', 'state')	3477
10	('barack', 'obama')	3344
11	('told', 'reporter')	3189
12	('president', 'barack')	2960
13	('hillary', 'clinton')	2499
14	('supreme', 'court')	2481
15	('trump', 'administration')	2477
16	('reuters', 'president')	2394
17	('year', 'old')	2334
18	('united', 'nation')	2322
19	('secretary', 'state')	2317

#df = title_per.to_frame().round(1)
fig.update_layout(
    title_text="High Frequency Bigrams  in Real News",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='High Frequency ', x=0.5, y=0.5, font_size=14, showarrow=False)])
fig = px.pie(bigram_df, values=bigram_df['count'].values, names= bigram_df.bigram, color_discrete_sequence=px.colors.sequential.YlGnBu,
             title='High Frquency Bigrams for Real News')
fig.update_traces(textposition='inside', textinfo='percent')
fig.update_layout(
   # autosize=False,
    width=800,
    height=800,
   
)
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file6.html")

Bigrams of Fake News Headlines

# Create list of lists containing bigrams in tweets
terms_bigram = [list(bigrams(text)) for text in df[df.label == "Fake"]["text"]]

# Flatten list of bigrams in clean tweets
bigrams_all = list(itertools.chain(*terms_bigram))

# Create counter of words in clean bigrams
bigram_counts = collections.Counter(bigrams_all)

bigram_df = pd.DataFrame(bigram_counts.most_common(20),
                             columns=['bigram', 'count'])
bigram_df.style.background_gradient(cmap='Purples')

	bigram	count
0	('donald', 'trump')	16402
1	('featured', 'image')	8069
2	('hillary', 'clinton')	7312
3	('white', 'house')	6749
4	('united', 'state')	6674
5	('twitter', 'com')	6567
6	('pic', 'twitter')	6232
7	('new', 'york')	4361
8	('president', 'obama')	4104
9	('president', 'trump')	4065
10	('getty', 'image')	4029
11	('fox', 'news')	3524
12	('year', 'old')	3248
13	('barack', 'obama')	2352
14	('trump', 'supporter')	2086
15	('century', 'wire')	1930
16	('trump', 'campaign')	1903
17	('supreme', 'court')	1827
18	('fake', 'news')	1818
19	('secretary', 'state')	1764

colors = ['mediumturquoise','gold' ]
colors1 = ['#F4D03F','#82E0AA', "#F1948A",]


# Use `hole` to create a donut-like pie chart
fig = go.Figure(data=[go.Pie(labels= bigram_df.bigram, values=bigram_df['count'], hole=.4)])

fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=15,
                  marker=dict(colors=px.colors.sequential.YlGnBu, line=dict(color='#000000', width=1)))

fig.update_layout(
   # autosize=False,
    width=800,
    height=800,
   
)
fig.update_layout(
    title_text="High Frequency Bigrams in Fake News ",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text=' ', x=0.5, y=0.5, font_size=14, showarrow=False)])
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file7.html")

Trigrams of Fake News

#list(ngrams(df.text.values[0],n=4))
# Create list of lists containing bigrams in tweets
#trigram = [list(bigrams(tweet)) for tweet in df[df.label == "Fake"]["text"]]
trigram = [list(ngrams(text,n=3)) for text in df[df.label == "Fake"]["text"]]

# Flatten list of bigrams in clean tweets
trigram_all = list(itertools.chain(*trigram))

# Create counter of words in clean bigrams
trigram_counts = collections.Counter(trigram_all)
trigram_df = pd.DataFrame(trigram_counts.most_common(20),
                             columns=['trigram', 'count'])
trigram_df.style.background_gradient(cmap='Blues')

	trigram	count
0	('pic', 'twitter', 'com')	6185
1	('donald', 'trump', 'realdonaldtrump')	1692
2	('new', 'york', 'time')	1581
3	('black', 'life', 'matter')	1319
4	('president', 'donald', 'trump')	1049
5	('president', 'united', 'state')	1015
6	('president', 'barack', 'obama')	953
7	('news', 'century', 'wire')	939
8	('featured', 'image', 'video')	887
9	('century', 'wire', 'say')	733
10	('video', 'screen', 'capture')	731
11	('image', 'video', 'screen')	697
12	('new', 'york', 'city')	651
13	('featured', 'image', 'screenshot')	607
14	('www', 'youtube', 'com')	577
15	('youtube', 'com', 'watch')	575
16	('http', 'www', 'youtube')	571
17	('image', 'screen', 'capture')	529
18	('director', 'james', 'comey')	518
19	('fbi', 'director', 'james')	500

#fig = px.bar(trigram_df, x="count", y="trigram", title='Trigrams in Fake News', orientation='h', 
#             width=700, height=700)
#fig.show()
# plot
sns.set_style('ticks')
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(10, 7)
ax = sns.barplot(x="count", y="trigram", data=trigram_df,
                 palette="Blues_d")
plt.title('Trigrams of Fake News')
plt.show()

png

Trigrams of Real News

trigram = [list(ngrams(text,n=3)) for text in df[df.label == "Real"]["text"]]

# Flatten list of bigrams in clean tweets
trigram_all = list(itertools.chain(*trigram))

# Create counter of words in clean bigrams
trigram_counts = collections.Counter(trigram_all)
trigram_df = pd.DataFrame(trigram_counts.most_common(20),
                             columns=['trigram', 'count'])
trigram_df.style.background_gradient(cmap='Blues')

	trigram	count
0	('president', 'donald', 'trump')	5869
1	('president', 'barack', 'obama')	2960
2	('reuters', 'president', 'donald')	1562
3	('washington', 'reuters', 'president')	1533
4	('new', 'york', 'reuters')	845
5	('elect', 'donald', 'trump')	834
6	('president', 'elect', 'donald')	832
7	('respond', 'request', 'comment')	806
8	('president', 'vladimir', 'putin')	791
9	('state', 'rex', 'tillerson')	673
10	('secretary', 'state', 'rex')	672
11	('national', 'security', 'adviser')	639
12	('russian', 'president', 'vladimir')	613
13	('speaker', 'paul', 'ryan')	608
14	('washington', 'reuters', 'republican')	603
15	('prime', 'minister', 'theresa')	584
16	('chancellor', 'angela', 'merkel')	570
17	('leader', 'mitch', 'mcconnell')	568
18	('democratic', 'president', 'barack')	550
19	('republican', 'presidential', 'candidate')	541

sns.set_style('ticks')
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(10, 7)
ax = sns.barplot(x="count", y="trigram", data=trigram_df,
                 palette="Blues_d")
plt.title('Trigrams of Real News')
plt.show()

png

Word Cloud using the Real News

top = Counter([item for sublist in df[df.label == "Real"]["text"] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

	Common_words	count
0	trump	54734
1	state	37678
2	reuters	28976
3	president	28728
4	republican	23007
5	year	22622
6	government	19992
7	house	17030
8	new	16917
9	united	15590
10	people	15356
11	party	15294
12	election	14759
13	official	14620
14	told	14245
15	country	14161
16	washington	12988
17	campaign	11155
18	group	11129
19	week	10658

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split # function for splitting data to train and test sets

import nltk
from nltk.corpus import stopwords
from nltk.classify import SklearnClassifier

from wordcloud import WordCloud,STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline


from subprocess import check_output
from os import path
from wordcloud import WordCloud

#d = path.dirname("/Users/nanaakwasiabayieboateng/PythonNLTK")




# Read the whole text.
#text = str(train['text'])

#stopwords = set(STOPWORDS)
#stopwords.add("Chrysler")

color = ['black','white'];

#background_color="white", max_words=2000, mask=text,
#               stopwords=stopwords, contour_width=3, contour_color='steelblue'
    
fig, (ax1, ax2,) = plt.subplots(1, 2, figsize=[26, 10])
sns.set_color_codes("pastel")   


wordcloud = WordCloud(max_words = 2000,  
                       width=1000,
                       height=800,
                       colormap='viridis',
                      max_font_size=80, min_font_size=2,  # Font size range
                      background_color=color[1],
                    margin=0,
                     
               stopwords = stop_words).generate("".join(str(df[df.label == "Real"].text.values)))
ax1.imshow(wordcloud, interpolation = 'bilinear')

sns.color_palette("viridis", as_cmap=True)
ax2= sns.barplot(y="Common_words", x="count", data=temp,
            label="Total",palette="viridis")
ax2.set_ylabel('')  
ax2.set_title('Word related to Positive News');
#ax1.imshow(wordcloud)
ax1.axis('off')
ax1.set_title('High Frequency Words');

png

Word Cloud using the Real News

Lets Look at the Count of Words Distribution in the Title

top = Counter([item for sublist in df[df.label == "Fake"]["text"] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Greens')

	Common_words	count
0	trump	80049
1	president	28406
2	people	26678
3	state	23663
4	clinton	19866
5	time	19199
6	year	19074
7	obama	18838
8	like	18667
9	american	18106
10	donald	17780
11	republican	16756
12	right	14857
13	news	14679
14	new	14416
15	image	14312
16	hillary	14192
17	white	13573
18	know	12062
19	medium	11847

fig, (ax1, ax2,) = plt.subplots(1, 2, figsize=[26, 10])
sns.set_color_codes("pastel")   
color = ['black','white'];

wordcloud = WordCloud(max_words = 2000,  
                       width=1000,
                       height=800,
                       colormap='viridis',
                      max_font_size=80, min_font_size=2,  # Font size range
                      background_color=color[1],
                    margin=0,
                     
               stopwords = stop_words).generate("".join(str(df[df.label == "Fake"].text.values)))
ax1.imshow(wordcloud, interpolation = 'bilinear')

sns.color_palette("viridis", as_cmap=True)
ax2= sns.barplot(y="Common_words", x="count", data=temp,
            label="Total",palette="viridis")
ax2.set_ylabel('')  
ax2.set_title('Word related to Positive News');
#ax1.imshow(wordcloud)
ax1.axis('off')
ax1.set_title('High Frequency Words');

png

Donald Trunmp dominates the news whether fake or real.

Analysis Over Time

import datetime
from IPython.display import display, HTML

df["date"] = pd.to_datetime(df["date"], errors='coerce')

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
h= df.head(2)

# render dataframe as html
html = h.to_html(render_links=True, index=False).replace('<th>','<th style = "background-color: #48c980">')
# write html to file 
text_file = open("/content/drive/MyDrive/Colab Notebooks/NLP/h.html", "w") 
text_file.write(html) 
text_file.close() 
HTML('h.html')

import seaborn as sns
temp=df.groupby('year').apply(lambda x:x['text'].count()).reset_index(name='Counts')
temp.style.set_properties(**{'background-color': 'pink',
                            'color': 'black',
                            'border-color': 'white'})

temp.style.background_gradient(cmap='Greens')

	year	Counts
0	2015.000000	2479
1	2016.000000	16470
2	2017.000000	25904
3	2018.000000	35

temp=df.groupby('year').apply(lambda x:x['text'].count()).reset_index(name='Counts')
#temp['year'] = temp['year'].astype(str)

fig = px.bar(temp, x="year", y="Counts",color="Counts",text='Counts')
#fig = px.histogram(temp, x="year", y="Counts")
fig.update_layout(
    title=" Frequency of Words Used In News per  Year",
    xaxis_title="Year",
    yaxis_title="Frequency",
    #legend_title="",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="RebeccaPurple"
    )
)
#Forcing an axis to be categorical
fig.update_xaxes(type='category')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="plotly_white")
#fig.update_yaxes(title='')
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file8.html")

temp=df.groupby('month').apply(lambda x:x['text'].count()).reset_index(name='Counts')
temp.style.clear()
cm = sns.light_palette("green", as_cmap=True)
temp.style.background_gradient(cmap=cm)

	month	Counts
0	1.000000	3106
1	2.000000	2957
2	3.000000	3336
3	4.000000	3034
4	5.000000	3076
5	6.000000	2896
6	7.000000	2829
7	8.000000	2829
8	9.000000	5199
9	10.000000	5476
10	11.000000	5536
11	12.000000	4614

temp=df.groupby('month').apply(lambda x:x['text'].count()).reset_index(name='Counts')

temp.style.background_gradient(cmap=cm)
#fig = px.scatter(temp, x="day", y="Counts",mode='lines+markers')
fig = go.Figure()
#fig.add_trace(go.bar(x=temp.day, y=temp.Counts ))
#fig = px.line(temp, x="day", y="Counts")
fig = px.bar(temp, x="month", y="Counts",color="Counts")
fig.update_layout(
    title=" Frequency of Words Used In News per Month of the Year",
    xaxis_title="Month",
    yaxis_title="Frequency",
    #legend_title="",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="white"
    )
)

fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="plotly_dark")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file9.html")

temp=df.groupby('day').apply(lambda x:x['text'].count()).reset_index(name='Counts')

#fig = px.scatter(temp, x="day", y="Counts",mode='lines+markers')
fig = go.Figure()
fig.add_trace(go.Scatter(x=temp.day, y=temp.Counts,
                         marker=dict(
        color=np.random.randn(temp.shape[0]),
        colorscale='Viridis',
        line_width=1
    ),
                    mode='lines+markers'))
#fig = px.line(temp, x="day", y="Counts")

fig.update_layout(
    title=" Frequency of Words Used In News per Day of the Month",
    xaxis_title="Month",
    yaxis_title="Frequency",
    #legend_title="",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="RebeccaPurple"
    )
)

fig.update_xaxes(tickvals=temp.day,tickangle=45)
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="seaborn")
#fig.update_xaxes(showticklabels=False)
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file10.html")

Distribution of Token Lengths

df["token_length"]  = df["text"].apply(lambda x : len(x))
df.head()

	title	text	subject	date	label	title_text	year	month	day	token_length
3056	[chicago, cub, snub, trump, visit, white, hous...	[unprecedented, clear, aimed, donald, trump, n...	News	2017-01-11	Fake	[unprecedented, clear, aimed, donald, trump, n...	2017.0	1.0	11.0	152
21285	[outrageous, example, radical, allowed, colleg...	[war, expect, child, fight, war, leftist, acad...	left-news	2015-11-28	Fake	[war, expect, child, fight, war, leftist, acad...	2015.0	11.0	28.0	570
11575	[wow, barbara, bush, keynote, speaker, baby, k...	[father, staunch, supporter, life, mother, voc...	politicsNews	2017-02-25	Fake	[father, staunch, supporter, life, mother, voc...	2017.0	2.0	25.0	149
267	[trump, puerto, rico, disgustingly, tell, puer...	[thought, possibly, worse, donald, trump, lowe...	News	2017-10-03	Fake	[thought, possibly, worse, donald, trump, lowe...	2017.0	10.0	3.0	115
18532	[islamic, state, torch, oil, well, northern, i...	[baghdad, reuters, islamic, state, militant, s...	worldnews	2017-10-02	Real	[baghdad, reuters, islamic, state, militant, s...	2017.0	10.0	2.0	120

import plotly.figure_factory as ff
import numpy as np


# Group data together
hist_data = [df[df.label=="Fake"]["token_length"], df[df.label=="Real"]["token_length"]]

group_labels = ['Fake', 'Real']
colors = ['slategray', 'magenta']
                         
# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2,
                          curve_type='normal', # override default 'kde'
                         colors=colors)
#fig = ff.create_distplot(hist_data, group_labels, show_hist=False, colors=colors)
# Add title
#fig.update_layout(title_text='Distribution of Text Token Counts')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(xaxis_range=[0,1500])
fig.update_layout( template="plotly")
fig.update_layout(
    autosize=False,
    width=1000,
    height=600,)
fig.update_layout(
    title=" Distribution of Text Token Counts",
    xaxis_title="Token Length",
    yaxis_title="Frequency")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file11.html")

Most token lengths for the news items is between 0 to 500 words.

from matplotlib import pyplot as plt
import seaborn as sns

from matplotlib import rcParams

# figure size in inches
#rcParams['figure.figsize'] = 11.7,8.27
#plt.figure(figsize=(12,10))
#multiple="stack",, kind="kde", fill=True
sns.set_theme();
sns.set(rc={'figure.figsize':(13,10)})
#fig, ax = plt.subplots()
#sns.set_size_inches(11.7, 8.27)
sns.displot(df, x="token_length",hue="label")
plt.xlim([0, 2500]);
plt.xlabel('Token Length');
plt.ylabel('Frequency');
plt.show()

png

Visualizing Networks of Co-occurring Word

The bigrams can be visualized looking at the top occurring bigrams as networks using the Python package NetworkX.

import networkx as nx

# Create dictionary of bigrams and their counts
d = bigram_df.set_index('bigram').T.to_dict('records')

# Create network plot 
G = nx.Graph()

# Create connections between nodes
for k, v in d[0].items():
    G.add_edge(k[0], k[1], weight=(v * 5))


fig, ax = plt.subplots(figsize=(18, 10))

pos = nx.spring_layout(G, k=2)

# Plot networks
nx.draw_networkx(G, pos,
                 font_size=16,
                 width=3,
                 edge_color='grey',
                 node_color='green',
                 with_labels = False,
                 ax=ax)


# Create offset labels
for key, value in pos.items():
    x, y = value[0]+.135, value[1]+.045
    ax.text(x, y,
            s=key,
            bbox=dict(facecolor='yellow', alpha=0.25),
            horizontalalignment='center', fontsize=13)
    
plt.show()

png

Sentiment Analysis with Vader

For top 2000 highest frequency words in Fake/Real determine polarity of each with vader from Textblob package. The polarity will then be used to classify news as either negative,positive or neutral.

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from textblob import TextBlob
from nltk import tokenize

#df.info()
df.drop_duplicates(subset = "text", keep = "first", inplace = True)


def get_polarity(text):
    return TextBlob(text).sentiment.polarity

df['Polarity'] = df['text'].progress_apply(get_polarity)

HBox(children=(FloatProgress(value=0.0, max=38646.0), HTML(value='')))

df['sentiment_vader']=''
df.loc[df.Polarity>0,'sentiment_vader']='POSITIVE'
df.loc[df.Polarity==0,'sentiment_vader']='NEUTRAL'
df.loc[df.Polarity<0,'sentiment_vader']='NEGATIVE'
#df.head()

df["label"].replace({0:"Fake",  1:"Real"},inplace=True)
temp=df.groupby(['label','sentiment_vader']).apply(lambda x:x['sentiment_vader'].count()).reset_index(name='Counts')
temp.style.background_gradient(cmap='Greens')

	label	sentiment_vader	Counts
0	Fake	NEGATIVE	4185
1	Fake	NEUTRAL	322
2	Fake	POSITIVE	12947
3	Real	NEGATIVE	4654
4	Real	NEUTRAL	551
5	Real	POSITIVE	15987

#temp=df.groupby(['label','sentiment_vader']).agg(['count']).reset_index()[['label','sentiment_vader','title']]
#temp.rename(columns={'title': 'Counts'}, inplace=True)
#temp
#

fig = px.bar(temp, x="label", y="Counts", facet_col="sentiment_vader",
             color=["crimson","crimson","crimson","green","green","green"],text='Counts')
#Forcing an axis to be categorical
fig.update_xaxes(type='category')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(showlegend=False)
fig.update_layout( template="plotly_white")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file12.html")

fig = px.bar(temp, x="label", y="Counts",color="sentiment_vader", barmode="group",text='Counts')
#Forcing an axis to be categorical
fig.update_xaxes(type='category')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(title_text='Sentiment Distribution over Real and Fake News')
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.update_layout(barmode='group', xaxis={'categoryorder':'category ascending'})
fig.update_layout(
    legend=dict(
        x=1,
        y=1,
        traceorder="reversed",
        #title_font_family="Times New Roman",
        font=dict(
            family="Courier",
            size=12,
            color="black"
        ),
        bgcolor="LightSteelBlue",
        bordercolor="Black",
        borderwidth=2
    )
)

fig.update_layout( template="simple_white")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file13.html")

Has the News Grown More Negative Between 2015 to 2018?

Yes. The proportion of negative sentiments in the news has increased between 2015 to 2017.

df.head()

	title	text	subject	date	label	Polarity	sentiment_vader	year	month	day
0	As U.S. budget fight looms, Republicans flip t...	WASHINGTON (Reuters) - The head of a conservat...	politicsNews	2017-12-31	Real	0.037083	POSITIVE	2017.0	12.0	31.0
1	U.S. military to accept transgender recruits o...	WASHINGTON (Reuters) - Transgender people will...	politicsNews	2017-12-29	Real	0.055880	POSITIVE	2017.0	12.0	29.0
2	Senior U.S. Republican senator: 'Let Mr. Muell...	WASHINGTON (Reuters) - The special counsel inv...	politicsNews	2017-12-31	Real	0.115930	POSITIVE	2017.0	12.0	31.0
3	FBI Russia probe helped by Australian diplomat...	WASHINGTON (Reuters) - Trump campaign adviser ...	politicsNews	2017-12-30	Real	0.035968	POSITIVE	2017.0	12.0	30.0
4	Trump wants Postal Service to charge 'much mor...	SEATTLE/WASHINGTON (Reuters) - President Donal...	politicsNews	2017-12-29	Real	0.030093	POSITIVE	2017.0	12.0	29.0

df["date"] = pd.to_datetime(df["date"], errors='coerce')

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['year'] = df['year'].fillna(0)
df['year'] = df['year'].astype(int)
temp=df.groupby(['year','sentiment_vader']).agg({'title':'count'}).query('year>0')
temp = temp.groupby(level=0).apply(lambda x:1 * x / float(x.sum())).reset_index()
temp.rename(columns={'title': 'Ratio'}, inplace=True)
temp.style.background_gradient(cmap='Blues')

	year	sentiment_vader	Ratio
0	2015	NEGATIVE	0.250470
1	2015	NEUTRAL	0.028804
2	2015	POSITIVE	0.720726
3	2016	NEGATIVE	0.199669
4	2016	NEUTRAL	0.021059
5	2016	POSITIVE	0.779271
6	2017	NEGATIVE	0.244642
7	2017	NEUTRAL	0.022905
8	2017	POSITIVE	0.732453
9	2018	NEGATIVE	0.314286
10	2018	POSITIVE	0.685714

fig = px.bar(temp, x="year", y="Ratio",color="sentiment_vader", barmode="group")
#Forcing an axis to be categorical
fig.update_xaxes(type='category')
#fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_xaxes(type='category')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(title_text='Sentiment Distribution over Real and Fake News')
#fig.update_traces( texttemplate='{}',textposition='outside')
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.update_layout(barmode='group', xaxis={'categoryorder':'category ascending'})
fig.update_layout(yaxis=dict(tickformat=".0%"))
fig.update_yaxes(title="Percent",title_font_family="Arial")
fig.update_layout( template="ggplot2")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file14.html")

temp=df.groupby(['day','sentiment_vader']).agg({'title':'count'})
temp = temp.groupby(level=0).apply(lambda x:1 * x / float(x.sum())).reset_index()
temp.rename(columns={'title': 'Ratio'}, inplace=True)
#fig = px.line(temp, x="day", y="Ratio",color="sentiment_vader",line_group="sentiment_vader")
fig = go.Figure()
fig.add_trace(go.Scatter(x=temp[temp.sentiment_vader=="POSITIVE"]["day"], 
                         y=temp[temp.sentiment_vader=="POSITIVE"]["Ratio"],
                    mode='lines+markers',
                    name='POSITIVE'))
fig.add_trace(go.Scatter(x=temp[temp.sentiment_vader=="NEGATIVE"]["day"], 
                         y=temp[temp.sentiment_vader=="NEGATIVE"]["Ratio"],
                    mode='lines+markers',
                    name='NEGATIVE'))
fig.add_trace(go.Scatter(x=temp[temp.sentiment_vader=="NEUTRAL"]["day"], 
                         y=temp[temp.sentiment_vader=="NEUTRAL"]["Ratio"],
                    mode='lines+markers', name='NEUTRAL'))
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(title_text='Sentiment Distribution over Real and Fake News Every Day of The Month')
fig.update_layout( template="plotly_dark")
fig.update_layout(xaxis_tickangle=-45)
#["plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white", "none"]
fig.update_layout(yaxis=dict(tickformat=".0%"))
fig.update_yaxes(title="Percent",title_font_family="Arial")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file15.html")

Classification of News as Real or Fake using Titles as Features

%%time
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

y=  le.fit_transform(df.label)

X_train, X_test, y_train, y_test = train_test_split(df.title, y, test_size = 0.2,random_state=2)

#vectorizer = CountVectorizer().fit(X_train)
vectorizer = TfidfVectorizer(use_idf=True).fit(X_train)
train_x = vectorizer.transform(X_train)
test_x = vectorizer.transform(X_test)


model = LogisticRegression(C=2.5)
model.fit(train_x, y_train)
y_pred = model.predict(test_x)
accuracy_value = roc_auc_score(y_test, y_pred)
print("accuracy score {}".format(accuracy_value))

accuracy score 0.9568209165268219
CPU times: user 1.77 s, sys: 1.14 s, total: 2.91 s
Wall time: 2.31 s

The title alone can predict fake or reals by nearly 96% accuracy.

Classification of News as Real or Fake with Text as Features

%%time

X_train, X_test, y_train, y_test = train_test_split(df.text, y, test_size = 0.2,random_state=2)

vectorizer = CountVectorizer().fit(X_train)
#vectorizer = TfidfVectorizer(use_idf=True).fit(X_train)
train_x = vectorizer.transform(X_train)
test_x = vectorizer.transform(X_test)
model = xgb.XGBClassifier(
              n_jobs = -1,
              max_depth = 6,
              #learning_rate=  0.1,
              min_child_weight= 2, 
              #min_samples_split= 0.9,
              n_estimators= 100,
              eta = 0.1, 
              verbose = 1, 
              gamma=0.05,
              #nrounds = 100
              objective = "binary:logistic", 
              eval_metric = "auc",       #"aucpr",    # "aucpr",  #aucpr, auc
              subsample = 0.7,
              colsample_bytree =0.8,
              max_delta_step=1,
              verbosity=1,
              tree_method='approx')

model.fit(train_x, y_train)
y_pred = model.predict(test_x)
accuracy_value = roc_auc_score(y_test, y_pred)
print("accuracy score {}".format(accuracy_value))

accuracy score 0.9970340274765952
CPU times: user 3min 45s, sys: 359 ms, total: 3min 45s
Wall time: 2min 8s

Classification of News as Real or Fake with Title and Text Combined as Features

%%time
X_train, X_test, y_train, y_test = train_test_split(df.title_text, y, test_size = 0.2,random_state=2)

vectorizer = CountVectorizer().fit(X_train)
#vectorizer = TfidfVectorizer(use_idf=True).fit(X_train)
train_x = vectorizer.transform(X_train)
test_x = vectorizer.transform(X_test)

model = xgb.XGBClassifier(
              n_jobs = -1,
              max_depth = 6,
              #learning_rate=  0.1,
              min_child_weight= 2, 
              #min_samples_split= 0.9,
              n_estimators= 100,
              eta = 0.1, 
              verbose = 1, 
              gamma=0.05,
              #nrounds = 100
              objective = "binary:logistic", 
              eval_metric = "auc",       #"aucpr",    # "aucpr",  #aucpr, auc
              subsample = 0.7,
              colsample_bytree =0.8,
              max_delta_step=1,
              verbosity=1,

              tree_method='approx')
#model = LogisticRegression(C=2.5)
model.fit(train_x, y_train)
y_pred = model.predict(test_x)
accuracy_value = roc_auc_score(y_test, y_pred)
print("accuracy score {}".format(accuracy_value))

accuracy score 0.9975657161197259
CPU times: user 3min 51s, sys: 693 ms, total: 3min 52s
Wall time: 2min 12s

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import seaborn as sns
p=0.5

def PlotConfusionMatrix(y_test,pred,y_test_normal,y_test_pneumonia,label):

    cfn_matrix = confusion_matrix(y_test,pred)
    cfn_norm_matrix = np.array([[1.0 / y_test_normal,1.0/y_test_normal],[1.0/y_test_pneumonia,1.0/y_test_pneumonia]])
    norm_cfn_matrix = cfn_matrix * cfn_norm_matrix

    #colsum=cfn_matrix.sum(axis=0)
    #norm_cfn_matrix = cfn_matrix / np.vstack((colsum, colsum)).T

    fig = plt.figure(figsize=(15,5))
    ax = fig.add_subplot(1,2,1)
    #sns.heatmap(cfn_matrix,cmap='magma',linewidths=0.5,annot=True,ax=ax,annot=True)
    sns.heatmap(cfn_matrix, annot = True,fmt='g',cmap='rocket')
    #tick_marks = np.arange(len(y_test))
    #plt.xticks(tick_marks, np.unique(y_test), rotation=45)
    plt.title('Confusion Matrix',color='b')
    plt.ylabel('Real Classes')
    plt.xlabel('Predicted Classes')
    plt.savefig('/content/drive/My Drive/Colab Notebooks/NLP/cm_' +label +  '.png')
        
    ax = fig.add_subplot(1,2,2)
    sns.heatmap(norm_cfn_matrix,cmap=plt.cm.Blues,linewidths=0.5,ax=ax,annot=True)

    plt.title('Normalized Confusion Matrix',color='b')
    plt.ylabel('Real Classes')
    plt.xlabel('Predicted Classes')
    plt.savefig('/content/drive/My Drive/Colab Notebooks/NLP/cm_norm' +label +  '.png')
    plt.show()
    
    print('---Classification Report---')
    print(classification_report(y_test,pred))
    

y_test_real,y_test_fake = np.bincount(y_test)
y_pred= np.where(y_pred<p,0,1 )


PlotConfusionMatrix(y_test,y_pred,y_test_real,y_test_fake,label= 'classification Report')

png

---Classification Report---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4702
           1       1.00      1.00      1.00      4278

    accuracy                           1.00      8980
   macro avg       1.00      1.00      1.00      8980
weighted avg       1.00      1.00      1.00      8980

Tags: Sentiment Analysis Tokenization Text Analysis xgboost Plotly express