Introduction

Fake news include false news stories,disinformation and misinformation with the intent of misleading people. The proliferation of fake news in recent years poses great danger to the safety and security of many people around the world. Some people have acted on fake news to commit certain crimes which are unpardonable. The growth of the internet and social media around the world has helped to explode the consumption of fake news and conspiracy theories around the world. Determining if a piece of news item is fake or real is not so obvious these days especially when the perpetrators create websites with similar names like well-known authentic news outlets. In this post we will try to do in-depth exploratory analysis and comparison of real and fake new. We will look at negativity, positivity and neutrality in sentiments expressed in both fake and real news. We would end by demonstrating how we can predict whether a news item is fake or real with a classification model.

The dataset for this analysis is located on kaggle. Find the link here.

#!pip install gensim # Gensim is an open-source library for unsupervised topic modeling and natural language processing
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from  nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import re
from nltk.corpus import stopwords
import seaborn as sns 
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
import matplotlib.pyplot as plt
import plotly
import seaborn as sns
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from tqdm import tqdm
import itertools
import collections
from xgboost import XGBClassifier
import xgboost as xgb
%matplotlib inline
%autosave 5
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!




Autosaving every 5 seconds
import warnings
warnings.filterwarnings("ignore")
from tqdm.notebook import tqdm
tqdm.pandas(desc="Completed") # add progressbar to pandas, use progress_apply instead apply
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px
from ipywidgets import interact #interactive plots
from IPython.display import clear_output
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
!unzip /content/drive/MyDrive/Data/Fake_True_News.zip
True_data = pd.read_csv('True.csv')
True_data['label']= 1
Fake_data = pd.read_csv('Fake.csv')
Fake_data['label']=0
True_data.head()
Archive:  /content/drive/MyDrive/Data/Fake_True_News.zip
replace Fake.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: yes
  inflating: Fake.csv                
replace True.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: yes
  inflating: True.csv                
title text subject date label
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews December 31, 2017 1
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews December 29, 2017 1
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews December 31, 2017 1
3 FBI Russia probe helped by Australian diplomat... WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews December 30, 2017 1
4 Trump wants Postal Service to charge 'much mor... SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews December 29, 2017 1
df = pd.concat([True_data,Fake_data],axis=0)
print(df.shape)
print(type(df.label))
df.info()
(44898, 5)
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
Int64Index: 44898 entries, 0 to 23480
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 2.1+ MB
#df= df.sample(frac=0.1)
df.shape
(44898, 5)
# Create and register a new `tqdm` instance with `pandas`
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()
#df["title"] = df["title"].progress_apply(preprocess)
#df["text"] = df["text"].progress_apply(preprocess)
from sklearn.utils import shuffle
df = shuffle(df)
df.head()
title text subject date label
13936 FORMER MEXICAN PREZ Sends “Middle Finger” To T... Wow these sound exactly like the type of peopl... politics May 10, 2016 0
10132 OOPS! NY GOV CUOMO Announces Statues of Confed... Really Andrew? Does all of New York really sta... politics Aug 17, 2017 0
7518 Janet Reno, first U.S. woman attorney general,... (Reuters) - Blunt-spoken Janet Reno, who serve... politicsNews November 7, 2016 1
17792 Murdered North Korean Kim Jong Nam had $100,00... KUALA LUMPUR (Reuters) - The half-brother of N... worldnews October 11, 2017 1
22509 PROPAGANDA: Star Trek Beyond – Social Justice ... Jay Dyer 21st Century WireThe last Star Trek r... US_News August 2, 2016 0

The gensim.utils.simple_process utility can be used to accomplish some of the basic common text preprocessing and cleaning such as tokenization and removing stop words.

stop_words = stopwords.words('english')
stop_words.extend(['said','say','from', 'subject', 're', 'edu', 'use'])
lemmatizer = WordNetLemmatizer() 
stemmer = SnowballStemmer("english")


def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2 and token not in stop_words:
           #token = [stemmer.stem(token)  for token in text.split() ]
           result.append(lemmatizer.lemmatize(token))
    return result
#@title Default title text
df["title"] = df["title"].apply(preprocess)
df["text"] = df["text"].apply(preprocess)

Combine title and text columns, this will later be used to demonstrate if classification performance improves with this combination over using either title or text column alone to predict whether the news is Fake or Real.

df["title_text"] = df["text"]+df["title"]
df.subject.value_counts()
politicsNews       11272
worldnews          10145
News                9050
politics            6841
left-news           4459
Government News     1570
US_News              783
Middle-east          778
Name: subject, dtype: int64

Looking at the subjects we can combine related subjects such politicsNews and politics to politicsNews.

df.subject.replace({'politics':'politicsNews'},inplace=True)
#df['label'] = df['label'].astype(str)
df["label"].replace({0:"Fake",  1:"Real"},inplace=True)
df["label"].value_counts()
Fake    23481
Real    21417
Name: label, dtype: int64

What is the distribution of Subjects between the True and Fake News?

temp=df.groupby('label').apply(lambda x:x['title'].count()).reset_index(name='Counts')
#temp.label.replace({0:'False',1:'True'},inplace=True)
temp.style.background_gradient(cmap='Purples')
label Counts
0 Fake 23481
1 Real 21417
#Let's draw a Funnel-Chart for better visualization
fig = go.Figure(go.Funnelarea(
    text =temp.label,
    values = temp.Counts,
    title = {"position": "top center", "text": "Funnel-Chart of News Distribution"}
    ))
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)

fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file1.html")
temp=df.groupby('label').apply(lambda x:x['title'].count()).reset_index(name='Counts')
#sub_tf_df.label.replace({0:'False',1:'True'},inplace=True)
fig = px.bar(temp, x="label", y="Counts",
             color='Counts', barmode='group',
             title =  "Frequency of Real and Fake News Distribution",
             height=400)

fig.update_layout(
    font_family="Courier New",
    font_color="white",
    title_font_family="Times New Roman",
    title_font_color="white"
   # legend_title_font_color="green"
)
fig.update_xaxes(title_font_family="Arial")
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="plotly_dark")
#fig =px.scatter(x=range(10), y=range(10))
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file2.html")
fig.show()

Common Words in News Fake News Title

top = Counter([item for sublist in df[df.label == "Fake"]["title"] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Greens')
Common_words count
0 trump 9350
1 video 8558
2 obama 2582
3 hillary 2322
4 watch 1941
5 clinton 1175
6 president 1165
7 black 975
8 tweet 936
9 white 905
10 new 905
11 breaking 896
12 news 883
13 republican 867
14 donald 848
15 muslim 842
16 gop 807
17 american 772
18 democrat 772
19 america 703
fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in Fake News Titles', orientation='h', 
             width=700, height=700,color='Common_words')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="plotly_white")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file3.html")

Common Words in News Real News Title

top = Counter([item for sublist in df[df.label == "Real"]["title"] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Purples')

Common_words count
0 trump 5567
1 say 2981
2 house 1452
3 russia 977
4 republican 976
5 north 926
6 korea 898
7 new 875
8 state 825
9 white 818
10 china 782
11 senate 759
12 court 753
13 tax 666
14 obama 665
15 clinton 659
16 election 656
17 vote 640
18 talk 597
19 leader 597
fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in Real news Titles', orientation='h', 
             width=700, height=700,color='Common_words')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="plotly_dark")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file4.html")

Which Subjects have received the most News Coverage?

temp=df.groupby('subject').apply(lambda x:x['title'].count()).reset_index(name='Counts')
fig=px.bar(temp,x='subject',y='Counts',color='Counts',title='Count of News Articles by Subject')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file2.html")
fig.update_layout( template="plotly_dark")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file5.html")

Exploring Co-occurring Words (Bigrams)

Let’s now explore certain words which occuur together in the tweets. Such words are called bigrams.A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words

from nltk import bigrams,trigrams,ngrams
# Create list of lists containing bigrams in tweets
terms_bigram = [list(bigrams(text)) for text in df[df.label == "Real"]["text"]]

# Flatten list of bigrams in clean tweets
bigrams_all = list(itertools.chain(*terms_bigram))

# Create counter of words in clean bigrams
bigram_counts = collections.Counter(bigrams_all)

bigram_df = pd.DataFrame(bigram_counts.most_common(20),
                             columns=['bigram', 'count'])
bigram_df.style.background_gradient(cmap='Purples')
bigram count
0 ('united', 'state') 12215
1 ('donald', 'trump') 10169
2 ('white', 'house') 8419
3 ('washington', 'reuters') 6674
4 ('president', 'donald') 5930
5 ('north', 'korea') 5659
6 ('new', 'york') 4740
7 ('prime', 'minister') 4206
8 ('told', 'reuters') 3496
9 ('islamic', 'state') 3477
10 ('barack', 'obama') 3344
11 ('told', 'reporter') 3189
12 ('president', 'barack') 2960
13 ('hillary', 'clinton') 2499
14 ('supreme', 'court') 2481
15 ('trump', 'administration') 2477
16 ('reuters', 'president') 2394
17 ('year', 'old') 2334
18 ('united', 'nation') 2322
19 ('secretary', 'state') 2317
#df = title_per.to_frame().round(1)
fig.update_layout(
    title_text="High Frequency Bigrams  in Real News",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='High Frequency ', x=0.5, y=0.5, font_size=14, showarrow=False)])
fig = px.pie(bigram_df, values=bigram_df['count'].values, names= bigram_df.bigram, color_discrete_sequence=px.colors.sequential.YlGnBu,
             title='High Frquency Bigrams for Real News')
fig.update_traces(textposition='inside', textinfo='percent')
fig.update_layout(
   # autosize=False,
    width=800,
    height=800,
   
)
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file6.html")

Bigrams of Fake News Headlines

# Create list of lists containing bigrams in tweets
terms_bigram = [list(bigrams(text)) for text in df[df.label == "Fake"]["text"]]

# Flatten list of bigrams in clean tweets
bigrams_all = list(itertools.chain(*terms_bigram))

# Create counter of words in clean bigrams
bigram_counts = collections.Counter(bigrams_all)

bigram_df = pd.DataFrame(bigram_counts.most_common(20),
                             columns=['bigram', 'count'])
bigram_df.style.background_gradient(cmap='Purples')
bigram count
0 ('donald', 'trump') 16402
1 ('featured', 'image') 8069
2 ('hillary', 'clinton') 7312
3 ('white', 'house') 6749
4 ('united', 'state') 6674
5 ('twitter', 'com') 6567
6 ('pic', 'twitter') 6232
7 ('new', 'york') 4361
8 ('president', 'obama') 4104
9 ('president', 'trump') 4065
10 ('getty', 'image') 4029
11 ('fox', 'news') 3524
12 ('year', 'old') 3248
13 ('barack', 'obama') 2352
14 ('trump', 'supporter') 2086
15 ('century', 'wire') 1930
16 ('trump', 'campaign') 1903
17 ('supreme', 'court') 1827
18 ('fake', 'news') 1818
19 ('secretary', 'state') 1764
colors = ['mediumturquoise','gold' ]
colors1 = ['#F4D03F','#82E0AA', "#F1948A",]


# Use `hole` to create a donut-like pie chart
fig = go.Figure(data=[go.Pie(labels= bigram_df.bigram, values=bigram_df['count'], hole=.4)])

fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=15,
                  marker=dict(colors=px.colors.sequential.YlGnBu, line=dict(color='#000000', width=1)))

fig.update_layout(
   # autosize=False,
    width=800,
    height=800,
   
)
fig.update_layout(
    title_text="High Frequency Bigrams in Fake News ",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text=' ', x=0.5, y=0.5, font_size=14, showarrow=False)])
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file7.html")

Trigrams of Fake News

#list(ngrams(df.text.values[0],n=4))
# Create list of lists containing bigrams in tweets
#trigram = [list(bigrams(tweet)) for tweet in df[df.label == "Fake"]["text"]]
trigram = [list(ngrams(text,n=3)) for text in df[df.label == "Fake"]["text"]]

# Flatten list of bigrams in clean tweets
trigram_all = list(itertools.chain(*trigram))

# Create counter of words in clean bigrams
trigram_counts = collections.Counter(trigram_all)
trigram_df = pd.DataFrame(trigram_counts.most_common(20),
                             columns=['trigram', 'count'])
trigram_df.style.background_gradient(cmap='Blues')
trigram count
0 ('pic', 'twitter', 'com') 6185
1 ('donald', 'trump', 'realdonaldtrump') 1692
2 ('new', 'york', 'time') 1581
3 ('black', 'life', 'matter') 1319
4 ('president', 'donald', 'trump') 1049
5 ('president', 'united', 'state') 1015
6 ('president', 'barack', 'obama') 953
7 ('news', 'century', 'wire') 939
8 ('featured', 'image', 'video') 887
9 ('century', 'wire', 'say') 733
10 ('video', 'screen', 'capture') 731
11 ('image', 'video', 'screen') 697
12 ('new', 'york', 'city') 651
13 ('featured', 'image', 'screenshot') 607
14 ('www', 'youtube', 'com') 577
15 ('youtube', 'com', 'watch') 575
16 ('http', 'www', 'youtube') 571
17 ('image', 'screen', 'capture') 529
18 ('director', 'james', 'comey') 518
19 ('fbi', 'director', 'james') 500
#fig = px.bar(trigram_df, x="count", y="trigram", title='Trigrams in Fake News', orientation='h', 
#             width=700, height=700)
#fig.show()
# plot
sns.set_style('ticks')
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(10, 7)
ax = sns.barplot(x="count", y="trigram", data=trigram_df,
                 palette="Blues_d")
plt.title('Trigrams of Fake News')
plt.show()

png

Trigrams of Real News

trigram = [list(ngrams(text,n=3)) for text in df[df.label == "Real"]["text"]]

# Flatten list of bigrams in clean tweets
trigram_all = list(itertools.chain(*trigram))

# Create counter of words in clean bigrams
trigram_counts = collections.Counter(trigram_all)
trigram_df = pd.DataFrame(trigram_counts.most_common(20),
                             columns=['trigram', 'count'])
trigram_df.style.background_gradient(cmap='Blues')
trigram count
0 ('president', 'donald', 'trump') 5869
1 ('president', 'barack', 'obama') 2960
2 ('reuters', 'president', 'donald') 1562
3 ('washington', 'reuters', 'president') 1533
4 ('new', 'york', 'reuters') 845
5 ('elect', 'donald', 'trump') 834
6 ('president', 'elect', 'donald') 832
7 ('respond', 'request', 'comment') 806
8 ('president', 'vladimir', 'putin') 791
9 ('state', 'rex', 'tillerson') 673
10 ('secretary', 'state', 'rex') 672
11 ('national', 'security', 'adviser') 639
12 ('russian', 'president', 'vladimir') 613
13 ('speaker', 'paul', 'ryan') 608
14 ('washington', 'reuters', 'republican') 603
15 ('prime', 'minister', 'theresa') 584
16 ('chancellor', 'angela', 'merkel') 570
17 ('leader', 'mitch', 'mcconnell') 568
18 ('democratic', 'president', 'barack') 550
19 ('republican', 'presidential', 'candidate') 541
sns.set_style('ticks')
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(10, 7)
ax = sns.barplot(x="count", y="trigram", data=trigram_df,
                 palette="Blues_d")
plt.title('Trigrams of Real News')
plt.show()

png

Word Cloud using the Real News

top = Counter([item for sublist in df[df.label == "Real"]["text"] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')
Common_words count
0 trump 54734
1 state 37678
2 reuters 28976
3 president 28728
4 republican 23007
5 year 22622
6 government 19992
7 house 17030
8 new 16917
9 united 15590
10 people 15356
11 party 15294
12 election 14759
13 official 14620
14 told 14245
15 country 14161
16 washington 12988
17 campaign 11155
18 group 11129
19 week 10658
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split # function for splitting data to train and test sets

import nltk
from nltk.corpus import stopwords
from nltk.classify import SklearnClassifier

from wordcloud import WordCloud,STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline


from subprocess import check_output
from os import path
from wordcloud import WordCloud

#d = path.dirname("/Users/nanaakwasiabayieboateng/PythonNLTK")




# Read the whole text.
#text = str(train['text'])

#stopwords = set(STOPWORDS)
#stopwords.add("Chrysler")

color = ['black','white'];

#background_color="white", max_words=2000, mask=text,
#               stopwords=stopwords, contour_width=3, contour_color='steelblue'
    
fig, (ax1, ax2,) = plt.subplots(1, 2, figsize=[26, 10])
sns.set_color_codes("pastel")   


wordcloud = WordCloud(max_words = 2000,  
                       width=1000,
                       height=800,
                       colormap='viridis',
                      max_font_size=80, min_font_size=2,  # Font size range
                      background_color=color[1],
                    margin=0,
                     
               stopwords = stop_words).generate("".join(str(df[df.label == "Real"].text.values)))
ax1.imshow(wordcloud, interpolation = 'bilinear')

sns.color_palette("viridis", as_cmap=True)
ax2= sns.barplot(y="Common_words", x="count", data=temp,
            label="Total",palette="viridis")
ax2.set_ylabel('')  
ax2.set_title('Word related to Positive News');
#ax1.imshow(wordcloud)
ax1.axis('off')
ax1.set_title('High Frequency Words');

png

Word Cloud using the Real News

Lets Look at the Count of Words Distribution in the Title

top = Counter([item for sublist in df[df.label == "Fake"]["text"] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Greens')
Common_words count
0 trump 80049
1 president 28406
2 people 26678
3 state 23663
4 clinton 19866
5 time 19199
6 year 19074
7 obama 18838
8 like 18667
9 american 18106
10 donald 17780
11 republican 16756
12 right 14857
13 news 14679
14 new 14416
15 image 14312
16 hillary 14192
17 white 13573
18 know 12062
19 medium 11847
fig, (ax1, ax2,) = plt.subplots(1, 2, figsize=[26, 10])
sns.set_color_codes("pastel")   
color = ['black','white'];

wordcloud = WordCloud(max_words = 2000,  
                       width=1000,
                       height=800,
                       colormap='viridis',
                      max_font_size=80, min_font_size=2,  # Font size range
                      background_color=color[1],
                    margin=0,
                     
               stopwords = stop_words).generate("".join(str(df[df.label == "Fake"].text.values)))
ax1.imshow(wordcloud, interpolation = 'bilinear')

sns.color_palette("viridis", as_cmap=True)
ax2= sns.barplot(y="Common_words", x="count", data=temp,
            label="Total",palette="viridis")
ax2.set_ylabel('')  
ax2.set_title('Word related to Positive News');
#ax1.imshow(wordcloud)
ax1.axis('off')
ax1.set_title('High Frequency Words');

png

Donald Trunmp dominates the news whether fake or real.

Analysis Over Time

import datetime
from IPython.display import display, HTML

df["date"] = pd.to_datetime(df["date"], errors='coerce')

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
h= df.head(2)

# render dataframe as html
html = h.to_html(render_links=True, index=False).replace('<th>','<th style = "background-color: #48c980">')
# write html to file 
text_file = open("/content/drive/MyDrive/Colab Notebooks/NLP/h.html", "w") 
text_file.write(html) 
text_file.close() 
HTML('h.html')


import seaborn as sns
temp=df.groupby('year').apply(lambda x:x['text'].count()).reset_index(name='Counts')
temp.style.set_properties(**{'background-color': 'pink',
                            'color': 'black',
                            'border-color': 'white'})

temp.style.background_gradient(cmap='Greens')
year Counts
0 2015.000000 2479
1 2016.000000 16470
2 2017.000000 25904
3 2018.000000 35
temp=df.groupby('year').apply(lambda x:x['text'].count()).reset_index(name='Counts')
#temp['year'] = temp['year'].astype(str)

fig = px.bar(temp, x="year", y="Counts",color="Counts",text='Counts')
#fig = px.histogram(temp, x="year", y="Counts")
fig.update_layout(
    title=" Frequency of Words Used In News per  Year",
    xaxis_title="Year",
    yaxis_title="Frequency",
    #legend_title="",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="RebeccaPurple"
    )
)
#Forcing an axis to be categorical
fig.update_xaxes(type='category')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="plotly_white")
#fig.update_yaxes(title='')
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file8.html")
temp=df.groupby('month').apply(lambda x:x['text'].count()).reset_index(name='Counts')
temp.style.clear()
cm = sns.light_palette("green", as_cmap=True)
temp.style.background_gradient(cmap=cm)

month Counts
0 1.000000 3106
1 2.000000 2957
2 3.000000 3336
3 4.000000 3034
4 5.000000 3076
5 6.000000 2896
6 7.000000 2829
7 8.000000 2829
8 9.000000 5199
9 10.000000 5476
10 11.000000 5536
11 12.000000 4614
temp=df.groupby('month').apply(lambda x:x['text'].count()).reset_index(name='Counts')

temp.style.background_gradient(cmap=cm)
#fig = px.scatter(temp, x="day", y="Counts",mode='lines+markers')
fig = go.Figure()
#fig.add_trace(go.bar(x=temp.day, y=temp.Counts ))
#fig = px.line(temp, x="day", y="Counts")
fig = px.bar(temp, x="month", y="Counts",color="Counts")
fig.update_layout(
    title=" Frequency of Words Used In News per Month of the Year",
    xaxis_title="Month",
    yaxis_title="Frequency",
    #legend_title="",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="white"
    )
)

fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="plotly_dark")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file9.html")
temp=df.groupby('day').apply(lambda x:x['text'].count()).reset_index(name='Counts')

#fig = px.scatter(temp, x="day", y="Counts",mode='lines+markers')
fig = go.Figure()
fig.add_trace(go.Scatter(x=temp.day, y=temp.Counts,
                         marker=dict(
        color=np.random.randn(temp.shape[0]),
        colorscale='Viridis',
        line_width=1
    ),
                    mode='lines+markers'))
#fig = px.line(temp, x="day", y="Counts")

fig.update_layout(
    title=" Frequency of Words Used In News per Day of the Month",
    xaxis_title="Month",
    yaxis_title="Frequency",
    #legend_title="",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="RebeccaPurple"
    )
)

fig.update_xaxes(tickvals=temp.day,tickangle=45)
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout( template="seaborn")
#fig.update_xaxes(showticklabels=False)
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file10.html")

Distribution of Token Lengths

df["token_length"]  = df["text"].apply(lambda x : len(x))
df.head()
title text subject date label title_text year month day token_length
3056 [chicago, cub, snub, trump, visit, white, hous... [unprecedented, clear, aimed, donald, trump, n... News 2017-01-11 Fake [unprecedented, clear, aimed, donald, trump, n... 2017.0 1.0 11.0 152
21285 [outrageous, example, radical, allowed, colleg... [war, expect, child, fight, war, leftist, acad... left-news 2015-11-28 Fake [war, expect, child, fight, war, leftist, acad... 2015.0 11.0 28.0 570
11575 [wow, barbara, bush, keynote, speaker, baby, k... [father, staunch, supporter, life, mother, voc... politicsNews 2017-02-25 Fake [father, staunch, supporter, life, mother, voc... 2017.0 2.0 25.0 149
267 [trump, puerto, rico, disgustingly, tell, puer... [thought, possibly, worse, donald, trump, lowe... News 2017-10-03 Fake [thought, possibly, worse, donald, trump, lowe... 2017.0 10.0 3.0 115
18532 [islamic, state, torch, oil, well, northern, i... [baghdad, reuters, islamic, state, militant, s... worldnews 2017-10-02 Real [baghdad, reuters, islamic, state, militant, s... 2017.0 10.0 2.0 120
import plotly.figure_factory as ff
import numpy as np


# Group data together
hist_data = [df[df.label=="Fake"]["token_length"], df[df.label=="Real"]["token_length"]]

group_labels = ['Fake', 'Real']
colors = ['slategray', 'magenta']
                         
# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2,
                          curve_type='normal', # override default 'kde'
                         colors=colors)
#fig = ff.create_distplot(hist_data, group_labels, show_hist=False, colors=colors)
# Add title
#fig.update_layout(title_text='Distribution of Text Token Counts')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(xaxis_range=[0,1500])
fig.update_layout( template="plotly")
fig.update_layout(
    autosize=False,
    width=1000,
    height=600,)
fig.update_layout(
    title=" Distribution of Text Token Counts",
    xaxis_title="Token Length",
    yaxis_title="Frequency")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file11.html")

Most token lengths for the news items is between 0 to 500 words.

from matplotlib import pyplot as plt
import seaborn as sns

from matplotlib import rcParams

# figure size in inches
#rcParams['figure.figsize'] = 11.7,8.27
#plt.figure(figsize=(12,10))
#multiple="stack",, kind="kde", fill=True
sns.set_theme();
sns.set(rc={'figure.figsize':(13,10)})
#fig, ax = plt.subplots()
#sns.set_size_inches(11.7, 8.27)
sns.displot(df, x="token_length",hue="label")
plt.xlim([0, 2500]);
plt.xlabel('Token Length');
plt.ylabel('Frequency');
plt.show()

png

Visualizing Networks of Co-occurring Word

The bigrams can be visualized looking at the top occurring bigrams as networks using the Python package NetworkX.

import networkx as nx

# Create dictionary of bigrams and their counts
d = bigram_df.set_index('bigram').T.to_dict('records')

# Create network plot 
G = nx.Graph()

# Create connections between nodes
for k, v in d[0].items():
    G.add_edge(k[0], k[1], weight=(v * 5))


fig, ax = plt.subplots(figsize=(18, 10))

pos = nx.spring_layout(G, k=2)

# Plot networks
nx.draw_networkx(G, pos,
                 font_size=16,
                 width=3,
                 edge_color='grey',
                 node_color='green',
                 with_labels = False,
                 ax=ax)


# Create offset labels
for key, value in pos.items():
    x, y = value[0]+.135, value[1]+.045
    ax.text(x, y,
            s=key,
            bbox=dict(facecolor='yellow', alpha=0.25),
            horizontalalignment='center', fontsize=13)
    
plt.show()

png

Sentiment Analysis with Vader

  • For top 2000 highest frequency words in Fake/Real determine polarity of each with vader from Textblob package. The polarity will then be used to classify news as either negative,positive or neutral.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from textblob import TextBlob
from nltk import tokenize

#df.info()
df.drop_duplicates(subset = "text", keep = "first", inplace = True)


def get_polarity(text):
    return TextBlob(text).sentiment.polarity

df['Polarity'] = df['text'].progress_apply(get_polarity)
HBox(children=(FloatProgress(value=0.0, max=38646.0), HTML(value='')))
df['sentiment_vader']=''
df.loc[df.Polarity>0,'sentiment_vader']='POSITIVE'
df.loc[df.Polarity==0,'sentiment_vader']='NEUTRAL'
df.loc[df.Polarity<0,'sentiment_vader']='NEGATIVE'
#df.head()
df["label"].replace({0:"Fake",  1:"Real"},inplace=True)
temp=df.groupby(['label','sentiment_vader']).apply(lambda x:x['sentiment_vader'].count()).reset_index(name='Counts')
temp.style.background_gradient(cmap='Greens')
label sentiment_vader Counts
0 Fake NEGATIVE 4185
1 Fake NEUTRAL 322
2 Fake POSITIVE 12947
3 Real NEGATIVE 4654
4 Real NEUTRAL 551
5 Real POSITIVE 15987
#temp=df.groupby(['label','sentiment_vader']).agg(['count']).reset_index()[['label','sentiment_vader','title']]
#temp.rename(columns={'title': 'Counts'}, inplace=True)
#temp
#
fig = px.bar(temp, x="label", y="Counts", facet_col="sentiment_vader",
             color=["crimson","crimson","crimson","green","green","green"],text='Counts')
#Forcing an axis to be categorical
fig.update_xaxes(type='category')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(showlegend=False)
fig.update_layout( template="plotly_white")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file12.html")
fig = px.bar(temp, x="label", y="Counts",color="sentiment_vader", barmode="group",text='Counts')
#Forcing an axis to be categorical
fig.update_xaxes(type='category')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(title_text='Sentiment Distribution over Real and Fake News')
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.update_layout(barmode='group', xaxis={'categoryorder':'category ascending'})
fig.update_layout(
    legend=dict(
        x=1,
        y=1,
        traceorder="reversed",
        #title_font_family="Times New Roman",
        font=dict(
            family="Courier",
            size=12,
            color="black"
        ),
        bgcolor="LightSteelBlue",
        bordercolor="Black",
        borderwidth=2
    )
)

fig.update_layout( template="simple_white")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file13.html")

Has the News Grown More Negative Between 2015 to 2018?

Yes. The proportion of negative sentiments in the news has increased between 2015 to 2017.

df.head()
title text subject date label Polarity sentiment_vader year month day
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews 2017-12-31 Real 0.037083 POSITIVE 2017.0 12.0 31.0
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews 2017-12-29 Real 0.055880 POSITIVE 2017.0 12.0 29.0
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews 2017-12-31 Real 0.115930 POSITIVE 2017.0 12.0 31.0
3 FBI Russia probe helped by Australian diplomat... WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews 2017-12-30 Real 0.035968 POSITIVE 2017.0 12.0 30.0
4 Trump wants Postal Service to charge 'much mor... SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews 2017-12-29 Real 0.030093 POSITIVE 2017.0 12.0 29.0

df["date"] = pd.to_datetime(df["date"], errors='coerce')

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['year'] = df['year'].fillna(0)
df['year'] = df['year'].astype(int)
temp=df.groupby(['year','sentiment_vader']).agg({'title':'count'}).query('year>0')
temp = temp.groupby(level=0).apply(lambda x:1 * x / float(x.sum())).reset_index()
temp.rename(columns={'title': 'Ratio'}, inplace=True)
temp.style.background_gradient(cmap='Blues')
year sentiment_vader Ratio
0 2015 NEGATIVE 0.250470
1 2015 NEUTRAL 0.028804
2 2015 POSITIVE 0.720726
3 2016 NEGATIVE 0.199669
4 2016 NEUTRAL 0.021059
5 2016 POSITIVE 0.779271
6 2017 NEGATIVE 0.244642
7 2017 NEUTRAL 0.022905
8 2017 POSITIVE 0.732453
9 2018 NEGATIVE 0.314286
10 2018 POSITIVE 0.685714
fig = px.bar(temp, x="year", y="Ratio",color="sentiment_vader", barmode="group")
#Forcing an axis to be categorical
fig.update_xaxes(type='category')
#fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_xaxes(type='category')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(title_text='Sentiment Distribution over Real and Fake News')
#fig.update_traces( texttemplate='{}',textposition='outside')
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.update_layout(barmode='group', xaxis={'categoryorder':'category ascending'})
fig.update_layout(yaxis=dict(tickformat=".0%"))
fig.update_yaxes(title="Percent",title_font_family="Arial")
fig.update_layout( template="ggplot2")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file14.html")
temp=df.groupby(['day','sentiment_vader']).agg({'title':'count'})
temp = temp.groupby(level=0).apply(lambda x:1 * x / float(x.sum())).reset_index()
temp.rename(columns={'title': 'Ratio'}, inplace=True)
#fig = px.line(temp, x="day", y="Ratio",color="sentiment_vader",line_group="sentiment_vader")
fig = go.Figure()
fig.add_trace(go.Scatter(x=temp[temp.sentiment_vader=="POSITIVE"]["day"], 
                         y=temp[temp.sentiment_vader=="POSITIVE"]["Ratio"],
                    mode='lines+markers',
                    name='POSITIVE'))
fig.add_trace(go.Scatter(x=temp[temp.sentiment_vader=="NEGATIVE"]["day"], 
                         y=temp[temp.sentiment_vader=="NEGATIVE"]["Ratio"],
                    mode='lines+markers',
                    name='NEGATIVE'))
fig.add_trace(go.Scatter(x=temp[temp.sentiment_vader=="NEUTRAL"]["day"], 
                         y=temp[temp.sentiment_vader=="NEUTRAL"]["Ratio"],
                    mode='lines+markers', name='NEUTRAL'))
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(title_text='Sentiment Distribution over Real and Fake News Every Day of The Month')
fig.update_layout( template="plotly_dark")
fig.update_layout(xaxis_tickangle=-45)
#["plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white", "none"]
fig.update_layout(yaxis=dict(tickformat=".0%"))
fig.update_yaxes(title="Percent",title_font_family="Arial")
fig.show()
fig.write_html("/content/drive/MyDrive/Colab Notebooks/NLP/htmlfies/file15.html")

Classification of News as Real or Fake using Titles as Features

%%time
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

y=  le.fit_transform(df.label)

X_train, X_test, y_train, y_test = train_test_split(df.title, y, test_size = 0.2,random_state=2)

#vectorizer = CountVectorizer().fit(X_train)
vectorizer = TfidfVectorizer(use_idf=True).fit(X_train)
train_x = vectorizer.transform(X_train)
test_x = vectorizer.transform(X_test)


model = LogisticRegression(C=2.5)
model.fit(train_x, y_train)
y_pred = model.predict(test_x)
accuracy_value = roc_auc_score(y_test, y_pred)
print("accuracy score {}".format(accuracy_value))
accuracy score 0.9568209165268219
CPU times: user 1.77 s, sys: 1.14 s, total: 2.91 s
Wall time: 2.31 s

The title alone can predict fake or reals by nearly 96% accuracy.

Classification of News as Real or Fake with Text as Features

%%time

X_train, X_test, y_train, y_test = train_test_split(df.text, y, test_size = 0.2,random_state=2)

vectorizer = CountVectorizer().fit(X_train)
#vectorizer = TfidfVectorizer(use_idf=True).fit(X_train)
train_x = vectorizer.transform(X_train)
test_x = vectorizer.transform(X_test)
model = xgb.XGBClassifier(
              n_jobs = -1,
              max_depth = 6,
              #learning_rate=  0.1,
              min_child_weight= 2, 
              #min_samples_split= 0.9,
              n_estimators= 100,
              eta = 0.1, 
              verbose = 1, 
              gamma=0.05,
              #nrounds = 100
              objective = "binary:logistic", 
              eval_metric = "auc",       #"aucpr",    # "aucpr",  #aucpr, auc
              subsample = 0.7,
              colsample_bytree =0.8,
              max_delta_step=1,
              verbosity=1,
              tree_method='approx')

model.fit(train_x, y_train)
y_pred = model.predict(test_x)
accuracy_value = roc_auc_score(y_test, y_pred)
print("accuracy score {}".format(accuracy_value))
accuracy score 0.9970340274765952
CPU times: user 3min 45s, sys: 359 ms, total: 3min 45s
Wall time: 2min 8s

Classification of News as Real or Fake with Title and Text Combined as Features

%%time
X_train, X_test, y_train, y_test = train_test_split(df.title_text, y, test_size = 0.2,random_state=2)

vectorizer = CountVectorizer().fit(X_train)
#vectorizer = TfidfVectorizer(use_idf=True).fit(X_train)
train_x = vectorizer.transform(X_train)
test_x = vectorizer.transform(X_test)

model = xgb.XGBClassifier(
              n_jobs = -1,
              max_depth = 6,
              #learning_rate=  0.1,
              min_child_weight= 2, 
              #min_samples_split= 0.9,
              n_estimators= 100,
              eta = 0.1, 
              verbose = 1, 
              gamma=0.05,
              #nrounds = 100
              objective = "binary:logistic", 
              eval_metric = "auc",       #"aucpr",    # "aucpr",  #aucpr, auc
              subsample = 0.7,
              colsample_bytree =0.8,
              max_delta_step=1,
              verbosity=1,

              tree_method='approx')
#model = LogisticRegression(C=2.5)
model.fit(train_x, y_train)
y_pred = model.predict(test_x)
accuracy_value = roc_auc_score(y_test, y_pred)
print("accuracy score {}".format(accuracy_value))
accuracy score 0.9975657161197259
CPU times: user 3min 51s, sys: 693 ms, total: 3min 52s
Wall time: 2min 12s
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import seaborn as sns
p=0.5

def PlotConfusionMatrix(y_test,pred,y_test_normal,y_test_pneumonia,label):

    cfn_matrix = confusion_matrix(y_test,pred)
    cfn_norm_matrix = np.array([[1.0 / y_test_normal,1.0/y_test_normal],[1.0/y_test_pneumonia,1.0/y_test_pneumonia]])
    norm_cfn_matrix = cfn_matrix * cfn_norm_matrix

    #colsum=cfn_matrix.sum(axis=0)
    #norm_cfn_matrix = cfn_matrix / np.vstack((colsum, colsum)).T

    fig = plt.figure(figsize=(15,5))
    ax = fig.add_subplot(1,2,1)
    #sns.heatmap(cfn_matrix,cmap='magma',linewidths=0.5,annot=True,ax=ax,annot=True)
    sns.heatmap(cfn_matrix, annot = True,fmt='g',cmap='rocket')
    #tick_marks = np.arange(len(y_test))
    #plt.xticks(tick_marks, np.unique(y_test), rotation=45)
    plt.title('Confusion Matrix',color='b')
    plt.ylabel('Real Classes')
    plt.xlabel('Predicted Classes')
    plt.savefig('/content/drive/My Drive/Colab Notebooks/NLP/cm_' +label +  '.png')
        
    ax = fig.add_subplot(1,2,2)
    sns.heatmap(norm_cfn_matrix,cmap=plt.cm.Blues,linewidths=0.5,ax=ax,annot=True)

    plt.title('Normalized Confusion Matrix',color='b')
    plt.ylabel('Real Classes')
    plt.xlabel('Predicted Classes')
    plt.savefig('/content/drive/My Drive/Colab Notebooks/NLP/cm_norm' +label +  '.png')
    plt.show()
    
    print('---Classification Report---')
    print(classification_report(y_test,pred))
    

y_test_real,y_test_fake = np.bincount(y_test)
y_pred= np.where(y_pred<p,0,1 )


PlotConfusionMatrix(y_test,y_pred,y_test_real,y_test_fake,label= 'classification Report')

png

---Classification Report---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4702
           1       1.00      1.00      1.00      4278

    accuracy                           1.00      8980
   macro avg       1.00      1.00      1.00      8980
weighted avg       1.00      1.00      1.00      8980