Working with Python in R | Machine Learning, Data Science, Statistics

Introduction

Python and R are the most popular machine learning languages. Jupiter notebooks are the most popular platform for python users to write up and run code. Rmarkdown can also be used for this same purpose. The choice of which of these two largely depends on your goals, interest and background.Those with a background in statistics overwhemly prefer to work in R wheareas those with computer science background prefer python. Which is better? R or Python is often asked. Neither one is beter than the other. A third alternative I believe is using both together. The availability of libraries that allows you to work interchangeably between the two languages has made data science officialy agnostic. Examples of these packages include rPython, PythonInR, reticulate , rJython, SnakeCharmR and XRPython.

Load the required packages.

pacman::p_load(skimr,kableExtra,tidyverse,reticulate,scales,viridis)

The Python Engine

This is a new Python language engine for R Markdown. It supports bi-directional communication between R and Python (R chunks can access Python objects and vice-versa).

import csv
import datetime
import re
import codecs
import requests
import pandas as pd
import random as r
# Importing Dataset
data = pd.read_csv("https://raw.githubusercontent.com/NanaAkwasiAbayieBoateng/Datasets/master/EmployeeAttrition.csv")
# Print Text
print( "working with python" +"in R is much fun")
#generate a random number

## working with pythonin R is much fun

import random as r
print(r.random())

## 0.45130671156685687

print(data[['Age','Attrition','BusinessTravel','DailyRate','Department']].head(3))

##    Age Attrition     BusinessTravel  DailyRate              Department
## 0   41       Yes      Travel_Rarely       1102                   Sales
## 1   49        No  Travel_Frequently        279  Research & Development
## 2   37       Yes      Travel_Rarely       1373  Research & Development

# summary statistics from the data
print(data.describe())
#get the current working directory

##                Age    DailyRate  DistanceFromHome    Education  EmployeeCount  \
## count  1470.000000  1470.000000       1470.000000  1470.000000         1470.0   
## mean     36.923810   802.485714          9.192517     2.912925            1.0   
## std       9.135373   403.509100          8.106864     1.024165            0.0   
## min      18.000000   102.000000          1.000000     1.000000            1.0   
## 25%      30.000000   465.000000          2.000000     2.000000            1.0   
## 50%      36.000000   802.000000          7.000000     3.000000            1.0   
## 75%      43.000000  1157.000000         14.000000     4.000000            1.0   
## max      60.000000  1499.000000         29.000000     5.000000            1.0   
## 
##        EmployeeNumber  EnvironmentSatisfaction   HourlyRate  JobInvolvement  \
## count     1470.000000              1470.000000  1470.000000     1470.000000   
## mean      1024.865306                 2.721769    65.891156        2.729932   
## std        602.024335                 1.093082    20.329428        0.711561   
## min          1.000000                 1.000000    30.000000        1.000000   
## 25%        491.250000                 2.000000    48.000000        2.000000   
## 50%       1020.500000                 3.000000    66.000000        3.000000   
## 75%       1555.750000                 4.000000    83.750000        3.000000   
## max       2068.000000                 4.000000   100.000000        4.000000   
## 
##           JobLevel          ...           RelationshipSatisfaction  \
## count  1470.000000          ...                        1470.000000   
## mean      2.063946          ...                           2.712245   
## std       1.106940          ...                           1.081209   
## min       1.000000          ...                           1.000000   
## 25%       1.000000          ...                           2.000000   
## 50%       2.000000          ...                           3.000000   
## 75%       3.000000          ...                           4.000000   
## max       5.000000          ...                           4.000000   
## 
##        StandardHours  StockOptionLevel  TotalWorkingYears  \
## count         1470.0       1470.000000        1470.000000   
## mean            80.0          0.793878          11.279592   
## std              0.0          0.852077           7.780782   
## min             80.0          0.000000           0.000000   
## 25%             80.0          0.000000           6.000000   
## 50%             80.0          1.000000          10.000000   
## 75%             80.0          1.000000          15.000000   
## max             80.0          3.000000          40.000000   
## 
##        TrainingTimesLastYear  WorkLifeBalance  YearsAtCompany  \
## count            1470.000000      1470.000000     1470.000000   
## mean                2.799320         2.761224        7.008163   
## std                 1.289271         0.706476        6.126525   
## min                 0.000000         1.000000        0.000000   
## 25%                 2.000000         2.000000        3.000000   
## 50%                 3.000000         3.000000        5.000000   
## 75%                 3.000000         3.000000        9.000000   
## max                 6.000000         4.000000       40.000000   
## 
##        YearsInCurrentRole  YearsSinceLastPromotion  YearsWithCurrManager  
## count         1470.000000              1470.000000           1470.000000  
## mean             4.229252                 2.187755              4.123129  
## std              3.623137                 3.222430              3.568136  
## min              0.000000                 0.000000              0.000000  
## 25%              2.000000                 0.000000              2.000000  
## 50%              3.000000                 1.000000              3.000000  
## 75%              7.000000                 3.000000              7.000000  
## max             18.000000                15.000000             17.000000  
## 
## [8 rows x 26 columns]

import os
cwd = os.getcwd()
print(cwd)

## DataMiningscience/RPYTHON_files

print("Path at terminal when executing this file")

## Path at terminal when executing this file

print(os.getcwd() + "\n")

## /DataMiningscience/RPYTHON_files

#matplotlib inline
#Import datavis libraries
import matplotlib.pyplot as plt
# Hide warnings if there are any
import warnings
warnings.filterwarnings('ignore')
import pandas_datareader.data as web
import numpy as np
import seaborn as sns; sns.set() 
from datetime import datetime
from dateutil.parser import parse
import pandas as pd
# 
start = datetime(2015, 1, 1)
end = datetime(2018, 3, 1)
#download Apple stock data
df = web.DataReader(['AAPL'], 'morningstar', start, end)
df=pd.DataFrame(df)
print(df.dtypes)
## select the Date and Symbol index and add as a main column

## Close     float64
## High      float64
## Low       float64
## Open      float64
## Volume      int64
## dtype: object

df["Date"]=df.index.get_level_values('Date')
df["Symbol"]=df.index.get_level_values('Symbol')
df.tail(3)
#drop columns
df2=df.drop(columns=['Date','Symbol'])
#Convert  index  to  column
df2=df2.reset_index()
#drop index
#df.drop(index='Date')
#We can use the drop parameter to avoid the old index being added as a column:
df3=df
df3.reset_index(drop=True)
print(df3.head(4))

##                     Close    High     Low    Open    Volume       Date Symbol
## Symbol Date                                                                  
## AAPL   2015-01-01  110.38  110.38  110.38  110.38         0 2015-01-01   AAPL
##        2015-01-02  109.33  111.44  107.35  111.39  53204626 2015-01-02   AAPL
##        2015-01-05  106.25  108.65  105.41  108.29  64285490 2015-01-05   AAPL
##        2015-01-06  106.26  107.43  104.63  106.54  65797116 2015-01-06   AAPL

df4 = pd.DataFrame([('bird',    389.0),
                    ('bird',     24.0),
                   ('mammal',   80.5),
                    ('mammal', np.nan)],
                  index=['falcon', 'parrot', 'lion', 'monkey'],
                  columns=('class', 'max_speed'))
df4.head()
#drop an index column
print(df4.reset_index(drop=True))
#df3.reset_index(level=['Date'])
#df3.MultiIndex.drop(labels, level=None, errors='raise')

##     class  max_speed
## 0    bird      389.0
## 1    bird       24.0
## 2  mammal       80.5
## 3  mammal        NaN

df5=df2.drop(columns=['Symbol','Volume'])
#Set the index to become the 
df5=df5.set_index(['Date'])
fig, ax = plt.subplots(1, 1,figsize=(20,10))
plt.subplots(1, 0)
params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 8),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)
plt.title('Stock Prices ', fontsize=20)
plt.ylabel('Price', fontsize=20)
plt.xlabel('Date', fontsize=20)
df5.plot(lw=2,grid = True)
fig.savefig('Apple.jpg')
plt.grid(True)
plt.legend(loc='lower right')
plt.show()

We can also plot transfer the data from the Python environment and plot with ggplot as follows:

#install.packages("viridis")
library(viridis)


df6=py$df3%>%mutate(Date=as.Date(Date))
ggplot(df6)+geom_line(aes(x=Date,y=Close,color="Close"))+geom_line(aes(x=Date,y=High,color="High"))+geom_line(aes(x=Date,y=Low,color="Low"))+geom_line(aes(x=Date,y=Open,color="Open"))+
scale_color_viridis(discrete=TRUE,option = "A")+
  #scale_fill_viridis(discrete = T,option = "C")+
      theme(
    legend.position="top",
    legend.direction="horizontal",
    legend.title = element_blank(),
    text=element_text(size=8,  family="Comic Sans MS"),
    axis.text.x=element_text(angle=45,hjust=1,size = 9),
    axis.text.y=element_text(size = 8),
    legend.text = element_text(size=8)
)+

  
  labs(y="Value",x="Date",title="")+
  
  # scale_x_date(date_labels = "%b %d")
 #scale_x_date(labels = date_format("%m-%y"))

 scale_x_date(breaks = date_breaks("4 month"),labels=date_format("%b %Y") ) 

df6%>% select(Date,High,Close,Low,Open,Symbol)%>%
gather(Name,Value,-Date,-Symbol)%>%mutate(Name=as.factor(Name))%>%
  
  ggplot(aes(x=Date,y=Value,color=Name))+geom_line()+
  
scale_color_viridis(discrete=TRUE,option = "D")+
  #scale_fill_viridis(discrete = T,option = "C")+
   theme(
   
    
    legend.title = element_blank(),
    text=element_text(size=8,  family="Comic Sans MS"),
    axis.text.x=element_text(angle=45,hjust=1,size = 9),
    axis.text.y=element_text(size = 8),
    legend.text = element_text(size=8)
)+
   scale_x_date(breaks = date_breaks("4 month"),labels=date_format("%b %Y") ) 

  # scale_fill_manual(values=c("#B9DE28FF" , "#D1E11CFF" , "#E8E419FF"))

The reticulate Package

Another alternative to working with in R is via the reticulate package. The reticulate package has made it possible for several popular python libraries such as tensorflow and keras available in R. It provides a set of tools for interoperability between R and Python. It’s functionality include sourcing python scripts,using Python interactively within an R session and importing modules.

library(reticulate)
os <- import("os")

#get current working directory
os$getcwd()

## [1] "/DataMiningscience/RPYTHON_files"

#list files current directory
os$listdir(".")

## [1] ".DS_Store"                                    
## [2] "2018-07-22-Working-with-Python-n-Rstudio.html"
## [3] "Apple.jpg"                                    
## [4] "figure-html"                                  
## [5] "RPYTHON.Rmd"                                  
## [6] "RPYTHON_files"                                
## [7] "trig.jpg"

Python in R Markdown

from IPython import get_ipython
#get_ipython().magic('reset -sf')
import matplotlib.pyplot as plt
#%matplotlib inline 
import numpy as np
fig, ax = plt.subplots(1, 1,figsize=(20,10))
plt.subplots(1, 0)
t=np.arange(-5,5,0.01)
s1=1+np.sin(0.5*np.pi*t)
s2=1+np.cos(0.5*np.pi*t)
plt.plot(t,s1, label="sine", linestyle='--')
plt.plot(t,s2, label="cosine",    lw=2)
plt.grid(True)
plt.title('Sine and Cosine Functions ', fontsize=20)
plt.ylabel('y-axis', fontsize=20)
plt.xlabel('x-axis', fontsize=20)
plt.grid(True)
# Place a legend above this subplot, expanding itself to
# fully use the given bounding box.
#plt.legend(bbox_to_anchor=(0, 1.02, 1., .102), loc=3,
#          ncol=1, mode="expand", borderaxespad=0.)
# Place a legend to the right of this smaller subplot.
plt.legend(bbox_to_anchor=(-0.1, 1.11), loc=2, borderaxespad=0.5)
#plt.legend(loc='lower right')
plt.savefig('trig.jpg')
plt.show()
#print(s)

import pandas
employeedata = pandas.read_csv("https://raw.githubusercontent.com/NanaAkwasiAbayieBoateng/Datasets/master/EmployeeAttrition.csv")
employeedata = employeedata[employeedata['BusinessTravel'] == "Travel_Frequently"]
employeedata = employeedata[['Age', 'Attrition', 'BusinessTravel','DailyRate','Department']]
employeedata = employeedata.dropna()

library(ggplot2)
ggplot(py$employeedata, aes(Age,fill=Attrition)) + geom_histogram(bins = 30)

Sourcing Python scripts

A Python script can be sourced using source_python() function. For example, if you had the following Python script employeeAttrition.py:

source_python("/DataMiningscience/employeeAttrition.py")

employeedata <- read_employee("https://raw.githubusercontent.com/NanaAkwasiAbayieBoateng/Datasets/master/EmployeeAttrition.csv")

library(ggplot2)
ggplot(py$employeedata, aes(Age,fill=Attrition)) + geom_histogram(bins = 30)

Python REPL

The repl_python() function allows switching between Python and R a breeze.First a call is made to repl_python() function , which provides a Python REPL embedded within your R session.Objects created within the Python REPL can be accessed from R using the py object exported from reticulate. The Python and R code can together be run in the R console like the example below.

# repl_python()
# 
# import pandas
# 
# 
# employeedata = pandas.read_csv("https://raw.githubusercontent.com/NanaAkwasiAbayieBoateng/Datasets/master/EmployeeAttrition.csv")
# employeedata = employeedata[employeedata['BusinessTravel'] == "Travel_Frequently"]
# employeedata = employeedata[['Age', 'Attrition', 'BusinessTravel','DailyRate','Department']]
# employeedata = employeedata.dropna()
# exit
# 
# library(ggplot2)
# ggplot(py$employeedata, aes(Age,fill=Attrition)) + geom_histogram(bins = 30)

Tags: python,reticulate,rstudio