(syntax ONLY, no winners here)
A repo comparing syntax in R and Python for various tasks. Not comprehensive, but a subset of lines to get one started
This is essentially a fork of a slide deck from Decision Stats
More geared for R users, trying out Python than otherwise. We use Rstudio and Rmarkdown to create the reference.
RStudio users, you may want to check out anaconda and Spyder
# Let us use conda to get all the packs we need
conda install pandas
## [1] "/Users/sahilseth/anaconda/bin:/Users/sahilseth/anaconda/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/opt/X11/bin:/Library/TeX/texbin:/usr/texbin"
install.packages(c("e1071", "kknn", "randomForest", "rpart"))
# extra libs to compile this document
devtools::install_github("yihui/runr")
Resources:
- A cheatsheet comparing R/Matlab and Python: http://mathesaurus.sourceforge.net/matlab-python-xref.pdf
- A book with various examples: Machine Learning: An Algorithmic Perspective
- A quick how to for by Data Robot A awesome slidedeck describing Python for R users
Functions | R | Python |
---|---|---|
Downloading and installing a package | install.packages('name') | pip install name |
Load a package | library('name') | import name as other_name |
Checking working directory | getwd() | import os;os.getcwd() |
Setting working directory | setwd() | os.chdir() |
List files in a directory | dir() | os.listdir() |
List all objects | ls() | globals() |
Remove an object | rm('name') | del('object') |
R
Creating a data frame df of dimension 6x4 (6 rows and 4 columns) containing random numbers
A <- matrix(runif(24,0,1), nrow=6, ncol=4)
df <- data.frame(A)
print(df)
## X1 X2 X3 X4
## 1 0.008414911 0.311282917 0.61186960 0.8987637
## 2 0.021930034 0.072141218 0.96503840 0.9508710
## 3 0.227929049 0.840288003 0.61093433 0.3269284
## 4 0.162879566 0.325983315 0.82753045 0.4227151
## 5 0.593558020 0.009578978 0.84678802 0.2197988
## 6 0.219805626 0.054050339 0.04714518 0.1655515
Here,
- runif function generates 24 random numbers between 0 to 1
- matrix function creates a matrix from those random numbers, nrow and ncol sets the numbers of rows and columns to the matrix
- data.frame converts the matrix to data frame | (Using pandas package*)
Python
import numpy as np
import pandas as pd
A=np.random.randn(6,4)
df=pd.DataFrame(A)
print(df)
## 0 1 2 3
## 0 -0.217405 -0.163276 0.936169 -0.089373
## 1 2.276137 0.891530 1.257429 -0.686684
## 2 0.295248 -0.528968 0.364880 0.274526
## 3 0.854174 -2.911316 0.768290 0.972371
## 4 -1.377254 2.524532 -0.718311 1.294197
## 5 0.252250 -0.408106 -0.598757 1.542085
Here,
- np.random.randn generates a matrix of 6 rows and 4 columns; this function is a part of
numpy
library - pd.DataFrame converts the matrix in to a data frame
function | R | Python |
---|---|---|
number of rows | rownames(df) |
df.index |
number of coliumns | colnames(df) |
df.index |
first few rows | head(df) |
df.head |
last few rows | tail(df) |
df.tail |
get dimensions | dim(df) |
df.shape |
length of df | length(df) |
df.len |
same as number of columns |
function | R | Python |
---|---|---|
quick summary including mean/std. dev etc | summary(df) |
df.describe |
setting row and column names | rownames(df) = c("a", "b") colnames(df) = c("x", "y") |
df.index = ["a", "b"] df.columns = ["x", "y"] |
function | R | Python |
---|---|---|
sorting the data | df[order(df$x)] |
df.sort(['x']) |
function | R | Python |
---|---|---|
slicing a set of rows, from row number x to y | df[x:y, ] |
df[x-1:y] Python starts counting from 0 |
slicing a column names | df[, "a"] df$a df["a"] |
df.loc[:, ['a']] |
slicing a column and rows | df[x:y, x:y] |
df.iloc[x-1:y, a-1,b] |
extract specific element | df[x, y] |
df.iloc[x-1, y-1] |
function | R | Python |
---|---|---|
subset rows where x>5 | subset(df, x>5) |
df[df.A> 5] |
function | R | Python |
---|---|---|
sum | sum(x) |
math.fsum(x) |
square root | sqrt(x) |
math.sqrt(x) |
standard deviation | sd(x) |
numpy.std(x) |
log | log(x) |
math.log(x) |
mean | mean(x) |
numpy.mean(x) |
median | median(x) |
numpy.media(x) |
function | R | Python |
---|---|---|
convert character to numeric | as.numeric(x) |
for single values: int(x) , long(x) , float(x) for list, vectors: map(int, x) , map(long, x) , map(float, x) |
convert numeric to character | as.character(x) paste(x) |
for single values: str(x) for list, vectors: map(str, x) |
check missing value | is.na(x) is.nan(x) |
math.is.nan(x) |
remove missing value | na.omit(x) |
[x for x in list if str(x) != 'nan'] |
number of chars. in value | char(x) |
len(x) |
function | R (lubridate ) |
Python |
---|---|---|
Getting time and date | Sys.time() |
d=datetime.date.time.now() |
parsing date and time: YYYY MM DD HH:MM:SS |
lubridate::ymd_hms(Sys.time()) |
d.strftime("%Y %b %d %H:%M:%s") |
function | R | Python |
---|---|---|
Scatter Plot | plot(variable1,variable2) |
import matplotlib plt.scatter(variable1,variable2);plt.show() |
Boxplot | boxplot(Var) |
plt.boxplot(Var);plt.show() |
Histogram | hist(Var) |
plt.hist(Var) plt.show() |
Pie Chart | pie(Var) |
from pylab import * pie(Var) show() |
import matplotlib.pyplot as plt
Data Visualization: Bubble
R
To know more about svm function in R visit: http://cran.r-project.org/web/packages/e1071/
library(e1071)
data(iris)
trainset = iris[1:149,]
testset = iris[150,]
svm.model = svm(Species~., data = trainset, cost = 100, gamma = 1)
svm.pred = predict(svm.model, testset)
svm.pred
## 150
## virginica
## Levels: setosa versicolor virginica
Python
To install sklearn library visit scikit-learn.org
To know more about sklearn svm visit sklearn.svm.SVC
from sklearn import svm
from sklearn import datasets
#Calling SVM
clf = svm.SVC()
iris = datasets.load_iris()
# Constructing training data X,
X, y = iris.data[:-1], iris.target[:-1]
# Fitting SVM
clf.fit(X, y)
# Testing the model on test data print
clf.predict(iris.data[-1])
# Output: Virginica Output: 2, corresponds to Virginica
R
To know more about lm function in R visit: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html
library(broom)
data(iris)
iris$y <- sapply(as.character(iris$Species), function(x){
switch (x,
setosa = 0,
versicolor = 1,
2
)
})
train_set <- iris[1:149,]
test_set <- iris[150,]
fit <- lm(y ~ 0+Sepal.Length+ Sepal.Width + Petal.Length+ Petal.Width , data=train_set)
tidy(fit)
## term estimate std.error statistic p.value
## 1 Sepal.Length -0.07454598 0.04926761 -1.5130828 1.324352e-01
## 2 Sepal.Width -0.03465755 0.05695934 -0.6084611 5.438337e-01
## 3 Petal.Length 0.21590110 0.05664803 3.8112730 2.037526e-04
## 4 Petal.Width 0.60581643 0.09340629 6.4858203 1.301553e-09
coefficients(fit)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## -0.07454598 -0.03465755 0.21590110 0.60581643
predict.lm(fit, test_set)
## 150
## 1.647771
Python
To know more about sklearn linear regression visit: http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
from sklearn import linear_model
from sklearn import datasets
iris = datasets.load_iris()
regr = linear_model.LinearRegression()
X, y = iris.data[:-1], iris.target[:-1]
regr.fit(X, y)
print(regr.coef_)
print(regr.predict(iris.data[-1]))
## [-0.09726197 -0.05347337 0.21782359 0.61500051]
## [ 1.65708429]
R
To know more about randomForest package in R visit: http://cran.r-project.org/web/packages/randomForest/
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
iris.rf <- randomForest(y ~ ., data=train_set,ntree=100,importance=TRUE, proximity=TRUE)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
print(iris.rf)
##
## Call:
## randomForest(formula = y ~ ., data = train_set, ntree = 100, importance = TRUE, proximity = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.01151123
## % Var explained: 98.27
predict(iris.rf, test_set, predict.all=TRUE)
## $aggregate
## 150
## 1.963167
##
## $individual
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## 150 2 2 2 2 2 2 2 2 2 2 2 2 2
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## 150 2 1.8 2 2 2 2 2 2 2 2 2
## [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
## 150 2 2 2 2 2 2 2 2 2 2 2
## [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## 150 2 2 2 2 2 1.8 2 2 2 2 2
## [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57]
## 150 2 2 2 2 2 2 1 2 2 2 2
## [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65] [,66] [,67] [,68]
## 150 2 2 2 2 2 2 2 2 2 2 2
## [,69] [,70] [,71] [,72] [,73] [,74] [,75] [,76] [,77] [,78] [,79]
## 150 2 2 1.8 2 2 2 2 2 2 2 2
## [,80] [,81] [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90]
## 150 2 2 2 2 2 1 2 2 2 2 1.666667
## [,91] [,92] [,93] [,94] [,95] [,96] [,97] [,98] [,99] [,100]
## 150 2 2 2 2 1.25 2 2 2 2 2
Python
To know more about sklearn random forest visit: http://scikit- learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn import ensemble
from sklearn import datasets
clf = ensemble.RandomForestClassifier(n_estimators=100, max_depth=10)
iris = datasets.load_iris()
X, y = iris.data[:-1], iris.target[:-1]
clf.fit(X, y)
print(clf.predict(iris.data[-1]))
# Output: 1.845 Output: 2
## [2]
R
To know more about rpart package in R visit: http://cran.r-project.org/web/packages/rpart/
library(rpart)
data(iris)
sub = c(1:149)
fit = rpart(Species ~., data = iris, subset = sub)
pred = predict(fit, iris[sub, ], type = "class")
Python
To know more about sklearn desicion tree visit : http://scikit- learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
X, y = iris.data[:-1], iris.target[:-1]
clf.fit(X, y)
print(clf.predict(iris.data[-1]))
#Output: Virginica Output: 2, corresponds to virginica
## [2]
R
To know more about e1071 package in R visit: http://cran.r-project.org/web/packages/e1071/
library(e1071)
data(iris)
trainset = iris[1:149,]
testset = iris[150,]
classifier = naiveBayes(trainset[,1:4], trainset[, 5])
predict(classifier, testset[,5])
## [1] setosa
## Levels: setosa versicolor virginica
Python
To know more about sklearn Naive Bayes visit : http://scikit- learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
iris = load_iris()
X, y = iris.data[:-1], iris.target[:-1]
clf.fit(X, y)
print(clf.predict(iris.data[-1]))
#Output: Virginica Output: 2, corresponds to virginica
## [2]
R
To know more about kknn package in R visit:
library(kknn)
data(iris)
trainset <- iris[1:149,]
testset = iris[150,]
iris.kknn = kknn(Species~., trainset,testset, distance = 1, kernel = "triangular")
summary(iris.kknn)
##
## Call:
## kknn(formula = Species ~ ., train = trainset, test = testset, distance = 1, kernel = "triangular")
##
## Response: "nominal"
## fit prob.setosa prob.versicolor prob.virginica
## 1 virginica 0 0.232759 0.767241
fit <- fitted(iris.kknn)
fit
## [1] virginica
## Levels: setosa versicolor virginica
Python
To know more about sklearn k nearest neighbors visit: scikitlearn.org
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
iris = load_iris()
X, y = iris.data[:-1], iris.target[:-1]
knn.fit(X,y)
print(knn.predict(iris.data[-1]))
# Output: Virginica Output: 2, corresponds to virginica
#py$stop()