Quick Intro to R

About the R language, briefly

If you are used to traditional computing languages, you will find R different in many ways. The basic ideas behind R date back four decades (to the S language of Chambers, Becker and Wilks] and have a strong flavor of exploration: one can grapple with data, understand its structure, visualize it, summarize it etc. Therefore, a common way people use R is by typing a command and immediately see the results. (Of course, scripts can also be written and fed to R for batch execution.)

The core of R itself is reasonably small, but over time, it has also become a vehicle for researchers to disseminate new tools and methodologies via packages. That is one reason for R’s popularity: there are thousands of packages (12000+ as of this writing) that extend R in many useful ways.

The CRAN website is a crucial resource, hosting all the software, documentation and manuals.

Some detail on the language

A very brief run-through.

R Style and Nomenclature

  • R variable names frequently contain periods and underscores. Example: male.cholesterol or male_cholesterol

  • R users tend to use the word objects to refer to R variables, functions, datasets, etc. This refers to something more than object in an object-oriented programming style.

  • John Chambers: Everything that exists in R is an object. Everything that happens is a function call.

So, in R, all action occurs via functions. Even something as simple as

1 + 2
## [1] 3

is computed via a function call.

  • You can use = or <- for assignment, but <- is better

Vectors and Indexing

  • R uses 1-based indexing!

  • R has no concept of a scalar. A scalar is simply a vector of length
  • R can handle both numeric and non-numeric (non-numeric) data. Beware R shows its statistical origins in that some non-numeric data may be automatically converted to factors.

  • Combining numeric and non-numeric into vectors causes silent coercion into the type that can accommodate the result!

c(1, 2, "foo", "bar")
## [1] "1"   "2"   "foo" "bar"
  • Logicals in R are TRUE and FALSE.

  • Indexing can be done with integers, names, logical values etc.

x <- 1:10
x[x %% 2 == 0]
## [1]  2  4  6  8 10
x[-(1:5)]
## [1]  6  7  8  9 10
  • Lists are versatile data structures that can grow or shrink and contain heterogeneous data. They are constructed using the list function:
(aList <- list(a = 1, b = 2, c = list(1, 2, "abc")))
## $a
## [1] 1
## 
## $b
## [1] 2
## 
## $c
## $c[[1]]
## [1] 1
## 
## $c[[2]]
## [1] 2
## 
## $c[[3]]
## [1] "abc"

Note how a list prints differently.

  • With lists, the individual elements can also be accessed using the dollar ($) notation.
aList$c
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] "abc"
  • R has a notion of a missing value, NA. This is not the same as NULL, which indicates nothing is present.
miss1 <- c(1.0, NA, 2.0)
2 * miss1
## [1]  2 NA  4

Some numerical computations will have to provide hints on how to handle the missing values. For example, the mean function computes the average of a set of numbers.

## No hint to process missing values
mean(miss1)
## [1] NA
## Remove missing values before processing
mean(miss1, na.rm = TRUE)
## [1] 1.5

One can check for missing-ness or nullity using the is family of functions.

is.null(c())
## [1] TRUE
is.null(NA)
## [1] FALSE
## This should produce a warning
is.na(c())
## Warning in is.na(c()): is.na() applied to non-(list or vector) of type
## 'NULL'
## logical(0)
is.na(NA)
## [1] TRUE

There are many others: is.numeric, is.list, is.vector, etc.

Numerics

  • Standard operations. When you perform arithmetic on vectors, the operations happen on all elements.
## Add two vectors
1:3 + 2:4
## [1] 3 5 7
## Multiply a vector by 2
2 * 1:3
## [1] 2 4 6
## Better to have parenthesis
2 * (1:3)
## [1] 2 4 6
## Divide
c(2, 4, 6) / c(2, 4, 6)
## [1] 1 1 1
## Halve
c(2, 4, 6) / 2
## [1] 1 2 3
## R recycles shorter vector to match length
c(2, 4, 6, 8) / c(1, 2)
## [1] 2 2 6 4

The last operation shows how R tries to make two vectors conform in length and provides no warning. It is easy to avoid such constructs.

  • The usual comparison operators are available: == for equality, != for not equal to, >= for greater than or equal to, etc.
xx <- 1:3
xx == xx
## [1] TRUE TRUE TRUE
## 1 is expanded to match length of xx
xx > 1
## [1] FALSE  TRUE  TRUE

Matrices

The function matrix can be used for creating matrices which are two-dimensional arrays.

## Create a 3 by 2 matrix.
m <- matrix(1:6, nrow = 3)
m
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Another way is to use existing vectors to bind into a matrix.

xx <- 1:3
yy <- 4:6
## Bind by columns
(m2 <- cbind(xx, yy))
##      xx yy
## [1,]  1  4
## [2,]  2  5
## [3,]  3  6
## Bind by rows
(m3 <- rbind(xx, yy))
##    [,1] [,2] [,3]
## xx    1    2    3
## yy    4    5    6

The matrix m2 has the same content as m above, but the columns have names xx and yy which can be used in subsetting/indexing.

## Access element in row 1, column 2
m[1, 2]
## [1] 4
## Access second column
m[ , 2]
## [1] 4 5 6
## Do the same with matrix m2
m2[, "yy"]
## [1] 4 5 6
## Access the third row of m
m[3, ]
## [1] 3 6
  • Sparse Matrices are provided by specialized packages such as Matrix or slam.

  • R provides a large suite of functions for numerical analysis, optimization etc.

R Packages

CRAN hosts over 12K packages.

There are several packages I see on CRAN related to differential privacy for instance.

Designed by and for the community of differential privacy algorithm developers. It can be used to empirically evaluate and visualize Cumulative Distribution Functions incorporating noise that satisfies differential privacy, with numerous options made to streamline collection of utility measurements across variations of key parameters, such as epsilon, domain size, sample size, data shape, etc. Developed by researchers at Harvard PSI.

PrivateLR implements two differentially private algorithms for estimating L2-regularized logistic regression coefficients.

An implementation of major general-purpose mechanisms for privatizing statistics, models, and machine learners, within the framework of differential privacy