Acknowledgment: the materials below are initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Introduction to R

Many software choices for linear regression analysis: Excel, Minitab, SPSS, SAS, and R. However, no software is a substitute for creative thinking. For this course, we exclusively use R.

R is both a computing language and an environment for statistical computing and statistical graphics. R is available, at no cost, for Windows, Linux, Unix and MacOS. R was developed at AT&T Bell Laboratories. The initial version of R was developed by Ross Ihaka and Robert Gentleman, both from the University of Auckland. The R website (http://www.r-project.org/) contains a significant amount of information about the R software.

To download and install R, please follow:

Rstudio is a great integrated development environment (IDE) for R.

Advantages of R

R as a calculator

Expressions are typed following the prompt (>) on the screen. The result appears on subsequent lines

2+2  # Addition
## [1] 4
sqrt(10)  # Square root of 
## [1] 3.162278

Additionl operators includes the following:

Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation

R objects

All R entities, almost everything, including functions and data structures, exist as objects. R objects can all be operated on as data.

Type in ls() to see the names of all objects in your workspace. In the following, we will learn four kinds of objects in R.

Vectors

Three types of vectos:

x <- c(2,3,5,2,7,1) # Numeric
x
## [1] 2 3 5 2 7 1
y <- c("Canberra","Sydney","Newcastle","Darwin") # Character
y
## [1] "Canberra"  "Sydney"    "Newcastle" "Darwin"
z <- c(TRUE,FALSE,FALSE,FALSE) # Logical 
z
## [1]  TRUE FALSE FALSE FALSE

Tips: The c(...) an acronym for “concatenate”. The symbol <- means “assigning to”.

What if I want to learn more about the R function c()? To get on-line, we can use ?, for example:

?class
??class

How to concatenate existing vectors?

x1 <- c(2,3,5,2,7,1)
x2 <- c(0,0,0)
x_join <- c(x1, x2)
x_join
## [1] 2 3 5 2 7 1 0 0 0

Tips: Use _, such as x_join, in variable names instead of ., such as x.join, because . sometimes means different things in some object-oriented programming languages.

Subsetting

To extract elements of vectors, we can:

Specify the indices of the elements that are to be extracted, e.g.

x <- c(3,11,8,15,12) # Assign to x the values 3, 11, 8, 15, 12
x[c(2,4)] # Extract elements (rows) 2 and 4
## [1] 11 15

Or specify a vector of logical values.

x[x>10]
## [1] 11 15 12

Other Operations on Vectors

An advantage of R is that it allows flexible operations on vectors. Try the following code and see how it works:

x_join+1
## [1] 3 4 6 3 8 2 1 1 1
x_join*2
## [1]  4  6 10  4 14  2  0  0  0
log(x_join)
## [1] 0.6931472 1.0986123 1.6094379 0.6931472 1.9459101 0.0000000      -Inf
## [8]      -Inf      -Inf
sum(x_join)
## [1] 20
mean(x_join)
## [1] 2.222222

Summarize Vectors

Suppose you are given the following vector:

vv <- rnorm(1000)

where rnorm(1000) means generate 1000 copies of random variables that follow standard normal distribution with mean zero and variance one, i.e., \(N(0,1)\).

How to summarize the information of vector?

vv[1:20]    # Get a sense of the data
##  [1]  0.31305736  0.25876387 -0.90024451  0.08902754 -0.18750180 -0.38349161
##  [7]  2.45937581 -1.29006418  0.05432386  0.27610338  0.84463287 -0.18307603
## [13] -0.62665444 -0.20292712 -0.69857687  1.88796299 -0.84485625 -0.17561151
## [19] -1.20074211  0.24606097
class(vv)   # Which class the vector belongs to
## [1] "numeric"
length(vv)  # What is the size of the vector
## [1] 1000
summary(vv) # Numerical summary
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.13345 -0.74195 -0.08150 -0.04956  0.62027  3.60127
hist(vv)    # Graphical summary

Other Ways to Create Vectors in R

Create a sequence with a fixed increment, seq(from=..., to=..., by=..., length.out=...).

seq(1,10,by=1)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,10,length.out=10)
##  [1]  1  2  3  4  5  6  7  8  9 10
1:10  # With an increment of 1
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,10,length.out=9)
## [1]  1.000  2.125  3.250  4.375  5.500  6.625  7.750  8.875 10.000

Create a vector containing elements with the same value rep(x, times=..., each=...).

rep(2,10)
##  [1] 2 2 2 2 2 2 2 2 2 2
rep(1:4, times=2)
## [1] 1 2 3 4 1 2 3 4
rep(1:4, each=2) 
## [1] 1 1 2 2 3 3 4 4

Missing Values in R

In R, missing values are represented by “NA” (meaning Not Available). Any arithmetic operation that involves NA generates an NA.

y <- c(1, NA, 3, 0, NA)
mean(y)
## [1] NA
mean(y, na.rm=T)
## [1] 1.333333
is.na(y)
## [1] FALSE  TRUE FALSE FALSE  TRUE

Comments in R

Anything that follows a # on the command line is taken as comment and ignored by R.

a=1
b=2
#a=3
d=a+b
d
## [1] 3

Matrices

To create a matrix, we use matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

M <- matrix(1:12,nrow=3,ncol=4,byrow=T)
M
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
M <- matrix(1:12,nrow=3,ncol=4)

Subsetting

Retrieve the elements of M.

M[2,3]
## [1] 8
M[2,]
## [1]  2  5  8 11
M[,3]
## [1] 7 8 9

Column and Row Operations

To obtain the means for each column of the matrix M, we use apply(X, MARGIN, FUN, ...)

X: an array, including a matrix.

MARGIN: a vector giving the subscripts which the function will be applied over. E.g., for a matrix, 1 indicates rows, 2 indicates columns.

FUN: the function to be applied

apply(M,2,mean)
## [1]  2  5  8 11
apply(M,1,mean)
## [1] 5.5 6.5 7.5
apply(M,2,sum)
## [1]  6 15 24 33
apply(M,1,sum)
## [1] 22 26 30

Other operations on Matrices

Try the following commands and figure out their functionality.

dim(M) # dimension
## [1] 3 4
t(M) # transpose
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
## [4,]   10   11   12
sum(M) # summation
## [1] 78
as.vector(M) # vectorize
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

Data Frames

Data frames are fundamental to the use of the R modeling and graphics functions. A data frame is a generalization of a matrix, in which different columns may have different modes. All elements of any column must however have the same mode, i.e. all numeric or all factor, or all character.

The data set clinic consists of two variables, type and score. type refers to what patients take. score is a kind of health score of patients.

type score
drug 8
drug 10
placebo 5
drug 9

Step 1. Input the data set. Label type and score as drug or placebo and health score respectively.

Step 2. Calculate the means of health score for patients taking drug and for patients taking placebo respectively.

type <- c(rep("drug",2),"placebo",rep("drug",2),rep("placebo",3))
score <- c(8,10,5,9,9,7,6,6)
clinic <- data.frame(type,score)
clinic
##      type score
## 1    drug     8
## 2    drug    10
## 3 placebo     5
## 4    drug     9
## 5    drug     9
## 6 placebo     7
## 7 placebo     6
## 8 placebo     6

Step 2. Calculate the means of health score for patients taking drug and for patients taking placebo respectively.

mean(score[type=="drug"])
## [1] 9
mean(score[type=="placebo"])
## [1] 6

Select columns or rows

clinic[,2]
## [1]  8 10  5  9  9  7  6  6
clinic[,"score"]
## [1]  8 10  5  9  9  7  6  6
clinic[type=="drug",]
##   type score
## 1 drug     8
## 2 drug    10
## 4 drug     9
## 5 drug     9

logical operators

Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x not x
x|y x or y
x&y x and y

Lists

An R list is an object consisting of an ordered collection of objects known as its components. There is no particular need for the components to be of the same mode or type. For example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on.

Here is a simple example of how to make a list:

lst <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))
lst
## $name
## [1] "Fred"
## 
## $wife
## [1] "Mary"
## 
## $no.children
## [1] 3
## 
## $child.ages
## [1] 4 7 9
length(lst)
## [1] 4
lst$child.ages
## [1] 4 7 9
lst$child.ages[2]
## [1] 7
class(lst$child.ages)
## [1] "numeric"
class(lst$child.ages[2])
## [1] "numeric"

Functions

Syntax: ff<-function(x) {xxx; return(y)}

A function is created using an assignment. On the right hand side, the parameters appear within round brackets. You can, if you wish, give a default. Following the closing “)” the function body appears. Except where the function body consists of just one statement, this is enclosed between curly braces { }. The return value usually appears on the final line of the function body. It is recommended to explicitly write a “return” function.

Define a function that calculate the mean and standard deviation of a sample

mean.and.sd <- function(x){
 av <- mean(x)
 sd <- sqrt(var(x))
 return(c(mean=av, SD=sd))
 }
x1<-rnorm(100)
mean.and.sd(x1)
##       mean         SD 
## 0.07342937 1.13515776
x2<-rnorm(10000)
mean.and.sd(x2)
##        mean          SD 
## 0.004624308 1.017048717

Controls Statements

The R has available a conditional construction of the form if (expr_1) expr_2 else expr_3 where expr_1 must evaluate to a single logical value.

Example:

x <- rnorm(10)
if(sum(x)>0) 
{
    print("Positive")
}  else  {
    print("negative")
}
## [1] "negative"

Another frequently used control statement is for loop:

for(i in 1:10)
{
  print(i)
} 
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
x <- rnorm(1000)
x.pos <- rep(NA, length(x))
for(j in 1:length(x))
{
  if(x[j]<0) 
  {
    x.pos[j]<-0
  } else {
    x.pos[j]<-x[j]
  }
}
hist(x)

hist(x.pos)

Histograms

To plot a histogram, we use

delivery <- read.csv("data_delivery.csv",h=T)
hist(delivery$DeliveryTime, xlim=c(0,200), breaks=30, xlab="The range of X", main="The histogram of X")

In the code above, breaks= specifices the number of bins, xlim=c(a,b) or ylim=c(a,b) specify the range of x or y axis to [a,b], xlab="ZZZ" or ylab="ZZZ" specify the label attached to the x or y axis as ZZZ, main="ZZZ" specify the title attached to the plot as ZZZ.

Scatter Plots

We first create a dataset by simulating 200 observations from the following linear model \(y = \beta_0 + \beta_1 * x + \epsilon\) where \(\beta_0=1\), \(\beta_1=2\), \(x ~ N(1, 4)\), \(\epsilon ~ N(0,1)\). Then we visualize the data in a scatter plot.

x <- rnorm(200,mean=1,sd=2)
noise <- rnorm(200)
y <- 1+2*x+noise
plot(x,y,pch=2,col="blue")

In the code above, pch= specifies the type of symbol for each data point, col="red" specifies the color of the dots.

We can also plot multiple figures at one once to save space.

par(mfrow=c(1,3))
plot(x,y,pch=20)
plot(x,y,pch=2)
plot(x,y,col="red")

Other features in plottings.

plot(x,y)
points(-x+1, y, pch=2,col="red")
abline(h=0,lty=2)
lines(x=c(-4,6),y=c(-10,5))
legend(-4,5,legend=c("y~x","y~x+1"),pch=c(1,2),col=c("black","red")) 

Tips and tricks: The function points(...) adds points to an existing plot. The functions abline(...) and lines() adds lines to an existing plot The function legend(...) attach a legend to an existing plot.

Linear Regression in R

We use lm(response ~ var1+var2+...+vark, data=...) to perform linear regression analysis.

Note: lm() stands for linear model.

model = lm(y~x) 
summary(model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50590 -0.67319 -0.06679  0.62364  2.40528 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.09091    0.07580   14.39   <2e-16 ***
## x            1.98932    0.03367   59.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9347 on 198 degrees of freedom
## Multiple R-squared:  0.9463, Adjusted R-squared:  0.9461 
## F-statistic:  3492 on 1 and 198 DF,  p-value: < 2.2e-16
model$coef
## (Intercept)           x 
##    1.090906    1.989316
plot(x,y,pch=20) 
abline(a=model$coef[1], b=model$coef[2], lwd=4, col="blue")

Exercises

Excercie 1

For the object x_join

  • Delete the 2nd and 4th elements.

  • Count how many 0 in x_join.

x_join
x_join[-c(2,4)]
sum(x_join==0)
length(which(x_join==0))

Excercie 2

Use two R functions sum() and length() to calculate the variance of samples in the R object x_join.

Check if your result is the same as var(x_join).

sum((x_join-sum(x_join)/length(x_join))^2)/(length(x_join)-1)
var(x_join)

Excercie 3

Divide the vector vv into two vectors:

  • vv1 contains those with positive values

  • vv2 contains those with non-positive values

Summary the vector vv1.

vv1 <- vv[vv>0]
length(vv1)
summary(vv1)
hist(vv1)

Excercie 4

Use the least lines of commands to generate the following matrix

##      celsius fahrenheit
## [1,]      25       77.0
## [2,]      26       78.8
## [3,]      27       80.6
## [4,]      28       82.4
## [5,]      29       84.2
## [6,]      30       86.0
uu<-matrix(NA,nrow=6,ncol=2)
uu[,1]<-25:30
uu[,2]<-9/5*uu[,1]+32
colnames(uu)<-c("celsius","fahrenheit")

celsius <- 25:30
fahrenheit <- 9/5*celsius+32
cbind(celsius, fahrenheit)

Excercie 5

Suppose M is a matrix. You are asked to write your own function to realize the same functionality of the following command: apply(M,2,mean). Note: You are only allowed to use one existing R function dim(). Do not call other functions.

col.means<-function(M){ # Input: a matrix; Output: Column-wise means
    if(class(M)=="matrix"){
       n.row<-dim(M)[1]; n.col<-dim(M)[2]
       means<-rep(NA,n.col)
       for(i in 1:n.col){
          summation<-0
          for(j in 1:n.row) summation<-summation+M[j,i]
          means[i]<-summation/n.row
       }
       return(means)
    } else print("The input is not a matrix!")
}
M<-matrix(1:12,nrow=3)
col.means(M)
## [1]  2  5  8 11
apply(M,2,mean)
## [1]  2  5  8 11
col.means(2)
## [1] "The input is not a matrix!"

Excercie 6

Create a data set student in R as follows

##   ID  name sex math music
## 1 02  Mark   M   78    98
## 2 12  Bill   M   89    NA
## 3 23 Cathy   F   93    79

Create a data set stu.bio without any score, using the data set you created in Step 1.

Create a data set student.m by selecting observations with male students in the data set student.

Create a new variable GOOD in such a way that if the math score of a student is greater or equal to 90, put YES and otherwise put NO. Append the new variable to the data set student.

ID<-c("02","12","23")
name<-c("Mark","Bill","Cathy")
sex<-c("M","M","F")
math<-c(78,89,93)
music<-c(98,NA,79)
student<-data.frame(ID,name,sex,math,music)
stu.bio<-student[,1:3]
student.m<-student[sex=="M",]
GOOD<-rep("YES",length(math))
GOOD[math<90]<-"NO"
student<-data.frame(student,GOOD)