Acknowledgment: the materials below are initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.
Many software choices for linear regression analysis: Excel, Minitab, SPSS, SAS, and R. However, no software is a substitute for creative thinking. For this course, we exclusively use R.
R is both a computing language and an environment for statistical computing and statistical graphics. R is available, at no cost, for Windows, Linux, Unix and MacOS. R was developed at AT&T Bell Laboratories. The initial version of R was developed by Ross Ihaka and Robert Gentleman, both from the University of Auckland. The R website (http://www.r-project.org/) contains a significant amount of information about the R software.
To download and install R, please follow:
From the R web page (http://www.r-project.org/), click the CRAN link on the left side of the page.
Select an appropriate mirror for your location (any U.S.A. location is fine)
Click the appropriate operating system.
Click base (for the base R software).
Click the link Download R x.x.x, excute the file and follow the instructions.
Rstudio is a great integrated development environment (IDE) for R.
It can be downloaded at https://www.rstudio.com/.
We highly recommend it.
Advantages of R
R has extensive and powerful graphics abilities
The R system is developing rapidly. New features and abilities appear every few months. New R packages are submitted to CRAN everyday: https://cran.r-project.org/web/packages/available_packages_by_date.html
Simple calculations and analyses can be handled straightforwardly. There is a language core that uses standard forms of algebraic notation, allowing the calculations such as 2+3, or 3^11.
R is an “open source” system. Source-code is available for inspection, or for adaptation to other systems.
R is free!
R is a functional language. The structure of an R program is similar to C, JAVA or MatLab.
Expressions are typed following the prompt (>) on the screen. The result appears on subsequent lines
2+2 # Addition
## [1] 4
sqrt(10) # Square root of
## [1] 3.162278
Additionl operators includes the following:
Operator | Description |
---|---|
+ |
addition |
- |
subtraction |
* |
multiplication |
/ |
division |
^ or ** |
exponentiation |
All R entities, almost everything, including functions and data structures, exist as objects. R objects can all be operated on as data.
Type in ls()
to see the names of all objects in your workspace. In the following, we will learn four kinds of objects in R.
Vectors
Matrices
Data Frames
Lists
Three types of vectos:
x <- c(2,3,5,2,7,1) # Numeric
x
## [1] 2 3 5 2 7 1
y <- c("Canberra","Sydney","Newcastle","Darwin") # Character
y
## [1] "Canberra" "Sydney" "Newcastle" "Darwin"
z <- c(TRUE,FALSE,FALSE,FALSE) # Logical
z
## [1] TRUE FALSE FALSE FALSE
Tips: The c(...)
an acronym for “concatenate”. The symbol <-
means “assigning to”.
What if I want to learn more about the R function c()
? To get on-line, we can use ?
, for example:
?class
??class
How to concatenate existing vectors?
x1 <- c(2,3,5,2,7,1)
x2 <- c(0,0,0)
x_join <- c(x1, x2)
x_join
## [1] 2 3 5 2 7 1 0 0 0
Tips: Use _
, such as x_join
, in variable names instead of .
, such as x.join
, because .
sometimes means different things in some object-oriented programming languages.
To extract elements of vectors, we can:
Specify the indices of the elements that are to be extracted, e.g.
x <- c(3,11,8,15,12) # Assign to x the values 3, 11, 8, 15, 12
x[c(2,4)] # Extract elements (rows) 2 and 4
## [1] 11 15
Or specify a vector of logical values.
x[x>10]
## [1] 11 15 12
An advantage of R is that it allows flexible operations on vectors. Try the following code and see how it works:
x_join+1
## [1] 3 4 6 3 8 2 1 1 1
x_join*2
## [1] 4 6 10 4 14 2 0 0 0
log(x_join)
## [1] 0.6931472 1.0986123 1.6094379 0.6931472 1.9459101 0.0000000 -Inf
## [8] -Inf -Inf
sum(x_join)
## [1] 20
mean(x_join)
## [1] 2.222222
Suppose you are given the following vector:
vv <- rnorm(1000)
where rnorm(1000)
means generate 1000 copies of random variables that follow standard normal distribution with mean zero and variance one, i.e., \(N(0,1)\).
How to summarize the information of vector?
vv[1:20] # Get a sense of the data
## [1] 0.31305736 0.25876387 -0.90024451 0.08902754 -0.18750180 -0.38349161
## [7] 2.45937581 -1.29006418 0.05432386 0.27610338 0.84463287 -0.18307603
## [13] -0.62665444 -0.20292712 -0.69857687 1.88796299 -0.84485625 -0.17561151
## [19] -1.20074211 0.24606097
class(vv) # Which class the vector belongs to
## [1] "numeric"
length(vv) # What is the size of the vector
## [1] 1000
summary(vv) # Numerical summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.13345 -0.74195 -0.08150 -0.04956 0.62027 3.60127
hist(vv) # Graphical summary
Create a sequence with a fixed increment, seq(from=..., to=..., by=..., length.out=...)
.
seq(1,10,by=1)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1,10,length.out=10)
## [1] 1 2 3 4 5 6 7 8 9 10
1:10 # With an increment of 1
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1,10,length.out=9)
## [1] 1.000 2.125 3.250 4.375 5.500 6.625 7.750 8.875 10.000
Create a vector containing elements with the same value rep(x, times=..., each=...)
.
rep(2,10)
## [1] 2 2 2 2 2 2 2 2 2 2
rep(1:4, times=2)
## [1] 1 2 3 4 1 2 3 4
rep(1:4, each=2)
## [1] 1 1 2 2 3 3 4 4
In R, missing values are represented by “NA” (meaning Not Available). Any arithmetic operation that involves NA generates an NA.
y <- c(1, NA, 3, 0, NA)
mean(y)
## [1] NA
mean(y, na.rm=T)
## [1] 1.333333
is.na(y)
## [1] FALSE TRUE FALSE FALSE TRUE
To create a matrix, we use matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
M <- matrix(1:12,nrow=3,ncol=4,byrow=T)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
M <- matrix(1:12,nrow=3,ncol=4)
Retrieve the elements of M
.
M[2,3]
## [1] 8
M[2,]
## [1] 2 5 8 11
M[,3]
## [1] 7 8 9
To obtain the means for each column of the matrix M
, we use apply(X, MARGIN, FUN, ...)
X
: an array, including a matrix.
MARGIN
: a vector giving the subscripts which the function will be applied over. E.g., for a matrix, 1 indicates rows, 2 indicates columns.
FUN
: the function to be applied
apply(M,2,mean)
## [1] 2 5 8 11
apply(M,1,mean)
## [1] 5.5 6.5 7.5
apply(M,2,sum)
## [1] 6 15 24 33
apply(M,1,sum)
## [1] 22 26 30
Try the following commands and figure out their functionality.
dim(M) # dimension
## [1] 3 4
t(M) # transpose
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
sum(M) # summation
## [1] 78
as.vector(M) # vectorize
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
Data frames are fundamental to the use of the R modeling and graphics functions. A data frame is a generalization of a matrix, in which different columns may have different modes. All elements of any column must however have the same mode, i.e. all numeric or all factor, or all character.
The data set clinic
consists of two variables, type
and score
. type
refers to what patients take. score
is a kind of health score of patients.
type | score |
---|---|
drug | 8 |
drug | 10 |
placebo | 5 |
drug | 9 |
Step 1. Input the data set. Label type
and score
as drug or placebo and health score respectively.
Step 2. Calculate the means of health score for patients taking drug and for patients taking placebo respectively.
type <- c(rep("drug",2),"placebo",rep("drug",2),rep("placebo",3))
score <- c(8,10,5,9,9,7,6,6)
clinic <- data.frame(type,score)
clinic
## type score
## 1 drug 8
## 2 drug 10
## 3 placebo 5
## 4 drug 9
## 5 drug 9
## 6 placebo 7
## 7 placebo 6
## 8 placebo 6
Step 2. Calculate the means of health score for patients taking drug and for patients taking placebo respectively.
mean(score[type=="drug"])
## [1] 9
mean(score[type=="placebo"])
## [1] 6
Select columns or rows
clinic[,2]
## [1] 8 10 5 9 9 7 6 6
clinic[,"score"]
## [1] 8 10 5 9 9 7 6 6
clinic[type=="drug",]
## type score
## 1 drug 8
## 2 drug 10
## 4 drug 9
## 5 drug 9
logical operators
Operator | Description |
---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
exactly equal to |
!= |
not equal to |
!x |
not x |
x|y |
x or y |
x&y |
x and y |
An R list is an object consisting of an ordered collection of objects known as its components. There is no particular need for the components to be of the same mode or type. For example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on.
Here is a simple example of how to make a list:
lst <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))
lst
## $name
## [1] "Fred"
##
## $wife
## [1] "Mary"
##
## $no.children
## [1] 3
##
## $child.ages
## [1] 4 7 9
length(lst)
## [1] 4
lst$child.ages
## [1] 4 7 9
lst$child.ages[2]
## [1] 7
class(lst$child.ages)
## [1] "numeric"
class(lst$child.ages[2])
## [1] "numeric"
Syntax: ff<-function(x) {xxx; return(y)}
A function is created using an assignment. On the right hand side, the parameters appear within round brackets. You can, if you wish, give a default. Following the closing “)” the function body appears. Except where the function body consists of just one statement, this is enclosed between curly braces { }
. The return value usually appears on the final line of the function body. It is recommended to explicitly write a “return” function.
Define a function that calculate the mean and standard deviation of a sample
mean.and.sd <- function(x){
av <- mean(x)
sd <- sqrt(var(x))
return(c(mean=av, SD=sd))
}
x1<-rnorm(100)
mean.and.sd(x1)
## mean SD
## 0.07342937 1.13515776
x2<-rnorm(10000)
mean.and.sd(x2)
## mean SD
## 0.004624308 1.017048717
The R has available a conditional construction of the form if (expr_1) expr_2 else expr_3
where expr_1
must evaluate to a single logical value.
Example:
x <- rnorm(10)
if(sum(x)>0)
{
print("Positive")
} else {
print("negative")
}
## [1] "negative"
Another frequently used control statement is for
loop:
for(i in 1:10)
{
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
x <- rnorm(1000)
x.pos <- rep(NA, length(x))
for(j in 1:length(x))
{
if(x[j]<0)
{
x.pos[j]<-0
} else {
x.pos[j]<-x[j]
}
}
hist(x)
hist(x.pos)
To plot a histogram, we use
delivery <- read.csv("data_delivery.csv",h=T)
hist(delivery$DeliveryTime, xlim=c(0,200), breaks=30, xlab="The range of X", main="The histogram of X")
In the code above, breaks=
specifices the number of bins, xlim=c(a,b)
or ylim=c(a,b)
specify the range of x or y axis to [a,b], xlab="ZZZ"
or ylab="ZZZ"
specify the label attached to the x or y axis as ZZZ, main="ZZZ"
specify the title attached to the plot as ZZZ.
We first create a dataset by simulating 200 observations from the following linear model \(y = \beta_0 + \beta_1 * x + \epsilon\) where \(\beta_0=1\), \(\beta_1=2\), \(x ~ N(1, 4)\), \(\epsilon ~ N(0,1)\). Then we visualize the data in a scatter plot.
x <- rnorm(200,mean=1,sd=2)
noise <- rnorm(200)
y <- 1+2*x+noise
plot(x,y,pch=2,col="blue")
In the code above, pch=
specifies the type of symbol for each data point, col="red"
specifies the color of the dots.
We can also plot multiple figures at one once to save space.
par(mfrow=c(1,3))
plot(x,y,pch=20)
plot(x,y,pch=2)
plot(x,y,col="red")
Other features in plottings.
plot(x,y)
points(-x+1, y, pch=2,col="red")
abline(h=0,lty=2)
lines(x=c(-4,6),y=c(-10,5))
legend(-4,5,legend=c("y~x","y~x+1"),pch=c(1,2),col=c("black","red"))
Tips and tricks: The function points(...)
adds points to an existing plot. The functions abline(...)
and lines()
adds lines to an existing plot The function legend(...)
attach a legend to an existing plot.
We use lm(response ~ var1+var2+...+vark, data=...)
to perform linear regression analysis.
Note: lm()
stands for linear model.
model = lm(y~x)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.50590 -0.67319 -0.06679 0.62364 2.40528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.09091 0.07580 14.39 <2e-16 ***
## x 1.98932 0.03367 59.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9347 on 198 degrees of freedom
## Multiple R-squared: 0.9463, Adjusted R-squared: 0.9461
## F-statistic: 3492 on 1 and 198 DF, p-value: < 2.2e-16
model$coef
## (Intercept) x
## 1.090906 1.989316
plot(x,y,pch=20)
abline(a=model$coef[1], b=model$coef[2], lwd=4, col="blue")
For the object x_join
Delete the 2nd and 4th elements.
Count how many 0 in x_join
.
x_join
x_join[-c(2,4)]
sum(x_join==0)
length(which(x_join==0))
Use two R functions sum()
and length()
to calculate the variance of samples in the R object x_join
.
Check if your result is the same as var(x_join)
.
sum((x_join-sum(x_join)/length(x_join))^2)/(length(x_join)-1)
var(x_join)
Divide the vector vv
into two vectors:
vv1
contains those with positive values
vv2
contains those with non-positive values
Summary the vector vv1.
vv1 <- vv[vv>0]
length(vv1)
summary(vv1)
hist(vv1)
Use the least lines of commands to generate the following matrix
## celsius fahrenheit
## [1,] 25 77.0
## [2,] 26 78.8
## [3,] 27 80.6
## [4,] 28 82.4
## [5,] 29 84.2
## [6,] 30 86.0
uu<-matrix(NA,nrow=6,ncol=2)
uu[,1]<-25:30
uu[,2]<-9/5*uu[,1]+32
colnames(uu)<-c("celsius","fahrenheit")
celsius <- 25:30
fahrenheit <- 9/5*celsius+32
cbind(celsius, fahrenheit)
Suppose M
is a matrix. You are asked to write your own function to realize the same functionality of the following command: apply(M,2,mean)
. Note: You are only allowed to use one existing R function dim()
. Do not call other functions.
col.means<-function(M){ # Input: a matrix; Output: Column-wise means
if(class(M)=="matrix"){
n.row<-dim(M)[1]; n.col<-dim(M)[2]
means<-rep(NA,n.col)
for(i in 1:n.col){
summation<-0
for(j in 1:n.row) summation<-summation+M[j,i]
means[i]<-summation/n.row
}
return(means)
} else print("The input is not a matrix!")
}
M<-matrix(1:12,nrow=3)
col.means(M)
## [1] 2 5 8 11
apply(M,2,mean)
## [1] 2 5 8 11
col.means(2)
## [1] "The input is not a matrix!"
Create a data set student
in R as follows
## ID name sex math music
## 1 02 Mark M 78 98
## 2 12 Bill M 89 NA
## 3 23 Cathy F 93 79
Create a data set stu.bio
without any score, using the data set you created in Step 1.
Create a data set student.m
by selecting observations with male students in the data set student
.
Create a new variable GOOD in such a way that if the math score of a student is greater or equal to 90, put YES and otherwise put NO. Append the new variable to the data set student
.
ID<-c("02","12","23")
name<-c("Mark","Bill","Cathy")
sex<-c("M","M","F")
math<-c(78,89,93)
music<-c(98,NA,79)
student<-data.frame(ID,name,sex,math,music)
stu.bio<-student[,1:3]
student.m<-student[sex=="M",]
GOOD<-rep("YES",length(math))
GOOD[math<90]<-"NO"
student<-data.frame(student,GOOD)
Comments in R
Anything that follows a
#
on the command line is taken as comment and ignored by R.