Computation Boot Camp

Patrick Cahan

Objectives

  1. Nudge you towards proficiency and comfort with big data analysis
  2. Develop culture of 'soups-to-nuts' reproducibility in our analytics

Course website: http://cahanlab.org/intra/training/bootcampJune2016/

Course schedule

  1. Monday: Intro to R and making simple, pretty presentations
  2. Tuesday: Data wrangling
  3. Wednesday: Unsupervised analysis
  4. Thursday: Supervised analsyis
  5. Friday: wrap up

Daily schedule

  • Morning: Background to topic, tutorials
  • Break: lunch, do what you need to do
  • Afternoon: Work on assignment (individually) that you will present the following morning

Inspiration and resources for this course

What is R?

  1. R is a language and environment for statistical computing and graphics
  2. There are hundreds of user-contributed packages
  3. Interpreted, not compiled

Cran: https://cran.r-project.org/

Launch R from your Terminal

Meet your new friend, Terminal

Basic interactions with the shell

List the contexts of your current working directory

ls -laht

Make a new directory, or folder, and 'go there':

mkdir ~/bootcamp2016/
cd ~/bootcamp2016
mkdir day1
cd day1

Now launch R

R

How I interact with R

  1. Evernote for writing code, then cut and paste into terminal.
  2. Text editor (Sublime 2) for writing functions

How this all works

In slides, a command (we'll also call them code or a code chunk) will look like this

print("I'm code")
## [1] "I'm code"

And then directly after it, will be the output of the code. So print("I'm code") is the code chunk and [1] "I'm code" is the output.

R is a calculator

2 + (2 * 3)^2
## [1] 38
(1 + 3) / 2 + 45
## [1] 47

Sequences and functions

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

1:10 is equivalent to the seq function

seq(from=1,to=10,by=1)
##  [1]  1  2  3  4  5  6  7  8  9 10

Help

To learn more about a function, use help

?seq

Variables

Most of the time you want to capture the results of a computation. Variables

  1. Create variables from within the R environment and from files on your computer
  2. Use "=" or "<-" to assign values to a variable name
  3. Case-sensitive
x <- rnorm(1e4, mean=0, sd=2) # result does not get sent to output

R's listing function, ls:

ls()
## [1] "x"

Variables and a bit of plotting

Things can be done with variables

mean(x);
## [1] 0.03374283
hist(x);

plot of chunk unnamed-chunk-12

Variable classes

  • Classes define the structure of a variable, and the set of operations or functions that can performed on them
  • vector and matrix are classes
  • data.frame is one of the most frequently used classes of R variables
  • data.frame is like a spreadsheet. Rows are entities and columns are variables characterizing that entity.
  • data.frames are somewhat advanced objects in R

  • Here we introduce "1 dimensional" classes; these are often referred to as 'vectors'

  • Vectors can have multiple sets of observations, but each observation has to be the same class

class(x)
## [1] "numeric"
y = "mistakenly, embryomics, smoogy boogy"
print(y)
## [1] "mistakenly, embryomics, smoogy boogy"
class(y)
## [1] "character"

combine (C)

The function c() collects/combines/joins single R objects into a vector of R objects. It is mostly used for creating vectors of numbers, character strings, and other data types.

x <- c(1, 4, 6, 8)
x
## [1] 1 4 6 8
class(x)
## [1] "numeric"

Length ...

length(): Get or set the length of vectors (including lists) and factors, and of any other R object for which a method has been defined.

length(x)
## [1] 4
y
## [1] "mistakenly, embryomics, smoogy boogy"
length(y)
## [1] 1

Operating on vectors

x <- rnorm(1e4, mean=0, sd=2)
hist(x)

plot of chunk unnamed-chunk-20

x <- x + 10 
hist(x)

plot of chunk unnamed-chunk-21

Let's try this out...

Download data from http://www.cahanlab.org/intra/training/bootcampJune2016/misc/character-deaths.csv



Load the data, modify path accordingly...

survdata<-read.csv("../../resources/character-deaths.csv", as.is=TRUE)
dim(survdata)
## [1] 917  13
colnames(survdata)
##  [1] "Name"               "Allegiances"        "Death.Year"        
##  [4] "Book.of.Death"      "Death.Chapter"      "Book.Intro.Chapter"
##  [7] "Gender"             "Nobility"           "GoT"               
## [10] "CoK"                "SoS"                "FfC"               
## [13] "DwD"

ggplots2: http://ggplot2.org/

Use ggplot2 to make nice graphics

Install the package if you don't already have it...

install.packages("ggplot2")



Load the library

library(ggplot2)

take a look at the data

How many characters per house?

ggplot(survdata, aes(x="Allegiances")) + geom_bar(stat="bin") + theme_bw()

plot of chunk unnamed-chunk-25

take a look at the data -- Take II

How many characters per house?

ggplot(survdata, aes(x=Allegiances)) + geom_bar(stat="bin") + theme_bw()

plot of chunk unnamed-chunk-26

take a look at the data -- Take III

How many characters per house?

ggplot(survdata, aes(x=Allegiances)) + geom_bar(stat="bin") + theme_bw() + theme(text = element_text(size=8), axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
    theme(axis.title.x = element_blank())+ ylab("") + xlab("") + coord_flip()

plot of chunk unnamed-chunk-27

Dead/Alive

length(which(is.na(survdata$Death.Year)))
## [1] 612
length(which(!is.na(survdata$Death.Year)))
## [1] 305

Dead/Alive

isDead<-rep("Dead", nrow(survdata))
isDead[ which( is.na(survdata$Death.Year)) ] <-"Alive"
survdata<-cbind(survdata, isDead=isDead)
ggplot(survdata, aes(x=Allegiances, fill=as.factor(isDead))) + geom_bar(stat="bin") + theme_bw() + theme(text = element_text(size=8), axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
    theme(axis.title.x = element_blank())+ ylab("") + xlab("") + coord_flip()

plot of chunk unnamed-chunk-29

When do most characters die?

ggplot(survdata, aes(x=Death.Chapter)) + geom_histogram(colour="black", fill="white") + facet_grid(. ~ Book.of.Death) + theme_bw()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-30

World university rankinks from: https://www.kaggle.com/mylesoneill/world-university-rankings/version/2

Your assignment, ussing ggplot2...

  1. Use facets to plot distributions of teaching scores, one plot per year
  2. Scatter plot to explore relationship between the teaching score and the income (facet by year)
  3. Add a regression line to the above plot
  4. Same as 2 and 3 but exploring the relationship between total_score and income
  5. Same as 2 and 3 but exploring relationship between world_rank and research
  6. Does there appear to be an effect of female_male_ratio and total_score? Use boxplot to visualize