Computation Boot Camp

Day 4 more unsupervised analysis and custom functions

Getting started

Create a new directory day4, download data for today: http://www.cahanlab.org/intra/training/bootcampJune2016/misc/day4_toydata.R to this directory, and launch R. Alter paths below for your setup ...

source("../../misc/utils.R")
library(ggplot2)
toyData<-utils_loadObject("../../misc/day4_toydata.R")

Look at the data

dim(toyData)

## [1] 40  2

toyData[1:3,]

##                 PC1      PC2
## GSM300438 -24.41612 17.36002
## GSM756454 -24.87946 15.11424
## GSM603070 -17.96385 10.31652

Yes, this is a subset of the PCA results from yesterday

ggplot(toyData, aes(x=PC1, y=PC2)) +
geom_point(pch=19, alpha=3/4, size=1) +
theme_bw()

plot of chunk unnamed-chunk-3

K-means

K-means assigns each point to a group such that the intra-group distances are minimized

kmeansRes<-kmeans(toyData, 4)
kvals<-cbind(toyData, classification=kmeansRes$cluster)

ggplot(kvals, aes(x=PC1, y=PC2, colour=as.factor(classification)) ) + geom_point(pch=19, alpha=3/4, size=1) + theme_bw() +
guides(colour=guide_legend(ncol=1))

plot of chunk unnamed-chunk-4

How to compute intra-group distances?

Euclidean distance is a commonly used metric

eucDistV1<-function(vect1, vect2){
    xsum <- (vect1[1]-vect2[1])**2 + (vect1[2] - vect2[2])**2;
    xsum ** (1/2);
}

ptA<-c(0,0);
ptB<-c(3,3)
eucDistV1(ptA, ptB);

## [1] 4.242641

In-class exercise

Write a new function that generalizes euclidean distance to an arbitrary number of dimensions Don't peak at next slide...

eucDistV2<-function(vect1, vect2){
    ## 
    ## 
    cat("Compute distance between 2 points in ", length(vect1)," dimension space\n", sep='');
}

ptA<-sample(1:1000, 25);
ptB<-sample(1:1000, 25)
eucDistV2(ptA, ptB);

## Compute distance between 2 points in 25 dimension space

In-class exercise solution using for

eucDistV2<-function(vect1, vect2){
    xsum<-0;
    for(i in 1:length(vect1)){
        xsum <- xsum + (vect1[i] - vect2[i])**2;
    }
    xsum**(1/2);
}

ptA<-c(0,0,0);
ptB<-c(3,3,3);
eucDistV2(ptA, ptB)

## [1] 5.196152

In-class exercise solution using apply

eucDistV3<-function(vect1, vect2){

    tmpFunc<-function(aVect){
      (aVect[1]-aVect[2])**2;
    }
    tmpDat<-cbind(vect1, vect2);
    squares<-apply(tmpDat, 1, tmpFunc);
    sum(squares)**(1/2)
}

ptA<-c(0,0,0);
ptB<-c(3,3,3);
eucDistV3(ptA, ptB)

## [1] 5.196152

In-class exercise solution using R's smartness

eucDistV4<-function(vect1, vect2){
    sum((vect1-vect2)**2)**(1/2)
}

ptA<-c(0,0,0);
ptB<-c(3,3,3);
eucDistV4(ptA, ptB)

## [1] 5.196152

In-class exercise #2

Write a new function that computes the distances between all points in toyData. Don't peak at next slide...

eucDistAll<-function(aMat){
    ansMatrix<-matrix(NA, nrow=nrow(aMat), ncol=nrow(aMat));
    ## 
    ## Add your code here. Use your eucDistV4 function 
    ##
    ansMatrix
}

In-class exercise solution to #2

eucDistAll<-function(aMat){
    myDim<-nrow(aMat);
    ansMatrix<-matrix(NA, nrow=myDim, ncol=myDim);
    for(i in 1:myDim){
        for(j in 1:myDim){
            if(j<i){
                ansMatrix[i,j] <- eucDistV4(aMat[i,], aMat[j,]);
            }
        }
    }
    ansMatrix;
}

Apply to toy data

toyDists<-eucDistAll(toyData);
hist(toyDists);

plot of chunk unnamed-chunk-12

We can use this distance matrix to perform hierarchical clustering

plot(hclust(as.dist(toyDists), "average"), hang=-1)

plot of chunk unnamed-chunk-13

Your final assignment:

Write a function that evaluates the groupings provided by k-means. It takes as input toyData (or a similar object), and a vector of possible k values. It returns a list of ...

data frame of 2 columns: k, and the mean intra cluster distances
a ggplot in which the points (samples) are colored according to the optimal value for k
a ggplot scatter plot in which x-axis is k and the y axis is the mean intra group distance

And, apply this function to the original data set (only first 2 principle components), which you can download here:

http://cahanlab.org/intra/training/bootcampJune2016/misc/day4_assignment_data.R

What is the optimal number of clusters that your iterative kmeans approach yields?