Create a new directory day4, download data for today: http://www.cahanlab.org/intra/training/bootcampJune2016/misc/day4_toydata.R to this directory, and launch R. Alter paths below for your setup ...
source("../../misc/utils.R")
library(ggplot2)
toyData<-utils_loadObject("../../misc/day4_toydata.R")
dim(toyData)
## [1] 40 2
toyData[1:3,]
## PC1 PC2
## GSM300438 -24.41612 17.36002
## GSM756454 -24.87946 15.11424
## GSM603070 -17.96385 10.31652
Yes, this is a subset of the PCA results from yesterday
ggplot(toyData, aes(x=PC1, y=PC2)) +
geom_point(pch=19, alpha=3/4, size=1) +
theme_bw()
K-means assigns each point to a group such that the intra-group distances are minimized
kmeansRes<-kmeans(toyData, 4)
kvals<-cbind(toyData, classification=kmeansRes$cluster)
ggplot(kvals, aes(x=PC1, y=PC2, colour=as.factor(classification)) ) + geom_point(pch=19, alpha=3/4, size=1) + theme_bw() +
guides(colour=guide_legend(ncol=1))
Euclidean distance is a commonly used metric
eucDistV1<-function(vect1, vect2){
xsum <- (vect1[1]-vect2[1])**2 + (vect1[2] - vect2[2])**2;
xsum ** (1/2);
}
ptA<-c(0,0);
ptB<-c(3,3)
eucDistV1(ptA, ptB);
## [1] 4.242641
Write a new function that generalizes euclidean distance to an arbitrary number of dimensions Don't peak at next slide...
eucDistV2<-function(vect1, vect2){
##
##
cat("Compute distance between 2 points in ", length(vect1)," dimension space\n", sep='');
}
ptA<-sample(1:1000, 25);
ptB<-sample(1:1000, 25)
eucDistV2(ptA, ptB);
## Compute distance between 2 points in 25 dimension space
eucDistV2<-function(vect1, vect2){
xsum<-0;
for(i in 1:length(vect1)){
xsum <- xsum + (vect1[i] - vect2[i])**2;
}
xsum**(1/2);
}
ptA<-c(0,0,0);
ptB<-c(3,3,3);
eucDistV2(ptA, ptB)
## [1] 5.196152
eucDistV3<-function(vect1, vect2){
tmpFunc<-function(aVect){
(aVect[1]-aVect[2])**2;
}
tmpDat<-cbind(vect1, vect2);
squares<-apply(tmpDat, 1, tmpFunc);
sum(squares)**(1/2)
}
ptA<-c(0,0,0);
ptB<-c(3,3,3);
eucDistV3(ptA, ptB)
## [1] 5.196152
eucDistV4<-function(vect1, vect2){
sum((vect1-vect2)**2)**(1/2)
}
ptA<-c(0,0,0);
ptB<-c(3,3,3);
eucDistV4(ptA, ptB)
## [1] 5.196152
Write a new function that computes the distances between all points in toyData. Don't peak at next slide...
eucDistAll<-function(aMat){
ansMatrix<-matrix(NA, nrow=nrow(aMat), ncol=nrow(aMat));
##
## Add your code here. Use your eucDistV4 function
##
ansMatrix
}
eucDistAll<-function(aMat){
myDim<-nrow(aMat);
ansMatrix<-matrix(NA, nrow=myDim, ncol=myDim);
for(i in 1:myDim){
for(j in 1:myDim){
if(j<i){
ansMatrix[i,j] <- eucDistV4(aMat[i,], aMat[j,]);
}
}
}
ansMatrix;
}
toyDists<-eucDistAll(toyData);
hist(toyDists);
plot(hclust(as.dist(toyDists), "average"), hang=-1)
Write a function that evaluates the groupings provided by k-means. It takes as input toyData (or a similar object), and a vector of possible k values. It returns a list of ...
And, apply this function to the original data set (only first 2 principle components), which you can download here:
http://cahanlab.org/intra/training/bootcampJune2016/misc/day4_assignment_data.R
What is the optimal number of clusters that your iterative kmeans approach yields?