Функция glm GWAA (Genome Wide Association Analysis) имеет необычную ошибку, которую я не могу найти объяснение.Но обратите внимание: мне нужно найти ошибку, приводящую к ДАННЫМ, а не к коду.
GWAA <- function(
genodata=genotypes,
phenodata=phenotypes,
family = gaussian,
filename=NULL,
append=FALSE,
workers=getOption("mc.cores",2L),
flip=TRUE,
select.snps=NULL,
hosts=NULL,
nSplits=10){
if (!require(doParallel)) { stop("Missing doParallel package") }
#Check that a filename was specified
if(is.null(filename)) stop("Must specify a filename for output.")
#Check that the genotype data is of class 'SnpMatrix'
if( class(genodata)!="SnpMatrix") stop("Genotype data must of class 'SnpMatrix'.")
#Check that there is a variable named 'phenotype' in phenodata table
if( !"phenotype" %in% colnames(phenodata)) stop("Phenotype data must have column named 'phenotype'")
#Check that there is a variable named 'id' in phenodata table
if( !"id" %in% colnames(phenodata)) stop("Phenotype data must have column named 'id'.")
#If a vector of SNPs is given, subset genotype data for these SNPs
if(!is.null(select.snps)) genodata<-genodata[,which(colnames(genodata)%in%select.snps)]
#Check that there are still SNPs in 'SnpMatrix' object
if(ncol(genodata)==0) stop("There are no SNPs in the 'SnpMatrix' object.")
#Print the number of SNPs to be checked
cat(paste(ncol(genodata), " SNPs included in analysis.\n"))
#If append=FALSE than we will overwrite file with column names
if(!isTRUE(append)) {
columns<-c("SNP", "Estimate", "Std.Error", "t-value", "p-value")
write.table(t(columns), filename, row.names=FALSE, col.names=FALSE, quote=FALSE) }
# Check sample counts
if (nrow(phenodata) != nrow(genodata)) {warning("Number of samples mismatch. Using subset found in phenodata.") }
# Order genodata rows to be the same as phenodata
genodata <- genodata[phenodata$id,]
cat(nrow(genodata), "samples included in analysis.\n")
# Change which allele is
flip.matrix<-function(x){
zero2 <- which(x==0)
two0 <- which(x==2)
x[zero2] <- 2
x[two0] <- 0
return(x)
}
nSNPs <- ncol(genodata)
genosplit <- ceiling(nSNPs/nSplits) # number of SNPs in each subset
snp.start <- seq(1, nSNPs, genosplit) # index of first SNP in group
snp.stop <- pmin(snp.start+genosplit-1, nSNPs) # index of last SNP in group
if (is.null(hosts)) {
# On Unix this will use fork and mclapply. On Windows it # will create multiple processes on localhost.
cl <- makeCluster(workers)
} else {
# The listed hosts must be accessible by the current user using # password-less ssh with R installed on all hosts, all
# packages installed, and "rscript" is in the default PATH.
# See docs for makeCluster() for more information.
cl <- makeCluster(hosts, "PSOCK")
}
show(cl) # report number of workers and type of parallel implementation
registerDoParallel(cl)
#I suspect here starts the problematic region.
foreach (part=1:nSplits) %do% {
# Returns a standar matrix of the
genoNum <- as(genodata[,snp.start[part]:snp.stop[part]], "numeric")
# Flip the numeric values of genotypes to count minor allele
if (isTRUE(flip)) genoNum <- flip.matrix(genoNum)
# For each SNP, concatenate the genotype column to the
# phenodata and fit a generalized linear model
rsVec <- colnames(genoNum)
res <- foreach(snp.name=rsVec, .combine='rbind') %dopar% {
a <- summary(glm(phenotype~ . - id, family=family, data=cbind(phenodata, snp=genoNum[,snp.name])))
a$coefficients['snp',]
}
# write results so far to a file
write.table(cbind(rsVec,res), filename, append=TRUE, quote=FALSE, col.names=FALSE, row.names=FALSE)
cat(sprintf("GWAS SNPs %s-%s (%s%% finished)\n", snp.start[part], snp.stop[part], 100*part/nSplits))
}
stopCluster(cl)
return(print("Done."))
}
Когда я запускаю эту команду:
GWAA(genodata = genData, phenodata = phenodata, family =gaussian, filename = paste(target1, ".txt", sep = ""))
Результат велик, пока одна задача во время последней10% вычислений:
168830 SNPs included in analysis.
748 samples included in analysis.
socket cluster with 2 nodes on host ‘localhost’
GWAS SNPs 1-16883 (10% finished)
GWAS SNPs 16884-33766 (20% finished)
GWAS SNPs 33767-50649 (30% finished)
GWAS SNPs 50650-67532 (40% finished)
GWAS SNPs 50650-67532 (40% finished)
GWAS SNPs 67533-84415 (50% finished)
GWAS SNPs 84416-101298 (60% finished)
GWAS SNPs 101299-118181 (70% finished)
GWAS SNPs 118182-135064 (80% finished)
GWAS SNPs 135065-151947 (90% finished)
Error in { :
task 10 failed - "task 16873 failed - "subscript out of bounds""
Нет проблем с выделенным SNP (задача 16873 на самом деле является SNP с индексом 168820, который выглядит следующим образом:
chr SNP location position A1 A2 index
GA026677 Y GA026677 Y:2722814 2722814 A G 168820
Это полностьюрегулярно и в этом SNP нет ошибок. Однако ошибка все еще возникает. Я думаю, что в данных есть что-то, что не позволяет части, начинающейся с 'foreach', выполняться правильно.
Вот необходимые данные дляВыполнение кода: 1.) genData 2.) феноданные