Ошибка функции GWAA: «Подписаться за пределы» - PullRequest
0 голосов
/ 22 января 2019

Функция glm GWAA (Genome Wide Association Analysis) имеет необычную ошибку, которую я не могу найти объяснение.Но обратите внимание: мне нужно найти ошибку, приводящую к ДАННЫМ, а не к коду.

GWAA <- function(
  genodata=genotypes, 
  phenodata=phenotypes, 
  family = gaussian, 
  filename=NULL, 
  append=FALSE, 
  workers=getOption("mc.cores",2L), 
  flip=TRUE,
  select.snps=NULL, 
  hosts=NULL, 
  nSplits=10){
      if (!require(doParallel)) { stop("Missing doParallel package") }

      #Check that a filename was specified
      if(is.null(filename)) stop("Must specify a filename for output.")

      #Check that the genotype data is of class 'SnpMatrix'
      if( class(genodata)!="SnpMatrix") stop("Genotype data must of class 'SnpMatrix'.")
      #Check that there is a variable named 'phenotype' in phenodata table
      if( !"phenotype" %in% colnames(phenodata)) stop("Phenotype data must have column named 'phenotype'")

      #Check that there is a variable named 'id' in phenodata table
      if( !"id" %in% colnames(phenodata)) stop("Phenotype data must have column named 'id'.")

      #If a vector of SNPs is given, subset genotype data for these SNPs
      if(!is.null(select.snps)) genodata<-genodata[,which(colnames(genodata)%in%select.snps)]

      #Check that there are still SNPs in 'SnpMatrix' object
      if(ncol(genodata)==0) stop("There are no SNPs in the 'SnpMatrix' object.")

      #Print the number of SNPs to be checked
      cat(paste(ncol(genodata), " SNPs included in analysis.\n"))

      #If append=FALSE than we will overwrite file with column names
      if(!isTRUE(append)) {
      columns<-c("SNP", "Estimate", "Std.Error", "t-value", "p-value")
      write.table(t(columns), filename, row.names=FALSE, col.names=FALSE, quote=FALSE) }

      # Check sample counts
      if (nrow(phenodata) != nrow(genodata)) {warning("Number of samples mismatch. Using subset found in phenodata.") }

      # Order genodata rows to be the same as phenodata
      genodata <- genodata[phenodata$id,]
      cat(nrow(genodata), "samples included in analysis.\n")

      # Change which allele is
      flip.matrix<-function(x){
            zero2 <- which(x==0)
            two0 <- which(x==2)
            x[zero2] <- 2
            x[two0] <- 0 
            return(x)
      }
      nSNPs <- ncol(genodata)
      genosplit <- ceiling(nSNPs/nSplits) # number of SNPs in each subset
      snp.start <- seq(1, nSNPs, genosplit) # index of first SNP in group
      snp.stop <- pmin(snp.start+genosplit-1, nSNPs) # index of last SNP in group
      if (is.null(hosts)) {
            # On Unix this will use fork and mclapply. On Windows it # will create multiple processes on localhost.
            cl <- makeCluster(workers)
      } else {
            # The listed hosts must be accessible by the current user using # password-less ssh with R installed on all hosts, all
            # packages installed, and "rscript" is in the default PATH.
            # See docs for makeCluster() for more information.
            cl <- makeCluster(hosts, "PSOCK")
      }
      show(cl) # report number of workers and type of parallel implementation
      registerDoParallel(cl)
#I suspect here starts the problematic region.
      foreach (part=1:nSplits) %do% {
            # Returns a standar matrix of the
            genoNum <- as(genodata[,snp.start[part]:snp.stop[part]], "numeric")
            # Flip the numeric values of genotypes to count minor allele
            if (isTRUE(flip)) genoNum <- flip.matrix(genoNum)
            # For each SNP, concatenate the genotype column to the
            # phenodata and fit a generalized linear model
            rsVec <- colnames(genoNum)
            res <- foreach(snp.name=rsVec, .combine='rbind') %dopar% {
                  a <- summary(glm(phenotype~ . - id, family=family, data=cbind(phenodata, snp=genoNum[,snp.name])))
                  a$coefficients['snp',]

            }
            # write results so far to a file
            write.table(cbind(rsVec,res), filename, append=TRUE, quote=FALSE, col.names=FALSE, row.names=FALSE)
            cat(sprintf("GWAS SNPs %s-%s (%s%% finished)\n", snp.start[part], snp.stop[part], 100*part/nSplits))
      }
      stopCluster(cl)
      return(print("Done."))
}

Когда я запускаю эту команду:

GWAA(genodata = genData, phenodata = phenodata, family =gaussian, filename = paste(target1, ".txt", sep = "")) 

Результат велик, пока одна задача во время последней10% вычислений:

168830  SNPs included in analysis.
748 samples included in analysis.
socket cluster with 2 nodes on host ‘localhost’
GWAS SNPs 1-16883 (10% finished)
GWAS SNPs 16884-33766 (20% finished)
GWAS SNPs 33767-50649 (30% finished)
GWAS SNPs 50650-67532 (40% finished)
GWAS SNPs 50650-67532 (40% finished)
GWAS SNPs 67533-84415 (50% finished)
GWAS SNPs 84416-101298 (60% finished)
GWAS SNPs 101299-118181 (70% finished)
GWAS SNPs 118182-135064 (80% finished)
GWAS SNPs 135065-151947 (90% finished)
Error in { :
  task 10 failed - "task 16873 failed - "subscript out of bounds""

Нет проблем с выделенным SNP (задача 16873 на самом деле является SNP с индексом 168820, который выглядит следующим образом:

        chr      SNP  location position A1 A2  index
GA026677   Y GA026677 Y:2722814  2722814  A  G 168820

Это полностьюрегулярно и в этом SNP нет ошибок. Однако ошибка все еще возникает. Я думаю, что в данных есть что-то, что не позволяет части, начинающейся с 'foreach', выполняться правильно.

Вот необходимые данные дляВыполнение кода: 1.) genData 2.) феноданные

...