Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq-count performance

    Hello,

    While using the Tophat --> htseq-count --> DESeq pipeline, I'm finding that htseq-count slams my machine (8 GB RAM). I'm guessing that since it is using all memory, and even most swap memory, that perhaps I could improve the performance by breaking down the task somehow. I'm using version 0.5.3p3 with these options:

    htseq-count -m intersection-nonempty -s no -t CDS -i gene_id -o htseq_{$fprefix}.sam sorted_{$fprefix}.sam hg19_EnsGene.gff

    Would it help if I separated the input files by chromosome ?

    Thanks,
    Danielle

  • #2
    Something went wrong here. htseq-count never uses much memory, because it reads the data read for read. Only the content of the relevant lines in the GFF file are kept in memory, and this can be nowhere near several GB. Please double-check that it really was htseq-count that filled up your machine and that your files are sane. Maybe something is wrong with the GFF file, so that HTSeq chokes on trying to read it.

    Comment


    • #3
      Hi Simon,
      Thanks so much for responding. I was hoping you would see this thread.
      I have rebooted my machine and run only my script. The process using all of the memory is python...and this is the only thing using python. Just to be sure, I've killed everything, rebooted, and started only the htseq-count call and as it is running, the memory used gradually climbs and climbs.
      The script is outputting warnings that look like this:
      Read HWI-ST623:0:2:1101:11808:178924:0:2:1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

      So...what may be happening is that htseq-count is storing more and more reads in memory as it tries to find the mate.

      I am sorting the sam file but I'm a newbie at this so it is possible I'm doing something wrong. The options I'm using are

      samtools sort -n bamfile.bam
      samtools view -h sorted_bamfile.bam > sorted_samfile.sam

      Here is the full script:
      foreach bamfile (./TopHat_output/*accepted_hits.bam)
      set fprefix = `echo $bamfile:t:r | sed 's/accepted_hits//'`
      samtools sort -n $bamfile sorted_{$fprefix}
      samtools view -h sorted_{$fprefix}.bam >
      ./HTSEQcount_output/sorted_{$fprefix}.sam
      /home/dglemay/work/tools/HTSeq-0.5.3p3/scripts/htseq-count -m
      intersection-nonempty -s no -t CDS -i gene_id -o
      ./HTSEQcount_output/htseq_{$fprefix}.sam
      ./HTSEQcount_output/sorted_{$fprefix}.sam ./hg19/hg19_EnsGene.gff >&!
      ./HTSEQcount_output/log_{$fprefix}.txt
      grep ENS ./HTSEQcount_output/log_{$fprefix}.txt >
      ./count_data/counts_{$fprefix}.txt
      # cleanup
      rm ./HTSEQcount_output/sorted_{$fprefix}.sam
      samtools view -bSl ./HTSEQcount_output/htseq_{$fprefix}.sam >
      ./HTSEQcount_output/htseq_{$fprefix}.bam
      end

      Thank you for reading,
      Danielle

      Comment


      • #4
        Hi Simon,

        Is HTseq-count possible to count UTR? I tried it and got nothing. If it works on UTR, any particular aspect I have to pay attention to?

        THanks

        Comment


        • #5
          Ah!

          Should be
          samtools sort
          not
          samtools sort -n

          @emilyjia2000: you probably need to start a new thread

          Comment


          • #6
            Does the memory usage climb the entire time or just at the beginning? What is the file size of your GFF?

            If it is alignment related, you would see an initial increase of memory consumption as you start the tool, as the GFF is read. Then, depending on your SAM file, the memory consumption, if what you say is true, would start to increase again. Can you observe this behavior?
            I am guessing you are using a *nix OS. Just open another Console window and enter "top". You'll be shown a detailed list of processes and their consumption of resources.

            Please correct me if I am wrong.

            Comment


            • #7
              try picard sort, it works on me.

              Comment


              • #8
                Originally posted by emilyjia2000 View Post
                try picard sort, it works on me.
                Sorry, but as author of HTSeq, I would like to say, just for the record: HTSeq works as well for nearly everybody, and it is designed to work with little memory. I have no clue what is wrong here but I am very sure that there must be something very strange with dglemay's input files.

                Comment


                • #9
                  Hi Simon,

                  I am very wondering the paired-end sorting problem before using HTSeq.
                  I read many posts about this issue, but no standard and complete thread explain it.
                  At first, I sort my paired-end BAM file with the command,
                  samtools sort -n my.bam my.sort

                  Then, I convert the BAM to SAM,
                  samtools view my.sort.bam > my.sort.sam

                  finally, I run HTSeq to get the counts,
                  htseq-counts --stranded=no --mode=intersection-nonempty -t exon -i gene_id my.sort.sam annotation.gtf > output.txt

                  But I still got a lot error messages that HTSeq cannot find the other aligned mate(Is the SAM file properly sorted). Someone said that we still need to sort the SAM file again. If I sort SAM again, then how to sort it? Still sorted by name or other sorting method?
                  Could you explain it more detailed?

                  Thanks a lot!

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 11:49 AM
                  0 responses
                  13 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-24-2024, 08:47 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  61 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X