Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq-count eats 42G of memory

    Hello,

    I've been using DESeq for a while and right now htseq-count (version 0.5.3p9) it's giving me problems with a bam file (smaller than other that I've used with the same options and the same gtf), it runs and keeps occupying more memory, fills it and then start swapping, this is the relevant top output:
    Code:
    2958 data      20   0 46.5g  46g  676 D    6 97.5 841:10.30 htseq-count
    Any idea? Thanks!

  • #2
    How have you called htseq-count? Please post the precise command line, and any output generated.

    Comment


    • #3
      Thank you for you answer:

      Code:
      samtools sort -n /rogue/bioinfotree/task/RNAseq/dataset/0.1/GSE27003/alignment_tophat/SRR097789.merged.bam SRR097789_sorted  
      samtools view SRR097789_sorted.bam | htseq-count -m intersection-nonempty --stranded=no - annotation.gtf > nonoverlap_nonempty_counts_SRR097789_htseq 2> nonoverlap_nonempty_counts_SRR097789_htseq.err
      No output reported, just the processing of the gtf infos.
      I've manually inspected the bam file and it seems to have a lot of missing read ids (? I got the data from sra and just dumped it to fastq files, and they seem ok to me. Then standard tophat pipeline...), so now I'm trying to filter them out to see if that's the cause of the problem.
      Samtools flagstat on the bam file does not report anything wrong with the bam file itself, apart from a very log percentage of properly paired reads.

      Comment


      • #4
        Htseq-count starts with reading in the annotation file and reports when its finished with that before looking at the reads. So, if it prints nothing, your problem is with the GTF file. How does it look like?

        Comment


        • #5
          Sorry, I did not point out htat properly maybe: "No output reported, just the processing of the gtf infos."
          It finished the processing of the gtf, if you want the number I have to rerun it...but I swear that it printed them and that the same gtf worked with other bam files.
          Last edited by EGrassi; 10-10-2012, 05:31 AM.

          Comment


          • #6
            Ok, I tried removing from the .sam files lines with an empty read name (I hope to understand why they're there) and it finished without any problem, the counts seems sensible to me (not from a biological point of view, ok, but they are numbers and sometimes they are different from 0!). If you want a sample of the .bam/.sam files with those strange lines I can give them to you (although I do not know if they are out of the bam format standard and samtools is just being nice to them or if they are ok and this could be a small bug of htseq_count/the library that it uses to scan bam files).

            Comment


            • #7
              ...because of tophat?

              Ok, as long as fastq seems fine to me (they have an id foreach read, for example) I'm starting to think that the strange bam/sam obtained are tophat's responsibility.

              I have "normal" lines like these ones:
              Code:
              SRR097789.29777200      89      chr19   55899346        255     50M     *       0       0       ATGCTCGCGCCNCGNTCAGCAGCATCAGACACATGATCCGCAAGAACAAG      AHFH@44455-!5-!:A:A:DHHHHHHHHHH=HEHDDDHDHHHHFFEGFG      AS:i:-2 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:11A2C35    YT:Z:UU XS:A:+  NH:i:1
              And others without the QNAME field:

              Code:
                     329     chr1    10005   0       50M     *       0       0       CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
                    *       AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU NH:i:20 CC:Z:chr5       CP:i:10131      HI:i:0
              Sequences gotten from lines like this one are found in the fastq files, associated with IDs.

              Does anyone have an idea? I'm using tophat 2.0 with this command line:
              Code:
              tophat -p 7 --no-novel-juncs -G /rogue/bioinfotree/prj/ewing-rnaseq/local/share/data/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf --transcriptome-index=transcriptome -o SRR097789.th ../alignment/expanded_genome2 <(zcat /rogue/bioinfotree/task/RNAseq/dataset/0.1/GSE27003/reads//SRR097789_1.fastq.gz) <(zcat /rogue/bioinfotree/task/RNAseq/dataset/0.1/GSE27003/reads//SRR097789_2.fastq.gz)
              The --no-novel-juncs option had to be added otherwise it will just freeze and stop after a while (see this other thread: http://seqanswers.com/forums/showthread.php?t=23887)

              Comment


              • #8
                I read that Post and got it fine and informative. stick merge

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                27 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                31 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                27 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X