Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq question: high number of missing mates/unmatched reads

    Hi,

    I'm trying to create read count tables from my sam files with HTSeq (v0.6.1) and run into a problem I don't get.
    When I run HTSeq-count (htseq-count -r name -s yes -t "Coding gene" -i "name" -f sam foo.sam bar.gff) I encounter two things: (i) a very low percentage of reads matching to features and (ii) half of the reads are without mate (Warning: 13666808 reads with missing mate encountered)

    I wrote a simple script checking how many of my read names occur twice in the *.sam file (i.e., checking for mate names) and less than 5% are single mate reads. Additionally, I checked how many read pairs fall inside a feature and found that >40% do.

    As many people used HTSeq-count without problems I'm pretty sure I'm doing something wrong and not HTSeq. But I'd like to understand what ...

    This is a line from my sam file:
    HWI-ST218:183:H88NHADXX:1:1101:1250:2193 83 chr1 3950365 60 99M1S chr1 3950190 -274 CAACTTCAACGGCCAGAGATAAATACCGCCCTTCCCTTAATATGATTGAAATCAACCAGACCGGAACCCACAACACCACCCCATAAAGCAATACAAACTN DDDDDDDDDDDDDDEDDDEEEEDDBDDDDDDEDEEFFFFFFFHHHHHHJJJJJJJJJJHHJJJJJJJJJIJIIHGHGCJJJJJJJJJHGHHHFFFDD=1# AS:i:99
    ]
    And this a line from my gff file (gff3 produced with bp_genbank2gff script):
    chr1 RefSeq Coding gene 286 609 . - . name=chr1_10001;product="transposase"

    I changed some of the settings for testing purposes,e.g., used -r pos instead of name and ran out of memory. This was kind of expected because HTSeq doesn't find mates and therefore keeps each read in memory. I also tried strand=no without any change at all.

    Does anyone have a clue what I'm doing wrong?

    Regards,
    TimK

  • #2
    hi TimK,
    Most probably your SAM file is not Name sorted. HTSeq requires input SAM to be read name sorted. Generally if a SAM/ BAM is sorted, its coordinate-based.

    Check the header of SAM to see what sorting is present.
    To name sort, use samtools sort command and use the -n option instead of the -o option.

    HTSeq expects that the read pairs would be consecutive in SAM file. For PE data, esp. in RNA-seq, coordinate sorting screws this expectation.

    Edit -
    -o in samtools sort is for output. Sorry about confusion. Coordinate sorting is default. For name sorting use -n
    Last edited by amitm; 08-17-2014, 04:21 PM. Reason: Correction

    Comment


    • #3
      HI amitm,

      thanks for the suggestion but the bam files I used first are sorted (tried name and position).
      I then used only sam files (for human readability) and checked them manually. They seem
      to be sorted by name (at least the few % I checked were). So I am quite sure the sorting is
      not the problem.

      Comment


      • #4
        hi TimK,
        That missing mate warning is characteristic when HTSeq is supplied with non- Name-sorted SAM.
        Name sorting is not default and its different from the (coordinate) sorted BAM which (say) TopHat returns.
        Manual checking is not reasonable as there are millions. After using samtools sort with -n parameter, do you still get the same error?

        Comment


        • #5
          Just to exclude possible errors I re-did the following:
          1. I sorted a *.bam using samtools sort -n
          2. I sorted a *.sam using sort -s -k 1,1

          The result for both is the same:
          Warning: 18631016 reads with missing mate encountered.
          29112998 SAM alignment pairs processed.


          Additionally, reagarding my own scripts:
          1. >95% of the read names appear twice in the *.sam file
          2. all reads have their mate in the next line


          And I agree that it sounds very much like the sam/bam files are not correctly sorted but I can't find the mistake. Is there any naming convention and some characters I use in the names shouldn't be used?
          Also I realized that there are several versions of the sam format. I'm using samtools v0.1.19-44428cd. Could it be related to that?

          TimK
          Last edited by TimK; 08-17-2014, 07:16 PM.

          Comment


          • #6
            hi TimK,
            Samtools version is fine and it shouldn't create such an error. Nonetheless even after Name sorting if you are getting error then probably you should do these -
            1) Use Picard to calculate alignmnet statistics of your data. Is proper alignment good?

            2) Try another count generation tool like featurecounts

            This tool doesn't require Name sorting. Typical coordinate sorted BAM file can be directly taken as input. And is faster too.

            See if that helps.

            Comment


            • #7
              Hi Amitm,

              thanks a lot for the help. I used smalt for the mapping to the reference but will have a look at Picard and definitely at featurecounts.

              Hope it will work with one of those.
              Cheers,
              Tim

              Comment


              • #8
                Follow-up

                Hello, I ran into a similar problem using BAM files and then sorted using:
                Code:
                samtools sort -n -O BAM -o <output> <input>
                While this indeed fix the error message about 'missing mates' and this other warning:
                Code:
                arning: Read ACB052:253:C76YKACXX:8:2311:13183:88600 claims to have an aligned mate which could not be found in an adjacent line.
                I went ahead and compared the read counts of the sorted and unsorted after running through HTSeq-count and found that the sorted had significantly less counts even though the __no_feature, __ambiguous, and __alignment_not_unique parameters were less in the sorted file than the unsorted.

                Any thoughts on this?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 11:49 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-24-2024, 08:47 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                61 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Working...
                X