Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq Counting Warning

    Hello

    I am getting lots of warning messages when I run HTSeq-count. I saw few other posts on this forum, but could not figure out the solution.

    I am using paired-end illumina sequences. I have three replicates of one sample and each file has two pairs of fasta file (from two Lanes i.e. _L001_ _R1, _L001_ _R2, _L002_ _R1 and _L002_ _R2). I merged two _R1 files to make a single _R1 file and also did the same for _R2. I used BWA aligner and converted SAM to BAM, sorted the BAM file and converted the sorted BAM to SAM file. So the final SAM file is the sorted one.

    Then I used the following command to run HTSeq-count
    HTML Code:
    htseq-count -m intersection-strict –stranded=no  Sorted_SAM_file    GTF_file   >    OUTPUT_File
    But the error log shows this
    HTML Code:
    100000 GFF lines processed.
    200000 GFF lines processed.
    300000 GFF lines processed.
    400000 GFF lines processed.
    500000 GFF lines processed.
    600000 GFF lines processed.
    700000 GFF lines processed.
    800000 GFF lines processed.
    861936 GFF lines processed.
    Warning: Read HWI-ST1023:203:H7NK5ADXX:2:1106:2557:97816 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    …
    …
    
    Warning: Read HWI-ST1023:203:H7NK5ADXX:1:1205:10269:59178 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    Warning: Read HWI-ST1023:203:H7NK5ADXX:1:1116:3922:31598 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    Warning: Read HWI-ST1023:203:H7NK5ADXX:2:2116:4516:45139 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    72953003 sam line pairs processed.
    The last 10 lines of the output file is this
    HTML Code:
    Traes_7DS_FF7C9C6FD	77
    
    Traes_7DS_FF911FA4A	26
    
    Traes_7DS_FF9F1CF23	0
    
    Traes_7DS_FFA36F6DA	0
    
    Traes_7DS_FFE9ACDAB	716
    
    no_feature	15387432
    
    ambiguous	249671
    
    too_low_aQual	0
    
    not_aligned	25066352
    
    alignment_not_unique	0
    So I am not sure if I can ignore those warning messages. Is there any other way to get rid of those warning messages?

  • #2
    Assuming you coordinate sorted when you say "sorted" (rather than sorting by read-name), then you need to use the "-r pos" option. BTW, you don't need to convert back to a SAM file.

    Comment


    • #3
      Thanks for the reply.

      I used following for converting sam to bam and sorting the bam

      samtools view -b -S file.sam > file.bam
      samtools sort file.bam file_sorted
      samtools view -h file_sorted.bam > file_sorted.sam

      Do I need to use "-r pos" in the second one? So I can use sorted BAM file also?

      Comment


      • #4
        You can actually just use "file.sam" or "file.bam" directly (you just need pairs to be next to each other in the file and that'll be the case when it's spat out of the aligner). To use "file_sorted.bam" you'll need "-r pos"

        Comment


        • #5
          Thanks Ryan !! I got it now, I need to do the following

          If I am using
          HTML Code:
          samtools sort file.bam file_sorted
          then I should use "-r pos"
          and if I am using
          HTML Code:
          samtools sort -n  file.bam file_sorted
          then I should use "-r name"

          Comment


          • #6
            Hi, sbdk82
            I had this problem before. Finally, I figured out the cause of this problem is the unmatched read names of paired-end reads.
            Let me explain: HTSeq counts the reads according to their read names, so you should sort your bam/sam file by the read names. After mapping to the reference, the read names of paired-end reads in the output bam/sam file are exactlly the same to each other, like this:
            HWI-ST1258:115:C28MCACXX:6:2201:6882:85446 99 chr1 3215877 50
            HWI-ST1258:115:C28MCACXX:6:2201:6882:85446 147 chr1 3215911 50
            As you see, read1 and read2 share the same read name, and bam/sam file uses a flag to distinguish read1 and read2 which is the second column( the value 99 and 147 ). But in your bam/sam output file, according to your provided information, I think the names of read1 and read2 are not exactly same to each other. Is it as follows?:
            HWI-ST1023:203:H7NK5ADXX:1:1106:2557:97816
            HWI-ST1023:203:H7NK5ADXX:2:1106:2557:97816
            The "1" and "2" are not mached. If you only sort the bam/sam file by read name, but not change the read name to make it the same in read1 and read2, you will still find the warnings in HTSeq process.
            Last edited by wisense; 03-18-2015, 07:00 PM. Reason: spell correction

            Comment


            • #7
              In read names from illumina, the 1 and 2 that you highlighted denote the lane (unless someone made the unwise decision to play with the read names). It's more likely that the sample was just sequenced on two lanes and then merged.

              Comment


              • #8
                Hi dpryan,
                Yes, you are probably right. In my case, the unmatched names of paired reads was the exact cause of the problem.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X