Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Output Alignment of TopHat 2X the number of left and right reads

    Hi Guys,

    I don't get it..here's the prep read info from tophat 2.04

    left_min_read_len=100
    left_max_read_len=100
    left_reads_in =15696378
    left_reads_out=15695101
    right_min_read_len=100
    right_max_read_len=100
    right_reads_in =15696378
    right_reads_out=14783689

    Here's the flagstat...

    rna-ctrl-001-5-accepted_hits.statBam
    30790076 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    30790076 + 0 mapped (100.00%:nan%)
    30790076 + 0 paired in sequencing
    17077256 + 0 read1
    13712820 + 0 read2
    23679570 + 0 properly paired (76.91%:nan%)
    26248024 + 0 with itself and mate mapped
    4542052 + 0 singletons (14.75%:nan%)
    466020 + 0 with mate mapped to a different chr
    65746 + 0 with mate mapped to a different chr (mapQ>=5)

    tophat -G ./Genes/genes.gtf -r 100 --mate-std-dev 50 -p 16 -o ./5refAnnot --b2-sensitive --no-novel-juncs ./5_Index-8.001_VEH_CNTL_R1.fastq ./5_Index-8.001_VEH_CNTL_R2.fastq

    Why is there 2X the number of single reads (left + right...). I would expect more like around 15M mapped reads!?

    Thanks for your comments/explanations.

    R

  • #2
    -Update

    There is a lot of QName duplicates with the only difference being the TLEN (one positive and one negative)

    Any reasons why would I have so many duplicates?

    Comment


    • #3
      Have you checked the total read counts from before alignment? Or the tophat.log file?

      I'm a bit confused as to what you're seeing as wrong here... I'm seeing about 600k reads unmapped by adding right reads in, left reads in, then subtracting mapped reads in the flagstat.

      The numbers look okay though. Left and right reads add up to more than the mapped reads.

      Comment


      • #4
        Hi Ramma,

        So seeing these kinds of things in the SAM file is normal? i.e. having the same Qname X2 and positions inverted? I have this for so many reads....

        HWI-ST915:13413WGACXX:5:2305:13760:142891 355 chr1 3127915 0 100M = 3127918 103 TATCGATGCAAAAATCCTCAATAAAATTCTCGCTAACCGAATCCAAGAACACATTAAAACAATCATCCATCCTGACCAAGTAGGTTTTATTCCAGGGATG CCCFFFFFGHHHHJJIIJJJJJJJJJJJIJJJJJJJJJJJJJIJJJJJIIJJJJIJIJIJJAC77?E3?;B@>;;;3@?A(;ACACD@@CCDCADD8892 AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:20 CC:Z:= CP:i:35213575 HI:i:0
        HWI-ST915:13413WGACXX:5:2305:13760:142891 403 chr1 3127918 0 100M = 3127915 -103 CGATTCAAAAATCCTCAATAAAATTCTCGCTAACCGAATCCAAGAACACATTAAAACAATCATCCATCCTGACCAAGTAGGTTTTATTCCAGGGATGCAG #######DCA;(;C@;AA>CCC@AFD=<A@;8HDEFBC=8F=DB>GB9AD>G@IIEFBIJJJIIHHHJJIHHJJJIJIJJJJJJIJJHHHHGFFFFFCCC AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:4G95 YT:Z:UU NH:i:20 CC:Z:= CP:i:35213578 HI:i:0


        Cheers,

        R
        Last edited by RemitoAmigo; 10-23-2012, 11:57 AM.

        Comment


        • #5
          What is giving you the SAM file? Tophat should output BAM files for the alignments, unless you are using the option --no-convert-bam or maybe if you're missing samtools.

          I'm not entirely sure what I'm looking at there besides that it does have some information which SAM and BAM files contain. Maybe take a look at the SAM format info info here.

          Comment


          • #6
            Hi Ramma,

            The SAM file is generated with Samtools view...from the BAM that was generated by tophat..

            You're looking at a sample of my SAM file where 2 entries have the same Qname but positions (start and end) are inverted. I was wondering if this is normal?

            In my mind, I think I should only see 1 of those 2 records, since they basically give the same info....

            Cheers,

            R

            Comment


            • #7
              Alright, I think the formatting is just throwing me off. Where's the inversion you mention? I'm pretty sure the multiple entries are fine. The actually sequences are only slightly different, but that would cause them to map differently. You could open the accepted_hits.bam file in IGV and visually inspect the aligned reads to see.

              I also see multiple entries like this in my alignments. I just scrolled down a few entries in a bam file and here's an example.

              Code:
              DH1DQQN1:242:C143BACXX:1:2109:12744:63815       369     V2LHS_100066    1       3       22M     V2LHS_80283     1       0       CCCGTTGAATATCACACTGAAT  JJJJJJJJIIIHHCCGJJJJJJ  XA:i:0  MD:Z:22 NM:i:0  NH:i:2  CC:Z:=  CP:i:1  HI:i:0
              DH1DQQN1:242:C143BACXX:1:2109:12744:63815       113     V2LHS_100066    1       3       22M     V3LHS_406268    1       0       CCCGTTGAATATCACACTGAAT  JJJJJJJJIIIHHCCGJJJJJJ  XA:i:0  MD:Z:22 NM:i:0  NH:i:2  HI:i:1
              As for an explanation, I'm not entirely sure what that would be.

              Comment


              • #8
                Hi Ramma,

                Here's the inversion...
                1st read: chr1 3127915 0 100M = 3127918
                2nd read: chr1 3127918 0 100M = 3127915

                From the SAM Format Specification document: "Reads/segments having identical Qname are regarded to come from the same template"

                I know that it is common to have duplicates in RNA-seq...but from the same template? wonder if I should remove reads with same templates(Qname)?...Illumina must take into account X and Y location on the flowcell and that they are not real duplicates. i.e. originated from 2 fragment copies of mRNA...make sense?

                Comment


                • #9
                  Originally posted by RemitoAmigo View Post
                  -Update

                  There is a lot of QName duplicates with the only difference being the TLEN (one positive and one negative)

                  Any reasons why would I have so many duplicates?
                  The bitwise FLAGs (355 and 403) (second fields) of your alignment lines indicate that these are the two reads belonging to a paired-end sequencing protocol. So they are not "duplicates".

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  9 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X