Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions about Top-Hat and Paired End Reads

    Hello.

    I am working with transcriptome data from a new HiSeq Machine, 100xbp Paired End Reads, and trying to map it to the newly released Brassica rapa genome using tophat.

    A)General question - I get the BAM output after using both mate-pairs and convert it to SAM; the Bit-wise flags are bizarre numbers; 99, 147,163, etc, rather than what I am used to seeing in runs where there are no mate pairs, i.e. just seeing 0 and 16, etc...
    If I just put one of the mate pairs into Tophat, I get the normal bit flags. What do these new bitflags tell me? Is there a table somewhere of these new bitflags? I realize they have to do with the fact that each 100bp read in this case is a new line of the SAM file, even if it is only one half of a mate-pair. Mate pairs also seem to have different flags, i.e. one half of the pair will have one flag, and the next member of the pair on the next line will have another (99,147, for example). What is all this about?

    I like to have the BAM and SAM files, and converting it this way leaves the SAM file free of a header, so its ready to use in a custom pipe.

    B)I took the first 10000 reads (first 40k lines of each .fastq file) and tried to play with mapping settings to get an idea of how rough the genome was, what settings would be best, etc, as it is newly published and potentially has gaps. I have few questions here:

    B1)So I tried doing both mate pairs separately at first,then together...
    Code:
    tophat -g 1 --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 200 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_1.fq
    tophat -g 1 --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 200 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_2.fq
    tophat -g 1 --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 200 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_1.fq Rapa_1_2.fq
    I get 5009/10000 mapping (by lines in the headerless SAM file) for the first half of the mate pair.
    I get 4880/10000 mapping for the second half of the mate pair.
    When I give it both pieces of information at once, I get 9987/20000 reads mapping, wheras the sum of the two is only 9889/20000 - So Tophat is using information from one mate pair to help uniquely assign(-g 1) the other it seems, as it did pick up another 98 reads, by my understanding. Is this correct reasoning? I had always been told that Tophat maps things independently.

    B2)Still confused about how Top-hat interprets the mate-pair, i tried not setting unique, i.e *not* setting the max multi-hits to 1;
    Code:
    tophat --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 100 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_1.fq Rapa_1_2.fq
    And then got my BAM->SAM, and that upped the number of lines to ~14000, out of a possible 20000 (i.e. 10000 from both parts of the mate pairs), ~75% mapping. That's acceptable.

    My question is basically is about how tophat interprets each member of the mate pair; when I am using -g 1, I get significantly less reads mapping, which is normally very understandable compared to not using it, especially given a duplicate rich genome such as this one. However, again, working on the premise that tophat (as I have been told) maps the mate pairs independently, does it apply the no multi-mapping requirement to the entire PAIR or to each 100bp member individually?

    For example, if I have a mate pair read for which P_1 maps to gene A or gene B, but P_2 maps exclusively to gene B, forcing the mate-pair to actually land on gene B, what happens?
    1)Tophat maps P_1, cant find a unique hit, throws it away, maps P_2, finds it exclusive to Gene B, and keeps P_2 Exclusively (1/2 of the reads map)
    2)Tophat maps P_1 and P_2, finds that the only rational place of origin for this mate pair is Gene B, and maps both properly to Gene B
    3)Or even more complex, if P_1 was in Gene A and B, P_2 was In Gene B and C, if P_1 and P_2 are mate paired, would tophat throw out each in turn, or would it correctly map the mate pair to Gene B?

    Basically, is -g 1 applying to the PAIR or each member individually? I wish to have the power to resolve expression differences between genes with similar sequences, and I only want to quantify reads as originating from a gene if that read can ultimately only be attributed to one gene.

    Thanks in advance for any input on this problem. =)

  • #2
    I have the same question, does anyone have any clue? Thanks

    Comment


    • #3
      See the SAM spec for flags, http://samtools.sourceforge.net/, link at the top right. You can also use this to explain them http://picard.sourceforge.net/explain-flags.html.

      You don't have to use "--library-type fr-unstranded" that is the default setting (I know it doesn't say that in the manual but one of the Tophat devs posted that on here somewhere).

      I'm not sure about the -g stuff.
      Last edited by biznatch; 12-02-2011, 09:32 AM.

      Comment


      • #4
        Basically, is -g 1 applying to the PAIR or each member individually?
        From what I can tell, -g is passed by TopHat directly to bowtie as the -k (maximum number of alignments reported) and -m (suppress all alignments if >n are found) parameters. As each mate is aligned separately, -g 1 is applied to each member individually. Information from one mate is not used to direct the other mate to the correct location.

        TopHat splits each read into a number of segments (set by --segment-length) that are aligned independently. Interestingly, it appears the values of -k and -m passed to bowtie in these alignments are twice that specified by -g. Anyone know why that is?

        I wish to have the power to resolve expression differences between genes with similar sequences, and I only want to quantify reads as originating from a gene if that read can ultimately only be attributed to one gene.
        My sense is that you're going to bias expression quantification with this approach. Just because one copy of the gene has more unique sequences than a few of its paralogs doesn't necessarily mean it is more highly expressed. Cufflinks is pretty good at estimating expression in situations like this--have your -g 20 (default) runs produced bad results? If there are only a handful of paralogs, you might try identifying unique regions from a multiple alignment and going from there.

        And then got my BAM->SAM, and that upped the number of lines to ~14000, out of a possible 20000 (i.e. 10000 from both parts of the mate pairs), ~75% mapping. That's acceptable.
        Each line can contain only one mapping, and reads aligning to multiple locations, or multiple ways a single read can align to the same position (i.e. they have different CIGAR strings) will appear on multiple lines. Can you confirm that the number of lines accurately represents the number of aligned reads?

        Perhaps compare the results to:

        Code:
        samtools view accepted_hits.bam | awk '{printf $1"\n"}' | sort -u | wc -l
        Hope this helps!


        Andrew

        Comment


        • #5
          Hello, I'm still not very clear about the -g stuff for the paired end reads. For example, if one mate has only one match to the reference genome, but the other mating read have several matches, will tophat report the paired-end read when setting -g=1?

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X