Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat: mapping percentages

    Hi all.

    I have a question about tophat (v > 1.3.1). I am running tophat with 72 single end reads from RNA dataset (Illumina Genome Analyzer was used); I have both a gff3 file and a reference genome. Reading logs/bowtie.left_kept_reads.fixmap.log I have:

    # reads processed: 7333307
    # reads with at least one reported alignment: 4403978 (60.05%)
    # reads that failed to align: 2928767 (39.94%)
    # reads with alignments suppressed due to -m: 562 (0.01%)
    Reported 4826602 alignments to 1 output stream(s)

    So I expected to find no more than 4826602 alignments in accepted_hits.bam.
    That is not right: after converting accepted_hits.bam to sam format I have:

    wc -l accepted_hits.sam
    6121159 accepted_hits.sam

    Each line should correspond to an alignment but bowtie report talks about 4826602 alignments. Can anyone tell me, please, what am I missing?

    Thanks a lot.

  • #2
    What does samtools view -c accepted_hits.sam tell you ? Counting lines is the wrong way to do it because there are other lines in the SAM file as well.

    Comment


    • #3
      I think how it works is as follows:

      The bowtie.left_kept_reads.fixmap.log file indicates how many sequences were able to align when the entire sequence was used, in your case, 72bp. But since the reads are from RNA, a lot of them span introns and won't align. After this initial alignment, Tophat breaks the sequence down into smaller pieces (25bp by default) and tries to align these, which will let it align additional sequences.

      If you open up one of the bowtie.left_kept_reads_seg#.fixmap.log files, you should see that the first line, "reads processed", is 2928767, ie. the reads that failed to align in the first attempt.
      Last edited by biznatch; 12-12-2011, 08:27 PM.

      Comment


      • #4
        Run Samtools flagstat and Picard MappingStats. What you will notice is the difference in the number of mapped reads. Samtools will show more than Picard. This is because samtools tells you how many reads are aligned, while picard will tell you how many unique reads are aligned.

        What you are seeing is an irritating result of the tophat alignments. If a read can map to multiple locations it will report each possible alignment. Therefore, if you input 10 reads and 7 have "at least one reported alignment" you can end up with 12 alignment events (ie. 120% mapping, but only 70% uniquely mapped reads)

        Comment


        • #5
          I got this command: samtools view accepted_hits.bam | cut -f 1 | sort | uniq | wc -l

          from here: http://vallandingham.me/RNA_seq_diff...xpression.html

          It's supposed to tell you how many unique lines and therefore how many unique reads there are in the bam file.

          Comment


          • #6
            Originally posted by biznatch View Post
            I got this command: samtools view accepted_hits.bam | cut -f 1 | sort | uniq | wc -l

            from here: http://vallandingham.me/RNA_seq_diff...xpression.html

            It's supposed to tell you how many unique lines and therefore how many unique reads there are in the bam file.
            right, and for total reads counts the lines of your fastq and divide by 4.

            Comment


            • #7
              That command is wrong. What you're extracting is the first column from the BAM file which is a read ID, like GAPC:2:86:13315:7719#0

              Every read has it's own ID, so you're not getting unique reads but just all of the reads you had before.

              Comment


              • #8
                Originally posted by Dario1984 View Post
                That command is wrong. What you're extracting is the first column from the BAM file which is a read ID, like GAPC:2:86:13315:7719#0

                Every read has it's own ID, so you're not getting unique reads but just all of the reads you had before.
                you're right about each read having a unique name, but reads with the same id will sometimes map multiple times. thus, giving you multiple entries in the bam with the same read name.

                Comment


                • #9
                  Ah, good point. Thanks.

                  Comment


                  • #10
                    reads with the same id will sometimes map multiple times. thus, giving you multiple entries in the bam with the same read name.
                    Correct, but a uniq on the list will give you a set of read IDs that mapped at least 1 time. This list will not include reads that did not align. This is because the BAM file from tophat (v1.3.2) only reports aligned reads.

                    However, the command does not work for paired-end data. This is because the read1/read2 flag is stripped off in the read ID in the BAM. For example, if the read ID is 'GAPC:2:86:13315:7719#0\1' for read 1 and 'GAPC:2:86:13315:7719#0\2' for read 2 in the fastq files, the read ID in the BAM file will be 'GAPC:2:86:13315:7719#0. Thus, a uniq on the read IDs will erroneously disregard the read1/read2 information.

                    Comment


                    • #11
                      Originally posted by nrockweiler View Post
                      However, the command does not work for paired-end data. This is because the read1/read2 flag is stripped off in the read ID in the BAM. For example, if the read ID is 'GAPC:2:86:13315:7719#0\1' for read 1 and 'GAPC:2:86:13315:7719#0\2' for read 2 in the fastq files, the read ID in the BAM file will be 'GAPC:2:86:13315:7719#0. Thus, a uniq on the read IDs will erroneously disregard the read1/read2 information.
                      it's a good point, but this thread was to answer the original poster's question about single end reads.

                      Comment


                      • #12
                        Originally posted by linsson View Post
                        Hi all.

                        I have a question about tophat (v > 1.3.1). I am running tophat with 72 single end reads from RNA dataset (Illumina Genome Analyzer was used); I have both a gff3 file and a reference genome.
                        What was your command line for running single end reads?
                        I have a running paired end command line.

                        The only thing I can come up with is:

                        tophat -o path/to/file -p 6 --library-type fr-firststrand --b2-very-sensitive --no-coverage-search --GTF file.gtf genome_reference single_read.fastq

                        Thanks in advance!

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        26 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        29 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        25 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X