Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat: mapping percentages

    Hi all.

    I have a question about tophat (v > 1.3.1). I am running tophat with 72 single end reads from RNA dataset (Illumina Genome Analyzer was used); I have both a gff3 file and a reference genome. Reading logs/bowtie.left_kept_reads.fixmap.log I have:

    # reads processed: 7333307
    # reads with at least one reported alignment: 4403978 (60.05%)
    # reads that failed to align: 2928767 (39.94%)
    # reads with alignments suppressed due to -m: 562 (0.01%)
    Reported 4826602 alignments to 1 output stream(s)

    So I expected to find no more than 4826602 alignments in accepted_hits.bam.
    That is not right: after converting accepted_hits.bam to sam format I have:

    wc -l accepted_hits.sam
    6121159 accepted_hits.sam

    Each line should correspond to an alignment but bowtie report talks about 4826602 alignments. Can anyone tell me, please, what am I missing?

    Thanks a lot.

  • #2
    What does samtools view -c accepted_hits.sam tell you ? Counting lines is the wrong way to do it because there are other lines in the SAM file as well.

    Comment


    • #3
      I think how it works is as follows:

      The bowtie.left_kept_reads.fixmap.log file indicates how many sequences were able to align when the entire sequence was used, in your case, 72bp. But since the reads are from RNA, a lot of them span introns and won't align. After this initial alignment, Tophat breaks the sequence down into smaller pieces (25bp by default) and tries to align these, which will let it align additional sequences.

      If you open up one of the bowtie.left_kept_reads_seg#.fixmap.log files, you should see that the first line, "reads processed", is 2928767, ie. the reads that failed to align in the first attempt.
      Last edited by biznatch; 12-12-2011, 08:27 PM.

      Comment


      • #4
        Run Samtools flagstat and Picard MappingStats. What you will notice is the difference in the number of mapped reads. Samtools will show more than Picard. This is because samtools tells you how many reads are aligned, while picard will tell you how many unique reads are aligned.

        What you are seeing is an irritating result of the tophat alignments. If a read can map to multiple locations it will report each possible alignment. Therefore, if you input 10 reads and 7 have "at least one reported alignment" you can end up with 12 alignment events (ie. 120% mapping, but only 70% uniquely mapped reads)

        Comment


        • #5
          I got this command: samtools view accepted_hits.bam | cut -f 1 | sort | uniq | wc -l

          from here: http://vallandingham.me/RNA_seq_diff...xpression.html

          It's supposed to tell you how many unique lines and therefore how many unique reads there are in the bam file.

          Comment


          • #6
            Originally posted by biznatch View Post
            I got this command: samtools view accepted_hits.bam | cut -f 1 | sort | uniq | wc -l

            from here: http://vallandingham.me/RNA_seq_diff...xpression.html

            It's supposed to tell you how many unique lines and therefore how many unique reads there are in the bam file.
            right, and for total reads counts the lines of your fastq and divide by 4.

            Comment


            • #7
              That command is wrong. What you're extracting is the first column from the BAM file which is a read ID, like GAPC:2:86:13315:7719#0

              Every read has it's own ID, so you're not getting unique reads but just all of the reads you had before.

              Comment


              • #8
                Originally posted by Dario1984 View Post
                That command is wrong. What you're extracting is the first column from the BAM file which is a read ID, like GAPC:2:86:13315:7719#0

                Every read has it's own ID, so you're not getting unique reads but just all of the reads you had before.
                you're right about each read having a unique name, but reads with the same id will sometimes map multiple times. thus, giving you multiple entries in the bam with the same read name.

                Comment


                • #9
                  Ah, good point. Thanks.

                  Comment


                  • #10
                    reads with the same id will sometimes map multiple times. thus, giving you multiple entries in the bam with the same read name.
                    Correct, but a uniq on the list will give you a set of read IDs that mapped at least 1 time. This list will not include reads that did not align. This is because the BAM file from tophat (v1.3.2) only reports aligned reads.

                    However, the command does not work for paired-end data. This is because the read1/read2 flag is stripped off in the read ID in the BAM. For example, if the read ID is 'GAPC:2:86:13315:7719#0\1' for read 1 and 'GAPC:2:86:13315:7719#0\2' for read 2 in the fastq files, the read ID in the BAM file will be 'GAPC:2:86:13315:7719#0. Thus, a uniq on the read IDs will erroneously disregard the read1/read2 information.

                    Comment


                    • #11
                      Originally posted by nrockweiler View Post
                      However, the command does not work for paired-end data. This is because the read1/read2 flag is stripped off in the read ID in the BAM. For example, if the read ID is 'GAPC:2:86:13315:7719#0\1' for read 1 and 'GAPC:2:86:13315:7719#0\2' for read 2 in the fastq files, the read ID in the BAM file will be 'GAPC:2:86:13315:7719#0. Thus, a uniq on the read IDs will erroneously disregard the read1/read2 information.
                      it's a good point, but this thread was to answer the original poster's question about single end reads.

                      Comment


                      • #12
                        Originally posted by linsson View Post
                        Hi all.

                        I have a question about tophat (v > 1.3.1). I am running tophat with 72 single end reads from RNA dataset (Illumina Genome Analyzer was used); I have both a gff3 file and a reference genome.
                        What was your command line for running single end reads?
                        I have a running paired end command line.

                        The only thing I can come up with is:

                        tophat -o path/to/file -p 6 --library-type fr-firststrand --b2-very-sensitive --no-coverage-search --GTF file.gtf genome_reference single_read.fastq

                        Thanks in advance!

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Advancing Precision Medicine for Rare Diseases in Children
                          by seqadmin




                          Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                          12-16-2024, 07:57 AM
                        • seqadmin
                          Recent Advances in Sequencing Technologies
                          by seqadmin



                          Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                          Long-Read Sequencing
                          Long-read sequencing has seen remarkable advancements,...
                          12-02-2024, 01:49 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 12-17-2024, 10:28 AM
                        0 responses
                        26 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-13-2024, 08:24 AM
                        0 responses
                        43 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-12-2024, 07:41 AM
                        0 responses
                        29 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-11-2024, 07:45 AM
                        0 responses
                        42 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X