Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat 2.0.0 number of reads

    Apart from the problem when using more than one core and random problems there I've got some other problems:

    Code:
    $ wc -l ../reads/SRR306839.fastq
    75400120 ../reads/SRR306839.fastq
    Thus I have 18850030 reads. Tophat reports:
    Code:
    18645993 reads; of these:
      18645993 (100.00%) were unpaired; of these:
        7576936 (40.64%) aligned 0 times
        5480546 (29.39%) aligned exactly 1 time
        5588511 (29.97%) aligned >1 times
    59.36% overall alignment rate
    Ok, maybe some of them where rejected a priori for some quality issue.
    Then:
    Code:
    $ samtools flagstat accepted_hits.bam
    41956190 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    41956190 + 0 mapped (100.00%:-nan%)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (-nan%:-nan%)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (-nan%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    I have a lot of lines in the .sam files without read IDs (in another dataset this caused problems with htseq_count, here not but as long as in the past I did not notice something like that I would like to understand). Is this ok? Are lines without IDs reported from reads mapping in more than one position on the transcriptome/genome?
    The SAM format guide says that missing read names should be marked with a '*'...

  • #2
    With tophat 2.0.4 the logs are similar but:
    Code:
    $ samtools flagstat accepted_hits.bam 
    10819957 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    10819957 + 0 mapped (100.00%:-nan%)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (-nan%:-nan%)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (-nan%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    I'm beginning to think that maybe I should try STAR. Or that I definitely need 3 months to perform "unit tests" also on these well known, widely used and presented in literature tools.

    Comment


    • #3
      Ok, maybe some of them where rejected a priori for some quality issue.
      How many reads is tophat telling you it is filtering in prep_reads.info?
      18645993 reads; of these:
      18645993 (100.00%) were unpaired; of these:
      7576936 (40.64%) aligned 0 times
      5480546 (29.39%) aligned exactly 1 time
      5588511 (29.97%) aligned >1 times
      59.36% overall alignment rate
      How is the overall quality of your read, especially the end? Is your sequencing machine calling bases irrespective of what the quality of that base is, or does it start calling and N at low quality?

      Tophat, unlike BWA, does not clip reads to remove low-quality ends. If your ends are <=q20 and your sequencer force-called the bases you may have nucleotides at the end that are making your reads unalignable, because those ends are preventing Tophat from positioning the read where it belongs - you're getting too many mismatches.

      I'm beginning to think that maybe I should try STAR.
      If you think trying a new tool on the block that has not been used that much is a "safer option" , especially one that will prevent you needing to do
      Or that I definitely need 3 months to perform "unit tests" also on these well known, widely used and presented in literature tools.
      I think you're wrong

      Everyone (especially the biologists around me (and I used to be one)) think that NGS data analysis is easy and a technique, a service that can be provided and not something that involves a boatload of time and benchmarking and intellectual effort no less complicated than designing some "pretty" wet lab experiments. So, yes, if you're going to do an analysis you need to know what your tools are doing, and the sad state of the field is format incompatibility, weird mappings and activities by different softwares that give different outputs, and you need to look at the data and the biology of your system to figure out what makes sense and what is probably an artefact.

      Comment


      • #4
        Originally posted by dvanic View Post
        How many reads is tophat telling you it is filtering in prep_reads.info?
        I had to delete the tophat 2.0.0 result to avoid filling our hd of data when doing "trial and error". Tophat 2.0.4 is telling:
        Code:
        prep_reads v2.0.4 (3480M)
        ---------------------------
        204037 out of 18850030 reads have been filtered out
        So in this case it is ok.

        Originally posted by dvanic View Post
        How is the overall quality of your read, especially the end? Is your sequencing machine calling bases irrespective of what the quality of that base is, or does it start calling and N at low quality?

        Tophat, unlike BWA, does not clip reads to remove low-quality ends. If your ends are <=q20 and your sequencer force-called the bases you may have nucleotides at the end that are making your reads unalignable, because those ends are preventing Tophat from positioning the read where it belongs - you're getting too many mismatches.
        OK, I see, thanks. Those are public data and the original article aligned them without any trimming, I will anyhow check the quality of the ends. But what is puzzling me is the resulting sam with too many reads and many of them without an ID, not the low percentage of alignment (at least now).


        Originally posted by dvanic View Post

        I think you're wrong
        In trying to understand if the software that I'm using is bug free? In trying a new software less used I'm not happy, yes...I want the unit tests more, that's it
        Yup, I understood just now that your two sentences were connected...definitely I will test STAR if I choose to use it, maybe more than tophat.
        Originally posted by dvanic View Post

        Everyone (especially the biologists around me (and I used to be one)) think that NGS data analysis is easy and a technique, a service that can be provided and not something that involves a boatload of time and benchmarking and intellectual effort no less complicated than designing some "pretty" wet lab experiments. So, yes, if you're going to do an analysis you need to know what your tools are doing, and the sad state of the field is format incompatibility, weird mappings and activities by different softwares that give different outputs, and you need to look at the data and the biology of your system to figure out what makes sense and what is probably an artefact.
        This is true, in a way. But I don't agree completely: these are high throughput techniques so I will never be able to check if all the results make sense...and the overall idea is to get sensible data, so if two different versions of a software gave me different outputs (one of them which seems out of the standard format that it should be) I'm worried about bugs and not biological soundness. That would be an issue that I hope to face later.

        Thank you, I will let you know about the quality of the ends...I can also add that the counts resulting from using htseq-count on the resulting accepted_hits.bam with the two tophat versions had a high an significant pearson correlation...but I still would like to understand what's the matter with the empty IDs
        Last edited by EGrassi; 11-08-2012, 06:50 AM.

        Comment


        • #5
          But what is puzzling me is the resulting sam with too many reads and many of them without an ID, not the low percentage of alignment (at least now).
          So, I have seen tophat report more reads in the accepted_hits.bam than in the original fastq, but that was because for some reads it would report more than one alignment. I parsed this out using an awk/uniq pipe on the sam file. I have not seen your situtation with no read IDs at all.

          This is true, in a way. But I don't agree completely: these are high throughput techniques so I will never be able to check if all the results make sense...and the overall idea is to get sensible data, so if two different versions of a software gave me different outputs (one of them which seems out of the standard format that it should be) I'm worried about bugs and not biological soundness. That would be an issue that I hope to face later.
          Before I started doing RNA-Seq data analysis I thought that if a software tool was being so widely used it must be reliable, reproducible and sound. After trying/using most of the available software I have come to the conclusion that it is a "Wild West" in the field at the moment. There are rewards for publishing a tool that works (for the authors, to some extent, in many cases only with their annotations made with "a custom perl script" (not funny how many times that is written in a methods section, with no details on either the script or based on what algorithms it is actually doing what it's supposed to be)). The paper will get cited, mostly in reviews of available methods. There is no incentive for most groups to maintain the software, to fix bugs and make decent documentation, since this does not get you more papers => more grants => promotions... Basically, I think it's shoddy science, but there is nothing I can do about it, except spend boatloads of time benchmarking, reading this forum and help lists about bugs people encounter, and looking at dummy datasets to get an idea of how a particular tool handles a particular scenario.

          And of course there's the brilliant scenario of cufflinks, which has been updated to include a plethora of new methods, but apart from the "How Cufflinks Works" page there is no peer-reviewed update of what has been incorporated and whether, together, it is statistically sound.

          /end rant/

          Coming back to your problem (and I know this is not exactly a solution) - have you tried using 2.0.5/6 on your reads? I've had much better results (am more happy with how the mapping "looks") with the .5 version, since it does seem to improve mapping accuracy by using both genome and transcriptome as a reference. Perhaps there was a bug in 2.0.0 that has been fixed in the later versions?...

          Comment


          • #6
            the '*' fields you speak of sounds like how bowtie reports unaligned reads. if you align you reads with and output SAM format you get a row for every read in your original FASTQ file and for reads that were not aligned they put an '*' in the column where you'd normally see the chromosome number (or whatever reference you're working with).

            i don't know what your overall pipeline looks like but if tophat is giving you trouble and you need the power of identifying alternative splicing (or isoform level expressions) you might give RSEM or eXpress a try. both of these use alignments to a transcriptome and use the EM algorithm for disambiguating alignments to multiple isoforms. it was discovered last year sometime that alignment to the transcriptome is much more sensitive than to the genome (i think that's why tophat included it in newer releases) and results in greater accuracy of expression estimations in simulated data. currently RSEM and eXpress appear to have the greatest accuracy for even isoform level expression estimates when compared to other methods. I saw a evaluation of a few pipelines using the BEERS pipeline (http://www.cbil.upenn.edu/BEERS/). tophat->cufflinks was the absolute worst for quantification and very poor for differential expression with cuffdiff. They used simulated data for which they knew the "true" expression and they knew the "true" fold changes between samples. then they ran their data through several pipelines and correlated the expression estimates back against the 'true' values. RSEM and eXpress (using bowtie and BWA to make alignments) performed the best with true count correlations better than r=0.9 while tophat->cufflinks estimates correlated about r=0.1. in other words the expression estimates generated with cufflinks looked like random noise compared to the true values. so don't use cufflinks.

            in fact using a simple count method such as counting hits per isoform and simply dividing the contribution of reads aligning to multiple targets by the square of the number of features they align to provides a better estimate of the counts than cufflinks correlating about r=0.7 though with a much larger confidence interval than eXpress and RSEM.

            if you need novel isoform discovery i'd recommend going through with tophat alignments, running cufflinks and generating a GTF annotation for your samples using their recommended pipeline. then make a FASTA reference for your new transcriptome with cufflinks' gffread tool

            Code:
            gffread -g <genome>.fa -w <transcriptome>.fa <annotation>.gtf
            then build a bowtie index for your new transcriptome, align to it with bowtie or BWA and quantify expressions with eXpress. finally you can use a DE tool in R like edgeR, DESeq or EBSeq to perform DE testing.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #7
              First of all thank you for your help. I would like to add all the possible support to your rant, I'm worried that this is shoddy science also

              I looked at those reads qualities with fastqc and for some samples they are terrible, but for the one that I showed here I don't notice a tremendous fall of qualities at the end of reads.

              Originally posted by dvanic View Post

              Coming back to your problem (and I know this is not exactly a solution) - have you tried using 2.0.5/6 on your reads? I've had much better results (am more happy with how the mapping "looks") with the .5 version, since it does seem to improve mapping accuracy by using both genome and transcriptome as a reference. Perhaps there was a bug in 2.0.0 that has been fixed in the later versions?...
              The reports did not cite something similar but I'm definitely willing to try. I'm worried because in the past I used tophat 2.0.0 and I did not notice any number of reads inconsistency but bug bites bug...

              Comment


              • #8
                Originally posted by sdriscoll View Post
                snip
                I saw BEER too but did not read the results (bad bad Elena!).
                Thanks for the interesting post, right now I'm not directly interested in cufflinks alternate transcript capabilities but who know what the future brings (actually I would really like to stick to my original research interests about transcription factors and leave the deep sequencing field until it become more mature...another option would be to develop some new
                software [not to add noise to the scene, just for internal use at least for the first year ] but here "pure" computer science projects are not the first option).
                If I will perform some tests with these other tools I will definitely report some results.

                ps. I read about * in the read name field in the sam format specifications.

                Comment


                • #9
                  I started learning to process RNA Seq data in 2009. In the early versions of Tophat it would actually make a file of totally naive expressions and we would just compute fold changes on them and pick genes with more than two fold change as possibly interesting. One of the post docs I was working with at the time would look through the list, sorted by fold change, and compare the fold changes and expression levels to what he could see in the bed graph histogram a we loaded into the genome browser. He would cross out genes from the list that looked suspicious (ones with strange coverages that didn't make sense with the expression data) and over the course of several months, along with whatever else he was doing, those gene lists proved to be very useful in his project. Some of that data went to this: http://www.ncbi.nlm.nih.gov/pubmed/21357675

                  I still think this basic approach is viable, for gene level analysis even with all the work that has been done in the field. Things are better now that running more samples is cheaper so that we can have a few replicates to cut back on false positives. I think that's what will improve things. The cheaper it is per million reads per sample the more replicates people can run and then there's less of a need for modeling and estimates for DE and we will get better mean expression estimates per condition.

                  I agree it is a real Wild West kinda thing. I think the tools are looking better right now than in the past. Tophat has been working for me for years. I like using bowtie to align things quickly and bowtie2 is great for gapped and local alignments plus its fast for long paired reads. Transcriptome assembly is still very mysterious. I'd think that someone would just release a tool that gives you all of the possible splice variants based on detected exons and junctions but it seems like people can't help but try to make their software capable of making decisions for the researchers and bypassing that basic level of information. I guess that's the difference between publication type tools and homemade tools.

                  I'll be digging into the eXpress data tomorrow. So far I like it but I need to spend a lot of time looking at the results and the raw data to see how much of it makes sense.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-27-2024, 06:37 PM
                  0 responses
                  13 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-27-2024, 06:07 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  69 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X