Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • When to use tophat2's coverage search?

    Hi!
    I'm a bit confused: when should one use tophat2's coverage search? Is there a logic in leaving it off/on for 100bp PE reads, or is this dictated solely by the computational resources one has available?

    Overall, what is YOUR standard practice with using this option?

    I have seen the manual, which states:
    Enables or disables the coverage-based search for junctions. Use when coverage search is disabled by
    default (such as for reads ≥75 bp), for maximum sensitivity. Default: no
    However, including the fact that I am working with a small number of libraries, I can afford the extra computational time and memory requirements, providing that this "maximum sensitivity" is really worth it. Question is: how do I make that call (other than by running my libraries with and without it and then comparing. I don't really want to reinvent the wheel here.).

    Also, for human, how much sense does it make to use the microexon search option???

    Thanks in advance!

  • #2
    I am interested in this question as well, does anyone have a good answer?

    Comment


    • #3
      I would also REALLY like to hear an answer on this. What am I giving up if I opt for a --no-coverage-search?
      In science, "fact" can only mean "confirmed to such a degree that it would be perverse to withhold provisional assent." I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms.
      --Stephen Jay Gould

      Comment


      • #4
        Sorry to simply provide a link here but since it was biostars.org that provided the answer, not seqanswers, I felt it was appropriate to give that site the credit.

        Here is a thread that provides a discussion on this topic. I make no claims on its validity, but I found it useful to read.

        In science, "fact" can only mean "confirmed to such a degree that it would be perverse to withhold provisional assent." I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms.
        --Stephen Jay Gould

        Comment


        • #5
          Thanks for the useful link, though I disagree with the interpretation provided by the biostars poster!

          From the tophat manual:
          The first and strongest source of evidence for a splice junction is when two segments from the same read (for reads of at least 45bp) are mapped at a certain distance on the same genomic sequence or when an internal segment fails to map - again suggesting that such reads are spanning multiple exons. With this approach, "GT-AG", "GC-AG" and "AT-AC" introns will be found ab initio. The second source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. We only suggest users use this second option (--coverage-search) for short reads (< 45bp) and with a small number of reads (<= 10 million). This latter option will only report alignments across "GT-AG" introns
          I've responded to this on biostars, but to repost here:
          Hi! The identification of new splice sites in different genes/transcripts is still possible without coverage search!

          Coverage search is, according to the manual, only useful when you've got very short reads, since in this case the probability that the read will "hit" the splice junction exactly may be very low for relatively lowly expressed transcripts. Hence, you need another way of detecting splice sites, which is where coverage search comes in. To make it easier for the algorithm by using coverage search you are allowing for only the most canonical of GT-AG splice junctions (only in this latter step; you'll get the GC-AG and AT-AC junctions that are supported by reads).

          So the resume is: coverage search should be left off for "modern" Illumina data.
          Last edited by dvanic; 02-18-2013, 06:20 PM. Reason: correcting interpretation error

          Comment


          • #6
            Wow. I am so thankful for your response. And finally, I think I have enough to make a decision on my runs... Unfortunatly, I think I am going to have to re-run many of them with the coverage search off but, THANKFULLY they should take much less time!

            Gus
            In science, "fact" can only mean "confirmed to such a degree that it would be perverse to withhold provisional assent." I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms.
            --Stephen Jay Gould

            Comment


            • #7
              But I understood the manual so that it first looks for splice sites based on reads overlapping several places using all the different ("GT-AG", "GC-AG" and "AT-AC") splice sites and that the coverage-search then added _more_ junctions to this. Not that coverage search restircted the junctions to over GT-AG introns. Hence with longer reads the return/pay-back of coverage search is diminished but it still adds information.

              Comment


              • #8
                Originally posted by pettervikman View Post
                But I understood the manual so that it first looks for splice sites based on reads overlapping several places using all the different ("GT-AG", "GC-AG" and "AT-AC") splice sites and that the coverage-search then added _more_ junctions to this. Not that coverage search restircted the junctions to over GT-AG introns. Hence with longer reads the return/pay-back of coverage search is diminished but it still adds information.
                Hi! Yes, you're right, thank you for catching that. However, I would still argue that coverage search should be left off for longer Illumina reads and mammalian (human, mouse) transcriptomes: the median exon length in humans is ~150 nucleotides, so if you have PE 100 reads you should have some reads cross the splice junctions... I'm not sure how much I would trust novel junctions that are only supported by coverage and not by reads directly, not to mention the additional computational time it takes.

                Comment


                • #9
                  I see the point in leaving --coverage-search off, especially since the samples I'm running at the moment have been stuck at this point for >3 days (2*101bp, ~40-50 million reads). I don't agree with the information that long reads should be sufficient in them selfs though. This since even if the chance of covering an exon/exon boundary is increased with the length you will still have a chance. For the genes with a low expression this might not be sufficient hence you'll get more junctions with --coverage-search.

                  Also the cost per experiment vs the extra (hopefully one time) alignment time, the experiments are expensive and I want the most from my data. But we'll se how long it takes and if I can use the server so much.

                  Comment


                  • #10
                    This since even if the chance of covering an exon/exon boundary is increased with the length you will still have a chance. For the genes with a low expression this might not be sufficient hence you'll get more junctions with --coverage-search.
                    How confident can you be, though, that these junctions are real? How well can you reconstruct these genes and their isoforms if you don't have enough reads that cover splice junctions?

                    Comment


                    • #11
                      I'm afraid I don't understand your point. All junctions/transcripts with a low number of reads are going to be hard to reconstruct. My thought is that by using coverage_search you'll get more reads mapping to junctions which then will move some transcripts from the "to few" bin to the "just enough" bin when it comes to number of mapped reads. This then with regards to reads mapping to junctions especially since you always (in my experience at least) will have more reads mapped to the gene in comparison to the junction.

                      So I'm currently comparing the output from ~70 samples +/- coverage_search to see if I'll benefit from the 4x mapping time that coverage_search takes.

                      Comment


                      • #12
                        My point is that median exon length in human is quite close to 100 nucleotides, and I work with 100bp PE reads.

                        So if I haven't managed to "hit" an exon junction with at least one read how likely is it that I will have enough coverage across the entire gene to be able to predict exons accurately? How do I prevent spurious reconstruction of transcripts and exon boundaries because of how lowly the gene is expressed? How many real single exons will be split into more than one exon because of low coverage or regions in them that have low mappability, for example due to repeats? And how do I filter these out?


                        My thought is that by using coverage_search you'll get more reads mapping to junctions which then will move some transcripts from the "to few" bin to the "just enough" bin when it comes to number of mapped reads.
                        Coverage search does not increase the number of reads mapping to junctions. Coverage search is when you have "piles" of reads mapping to adjacent regions in the genome and there are NO junction reads, but you infer that there is a junction and these reads are part of one transcript based on them being in an adjacent locus and having the GT-AG sequence in the putative intron between them:
                        The second source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping.

                        Comment


                        • #13
                          Firstly, I thought that the coverage search defined new exons as based on coverage piles and that it then tried to map reads to the exons and junctions between all such piles. Hence reads that previously would have gotten a map somewhere else could be remapped to a junction between two defined exons.

                          Regarding all the other questions, well that's something to look in to. I know that I get more reads mapped from our initial investigation comparing between coverage/non coverage. If this is good maps or spurious maps I'll see later on.

                          Comment


                          • #14
                            Originally posted by pettervikman View Post
                            So I'm currently comparing the output from ~70 samples +/- coverage_search to see if I'll benefit from the 4x mapping time that coverage_search takes.
                            Hi, I am wondering what conclusion did you get from the comparison. Do you think coverage search is worth the time?

                            Thanks,

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            31 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            33 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            53 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X