SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Tophat2.03: error mrfox Bioinformatics 6 08-07-2013 06:09 AM
Tophat2 with fusion search and tophat-fusion-post problems seqfast Bioinformatics 9 07-30-2013 07:16 PM
tophat2 error Xi Wang Bioinformatics 13 12-21-2012 07:36 AM
tophat2 segment_juncs error: Error: segment-based junction search failed with err =-6 hulan0@gmail.com Bioinformatics 1 04-16-2012 07:37 AM
TopHat closure based search and coverage based search tasteandsee Bioinformatics 1 03-27-2012 02:47 AM

Reply
 
Thread Tools
Old 08-09-2012, 05:13 PM   #1
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Question When to use tophat2's coverage search?

Hi!
I'm a bit confused: when should one use tophat2's coverage search? Is there a logic in leaving it off/on for 100bp PE reads, or is this dictated solely by the computational resources one has available?

Overall, what is YOUR standard practice with using this option?

I have seen the manual, which states:
Quote:
Enables or disables the coverage-based search for junctions. Use when coverage search is disabled by
default (such as for reads ≥75 bp), for maximum sensitivity. Default: no
However, including the fact that I am working with a small number of libraries, I can afford the extra computational time and memory requirements, providing that this "maximum sensitivity" is really worth it. Question is: how do I make that call (other than by running my libraries with and without it and then comparing. I don't really want to reinvent the wheel here.).

Also, for human, how much sense does it make to use the microexon search option???

Thanks in advance!
dvanic is offline   Reply With Quote
Old 08-31-2012, 06:41 AM   #2
magbju
Junior Member
 
Location: Sweden

Join Date: Aug 2012
Posts: 4
Default

I am interested in this question as well, does anyone have a good answer?
magbju is offline   Reply With Quote
Old 02-01-2013, 09:40 AM   #3
Gus
Member
 
Location: Irvine CA, USA

Join Date: Dec 2009
Posts: 29
Default

I would also REALLY like to hear an answer on this. What am I giving up if I opt for a --no-coverage-search?
__________________
In science, "fact" can only mean "confirmed to such a degree that it would be perverse to withhold provisional assent." I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms.
--Stephen Jay Gould
Gus is offline   Reply With Quote
Old 02-01-2013, 09:46 AM   #4
Gus
Member
 
Location: Irvine CA, USA

Join Date: Dec 2009
Posts: 29
Default

Sorry to simply provide a link here but since it was biostars.org that provided the answer, not seqanswers, I felt it was appropriate to give that site the credit.

Here is a thread that provides a discussion on this topic. I make no claims on its validity, but I found it useful to read.

http://www.biostars.org/p/49224/
__________________
In science, "fact" can only mean "confirmed to such a degree that it would be perverse to withhold provisional assent." I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms.
--Stephen Jay Gould
Gus is offline   Reply With Quote
Old 02-03-2013, 01:02 AM   #5
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default

Thanks for the useful link, though I disagree with the interpretation provided by the biostars poster!

From the tophat manual:
Quote:
The first and strongest source of evidence for a splice junction is when two segments from the same read (for reads of at least 45bp) are mapped at a certain distance on the same genomic sequence or when an internal segment fails to map - again suggesting that such reads are spanning multiple exons. With this approach, "GT-AG", "GC-AG" and "AT-AC" introns will be found ab initio. The second source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. We only suggest users use this second option (--coverage-search) for short reads (< 45bp) and with a small number of reads (<= 10 million). This latter option will only report alignments across "GT-AG" introns
I've responded to this on biostars, but to repost here:
Hi! The identification of new splice sites in different genes/transcripts is still possible without coverage search!

Coverage search is, according to the manual, only useful when you've got very short reads, since in this case the probability that the read will "hit" the splice junction exactly may be very low for relatively lowly expressed transcripts. Hence, you need another way of detecting splice sites, which is where coverage search comes in. To make it easier for the algorithm by using coverage search you are allowing for only the most canonical of GT-AG splice junctions (only in this latter step; you'll get the GC-AG and AT-AC junctions that are supported by reads).

So the resume is: coverage search should be left off for "modern" Illumina data.

Last edited by dvanic; 02-18-2013 at 06:20 PM. Reason: correcting interpretation error
dvanic is offline   Reply With Quote
Old 02-03-2013, 09:14 AM   #6
Gus
Member
 
Location: Irvine CA, USA

Join Date: Dec 2009
Posts: 29
Default

Wow. I am so thankful for your response. And finally, I think I have enough to make a decision on my runs... Unfortunatly, I think I am going to have to re-run many of them with the coverage search off but, THANKFULLY they should take much less time!

Gus
__________________
In science, "fact" can only mean "confirmed to such a degree that it would be perverse to withhold provisional assent." I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms.
--Stephen Jay Gould
Gus is offline   Reply With Quote
Old 02-18-2013, 01:13 AM   #7
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Default

But I understood the manual so that it first looks for splice sites based on reads overlapping several places using all the different ("GT-AG", "GC-AG" and "AT-AC") splice sites and that the coverage-search then added _more_ junctions to this. Not that coverage search restircted the junctions to over GT-AG introns. Hence with longer reads the return/pay-back of coverage search is diminished but it still adds information.
pettervikman is offline   Reply With Quote
Old 02-18-2013, 06:18 PM   #8
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default

Quote:
Originally Posted by pettervikman View Post
But I understood the manual so that it first looks for splice sites based on reads overlapping several places using all the different ("GT-AG", "GC-AG" and "AT-AC") splice sites and that the coverage-search then added _more_ junctions to this. Not that coverage search restircted the junctions to over GT-AG introns. Hence with longer reads the return/pay-back of coverage search is diminished but it still adds information.
Hi! Yes, you're right, thank you for catching that. However, I would still argue that coverage search should be left off for longer Illumina reads and mammalian (human, mouse) transcriptomes: the median exon length in humans is ~150 nucleotides, so if you have PE 100 reads you should have some reads cross the splice junctions... I'm not sure how much I would trust novel junctions that are only supported by coverage and not by reads directly, not to mention the additional computational time it takes.
dvanic is offline   Reply With Quote
Old 02-21-2013, 03:57 AM   #9
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Default

I see the point in leaving --coverage-search off, especially since the samples I'm running at the moment have been stuck at this point for >3 days (2*101bp, ~40-50 million reads). I don't agree with the information that long reads should be sufficient in them selfs though. This since even if the chance of covering an exon/exon boundary is increased with the length you will still have a chance. For the genes with a low expression this might not be sufficient hence you'll get more junctions with --coverage-search.

Also the cost per experiment vs the extra (hopefully one time) alignment time, the experiments are expensive and I want the most from my data. But we'll se how long it takes and if I can use the server so much.
pettervikman is offline   Reply With Quote
Old 03-27-2013, 05:33 PM   #10
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default

Quote:
This since even if the chance of covering an exon/exon boundary is increased with the length you will still have a chance. For the genes with a low expression this might not be sufficient hence you'll get more junctions with --coverage-search.
How confident can you be, though, that these junctions are real? How well can you reconstruct these genes and their isoforms if you don't have enough reads that cover splice junctions?
dvanic is offline   Reply With Quote
Old 03-28-2013, 12:24 AM   #11
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Default

I'm afraid I don't understand your point. All junctions/transcripts with a low number of reads are going to be hard to reconstruct. My thought is that by using coverage_search you'll get more reads mapping to junctions which then will move some transcripts from the "to few" bin to the "just enough" bin when it comes to number of mapped reads. This then with regards to reads mapping to junctions especially since you always (in my experience at least) will have more reads mapped to the gene in comparison to the junction.

So I'm currently comparing the output from ~70 samples +/- coverage_search to see if I'll benefit from the 4x mapping time that coverage_search takes.
pettervikman is offline   Reply With Quote
Old 03-28-2013, 12:59 AM   #12
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default

My point is that median exon length in human is quite close to 100 nucleotides, and I work with 100bp PE reads.

So if I haven't managed to "hit" an exon junction with at least one read how likely is it that I will have enough coverage across the entire gene to be able to predict exons accurately? How do I prevent spurious reconstruction of transcripts and exon boundaries because of how lowly the gene is expressed? How many real single exons will be split into more than one exon because of low coverage or regions in them that have low mappability, for example due to repeats? And how do I filter these out?


Quote:
My thought is that by using coverage_search you'll get more reads mapping to junctions which then will move some transcripts from the "to few" bin to the "just enough" bin when it comes to number of mapped reads.
Coverage search does not increase the number of reads mapping to junctions. Coverage search is when you have "piles" of reads mapping to adjacent regions in the genome and there are NO junction reads, but you infer that there is a junction and these reads are part of one transcript based on them being in an adjacent locus and having the GT-AG sequence in the putative intron between them:
Quote:
The second source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping.
dvanic is offline   Reply With Quote
Old 03-28-2013, 01:11 AM   #13
pettervikman
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 23
Default

Firstly, I thought that the coverage search defined new exons as based on coverage piles and that it then tried to map reads to the exons and junctions between all such piles. Hence reads that previously would have gotten a map somewhere else could be remapped to a junction between two defined exons.

Regarding all the other questions, well that's something to look in to. I know that I get more reads mapped from our initial investigation comparing between coverage/non coverage. If this is good maps or spurious maps I'll see later on.
pettervikman is offline   Reply With Quote
Old 07-07-2014, 02:40 AM   #14
byb121
Member
 
Location: Newcastle upon Tyne

Join Date: Aug 2009
Posts: 18
Default

Quote:
Originally Posted by pettervikman View Post
So I'm currently comparing the output from ~70 samples +/- coverage_search to see if I'll benefit from the 4x mapping time that coverage_search takes.
Hi, I am wondering what conclusion did you get from the comparison. Do you think coverage search is worth the time?

Thanks,
byb121 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO