SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
miRNA mature and star sequences, isomiRs etc naluru Bioinformatics 3 04-19-2011 05:38 AM

Reply
 
Thread Tools
Old 11-05-2012, 07:11 PM   #1
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default STAR vs Tophat (2.0.5/6)

Hi! I was wondering - what kind of comparisons have users of this forum run on Tophat vs STAR?
What have you found to be the main differences in terms of accuracy and "quality" of mapping (as opposed to speed/memory requirements, which have been covered in the paper quite well)?
And how does STAR compare to the newer versions of Tophat, which are purported to have solved the pseudogene problem?

I have seen this post, but didn't want to lead the discussion off topic there:
http://seqanswers.com/forums/showthr...highlight=STAR

To quote pbluescript from that thread:
Quote:
You should try STAR. http://gingeraslab.cshl.edu/STAR/
I've mentioned this elsewhere on the forums, but in the comparisons I've done, STAR yields more mapped reads, more uniquely mapped reads, and more reads with both pairs mapped than Tophat all in about 20% of the time it takes Tophat. The only downside is an increase in potential false positive splice junctions, but those can be filtered out easily enough.
It was just published too:
http://bioinformatics.oxfordjournals...ts635.abstract
More reads != better mapping, it may just mean more false assignments of reads to location + salvaging more reads that may not be mappable...
dvanic is offline   Reply With Quote
Old 11-05-2012, 09:21 PM   #2
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

Quote:
Originally Posted by dvanic View Post
More reads != better mapping, it may just mean more false assignments of reads to location + salvaging more reads that may not be mappable...
Since you quoted me, I guess I should add my comments.

My testing hasn't been as extensive as that in the recent paper, so I recommend you all just go read that.

It's true that more mapped reads does not equal better mapping. However, I do get more uniquely mapped reads and more reads with both pairs mapped (for my PE runs). For much of my data, even if I throw out multi-mapped reads, just the unique, proper pair reads from STAR yields more aligned reads than Tophat (up to 2.0.3 when I just gave up on testing Tophat). These are predominantly good quality alignments too.
So far, the testing I have done to confirm the data produced by STAR has held up very well. Gene expression levels, novel splice junctions (when well supported), and alternative isoforms have confirmed well with PCR-based methods.

This comes with a couple caveats though. I don't use STAR for analysis of RNA editing. For that, I switch to BWA and custom built transcriptomes for alignment.
For most of my data, I am sequencing small amounts of fragmented RNA, so the read quality can be quite variable. Getting 50-60% of my reads mapped makes me happy. For my few good quality samples, the differences between Tophat and STAR aren't as pronounced.

As the paper mentions, Tophat is MUCH slower. STAR can be slow on some of my data sets with a small percentage of reads that map to the target genome, but it's still faster than Tophat.

On a side note, of the four or five emails I've sent to the Tophat team requesting advice or reporting a bug, only one was answered. Every email I've sent to the STAR developer has been answered and answered quickly.
pbluescript is offline   Reply With Quote
Old 11-05-2012, 10:19 PM   #3
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default

Quote:
My testing hasn't been as extensive as that in the recent paper, so I recommend you all just go read that.
Oh, I've read the paper, but see several "problems" with their tests (not really problems, more like it's hard to test every version against every version, and every version adds some new feature and maps differently).

With Tophat, we've found that
1) There is a significant difference (for versions 1.4.1 and 2.0.0-2.0.4) with mapping with a reference transcriptome or without one. I did some benchmarking and visualization with one of my datasets and found that without a reference more reads are mapped. Many of these reads are "wierd" - low quality, not "adding up" to transcripts, located in regions with no annotated transcripts in "clusters" that don't appear to be a continuous transcript etc...

For reads that are mapped differently with/without the annotation,
- lots mapped to pseudogenes without a reference
- reads mapping over splice junctions that have a small overlap with one of the junctions (~5 nucleotides) will be differently aligned with and without the annotation (the annotation has information on the structure of the isoforms at the given locus, while without the annotation the "tail" is just mapped to the nearest sequence of those nucleotides, which may not be part of the transcript at all)
- some reads are just mapped weirdly (as above for unmapped) when you don't use an annotation.

Hence, with these versions of Tophat we have always tried to use an annotation.

2) With Tophat 2.0.5/6:
They have introduced a double mapping:
Quote:
Version 2.0.5 adds new options to better control the read alignment and to improve mapping accuracy, and the ability to resume partial TopHat runs:
along with -N/--read-mismatches, TopHat introduces new options for finer control of the read alignment process by limiting the number of mismatches, indels and indel length. Please check new options --read-gap-length and --read-edit-dist.
the new --read-realign-edit-dist option can be used to greatly improve spliced-mapping accuracy (especially in the absence of annotation data) by forcing the re-mapping of some or all reads regardless of them being already mapped in earlier steps of the pipeline.
This, in my experience, makes the mapped data "look better" (when visualized as a wiggle or bam=> bed), i.e. the read positions make sense and look to be part of reasonable transcripts. Also, theoretically, having the ability to accurately map to both genome and transcriptome independently and choose the best alignment seems like a very good idea to me, which is why I am using this version with these settings at the moment.

Getting back on topic, the STAR paper used Tophat with the default options, but with 10 mismatches (default is 2) for a 100 bp PE read (and, crucially, no reference annotation, which, as I've outlined above, makes a big difference for alignment "quality" in Tophat). Hence, in an indirect way, Tophat was at a disadvantage here in terms of how it would perform in a real-world scenario... I'm just wondering how much, and what is observed in a real-world scenario with different datasets by different people. (As in, should I convert???)

[As a side note, the newer versions of Tophat 2.0.5/6 have a much higher memory requirement, in my experience, than the older ones (for example, on 100 million 100bp PE reads it uses ~23 gb, and runs using 8 cores for four bloody days... - but this is a price I am willing to pay to be more "confident" in the higher accuracy of my mapping)]

Quote:
Gene expression levels, novel splice junctions (when well supported), and alternative isoforms have confirmed well with PCR-based methods.
We've found this as well with the latest Tophats when using a reference.

Quote:
I don't use STAR for analysis of RNA editing. For that, I switch to BWA and custom built transcriptomes for alignment.
I think editing is a separate issue altogether. I've seen some of the recent "better" papers (Kleinman, Bahn, Ju etc), but I am still not sure we can accurately estimate editing from a "normal" RNA-Seq dataset (and not one targeted at detecting editing) - the levels of editing for most transcripts are just so low, mapping accuracy needs to be better, SNPs specific to that particular individual need to be taken into account (i.e. in a perfect world you need a genome sequence) etc etc... After the Li paper fiasco I'm very much a skeptic.

Quote:
For my few good quality samples, the differences between Tophat and STAR aren't as pronounced.
I am spoiled by decent quality data that others playing with it have gone green with envy at...

Quote:
As the paper mentions, Tophat is MUCH slower. STAR can be slow on some of my data sets with a small percentage of reads that map to the target genome, but it's still faster than Tophat.
Undoubtedly. But for me (and some of my colleagues here in Oz as well), I'd rather wait four days for my data to map and then play with it for several weeks/months, confident in that what I see, no matter how biologically strange, is probably real, rather than getting my results in a day, spending two weeks looking at some interesting feature, and then discovering it's a mapping artefact. And I know I'm not insured against this, no matter how fancy I am with my pipelining, but I'd like to minimize the chances of this where I can.

Quote:
On a side note, of the four or five emails I've sent to the Tophat team requesting advice or reporting a bug, only one was answered. Every email I've sent to the STAR developer has been answered and answered quickly.
That's actually an important selling point!

With Tophat we've been mostly lucky, but I have issues with another tool in the Tuxedo pipe - cufflinks. There is, at the moment, no actual paper that reports what cufflinks is now doing and how, whether using the smorgasbord of methods it's trying to use together is actually statistically valid, and only the rather complex, confusing and, frankly, unintelligible "how cufflinks works" web page as a reference, and the promise of a "manuscript in preparation". This annoys me, since this IS the most commonly used tool in the field, and the fact that most people who use it have no idea what it's doing reflects IMHO shoddy science... (sorry for the rant)
dvanic is offline   Reply With Quote
Old 11-06-2012, 12:57 PM   #4
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

Quote:
Originally Posted by dvanic View Post
Hence, with these versions of Tophat we have always tried to use an annotation.
When Tophat introduced the option of mapping to a transcriptome first, I did notice an overall improvement in mapping quality. It found a good number of additional splice junctions. However, for my data, STAR was still the winner.

Quote:
Getting back on topic, the STAR paper used Tophat with the default options, but with 10 mismatches (default is 2) for a 100 bp PE read (and, crucially, no reference annotation, which, as I've outlined above, makes a big difference for alignment "quality" in Tophat).
Those are not the Tophat options used in the STAR paper. Perhaps you were looking at one of the different aligners? From the STAR paper:
tophat --solexa1.3-quals -p $1 -r172 --min-segment-intron 20 --max-segment-intron 500000 --min-intron-length 20 --max-intron-length 500000 <genome_name> Read1.fastq Read2.fastq

There isn't even a way to set 10 mismatches/read in Tophat.

Quote:
Undoubtedly. But for me (and some of my colleagues here in Oz as well), I'd rather wait four days for my data to map and then play with it for several weeks/months, confident in that what I see, no matter how biologically strange, is probably real, rather than getting my results in a day, spending two weeks looking at some interesting feature, and then discovering it's a mapping artefact. And I know I'm not insured against this, no matter how fancy I am with my pipelining, but I'd like to minimize the chances of this where I can.
I totally agree. I spent months working on mapping methods. I have access to a cluster, so it was fairly easy to test numerous mapping methods on a large number of samples. I'm just happy that the method that gave me the best results is also the fastest.
pbluescript is offline   Reply With Quote
Old 11-06-2012, 01:23 PM   #5
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default

Quote:
Those are not the Tophat options used in the STAR paper. Perhaps you were looking at one of the different aligners? From the STAR paper:
tophat --solexa1.3-quals -p $1 -r172 --min-segment-intron 20 --max-segment-intron 500000 --min-intron-length 20 --max-intron-length 500000 <genome_name> Read1.fastq Read2.fastq
Damn, should have checked the supplements. I assumed that this was what they were using based on the statement:
Quote:
All aligners were run in the de novo mode, i.e. without using gene/transcript annotations. The maximum number of mismatches was set at 10 per paired-end read, the minimum/maximum intron sizes were set at 20b/500kb
from the main paper.
Quote:
There isn't even a way to set 10 mismatches/read in Tophat.
I am assuming you would be able to by using the option:
Quote:
-N/--read-mismatches Final read alignments having more than these many mismatches are discarded. The default is 2.
although I have never tried to use 10 in the real world (have gone up to 5 successfully, though, but generally stick to the default 2).

Quote:
I totally agree. I spent months working on mapping methods.
It's just we ALL do this, and it would be so much nicer if there could be really relevant "real-world" comparison papers, as opposed to "this is my software - it is the best against every other software (if we run the tests in a particular way)".

But it's good to know STAR is working for someone! I'll give it a shot and see what I get out of it.

Have you used STAR-generated bams with cufflinks by any chance?
dvanic is offline   Reply With Quote
Old 11-06-2012, 03:19 PM   #6
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

Oh cool. It looks like they added that -N option in the 2.0.5 release, after I stopped using it.

I have used STAR for Cufflinks. In fact, a fairly recent update added the appropriate tags for use with Cufflinks. Before that, I had to add them separately.
Whether or not Cufflinks works is another issue of hot debate on this board.
pbluescript is offline   Reply With Quote
Old 11-06-2012, 04:57 PM   #7
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default

Quote:
Whether or not Cufflinks works is another issue of hot debate on this board.
I agree wholeheartedly. But use it anyway because it is, with all of its flaws, the "best" thing out there. The only problem is that I am still not sure how it is now working... Especially for differential expression of isoforms, with or without replicates. [And I have read the "How cufflinks works" page. Doesn't really make it that much clearer. Or convince me that the stats are valid.

And I have had it trip and assemble weird transcripts, especially when run without a reference. And not assemble something in one library and assemble it in another, even though there are an approximately equal number of reads in both libraries...

Basically, I use it and then look in very intently in the browser at anything I'm basing a hypothesis on
dvanic is offline   Reply With Quote
Old 11-28-2012, 05:48 AM   #8
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default

Can I ask which options did you use for your tests of tophat-star? Some default values seems very different (for example about multi mapped reads) and I would like to avoid losing something...thank you.
EGrassi is offline   Reply With Quote
Old 02-06-2013, 02:23 AM   #9
NicoBxl
not just another member
 
Location: Belgium

Join Date: Aug 2010
Posts: 264
Default

Is STAR managing strand-specific data like tophat ?
NicoBxl is offline   Reply With Quote
Old 06-05-2013, 10:12 PM   #10
Torst
Senior Member
 
Location: The University of Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 275
Exclamation

Yes, according to the manual (PDF) it assumes strand-specific reads. If you don't have them, you need to enable an option.
Torst is offline   Reply With Quote
Old 06-05-2013, 11:36 PM   #11
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Are people still interested in this discussion? I've been benchmarking things like crazy including Tophat and STAR as well as several other mapping approaches and then quantifying things like alignment position and read counts at the gene level as well as at the isoform level. I'm seeing some interesting things that sort of validate what I've suspected in the past when using these tools on real data. STAR, with a reference, vs Tophat, with a reference, perform VERY similarly in terms of alignment precision (or accuracy or 1-FDR depending on what you want to call it). Tophat with a reference is a significant improvement over without a reference while STAR's improvement is less (so STAR without a reference seems to out-perform Tophat in alignment precision). In terms of counting hits to genes they, again, perform very similarly however STAR beats Tophat out in count value precision. What I mean to say is that if I compare the list of genes that received alignments to the list of genes that should have alignments the counts from the two aligners are similar but if I compare the count values to the control values then STAR has much higher precision compared to Tophat. Both of these pipelines are defeated pretty significantly by RSEM and eXpress, however, for gene level counts.

Cufflinks is a different story. I figured out how to generate counts from cufflinks (not cuffdiff) and I compared those counts to my own naive counter and go the same values so I can see that it's counting hits at the gene level in a logical way.

The isoform level counts aren't awesome but the sensitivity (ratio of isoforms with counts to those that should have counts) is OK. the FDR is bad though.

I've not finished yet but I'm sticking together a pipeline to benchmark cufflinks' de-novo isoform assembly abilities. I've only run a single simulation which resulted in 26% false positives (that's isoforms it assembled that cuffcompare evaluated to be matches to annotated isoforms that weren't expressed at all in my simulation). so I wasn't too stoked about that. I even provided it with isoforms with massive expression to be sure there was enough reads for it to do its thing. I have a lot more to look at, however, before making any noise. There's more going on here than just cufflnks' ability to make isoforms - it's completely dependent upon the aligner. Also it's isoform expression assigning stage doesn't have the same flexibility of RSEM and eXpress who get all mappings of reads and can perform a detailed evaluation of which of those mappings is really "correct". This approach does yield an improvement over the aligner chosen "primary" mapping. To me it seems that cufflinks is at a disadvantage since it could be strangled by the aligner's ability to select the correct mapping of reads.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-06-2013, 03:13 AM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by sdriscoll View Post
Are people still interested in this discussion?
Yes, I'd be very interested in seeing more about that, it sounds quite useful.
dpryan is offline   Reply With Quote
Old 06-06-2013, 10:27 PM   #13
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

We have done a decent bit of side by side comparisons of STAR 2.3.0 and TOPHAT 2.0.8b.

Short story is they are not very different. On most obvious counts STAR wins, faster and more unique alignments. We run both with ensembl 70 GTF. In fact the biggest thing we notice recently was moving from ensembl 64 to 70 and changing from TOPHAT 2.0.4 was that the run time dropped to under 24hrs compared to the previous 2.5 days on ~70 million read pairs. STAR did a much better job picking up large known indels detected in matching exomes.

My only negative comment on STAR is that it is very aggressive at trying to find junctions and throughs out some clear garbage. HOWEVER, this can easily be removed using the filter by sbjOUT option.

Based on our testing I highly suspect we will be following pbluescript and moving to STAR from TOPHAT.
Jon_Keats is offline   Reply With Quote
Old 06-06-2013, 10:53 PM   #14
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

You know there is something that Heng posted once in the bio-bwa user group. He appears to be more of a DNA alignment dude than an RNA-seq alignment dude but he was talking about adapting the BWA aligners to become RNA-seq aligners. He posted a quick simulation where he sampled 1M reads from the genome (so no spliced reads) and then aligned them to the genome with STAR, Tophat (with both bowtie1 and bowtie2) and also his bwa 'mem' aligner. So tools like STAR and Tophat should have reported no junctions. Tophat, with bowtie2, managed to do this pretty well. I think he said it reported 1 junction. the bwa 'mem' aligner did pretty good as well reporting only a few chimeric alignments that could all be filtered out by removing alignments with MAPQ < 5. Tophat with bowtie1 reported all kinds of fusion alignments and STAR reported several hundred spliced alignments.

His point was that with bwa mem he seems to have a good base aligner - one that isn't reporting junctions when it shouldn't report junctions. It seems that STAR IS probably over-eager to report junctions based on his test. It may be useful to try such simulations yourself and maybe convince yourself which aligner is controlling the false positives of reported junctions better.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */

Last edited by sdriscoll; 06-06-2013 at 10:59 PM.
sdriscoll is offline   Reply With Quote
Old 06-29-2013, 09:15 PM   #15
genomeHunter
Member
 
Location: Canada

Join Date: Apr 2013
Posts: 26
Default

STAR's MAPQ values should NOT be used for filtering reads and judging their qualities. I saw Heng's ROC plot in which STAR's ROC was just a single dot. I tried Bioplanet's GCAT test set on STAR and it was very good:

http://www.bioplanet.com/gcat/report.../compare-23-18

Note that (1) STAR is not a DNA mapper and (2) MAPQ fileds are not set the same as say BWA.

I have seen a lot of STAR spliced alignment where the overhang is just one or two bases. I look at the CIAGR and throw out anything with an overhang less than 8.

Cheers
GH
genomeHunter is offline   Reply With Quote
Old 06-30-2013, 10:05 PM   #16
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

STAR's MAPQs are pretty easy to understand - they are pretty much what Tophat uses as well. The MAPQ is -10*log10(p) where p is the multi-mapping probability. So for something that can map in two places p = 1/2 and the equation works out to 3.01 which rounds to 3. If it can map to three places you get 1.76 which rounds to 2. So any MAPQ in STAR or Tophat above 3 basically just means the mapping is unique. Since -10log10(1) = 0 they just set a high value for unique mappings. There's probably some question still about whether or not those mappings are correctly called unique but that's another story. BWA uses a much different approach and I'm not one to explain it because I don't actually know what it is.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 07-01-2013, 09:23 AM   #17
genomeHunter
Member
 
Location: Canada

Join Date: Apr 2013
Posts: 26
Default

Thanks. I knew about that and actually I use it to filter unique reads, but it is just a flag and not an indicator of mapping quality. In contrast, as you mentioned, BWA assignes a MAPQ based on CIAGR and quality of the positions in the read.

Cheers
GH
genomeHunter is offline   Reply With Quote
Old 07-03-2013, 03:44 PM   #18
jblachly
Junior Member
 
Location: USA

Join Date: Jan 2013
Posts: 1
Default

Quote:
Originally Posted by sdriscoll View Post
Are people still interested in this discussion? I've been benchmarking things like crazy including Tophat and STAR as well as several other mapping approaches and then quantifying things like alignment position and read counts at the gene level as well as at the isoform level. I'm seeing some interesting things that sort of validate what I've suspected in the past when using these tools on real data. STAR, with a reference, vs Tophat, with a reference, perform VERY similarly in terms of alignment precision (or accuracy or 1-FDR depending on what you want to call it). Tophat with a reference is a significant improvement over without a reference while STAR's improvement is less (so STAR without a reference seems to out-perform Tophat in alignment precision). In terms of counting hits to genes they, again, perform very similarly however STAR beats Tophat out in count value precision. What I mean to say is that if I compare the list of genes that received alignments to the list of genes that should have alignments the counts from the two aligners are similar but if I compare the count values to the control values then STAR has much higher precision compared to Tophat. Both of these pipelines are defeated pretty significantly by RSEM and eXpress, however, for gene level counts.

Cufflinks is a different story. I figured out how to generate counts from cufflinks (not cuffdiff) and I compared those counts to my own naive counter and go the same values so I can see that it's counting hits at the gene level in a logical way.

The isoform level counts aren't awesome but the sensitivity (ratio of isoforms with counts to those that should have counts) is OK. the FDR is bad though.

I've not finished yet but I'm sticking together a pipeline to benchmark cufflinks' de-novo isoform assembly abilities. I've only run a single simulation which resulted in 26% false positives (that's isoforms it assembled that cuffcompare evaluated to be matches to annotated isoforms that weren't expressed at all in my simulation). so I wasn't too stoked about that. I even provided it with isoforms with massive expression to be sure there was enough reads for it to do its thing. I have a lot more to look at, however, before making any noise. There's more going on here than just cufflnks' ability to make isoforms - it's completely dependent upon the aligner. Also it's isoform expression assigning stage doesn't have the same flexibility of RSEM and eXpress who get all mappings of reads and can perform a detailed evaluation of which of those mappings is really "correct". This approach does yield an improvement over the aligner chosen "primary" mapping. To me it seems that cufflinks is at a disadvantage since it could be strangled by the aligner's ability to select the correct mapping of reads.
Thanks for all your work on this.

One concern I have before transitioning to STAR is that tophat2 (according to their recent paper) apparently preferentially maps to the transcriptome, THEN to the genome, whereas STAR, as far as I can tell, does not show a preference for a spliced alignment versus an alignment to a processed pseudogene which is not a part of the annotation.

Do you have any information on this (that is, mappings to processed pseudogenes between the two tools) ?
jblachly is offline   Reply With Quote
Old 07-03-2013, 06:00 PM   #19
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

This is my email to the bio-bwa-help list:

Quote:
I simulated 1 million 101bp reads from the human genome without splicing [PS: single-end reads]. This data set is like a negative control. We would expect see no splicing/fusion from the data. Now I map these 1M single-end reads with bwa-mem, tophat+bowtie1/2 and star, and see what happens.

Under the default setting, bwa-mem reports 75 chimeric reads in ~4 minutes. None of these reads is mapped with both parts having a mapQ>3. Another way to say this is if we set a mapQ threshold 5, we will not get any chimeric alignments. In this sense, bwa-mem reports no false positives.

Tophat+bowtie2 performs quite well, too. It takes half an hour or so and identifies one false junction only. The tophat manual suggests not using bowtie2 to detect fusions [PS: thus no fusion detection in this case].

Tophat+bowtie1 under the fusion configuration is a little faster but less accurate. It reports 30 splicing alignments. It, however, gives 3193 candidate fusion events. Probably tophat-fusion-post can filter most of them as the majority are supported by a single read, but the bulk of candidates imply that tophat is not very good at initial chimeric mapping.

Star is impressively fast (at the cost of ~28GB RAM). My data set is too small to test its speed. According to its paper, star is ~10X as fast as bwa-mem, which I believe. Accuracy-wise, it reports 539 splicing and 274 chimera. [PS: unique hits only]
Command lines:

Quote:
STAR --genomeDir ../index/star --readFilesIn r1.fq --runThreadN 1 --outFilterMultimapNmax 1 --chimSegmentMin 30
tophat2 ../index/bowtie2/hs37m-bt2 r1.fq > r1-se.tophat.sam
tophat2 -o tophat_fusion_out --fusion-search --bowtie1 --no-coverage-search -r 0 --max-intron-length 100000 --fusion-min-dist 100000 --fusion-anchor-length 13 ../../index/bowtie/hs37m r1.fq
I copied the tophat2+bowtie1 command line from a webpage on the tophat website, but I cannot find the page now.

As a side note, if you really need an ultrafast mapper for genomic reads, you should try the latest SNAP. It is both fast and accurate.
lh3 is offline   Reply With Quote
Old 07-03-2013, 06:17 PM   #20
genomeHunter
Member
 
Location: Canada

Join Date: Apr 2013
Posts: 26
Default

Hi Heng,

Thanks for your post. It would be nice if you could upload the false spliced-reads somewhere. You can see my own comparison with BWA and Bowtie2 in GCAT (link). Most spliced reads I saw had just a couple of bases mapped to some other exon, e.g., M97N2567M3. If you remove reads with overhangs less than 8, then the majority of false spliced reads will be gone.

I think in general most of the mappers are mature enough nowadays and mapping RNA-Seq reads seems to be close to perfection. We need better mapping protocols, e.g., multi-step mapping, or developing methods for handling quality disparity in mates of a pair.

Cheers,
GH
genomeHunter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:54 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO