Seqanswers Leaderboard Ad

**sdriscoll** · 06-30-2013, 10:05 PM

STAR's MAPQs are pretty easy to understand - they are pretty much what Tophat uses as well. The MAPQ is -10*log10(p) where p is the multi-mapping probability. So for something that can map in two places p = 1/2 and the equation works out to 3.01 which rounds to 3. If it can map to three places you get 1.76 which rounds to 2. So any MAPQ in STAR or Tophat above 3 basically just means the mapping is unique. Since -10log10(1) = 0 they just set a high value for unique mappings. There's probably some question still about whether or not those mappings are correctly called unique but that's another story. BWA uses a much different approach and I'm not one to explain it because I don't actually know what it is.

**genomeHunter** · 07-01-2013, 09:23 AM

Thanks. I knew about that and actually I use it to filter unique reads, but it is just a flag and not an indicator of mapping quality. In contrast, as you mentioned, BWA assignes a MAPQ based on CIAGR and quality of the positions in the read.

Cheers
GH

**jblachly** · 07-03-2013, 03:44 PM

Originally posted by sdriscoll View Post

Are people still interested in this discussion? I've been benchmarking things like crazy including Tophat and STAR as well as several other mapping approaches and then quantifying things like alignment position and read counts at the gene level as well as at the isoform level. I'm seeing some interesting things that sort of validate what I've suspected in the past when using these tools on real data. STAR, with a reference, vs Tophat, with a reference, perform VERY similarly in terms of alignment precision (or accuracy or 1-FDR depending on what you want to call it). Tophat with a reference is a significant improvement over without a reference while STAR's improvement is less (so STAR without a reference seems to out-perform Tophat in alignment precision). In terms of counting hits to genes they, again, perform very similarly however STAR beats Tophat out in count value precision. What I mean to say is that if I compare the list of genes that received alignments to the list of genes that should have alignments the counts from the two aligners are similar but if I compare the count values to the control values then STAR has much higher precision compared to Tophat. Both of these pipelines are defeated pretty significantly by RSEM and eXpress, however, for gene level counts.

Cufflinks is a different story. I figured out how to generate counts from cufflinks (not cuffdiff) and I compared those counts to my own naive counter and go the same values so I can see that it's counting hits at the gene level in a logical way.

The isoform level counts aren't awesome but the sensitivity (ratio of isoforms with counts to those that should have counts) is OK. the FDR is bad though.

I've not finished yet but I'm sticking together a pipeline to benchmark cufflinks' de-novo isoform assembly abilities. I've only run a single simulation which resulted in 26% false positives (that's isoforms it assembled that cuffcompare evaluated to be matches to annotated isoforms that weren't expressed at all in my simulation). so I wasn't too stoked about that. I even provided it with isoforms with massive expression to be sure there was enough reads for it to do its thing. I have a lot more to look at, however, before making any noise. There's more going on here than just cufflnks' ability to make isoforms - it's completely dependent upon the aligner. Also it's isoform expression assigning stage doesn't have the same flexibility of RSEM and eXpress who get all mappings of reads and can perform a detailed evaluation of which of those mappings is really "correct". This approach does yield an improvement over the aligner chosen "primary" mapping. To me it seems that cufflinks is at a disadvantage since it could be strangled by the aligner's ability to select the correct mapping of reads.

Thanks for all your work on this.

One concern I have before transitioning to STAR is that tophat2 (according to their recent paper) apparently preferentially maps to the transcriptome, THEN to the genome, whereas STAR, as far as I can tell, does not show a preference for a spliced alignment versus an alignment to a processed pseudogene which is not a part of the annotation.

Do you have any information on this (that is, mappings to processed pseudogenes between the two tools) ?

**lh3** · 07-03-2013, 06:00 PM

This is my email to the bio-bwa-help list:

I simulated 1 million 101bp reads from the human genome without splicing [PS: single-end reads]. This data set is like a negative control. We would expect see no splicing/fusion from the data. Now I map these 1M single-end reads with bwa-mem, tophat+bowtie1/2 and star, and see what happens.

Under the default setting, bwa-mem reports 75 chimeric reads in ~4 minutes. None of these reads is mapped with both parts having a mapQ>3. Another way to say this is if we set a mapQ threshold 5, we will not get any chimeric alignments. In this sense, bwa-mem reports no false positives.

Tophat+bowtie2 performs quite well, too. It takes half an hour or so and identifies one false junction only. The tophat manual suggests not using bowtie2 to detect fusions [PS: thus no fusion detection in this case].

Tophat+bowtie1 under the fusion configuration is a little faster but less accurate. It reports 30 splicing alignments. It, however, gives 3193 candidate fusion events. Probably tophat-fusion-post can filter most of them as the majority are supported by a single read, but the bulk of candidates imply that tophat is not very good at initial chimeric mapping.

Star is impressively fast (at the cost of ~28GB RAM). My data set is too small to test its speed. According to its paper, star is ~10X as fast as bwa-mem, which I believe. Accuracy-wise, it reports 539 splicing and 274 chimera. [PS: unique hits only]

Command lines:

STAR --genomeDir ../index/star --readFilesIn r1.fq --runThreadN 1 --outFilterMultimapNmax 1 --chimSegmentMin 30
tophat2 ../index/bowtie2/hs37m-bt2 r1.fq > r1-se.tophat.sam
tophat2 -o tophat_fusion_out --fusion-search --bowtie1 --no-coverage-search -r 0 --max-intron-length 100000 --fusion-min-dist 100000 --fusion-anchor-length 13 ../../index/bowtie/hs37m r1.fq

I copied the tophat2+bowtie1 command line from a webpage on the tophat website, but I cannot find the page now.

As a side note, if you really need an ultrafast mapper for genomic reads, you should try the latest SNAP. It is both fast and accurate.

**genomeHunter** · 07-03-2013, 06:17 PM

Hi Heng,

Thanks for your post. It would be nice if you could upload the false spliced-reads somewhere. You can see my own comparison with BWA and Bowtie2 in GCAT (link). Most spliced reads I saw had just a couple of bases mapped to some other exon, e.g., M97N2567M3. If you remove reads with overhangs less than 8, then the majority of false spliced reads will be gone.

I think in general most of the mappers are mature enough nowadays and mapping RNA-Seq reads seems to be close to perfection. We need better mapping protocols, e.g., multi-step mapping, or developing methods for handling quality disparity in mates of a pair.

Cheers,
GH

**lh3** · 07-04-2013, 06:45 AM

On my data set, setting a threshold 8 has little effect. I briefly looked through the spliced hits, the vast majority of false "exon" hits are longer than 10bp. Also, note that there are 274 chimeric alignments under "--chimSegmentMin 30".

I disagree that RNA-seq alignment is close to perfection. Tophat-like approach does not fully utilize the information in longer reads. It might also be potentially sensitive to and biased by the coverage of a gene (I am not very sure, though). Other RNA-seq mappers like STAR take the right approach, IMO, but in comparison to the top-tier DNA-seq mappers, such as novoalign and snap, they are not careful enough to disambiguate hits in semi-repetitive regions. RNA-seq mappers need to learn from DNA-seq mappers about how to improve accuracy.

On the other hand, there is a question about whether we need very high mapping accuracy for RNA-seq reads. The fluctuation in gene expression is by far higher than the mapping error rate; library prep also introduces errors (e.g. intronic reads) and variance. Maybe improving accuracy would not greatly affect typical studies.

On the detection of fusion genes, I think we will more often use old-style methods - run a generic local aligner (e.g. blast/blat/ssaha2 in the old days) and then analyze local hits. When reads reach 250bp in length, which is already happening, many will be changed again.

**alexdobin** · 07-05-2013, 07:35 PM

Hi Heng,

this is an interesting test and it would be great if you could post the simulated data so that I can play with the parameters.

STAR indeed prefers spliced alignments over alignments with mismatches to avoid the "pseudogene trap". By default, an alignment with a canonical junction without mismatches will always be chosen over an unspliced alignment with one or more mismatches. This choice of penalties will not work well with DNA mapping, of course.

Cheers
Alex

**sdriscoll** · 07-05-2013, 08:49 PM

Alex in my simulated data tests this strategy seems to pay off quite well. With STAR I see a very low rate of misalignment. One way I've tracked this is by mapping simulated RNA seq reads that only contain sequencing errors and then running a SNP caller. While STAR couldn't beat out the "corrected" transcriptome alignments I get from eXpress it performed considerably better than BWA mem and the newer "subread" aligner. The expected result would be zero snps but bwa and subread alignments resulted in 1000's. STAR gave me 8 and the express pipeline gave me 1. So STAR is placing reads in the genome basically just as well as eXpress was able to disambiguate the transcriptome alignments. Both did very well.

**lh3** · 07-05-2013, 09:49 PM

I put the spliced alignments at: http://lh3lh3.users.sourceforge.net/tmp-aln.tgz. Among ~500 spliced alignments reported by star, >60% should be mapped with a small indel. This rate of gapped alignments is much higher than the average rate ~11% in the whole data set. Does star prefer ungapped alignments to avoid frameshifts? Is it possible to tune star to choose unspliced alignments for these reads, or ask it to report both spliced and unspliced alignments?

In addition, it should be noted that the spliced alignments are only 1/8 of all mismappings (<10% for paired-end data). Excluding spliced alignments would not affect the overall conclusion on genomic read alignment.

Thanks.

**lh3** · 07-05-2013, 10:13 PM

@sdriscoll. When you use bwa-mem for RNA-seq, I can think of two sources of errors: pseudogenes and hanging ends. In the former case, wrong SNPs are mostly clustered in pseudogenes. It is difficult for a genomic read mapper to resolve this problem. In the latter case, wrong SNPs are clustered around exon boundaries. This can be alleviated a little bit by changing the default parameter (-L0). Among the 1000 errors, which source is dominant? Are there other source of errors I have overlooked? Have you briefly looked at that? (PS: I don't think a genomic read mapper can outperform an RNA-seq mapper in this case. Just want to understand the major source of errors). Thanks.

In addition, have you tried gsnap or tophat? After all, this thread is about star vs. tophat ;-)

**shi** · 07-06-2013, 12:02 AM

Originally posted by sdriscoll View Post

Alex in my simulated data tests this strategy seems to pay off quite well. With STAR I see a very low rate of misalignment. One way I've tracked this is by mapping simulated RNA seq reads that only contain sequencing errors and then running a SNP caller. While STAR couldn't beat out the "corrected" transcriptome alignments I get from eXpress it performed considerably better than BWA mem and the newer "subread" aligner. The expected result would be zero snps but bwa and subread alignments resulted in 1000's. STAR gave me 8 and the express pipeline gave me 1. So STAR is placing reads in the genome basically just as well as eXpress was able to disambiguate the transcriptome alignments. Both did very well.

I'm not sure if you used 'subread-align' (this is what I called 'subread') or 'subjunc' when you ran subread. 'subread-align' only reports the mapping locations of the largest mappable region in each read, therefore regions in the exon-spanning read which map to 2nd or 3rd exons will not be reported. You should absolutely run 'subjunc' for your RNA-seq data if you are going to call SNPs using the mapping results.

Best wishes,

Wei

**sdriscoll** · 07-06-2013, 10:15 AM

You're correct. I used subjunc but I do refer to your package as "subread" in general.

**sdriscoll** · 07-06-2013, 10:20 AM

@lh3 I haven't tried gsnap yet...I was very close to setting it up last week but got distracted. I believe Tophat, with a gene annotation, works almost as we'll as STAR. They are very similar in performance. I haven't compared the SNP results though. I suspect you are right about the mem results. That SNP count only includes hits in exons. I can take a closer look..or send you some example data if you're interested.

**lh3** · 07-06-2013, 11:45 AM

@sdriscoll. Thanks a lot. If your data are sitting at Sanger/EBI, I may have a look there. Want to understand the major error source such as I can avoid it in future. If you believe Tophat is similar, I guess gsnap should not be much different, either.

PS: just notice that you mentioned gene annotations. I guess you were running star without annotations? If that is the case, how would tophat/tophat2 without annotations be compared to star? Were you running subread with or without annotations? Sorry that I have little idea about how much gene annotations affect the result. Thanks.

**shi** · 07-06-2013, 03:47 PM

Originally posted by sdriscoll View Post

You're correct. I used subjunc but I do refer to your package as "subread" in general.

Thanks for your info. You may not be aware that Subjunc has a verification step in which all discovered exon-exon junctions are validated and all the reads are re-aligned (including exonic reads and exon-spanning reads). However, this step needs to have at least a few millions reads (eg. number of reads included in a typical sequencing library) so that the exon-exon junctions can have a good read coverage and those most highly confidently mapped reads can be used to correct mapping errors in those reads of low mapping confidence. But I guess there were much fewer reads in your simulation and thus you cannot really see the power of this approach.

In our evaluations using simulation and the SEQC data (~8M read pairs), we found subjunc was much more accurate than TopHat2 in the accuracy of junction detection and read mapping. So I'm quite surprised by your evaluation results.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News