Seqanswers Leaderboard Ad

**Brian Bushnell** · 03-26-2015, 06:47 PM

If you want maximal sensitivity, I encourage you to try out BBMap; it's faster and much more sensitive than TopHat2. I am the author and am biased, but those are objective facts. For human RNA-seq I suggest:

(index)
bbmap.sh ref=hg38.fa

(map)
bbmap.sh in=reads.fq out=mapped.sam maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=t rpkm=rpkm.txt

Although if the library was strand-specific, and you want to analyze it with Cufflinks (which I do not recommend), then change the xstag flag to whatever strand type you used. xstag and xmtag are just for Cufflinks compatibility; I don't think other programs need them. "rpkm=<file>" is optional but will write rpkm/fpkm counts to that file.

For analysis, I suggest you pick a single human genome build (including whatever alternate contigs you want) and standardize on it. You will waste a huge amount of time and effort if you map to multiple builds, call variants with respect to each one, calculate different FPKMs for each, maintain variant and gene-locus databases for each, etc etc. Trust me - it is not worth it; there are an infinite number of possibilities out there and time is better spent on the 99% where they concur than the 1% where they differ, which is basically the least-reliable part. And multiple different contradictory answers typically make people confused; they will either ignore all of it for being too much information, or chase the white rabbit down the wrong hole - so, in my opinion, your job is not to give everyone all possible information, but to act as an intelligent filter and pick the best, even when you're not certain.

Gene-annotation is a bit different, since it does not cause so much replicated work. When I dealt with the human genome, for determining the effects of variations, I made a union of the major annotation databases and considered the effect of any variation on all possible transcripts. However, for RNA-seq quantification, you don't want two different transcripts that are identical except one ends 3bp later than the other competing for the same reads. So here again, for RNA-seq quantification, I think picking one thing and sticking with it is best. You could pick one and union it with all others, throwing away any transcript that overlaps with a transcript from your chosen set; that would prevent the aforementioned problem, and still allow you to deal with genes that are completely unique to one database without being confounded by barely-differing duplicates.

In summary:

Make a system where at any given time there is exactly one reference (potentially picking the best and adding unique sequences from others), exactly one set of annotations (potentially picking the best for your chosen reference, and bringing in uniquely annotated genes from other references and annotations for additional sequences you brought in), and produce some standardized output that everyone can understand - updating on a regular schedule.

I have no comment on the best choice of a human annotation set.

As for the decision between transcriptome-mapping and genome-mapping - in my opinion, genome-mapping is best for any organism that is not completely understood, while transcriptome-mapping is best for tiny genomes in which you can confidently claim that all genes have been identified, as well as all transcripts, since microbes typically lack alternative splicing. But if you want to make new discoveries in eukaryotes, you should always do genome-mapping first and only try mapping the residual unmapped reads to the transcriptome afterward. Otherwise your results will be biased by existing annotations.

**luc** · 03-26-2015, 07:07 PM

Hi Brian,
Which Cufflinks alternative would you recommend?

Thanks

**Brian Bushnell** · 03-26-2015, 08:10 PM

Luc,

I apologize - I have had a lot of negative experiences with Cufflinks, but I don't directly deal with RNA-seq quantification anymore, so I have no positive experiences with anything else. Therefore, while I encourage you to read my exposition, it doesn't suggest a specific alternative.

One of my trustworthy co-workers showed me results indicating that Cufflinks was incapable of incapable of determining that there was differential expression between human males and females for genes exclusive to chromosome Y, while DEseq was capable of determining that, in fact, human males show higher expression of these genes than human females. That is enough for me to favor one over the other. When combined with my experience of Cufflinks running forever unless you manually filter out extremely highly-expressed genes, I can't ever recommend it to anyone.

I have no direct experience with DEseq, but that's what JGI switched to instead of Tuxedo-package alternatives; from the data I have seen, it was a very good choice - it appears to be much more accurate and consistent.

I only have direct knowledge of BBMap's rpkm/fpkm counts, or Seal's. They give similar results for the vast majority of transcripts, and contradictory results for some transcripts, since they analyze things differently (BBMap uses global alignment, and Seal counts long kmer matches). That non-consensus greatly irritates me, but underscores the point that algorithms and models only provide approximations, not truth.

In summary - as a direct alternative, it depends on what you are doing, but DEseq seems to be good for differential expression in known transcripts, though I have not tested it personally. For identification of novel transcripts, I have no suggestions.

**lexogen** · 03-27-2015, 06:44 AM

Hi Luc and others who are looking for an alternative to Cufflinks -

I can recommend to try our new software called Mix-Square
You can get it here https://secure.lexogen.com/solo/prod...26&Single=true

We have started offering it to researchers to test on various datasets and are interested in their feedback.

It is a statistical model which addresses positional coverage bias by mixtures of probability distributions. The parameters of the Mix2 model can be efficiently trained with the EM algorithm yielding simultaneous estimates for the relative abundance of gene isoforms and the positional coverage bias.

In summary, Mix-Square yields highly accurate concentration estimates for gene isoforms by adapting to the positional coverage bias in RNA-Seq data. This leads to higher accuracy in the detection of differential expression of genes and gene isoforms. Mix-Square enables repeatable concentration estimates across multiple library preparations and sequencing facilities and can be used as an explorative tool to investigate the positional biases present in RNA-Seq data. Mix-Square is highly efficient and runs significantly faster than Cufflinks and Pennseq.
We have uploaded a manuscript draft on biorxiv http://biorxiv.org/content/biorxiv/e...11767.full.pdf , but for the latest information, including the data from more experiments, please contact us directly.

**luc** · 04-01-2015, 10:32 PM

Hi Brian, hi lexogen,

thanks a lot for the explanations and the pointers to DEseq (we have been testing it previoulsy) and Mix-Square.

**pkstarstorm05** · 09-10-2015, 05:29 AM

Hey Brian - amaze balls advice and insight as always.

I was wondering if you had any suggestions for mouse annotation sources (ensemble, vs ucsc, etc) - or is that the same story as human annotation files?

Also, can an annotation file be included with bbmap (in the way you can include one with tophat2) to annotate your transcriptome or is that unnecessary? I.e. Can annotation simply be taken care of later during transcriptome assembly? I'm not totally clear on this.

Lastly - I absolutely love your software. Its by far the most useful tool I've come across, and its helped me solve some incredibly difficult genome mutation problems. I don't know how much this software is being used, but its should be the first pick for everyone. But I know its not - and one reason I'm quite confident about is that the manual is... lets say difficult to read. You're quick responses are amazing and always incredibly helpful, and I wonder if an overhaul to the manual (say in the way that bowtie's manual is set up) wouldn't make this software a little more accessible and/or attractive.

I implore you! Beautify the bbmap manual for the good of humanity! (or am I being absurd and there is already one...?)

Cheers,
Paul G

**Brian Bushnell** · 09-10-2015, 09:14 AM

Hi Paul,

I don't really deal with annotation these days, but I tend to use NCBI as my go-to site for such things. For example -
ftp://ftp.ncbi.nlm.nih.gov/genomes/Mus_musculus/GFF/

But, I don't know if those are better or worse than others. BBMap does not accept annotation files and strictly finds splice sites de-novo. Though I'm not entirely clear on your question, perhaps. Can you describe exactly what you're doing? It sounds like you're trying to improve the mouse annotation by finding novel expressed genes or isoforms... is that correct?

I completely agree with you about the manual (which kind of does not exist); it has a lot of room for improvement. I'm looking into the possibility of hiring a student to help me write a comprehensive guide to all of the different tools, since I'm short on time. Hopefully something will come out of that!

-Brian

**pkstarstorm05** · 09-10-2015, 10:59 PM

Hi Brian -

sorry I've got one more question about bbmap's run information printing. Does bbmap produce an output log? There are stats printed out on the screen during a mapping run, but it looks like there is no log. I know you can print any console output directly to a log file... (2> outputlog.txt I think) but I've set up a very large mapping run that I would prefer not to restart!
>.<

That would be great - for us and the student! I may find myself looking for a little side project, so if I get some time, I'll throw together an HTML template based on the Bowtie2 manual and send it to you. That might work at least for a while until you have an opportunity to fill something out more completely.

Cheers,
Paul

**gringer** · 09-12-2015, 05:23 AM

My prediction is that the ultra-fast mappers that do gene-counting at the same time as mapping are going to be the way to do RNASeq in the future; programs like STAR, Kallisto, RapMap, and possibly even HISAT2 (although that doesn't yet seem to be at the map-and-count stage).

**Brian Bushnell** · 09-12-2015, 07:21 AM

Originally posted by pkstarstorm05 View Post

Hi Brian -

sorry I've got one more question about bbmap's run information printing. Does bbmap produce an output log? There are stats printed out on the screen during a mapping run, but it looks like there is no log. I know you can print any console output directly to a log file... (2> outputlog.txt I think) but I've set up a very large mapping run that I would prefer not to restart!
>.<

There is a flag, "statsfile=stderr" (which is the default), and you can change it to, say, "statsfile=stats.txt" to write to a specific file without redirection. However, you can't do that after BBMap starts running

That would be great - for us and the student! I may find myself looking for a little side project, so if I get some time, I'll throw together an HTML template based on the Bowtie2 manual and send it to you. That might work at least for a while until you have an opportunity to fill something out more completely.

Cheers,
Paul

There is, also, Genomax's thread which gathered a lot of common BBMap commands. Not a manual, but very helpful. Still, if you have the time and inclination to put together something in HTML, I'd be grateful!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

RNASeq Best Practices

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News