Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNASeq Best Practices

    After trying out various command line options with tophat2 (followed by htseq-count), as well as different sources for annotation files and genome sequences, I am somewhat lost as to what the best practices should be.

    My samples are human and I would like to use the hg38 genome release. I think I understand the various differences between the UCSC, NCBI, and Ensembl annotations sufficiently now, and have convinced myself that Ensembl might be the best way to go (feedback on this choice would still be very appreciated).

    However, I am still stuck on how to deal with genes that have been mapped to multiple "regular" as well as alternative assemblies from the hg38 releases. If I exclude alternative assemblies, I would, for example not have CCL3L1 included. If I included them, I would definitely need to use Ensembl's unique IDs, which complicates my downstream analysis somewhat (our biologists would prefer to use gene ids like 'CCL3L1').

    As for tophat2 options, I have been comparing output from the default options plus --b2--very-sensitive with that using a transcriptome index. For the latter, I thought it would make sense to include hits to the transcriptome only and pre-filter them by excluding any reads with more than one hit. This (probably misguided) approach was thought to speed up counting of reads with htseq-count, which also would have discarded multiple hits of the same read. Alas, I end up with very different results compared to the default options, which I think I can explain but would love someone else's opinion on as well... basically, with the second approach, I am seeing reads INCLUDED that would have mapped better to the whole genome but also mapped to a transcript; vice versa, I am seeing reads EXCLUDED that only overlap splice junctions such that part of the read is exonic, the other part is intronic. htseq-count happily includes the latter in the total count.

    Any input and advice would be greatly appreciated!
    Last edited by sunkid; 03-26-2015, 05:22 PM. Reason: clarification

  • #2
    If you want maximal sensitivity, I encourage you to try out BBMap; it's faster and much more sensitive than TopHat2. I am the author and am biased, but those are objective facts. For human RNA-seq I suggest:

    (index)
    bbmap.sh ref=hg38.fa

    (map)
    bbmap.sh in=reads.fq out=mapped.sam maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=t rpkm=rpkm.txt

    Although if the library was strand-specific, and you want to analyze it with Cufflinks (which I do not recommend), then change the xstag flag to whatever strand type you used. xstag and xmtag are just for Cufflinks compatibility; I don't think other programs need them. "rpkm=<file>" is optional but will write rpkm/fpkm counts to that file.

    For analysis, I suggest you pick a single human genome build (including whatever alternate contigs you want) and standardize on it. You will waste a huge amount of time and effort if you map to multiple builds, call variants with respect to each one, calculate different FPKMs for each, maintain variant and gene-locus databases for each, etc etc. Trust me - it is not worth it; there are an infinite number of possibilities out there and time is better spent on the 99% where they concur than the 1% where they differ, which is basically the least-reliable part. And multiple different contradictory answers typically make people confused; they will either ignore all of it for being too much information, or chase the white rabbit down the wrong hole - so, in my opinion, your job is not to give everyone all possible information, but to act as an intelligent filter and pick the best, even when you're not certain.

    Gene-annotation is a bit different, since it does not cause so much replicated work. When I dealt with the human genome, for determining the effects of variations, I made a union of the major annotation databases and considered the effect of any variation on all possible transcripts. However, for RNA-seq quantification, you don't want two different transcripts that are identical except one ends 3bp later than the other competing for the same reads. So here again, for RNA-seq quantification, I think picking one thing and sticking with it is best. You could pick one and union it with all others, throwing away any transcript that overlaps with a transcript from your chosen set; that would prevent the aforementioned problem, and still allow you to deal with genes that are completely unique to one database without being confounded by barely-differing duplicates.

    In summary:

    Make a system where at any given time there is exactly one reference (potentially picking the best and adding unique sequences from others), exactly one set of annotations (potentially picking the best for your chosen reference, and bringing in uniquely annotated genes from other references and annotations for additional sequences you brought in), and produce some standardized output that everyone can understand - updating on a regular schedule.

    I have no comment on the best choice of a human annotation set.

    As for the decision between transcriptome-mapping and genome-mapping - in my opinion, genome-mapping is best for any organism that is not completely understood, while transcriptome-mapping is best for tiny genomes in which you can confidently claim that all genes have been identified, as well as all transcripts, since microbes typically lack alternative splicing. But if you want to make new discoveries in eukaryotes, you should always do genome-mapping first and only try mapping the residual unmapped reads to the transcriptome afterward. Otherwise your results will be biased by existing annotations.

    Comment


    • #3
      Hi Brian,
      Which Cufflinks alternative would you recommend?

      Thanks

      Comment


      • #4
        Luc,

        I apologize - I have had a lot of negative experiences with Cufflinks, but I don't directly deal with RNA-seq quantification anymore, so I have no positive experiences with anything else. Therefore, while I encourage you to read my exposition, it doesn't suggest a specific alternative.

        One of my trustworthy co-workers showed me results indicating that Cufflinks was incapable of incapable of determining that there was differential expression between human males and females for genes exclusive to chromosome Y, while DEseq was capable of determining that, in fact, human males show higher expression of these genes than human females. That is enough for me to favor one over the other. When combined with my experience of Cufflinks running forever unless you manually filter out extremely highly-expressed genes, I can't ever recommend it to anyone.

        I have no direct experience with DEseq, but that's what JGI switched to instead of Tuxedo-package alternatives; from the data I have seen, it was a very good choice - it appears to be much more accurate and consistent.

        I only have direct knowledge of BBMap's rpkm/fpkm counts, or Seal's. They give similar results for the vast majority of transcripts, and contradictory results for some transcripts, since they analyze things differently (BBMap uses global alignment, and Seal counts long kmer matches). That non-consensus greatly irritates me, but underscores the point that algorithms and models only provide approximations, not truth.

        In summary - as a direct alternative, it depends on what you are doing, but DEseq seems to be good for differential expression in known transcripts, though I have not tested it personally. For identification of novel transcripts, I have no suggestions.
        Last edited by Brian Bushnell; 03-26-2015, 08:12 PM.

        Comment


        • #5
          Hi Luc and others who are looking for an alternative to Cufflinks -

          I can recommend to try our new software called Mix-Square
          You can get it here https://secure.lexogen.com/solo/prod...26&Single=true

          We have started offering it to researchers to test on various datasets and are interested in their feedback.

          It is a statistical model which addresses positional coverage bias by mixtures of probability distributions. The parameters of the Mix2 model can be efficiently trained with the EM algorithm yielding simultaneous estimates for the relative abundance of gene isoforms and the positional coverage bias.

          In summary, Mix-Square yields highly accurate concentration estimates for gene isoforms by adapting to the positional coverage bias in RNA-Seq data. This leads to higher accuracy in the detection of differential expression of genes and gene isoforms. Mix-Square enables repeatable concentration estimates across multiple library preparations and sequencing facilities and can be used as an explorative tool to investigate the positional biases present in RNA-Seq data. Mix-Square is highly efficient and runs significantly faster than Cufflinks and Pennseq.
          We have uploaded a manuscript draft on biorxiv http://biorxiv.org/content/biorxiv/e...11767.full.pdf , but for the latest information, including the data from more experiments, please contact us directly.

          Comment


          • #6
            Hi Brian, hi lexogen,

            thanks a lot for the explanations and the pointers to DEseq (we have been testing it previoulsy) and Mix-Square.

            Comment


            • #7
              Hey Brian - amaze balls advice and insight as always.

              I was wondering if you had any suggestions for mouse annotation sources (ensemble, vs ucsc, etc) - or is that the same story as human annotation files?

              Also, can an annotation file be included with bbmap (in the way you can include one with tophat2) to annotate your transcriptome or is that unnecessary? I.e. Can annotation simply be taken care of later during transcriptome assembly? I'm not totally clear on this.

              Lastly - I absolutely love your software. Its by far the most useful tool I've come across, and its helped me solve some incredibly difficult genome mutation problems. I don't know how much this software is being used, but its should be the first pick for everyone. But I know its not - and one reason I'm quite confident about is that the manual is... lets say difficult to read. You're quick responses are amazing and always incredibly helpful, and I wonder if an overhaul to the manual (say in the way that bowtie's manual is set up) wouldn't make this software a little more accessible and/or attractive.

              I implore you! Beautify the bbmap manual for the good of humanity! (or am I being absurd and there is already one...?)

              Cheers,
              Paul G

              Comment


              • #8
                Hi Paul,

                I don't really deal with annotation these days, but I tend to use NCBI as my go-to site for such things. For example -
                ftp://ftp.ncbi.nlm.nih.gov/genomes/Mus_musculus/GFF/

                But, I don't know if those are better or worse than others. BBMap does not accept annotation files and strictly finds splice sites de-novo. Though I'm not entirely clear on your question, perhaps. Can you describe exactly what you're doing? It sounds like you're trying to improve the mouse annotation by finding novel expressed genes or isoforms... is that correct?

                I completely agree with you about the manual (which kind of does not exist); it has a lot of room for improvement. I'm looking into the possibility of hiring a student to help me write a comprehensive guide to all of the different tools, since I'm short on time. Hopefully something will come out of that!

                -Brian
                Last edited by Brian Bushnell; 09-10-2015, 09:18 AM.

                Comment


                • #9
                  Hi Brian -

                  sorry I've got one more question about bbmap's run information printing. Does bbmap produce an output log? There are stats printed out on the screen during a mapping run, but it looks like there is no log. I know you can print any console output directly to a log file... (2> outputlog.txt I think) but I've set up a very large mapping run that I would prefer not to restart!
                  >.<

                  That would be great - for us and the student! I may find myself looking for a little side project, so if I get some time, I'll throw together an HTML template based on the Bowtie2 manual and send it to you. That might work at least for a while until you have an opportunity to fill something out more completely.

                  Cheers,
                  Paul

                  Comment


                  • #10
                    My prediction is that the ultra-fast mappers that do gene-counting at the same time as mapping are going to be the way to do RNASeq in the future; programs like STAR, Kallisto, RapMap, and possibly even HISAT2 (although that doesn't yet seem to be at the map-and-count stage).

                    Comment


                    • #11
                      Originally posted by pkstarstorm05 View Post
                      Hi Brian -

                      sorry I've got one more question about bbmap's run information printing. Does bbmap produce an output log? There are stats printed out on the screen during a mapping run, but it looks like there is no log. I know you can print any console output directly to a log file... (2> outputlog.txt I think) but I've set up a very large mapping run that I would prefer not to restart!
                      >.<
                      There is a flag, "statsfile=stderr" (which is the default), and you can change it to, say, "statsfile=stats.txt" to write to a specific file without redirection. However, you can't do that after BBMap starts running

                      That would be great - for us and the student! I may find myself looking for a little side project, so if I get some time, I'll throw together an HTML template based on the Bowtie2 manual and send it to you. That might work at least for a while until you have an opportunity to fill something out more completely.

                      Cheers,
                      Paul
                      There is, also, Genomax's thread which gathered a lot of common BBMap commands. Not a manual, but very helpful. Still, if you have the time and inclination to put together something in HTML, I'd be grateful!

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 11:49 AM
                      0 responses
                      15 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-24-2024, 08:47 AM
                      0 responses
                      16 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      61 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X