Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-seq read alignment best practices for low-input Smart-seq2 samples

    RNA-seq read alignment best practices for low-input Smart-seq2 samples

    Hi All,

    I think my data may fall in a nether region between single-cell and bulk...

    There are many ways to align RNA-seq reads to transcripts and genomes, and I am confused about which is best for my experiment.

    My goal is to do differential expression analysis on Plasmodium vivax data. The input is fastq files of paired 25nt reads from Smart-seq2.

    The P.vivax genome is 29 Mb, 40% G/C, 6,642 genes. Introns are much smaller and fewer in number than human introns. P.vivax RNA is hard to get, so we used Smart-seq2 to generate sequence. Using bowtie and RSEM we find for our richer samples we get 4.2 million reads aligned to transcripts, and 86% transcripts with at least one read aligned. For our less rich samples we get 1.2 million reads aligned to transcripts, and 78% transcripts with at least one read aligned. rRNA accounts for < 10% of the aligned reads.

    Questions:
    • Is SEQanswers the right place to ask for help on this set of questions?
    • Is my understanding correct that if I am primarily interested in differential expression of known genes, as opposed to finding new genes and fixing gene models, that I may align reads to transcripts as opposed to the whole genome?
    • There are so many different aligners: bwa, bowtie, bowtie2, STAR, HISAT2. How much does it really matter which one I use? Does using a different alignment program have major effects on the outcome, or is it more just playing around the edges? How do I tell which is best for my data?
    • What metrics ought I use to assess my data?


    I have read articles reviewing best practices for evaluation RNA-seq data, for example PMID26813401. But I’m not sure how this translates to my data, which I think falls in a nether region between single-cell and bulk.

    It is wonderfully convenient to run RSEM and have RSEM do the alignment itself. Is it better practice to do the alignment step outside of RSEM?

    Thank you! Jon

  • #2
    Hi Jon,

    Not sure if you're still after any kind of answer but I can contribute a little bit of experience that may help guide you.

    Q1: sure it is!

    Q2:

    Not necessarily. Direct to transriptome approaches such as eXpress, RSEM, Salmon/sailfish, or Kallisto rank the highest in gene level normalized expression accuracy. That is to say they generate the most accurate FPKM/TPM type quantification. This is due to the fact that they probabilistically determine isoform level expression. Without accurate isoform levels you cannot compute accurate gene levels since those are a sum of the isoform levels. Since these kind of expression values are transcript length normalized there's no way to properly assemble the total gene level without accurate isoform levels. Quantification of normalized expression from these programs is far superior to those from programs like Cufflinks or Stringtie. They also do a fantastic job of not mis-assigning expression to pseudogenes. That's all great and when we make figures for a paper and we include gene expression levels I always use TPM values from these kind of programs. RSEM seems to come out on top but if you want results faster then Kallisto and Salmon do a good job. However since we don't do differential expression on these normalized expression values it's not totally necessary to run them and we can get good differential expression results from genome alignment + read counting to annotated features with something like htseq-count or featureCounts.

    There are a couple issues I've run into with the direct to transcriptome approach, however. First off the direct to transcriptome approach takes the transcripts out of context. I've found that sometimes alignments to the genome will reveal continuous alignments across a gnomic region that happens to house a transcript. The genome alignment coverage may not at all be specific to that gene however if you run a direct to transcriptome quantification you'll see expression for that gene. Also sometimes we have single-exon features that lie within an intron of another gene and sometimes that intron may be retained. Again the direct to transcriptome approach will count expression of that single exon feature when it's not actually appropriate. When I see one or two cases of something like that I just assume it's widespread so I've expanded my transcriptome references to include introns and genomic regions that contain single-exon transcripts as well as any transcript that totally contained within the intron of another. This way the probabilistic assignment routine can "see" both possibilities and then based on coverage it will make a good choice. This has worked great to cut down on those false positives.

    Another issue I've come across are reads that map to a gene, maybe even exclusively to one gene, but when mapped to the genome can align to many sites. By genome alignment standards that gene's actual origin is ambiguous and one may not want to trust the assignment of it to any spot and certainly not to a particular gene. This problem is hard to overcome because for these programs to work well they technically need to "see" all possible alignments of a read to the transcriptome. So a read that maps to multiple gene loci is something you want to keep while a read that maps to some gene loci and then many additional genome sites is something you'd want to exclude.

    Finally I have had issues with large bundles of genes share sequence - maybe they are 60% or more similar. Not alternative isoforms within a locus but many genes spread around the genome that are highly similar. These programs are designed to solve a gene locus. The goal is to report as few features in the locus as expressed as necessary to fully explain the coverage. This approach doesn't extend well to these other types of genes and what ends up happening is depending on the pattern of the coverage at these features the program will push all of the reads that are ambiguous between the lot of 'em all into 1 feature. Any variation in the coverage pattern may cause the program to make a different choice of which feature to assign the reads to. As a result I'll end up with differential expression results between conditions showing these features going from 0 reads to 100's of reads (or more) and they end up polluting the top hits. If I instead consider these features to be ambiguous and find some way to bundle them up I can sum the read counts assigned to them all into a single feature. When I do that I find that overall they are not expressed differently. It's like that every time in every experiment I've run into it.

    Q3:
    There are a lot of options, yes. I've found that no one aligner really gets everything that's align-able. For running RSEM what I do is align with STAR first, because it's super fast, and then mop up as many of the unaligned reads as possible with bowtie2. I only chose to do that because overall it's faster than doing the entire alignment with bowtie or bowtie2 and back when I set that up RSEM only had a run option for bowtie. I think now it can use any of those three so I'd say use STAR since it's likely the fastest and I'd assume the settings they have worked out are good since they know what their program needs in order to make a good quantification.

    That's all I have for now.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:37 PM
    0 responses
    10 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 06:07 PM
    0 responses
    9 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    49 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    67 views
    0 likes
    Last Post seqadmin  
    Working...
    X