Go Back   SEQanswers > Applications Forums > RNA Sequencing

Similar Threads
Thread Thread Starter Forum Replies Last Post
UMI in Smart-seq2 protocol PrimerBuffalo Sample Prep / Library Generation 1 05-11-2017 03:10 PM
SMART-Seq v4 Ultra Low Input RNA Kit potentially compromised - how robust are the rea pheucticus Sample Prep / Library Generation 0 08-13-2016 01:53 PM
Clontech Smart Seq v4 Ultra Low RNA Input sajoshi Sample Prep / Library Generation 7 03-16-2016 05:15 AM
Smart-seq2 lowest input? kobeho24 RNA Sequencing 1 06-04-2015 07:19 AM

Thread Tools
Old 12-21-2017, 12:14 PM   #1
Jon Goldberg
Junior Member
Location: Newton, MA

Join Date: Dec 2017
Posts: 1
Default RNA-seq read alignment best practices for low-input Smart-seq2 samples

RNA-seq read alignment best practices for low-input Smart-seq2 samples

Hi All,

I think my data may fall in a nether region between single-cell and bulk...

There are many ways to align RNA-seq reads to transcripts and genomes, and I am confused about which is best for my experiment.

My goal is to do differential expression analysis on Plasmodium vivax data. The input is fastq files of paired 25nt reads from Smart-seq2.

The P.vivax genome is 29 Mb, 40% G/C, 6,642 genes. Introns are much smaller and fewer in number than human introns. P.vivax RNA is hard to get, so we used Smart-seq2 to generate sequence. Using bowtie and RSEM we find for our richer samples we get 4.2 million reads aligned to transcripts, and 86% transcripts with at least one read aligned. For our less rich samples we get 1.2 million reads aligned to transcripts, and 78% transcripts with at least one read aligned. rRNA accounts for < 10% of the aligned reads.

  • Is SEQanswers the right place to ask for help on this set of questions?
  • Is my understanding correct that if I am primarily interested in differential expression of known genes, as opposed to finding new genes and fixing gene models, that I may align reads to transcripts as opposed to the whole genome?
  • There are so many different aligners: bwa, bowtie, bowtie2, STAR, HISAT2. How much does it really matter which one I use? Does using a different alignment program have major effects on the outcome, or is it more just playing around the edges? How do I tell which is best for my data?
  • What metrics ought I use to assess my data?

I have read articles reviewing best practices for evaluation RNA-seq data, for example PMID26813401. But I’m not sure how this translates to my data, which I think falls in a nether region between single-cell and bulk.

It is wonderfully convenient to run RSEM and have RSEM do the alignment itself. Is it better practice to do the alignment step outside of RSEM?

Thank you! Jon
Jon Goldberg is offline   Reply With Quote
Old 01-10-2018, 12:43 PM   #2
I like code
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438

Hi Jon,

Not sure if you're still after any kind of answer but I can contribute a little bit of experience that may help guide you.

Q1: sure it is!


Not necessarily. Direct to transriptome approaches such as eXpress, RSEM, Salmon/sailfish, or Kallisto rank the highest in gene level normalized expression accuracy. That is to say they generate the most accurate FPKM/TPM type quantification. This is due to the fact that they probabilistically determine isoform level expression. Without accurate isoform levels you cannot compute accurate gene levels since those are a sum of the isoform levels. Since these kind of expression values are transcript length normalized there's no way to properly assemble the total gene level without accurate isoform levels. Quantification of normalized expression from these programs is far superior to those from programs like Cufflinks or Stringtie. They also do a fantastic job of not mis-assigning expression to pseudogenes. That's all great and when we make figures for a paper and we include gene expression levels I always use TPM values from these kind of programs. RSEM seems to come out on top but if you want results faster then Kallisto and Salmon do a good job. However since we don't do differential expression on these normalized expression values it's not totally necessary to run them and we can get good differential expression results from genome alignment + read counting to annotated features with something like htseq-count or featureCounts.

There are a couple issues I've run into with the direct to transcriptome approach, however. First off the direct to transcriptome approach takes the transcripts out of context. I've found that sometimes alignments to the genome will reveal continuous alignments across a gnomic region that happens to house a transcript. The genome alignment coverage may not at all be specific to that gene however if you run a direct to transcriptome quantification you'll see expression for that gene. Also sometimes we have single-exon features that lie within an intron of another gene and sometimes that intron may be retained. Again the direct to transcriptome approach will count expression of that single exon feature when it's not actually appropriate. When I see one or two cases of something like that I just assume it's widespread so I've expanded my transcriptome references to include introns and genomic regions that contain single-exon transcripts as well as any transcript that totally contained within the intron of another. This way the probabilistic assignment routine can "see" both possibilities and then based on coverage it will make a good choice. This has worked great to cut down on those false positives.

Another issue I've come across are reads that map to a gene, maybe even exclusively to one gene, but when mapped to the genome can align to many sites. By genome alignment standards that gene's actual origin is ambiguous and one may not want to trust the assignment of it to any spot and certainly not to a particular gene. This problem is hard to overcome because for these programs to work well they technically need to "see" all possible alignments of a read to the transcriptome. So a read that maps to multiple gene loci is something you want to keep while a read that maps to some gene loci and then many additional genome sites is something you'd want to exclude.

Finally I have had issues with large bundles of genes share sequence - maybe they are 60% or more similar. Not alternative isoforms within a locus but many genes spread around the genome that are highly similar. These programs are designed to solve a gene locus. The goal is to report as few features in the locus as expressed as necessary to fully explain the coverage. This approach doesn't extend well to these other types of genes and what ends up happening is depending on the pattern of the coverage at these features the program will push all of the reads that are ambiguous between the lot of 'em all into 1 feature. Any variation in the coverage pattern may cause the program to make a different choice of which feature to assign the reads to. As a result I'll end up with differential expression results between conditions showing these features going from 0 reads to 100's of reads (or more) and they end up polluting the top hits. If I instead consider these features to be ambiguous and find some way to bundle them up I can sum the read counts assigned to them all into a single feature. When I do that I find that overall they are not expressed differently. It's like that every time in every experiment I've run into it.

There are a lot of options, yes. I've found that no one aligner really gets everything that's align-able. For running RSEM what I do is align with STAR first, because it's super fast, and then mop up as many of the unaligned reads as possible with bowtie2. I only chose to do that because overall it's faster than doing the entire alignment with bowtie or bowtie2 and back when I set that up RSEM only had a run option for bowtie. I think now it can use any of those three so I'd say use STAR since it's likely the fastest and I'd assume the settings they have worked out are good since they know what their program needs in order to make a good quantification.

That's all I have for now.
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote

align, bowtie2, rsem, smart-seq2, star

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 12:39 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO