View Single Post
Old 11-08-2012, 04:38 PM   #6
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

the '*' fields you speak of sounds like how bowtie reports unaligned reads. if you align you reads with and output SAM format you get a row for every read in your original FASTQ file and for reads that were not aligned they put an '*' in the column where you'd normally see the chromosome number (or whatever reference you're working with).

i don't know what your overall pipeline looks like but if tophat is giving you trouble and you need the power of identifying alternative splicing (or isoform level expressions) you might give RSEM or eXpress a try. both of these use alignments to a transcriptome and use the EM algorithm for disambiguating alignments to multiple isoforms. it was discovered last year sometime that alignment to the transcriptome is much more sensitive than to the genome (i think that's why tophat included it in newer releases) and results in greater accuracy of expression estimations in simulated data. currently RSEM and eXpress appear to have the greatest accuracy for even isoform level expression estimates when compared to other methods. I saw a evaluation of a few pipelines using the BEERS pipeline (http://www.cbil.upenn.edu/BEERS/). tophat->cufflinks was the absolute worst for quantification and very poor for differential expression with cuffdiff. They used simulated data for which they knew the "true" expression and they knew the "true" fold changes between samples. then they ran their data through several pipelines and correlated the expression estimates back against the 'true' values. RSEM and eXpress (using bowtie and BWA to make alignments) performed the best with true count correlations better than r=0.9 while tophat->cufflinks estimates correlated about r=0.1. in other words the expression estimates generated with cufflinks looked like random noise compared to the true values. so don't use cufflinks.

in fact using a simple count method such as counting hits per isoform and simply dividing the contribution of reads aligning to multiple targets by the square of the number of features they align to provides a better estimate of the counts than cufflinks correlating about r=0.7 though with a much larger confidence interval than eXpress and RSEM.

if you need novel isoform discovery i'd recommend going through with tophat alignments, running cufflinks and generating a GTF annotation for your samples using their recommended pipeline. then make a FASTA reference for your new transcriptome with cufflinks' gffread tool

Code:
gffread -g <genome>.fa -w <transcriptome>.fa <annotation>.gtf
then build a bowtie index for your new transcriptome, align to it with bowtie or BWA and quantify expressions with eXpress. finally you can use a DE tool in R like edgeR, DESeq or EBSeq to perform DE testing.
sdriscoll is offline   Reply With Quote