SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there a maximum number of reads able to be loaded into tophat will.jackson Bioinformatics 1 11-04-2012 04:12 PM
Tophat/Bowtie2 inconsistency in number of paired reads BenLerch RNA Sequencing 0 07-14-2012 11:52 AM
tophat's prep_reads gave different number of reads zun RNA Sequencing 2 06-26-2012 04:43 PM
TopHat/Bowtie - number of reads aligned mgibson Bioinformatics 7 10-22-2011 08:04 PM
Different number of unique reads using TopHat -g reut Bioinformatics 0 08-29-2011 05:22 AM

Reply
 
Thread Tools
Old 11-02-2012, 08:07 AM   #1
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default Tophat 2.0.0 number of reads

Apart from the problem when using more than one core and random problems there I've got some other problems:

Code:
$ wc -l ../reads/SRR306839.fastq
75400120 ../reads/SRR306839.fastq
Thus I have 18850030 reads. Tophat reports:
Code:
18645993 reads; of these:
  18645993 (100.00%) were unpaired; of these:
    7576936 (40.64%) aligned 0 times
    5480546 (29.39%) aligned exactly 1 time
    5588511 (29.97%) aligned >1 times
59.36% overall alignment rate
Ok, maybe some of them where rejected a priori for some quality issue.
Then:
Code:
$ samtools flagstat accepted_hits.bam
41956190 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
41956190 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
I have a lot of lines in the .sam files without read IDs (in another dataset this caused problems with htseq_count, here not but as long as in the past I did not notice something like that I would like to understand). Is this ok? Are lines without IDs reported from reads mapping in more than one position on the transcriptome/genome?
The SAM format guide says that missing read names should be marked with a '*'...
EGrassi is offline   Reply With Quote
Old 11-07-2012, 08:39 AM   #2
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default

With tophat 2.0.4 the logs are similar but:
Code:
$ samtools flagstat accepted_hits.bam 
10819957 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
10819957 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
I'm beginning to think that maybe I should try STAR. Or that I definitely need 3 months to perform "unit tests" also on these well known, widely used and presented in literature tools.
EGrassi is offline   Reply With Quote
Old 11-07-2012, 06:51 PM   #3
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default

Quote:
Ok, maybe some of them where rejected a priori for some quality issue.
How many reads is tophat telling you it is filtering in prep_reads.info?
Quote:
18645993 reads; of these:
18645993 (100.00%) were unpaired; of these:
7576936 (40.64%) aligned 0 times
5480546 (29.39%) aligned exactly 1 time
5588511 (29.97%) aligned >1 times
59.36% overall alignment rate
How is the overall quality of your read, especially the end? Is your sequencing machine calling bases irrespective of what the quality of that base is, or does it start calling and N at low quality?

Tophat, unlike BWA, does not clip reads to remove low-quality ends. If your ends are <=q20 and your sequencer force-called the bases you may have nucleotides at the end that are making your reads unalignable, because those ends are preventing Tophat from positioning the read where it belongs - you're getting too many mismatches.

Quote:
I'm beginning to think that maybe I should try STAR.
If you think trying a new tool on the block that has not been used that much is a "safer option" , especially one that will prevent you needing to do
Quote:
Or that I definitely need 3 months to perform "unit tests" also on these well known, widely used and presented in literature tools.
I think you're wrong

Everyone (especially the biologists around me (and I used to be one)) think that NGS data analysis is easy and a technique, a service that can be provided and not something that involves a boatload of time and benchmarking and intellectual effort no less complicated than designing some "pretty" wet lab experiments. So, yes, if you're going to do an analysis you need to know what your tools are doing, and the sad state of the field is format incompatibility, weird mappings and activities by different softwares that give different outputs, and you need to look at the data and the biology of your system to figure out what makes sense and what is probably an artefact.
dvanic is offline   Reply With Quote
Old 11-07-2012, 11:49 PM   #4
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default

Quote:
Originally Posted by dvanic View Post
How many reads is tophat telling you it is filtering in prep_reads.info?
I had to delete the tophat 2.0.0 result to avoid filling our hd of data when doing "trial and error". Tophat 2.0.4 is telling:
Code:
prep_reads v2.0.4 (3480M)
---------------------------
204037 out of 18850030 reads have been filtered out
So in this case it is ok.

Quote:
Originally Posted by dvanic View Post
How is the overall quality of your read, especially the end? Is your sequencing machine calling bases irrespective of what the quality of that base is, or does it start calling and N at low quality?

Tophat, unlike BWA, does not clip reads to remove low-quality ends. If your ends are <=q20 and your sequencer force-called the bases you may have nucleotides at the end that are making your reads unalignable, because those ends are preventing Tophat from positioning the read where it belongs - you're getting too many mismatches.
OK, I see, thanks. Those are public data and the original article aligned them without any trimming, I will anyhow check the quality of the ends. But what is puzzling me is the resulting sam with too many reads and many of them without an ID, not the low percentage of alignment (at least now).


Quote:
Originally Posted by dvanic View Post

I think you're wrong
In trying to understand if the software that I'm using is bug free? In trying a new software less used I'm not happy, yes...I want the unit tests more, that's it
Yup, I understood just now that your two sentences were connected...definitely I will test STAR if I choose to use it, maybe more than tophat.
Quote:
Originally Posted by dvanic View Post

Everyone (especially the biologists around me (and I used to be one)) think that NGS data analysis is easy and a technique, a service that can be provided and not something that involves a boatload of time and benchmarking and intellectual effort no less complicated than designing some "pretty" wet lab experiments. So, yes, if you're going to do an analysis you need to know what your tools are doing, and the sad state of the field is format incompatibility, weird mappings and activities by different softwares that give different outputs, and you need to look at the data and the biology of your system to figure out what makes sense and what is probably an artefact.
This is true, in a way. But I don't agree completely: these are high throughput techniques so I will never be able to check if all the results make sense...and the overall idea is to get sensible data, so if two different versions of a software gave me different outputs (one of them which seems out of the standard format that it should be) I'm worried about bugs and not biological soundness. That would be an issue that I hope to face later.

Thank you, I will let you know about the quality of the ends...I can also add that the counts resulting from using htseq-count on the resulting accepted_hits.bam with the two tophat versions had a high an significant pearson correlation...but I still would like to understand what's the matter with the empty IDs

Last edited by EGrassi; 11-08-2012 at 05:50 AM.
EGrassi is offline   Reply With Quote
Old 11-08-2012, 01:32 PM   #5
dvanic
Member
 
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61
Default

Quote:
But what is puzzling me is the resulting sam with too many reads and many of them without an ID, not the low percentage of alignment (at least now).
So, I have seen tophat report more reads in the accepted_hits.bam than in the original fastq, but that was because for some reads it would report more than one alignment. I parsed this out using an awk/uniq pipe on the sam file. I have not seen your situtation with no read IDs at all.

Quote:
This is true, in a way. But I don't agree completely: these are high throughput techniques so I will never be able to check if all the results make sense...and the overall idea is to get sensible data, so if two different versions of a software gave me different outputs (one of them which seems out of the standard format that it should be) I'm worried about bugs and not biological soundness. That would be an issue that I hope to face later.
Before I started doing RNA-Seq data analysis I thought that if a software tool was being so widely used it must be reliable, reproducible and sound. After trying/using most of the available software I have come to the conclusion that it is a "Wild West" in the field at the moment. There are rewards for publishing a tool that works (for the authors, to some extent, in many cases only with their annotations made with "a custom perl script" (not funny how many times that is written in a methods section, with no details on either the script or based on what algorithms it is actually doing what it's supposed to be)). The paper will get cited, mostly in reviews of available methods. There is no incentive for most groups to maintain the software, to fix bugs and make decent documentation, since this does not get you more papers => more grants => promotions... Basically, I think it's shoddy science, but there is nothing I can do about it, except spend boatloads of time benchmarking, reading this forum and help lists about bugs people encounter, and looking at dummy datasets to get an idea of how a particular tool handles a particular scenario.

And of course there's the brilliant scenario of cufflinks, which has been updated to include a plethora of new methods, but apart from the "How Cufflinks Works" page there is no peer-reviewed update of what has been incorporated and whether, together, it is statistically sound.

/end rant/

Coming back to your problem (and I know this is not exactly a solution) - have you tried using 2.0.5/6 on your reads? I've had much better results (am more happy with how the mapping "looks") with the .5 version, since it does seem to improve mapping accuracy by using both genome and transcriptome as a reference. Perhaps there was a bug in 2.0.0 that has been fixed in the later versions?...
dvanic is offline   Reply With Quote
Old 11-08-2012, 04:38 PM   #6
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

the '*' fields you speak of sounds like how bowtie reports unaligned reads. if you align you reads with and output SAM format you get a row for every read in your original FASTQ file and for reads that were not aligned they put an '*' in the column where you'd normally see the chromosome number (or whatever reference you're working with).

i don't know what your overall pipeline looks like but if tophat is giving you trouble and you need the power of identifying alternative splicing (or isoform level expressions) you might give RSEM or eXpress a try. both of these use alignments to a transcriptome and use the EM algorithm for disambiguating alignments to multiple isoforms. it was discovered last year sometime that alignment to the transcriptome is much more sensitive than to the genome (i think that's why tophat included it in newer releases) and results in greater accuracy of expression estimations in simulated data. currently RSEM and eXpress appear to have the greatest accuracy for even isoform level expression estimates when compared to other methods. I saw a evaluation of a few pipelines using the BEERS pipeline (http://www.cbil.upenn.edu/BEERS/). tophat->cufflinks was the absolute worst for quantification and very poor for differential expression with cuffdiff. They used simulated data for which they knew the "true" expression and they knew the "true" fold changes between samples. then they ran their data through several pipelines and correlated the expression estimates back against the 'true' values. RSEM and eXpress (using bowtie and BWA to make alignments) performed the best with true count correlations better than r=0.9 while tophat->cufflinks estimates correlated about r=0.1. in other words the expression estimates generated with cufflinks looked like random noise compared to the true values. so don't use cufflinks.

in fact using a simple count method such as counting hits per isoform and simply dividing the contribution of reads aligning to multiple targets by the square of the number of features they align to provides a better estimate of the counts than cufflinks correlating about r=0.7 though with a much larger confidence interval than eXpress and RSEM.

if you need novel isoform discovery i'd recommend going through with tophat alignments, running cufflinks and generating a GTF annotation for your samples using their recommended pipeline. then make a FASTA reference for your new transcriptome with cufflinks' gffread tool

Code:
gffread -g <genome>.fa -w <transcriptome>.fa <annotation>.gtf
then build a bowtie index for your new transcriptome, align to it with bowtie or BWA and quantify expressions with eXpress. finally you can use a DE tool in R like edgeR, DESeq or EBSeq to perform DE testing.
sdriscoll is offline   Reply With Quote
Old 11-08-2012, 10:39 PM   #7
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default

First of all thank you for your help. I would like to add all the possible support to your rant, I'm worried that this is shoddy science also

I looked at those reads qualities with fastqc and for some samples they are terrible, but for the one that I showed here I don't notice a tremendous fall of qualities at the end of reads.

Quote:
Originally Posted by dvanic View Post

Coming back to your problem (and I know this is not exactly a solution) - have you tried using 2.0.5/6 on your reads? I've had much better results (am more happy with how the mapping "looks") with the .5 version, since it does seem to improve mapping accuracy by using both genome and transcriptome as a reference. Perhaps there was a bug in 2.0.0 that has been fixed in the later versions?...
The reports did not cite something similar but I'm definitely willing to try. I'm worried because in the past I used tophat 2.0.0 and I did not notice any number of reads inconsistency but bug bites bug...
EGrassi is offline   Reply With Quote
Old 11-08-2012, 10:45 PM   #8
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default

Quote:
Originally Posted by sdriscoll View Post
snip
I saw BEER too but did not read the results (bad bad Elena!).
Thanks for the interesting post, right now I'm not directly interested in cufflinks alternate transcript capabilities but who know what the future brings (actually I would really like to stick to my original research interests about transcription factors and leave the deep sequencing field until it become more mature...another option would be to develop some new
software [not to add noise to the scene, just for internal use at least for the first year ] but here "pure" computer science projects are not the first option).
If I will perform some tests with these other tools I will definitely report some results.

ps. I read about * in the read name field in the sam format specifications.
EGrassi is offline   Reply With Quote
Old 11-08-2012, 11:34 PM   #9
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

I started learning to process RNA Seq data in 2009. In the early versions of Tophat it would actually make a file of totally naive expressions and we would just compute fold changes on them and pick genes with more than two fold change as possibly interesting. One of the post docs I was working with at the time would look through the list, sorted by fold change, and compare the fold changes and expression levels to what he could see in the bed graph histogram a we loaded into the genome browser. He would cross out genes from the list that looked suspicious (ones with strange coverages that didn't make sense with the expression data) and over the course of several months, along with whatever else he was doing, those gene lists proved to be very useful in his project. Some of that data went to this: http://www.ncbi.nlm.nih.gov/pubmed/21357675

I still think this basic approach is viable, for gene level analysis even with all the work that has been done in the field. Things are better now that running more samples is cheaper so that we can have a few replicates to cut back on false positives. I think that's what will improve things. The cheaper it is per million reads per sample the more replicates people can run and then there's less of a need for modeling and estimates for DE and we will get better mean expression estimates per condition.

I agree it is a real Wild West kinda thing. I think the tools are looking better right now than in the past. Tophat has been working for me for years. I like using bowtie to align things quickly and bowtie2 is great for gapped and local alignments plus its fast for long paired reads. Transcriptome assembly is still very mysterious. I'd think that someone would just release a tool that gives you all of the possible splice variants based on detected exons and junctions but it seems like people can't help but try to make their software capable of making decisions for the researchers and bypassing that basic level of information. I guess that's the difference between publication type tools and homemade tools.

I'll be digging into the eXpress data tomorrow. So far I like it but I need to spend a lot of time looking at the results and the raw data to see how much of it makes sense.
sdriscoll is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:45 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO