SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
can small rnaseq data be analyzed like rnaseq data? PFS Bioinformatics 5 05-02-2017 08:16 AM
TopHat Error: Could not find Bowtie index files /bowtie-0.12.5/indexes/. rebrendi Bioinformatics 11 06-22-2016 09:55 AM
[NGS - analysis of gene expression data] Machine Learning + RNAseq data Chuckytah Bioinformatics 7 03-05-2012 03:16 AM
ABI-SOLID data with Bowtie-0.12.7 and TopHat-1.1.2 repinementer Bioinformatics 17 04-19-2011 08:10 AM
Perfect match disagreement between bowtie and BLAT on human genome danielsbrewer Bioinformatics 3 03-10-2009 09:10 PM

Reply
 
Thread Tools
Old 04-23-2010, 02:34 AM   #1
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Smile Blat vs bowtie/tophat on Rnaseq data

Dear all,
This is my first thread here, so hello everyone!

I am new in the field, so I hope my questions are not too trivial and could interest some of you.

We have RNAseq data 36bp single end reads. I am interested to get quantitative expression data comparable between different exeprimental conditions and also on the discovery of new transcripts/exons. I have so far tried the bowtie/tophat/cufflinks pipeline for the analysis, using the default settings of those software with the option -g 1 -G /path/to/gff/file as a starting point.
I am however wondering if I can improve the mapping by using blat instead/or in complement of bowtie. I have access to a computing grid and will not have to run the pipeline everyday (as long as we believe the mapping is as comprehensive as possible). So the processing time of blat is theoriticaly not too problematic for me.

Does anyone knows how blat perform compared to bowtie? I know for instance that Erange use blat for the unmappable reads to increase the mapping coverage. Tophat is efficient for abundant transcripts but its ability to map low abundance exons-exons reads drops significantly, if I am not mistaken. This should not be a problem for blat however.

Is it possible to use the output of blat to generate exon-exon model using tophat or else? The purpose here is also to be able to use cufflink for the computation of the rpkm.

Last questions more related to tophat itself: tophat cover 80% of Erange modeled exon. Is it solely due to the coverage limit with low rpkm transcript of tophat algorithm? Is it possible to increase the coverage of known exons with tophat by providing the gff file to tophat? Will this increase the coverage of exon-exon mapped compared to Erange?

Thank if you read this message entirely!

Have a nice day, all of you



Cheers

Olivier
oliviera is offline   Reply With Quote
Old 04-23-2010, 03:19 AM   #2
joa_ds
Member
 
Location: belgium

Join Date: Dec 2008
Posts: 52
Default

we did our first RNA seq run last week and the first results came back today. I have to admint that my collegeau did all the work, I just had some discussions with him and looked at some graphs he made, so I have no details

ANyway, he used bowtie/cufflinks/tophat and clearly showed differential expression and differential splicing patterns. So the pipeline clearly is working. Have to mind the the rRNA though. The lab people did not use any rRNA cleanup kit nor did they do Poly-A enrichment, so quite a waste of $ to find out that rRNA exists...

On your BLAT question. I am 100% confident that BLAT is NOT the way to go to map illumina data. First of all, your sequences are short, and BLAT does not handle that very well. Secondly, BLAT is waaaay (i guess we are speaking in terms of weeks to months on a 16 core server) to slow to map millions of reads genome wide.3rd, you will make your further downstream analysis complicated using tophat/cufflinks.

My guess is that you are thinking about BLAT to discover splice variants because BLAT is designed to do so. But with a 36bp sequence, there will be no advantage at all. Just too short to be usable in BLAT.

You will have to finetune some settings in the bowtie/cufflinks/tophat to improve your results. Or use something else and start from scratch. There are no other extensive RNAseq pipelines out there as far as i know.
joa_ds is offline   Reply With Quote
Old 04-23-2010, 03:41 AM   #3
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Default

Thanx for your fast reply joa,
I fear like you that using blat will make the whole downstream analysis difficult.
I have however tested already blat on a subset of my data and what I get is similar to eland and tophat results with the exception that some unmappable reads are mapped with blat. The converse situation (ie. seq reads mapped by bowtie or eland but not by blat) I didn t encounter yet.
The speed issue I am not sure now. I have make a small run using 10,000 reads and it was done in less than a minute my machine. As I have access to 100 nodes with similar power than my machine on our grid computing I could "theoritically" map 5 Millions reads in no time.
Do you use tophat using the reference exon annotations or are you modelling the exon-exon junctions only ab-initio? I thought someone already tested this, so if I can save some time....
oliviera is offline   Reply With Quote
Old 04-23-2010, 03:54 AM   #4
joa_ds
Member
 
Location: belgium

Join Date: Dec 2008
Posts: 52
Default

Can you elaborate the experiment design a bit?

Are you mapping genome-wide or just on coding regions? Human genome or a smaller species?

It would be strange, but correct me if I see it wrong, to see BLAT correctly call splice junctions using only a 36bp read. I never considered it.

About the mapping speed. We have a 16 core here and 100k sequences (454 though) genome-wide (human) takes around 1 day. But if speed is not an issue, you might consider it. Downstream will remain difficult though.

Our experiment was a Rice RNAseq, which is rather poorly annotated. So the first idea was to use cufflink/tophat (I always forget which one does what...) to do de-novo identification of splicing junctions. The rRNA bias made the coverage in non-rRNA regions lower than expected and apparently coverage in the different samples is not enough to discover new splicing junctions. Some were called in rRNA regions, but that is it.

The new strategy is to use an annotated set and compare coverages of every exon in a gene and see if certain exons are (partially )skipped in some samples. So no new splicing junctions for the moment, but when coverage is high enough, it should be possible.
joa_ds is offline   Reply With Quote
Old 04-23-2010, 04:30 AM   #5
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Default

I do the mapping against the zebrafish genome, and then use the annotated USCS genome for the localisation of the exon.
I agree with you that blat should not be able to map the exon-exon reads. The additional reads I fetch are perhaps reads with gaps compared to the reference genome sequence. For what I saw Bowtie does not align reads with gaps:
"Bowtie does not yet report gapped alignments; this is future work." From the bowtie manual.
As coverage is an issue (I totaly agree with you), perhaps blat could help. I simply do not know if this is worse the effort to use blat or not...
A last question for you Joa (sorry...). You are thus using tophat with a reference file of the known exon (right?). Did it really improved the results compared to the de-novo identification of junctions? What do you do with the reads-isalnds which are not mapped into the known exons?
oliviera is offline   Reply With Quote
Old 04-23-2010, 05:35 AM   #6
poisson200
Member
 
Location: united kingdom

Join Date: Feb 2010
Posts: 63
Default There are several considerations

Hello Oliviera/Joa,
I think BLAT is another good option in complement to TopHat. TopHat, if I understand what I read, is good at mapping reads that map continuously to the genome using Bowtie and is also able to map reads that span exon/exon boundaries. I think it is optimised for Illumina reads of 75 to 100bp.

1) A sequence read (unless matching repeat sequence) should map to 1 position in the genome
2) If the genome is not fully complete, one problem could be your piece of genome does exist yet (I don't know if Zebrafish is fully complete with no gaps)

So you could build a pipeline where you map all with TopHat, which will give you a list of genome positions for your reads.

Do the same with BLAT (as it sounds like you have the computing power). It will be interesting to compare the results of the above and look by eye, at a few, to see which ones look correct.

I would also, since you have good computing power, do a BLAST vs. the Zebrafish Refseq transcriptome to give another, rough, measure of gene expression. Just convert your FASTQ to FASTA and blast like:
blastall -a 8 -p blastn -i your_reads_fasta -d Refseq_Zebrafish_database -e 10-e4 -o output_blast

You can play with blastall -m options to get an output format you like, then parse it with a Perl prog to get rpkm values. This is a rough and ready QC test, such that you should be getting similar genes differentially expressed.

BFAST could be a quicker option, though I have in the last 3 hours BLAST-ed over 220,000 454 reads to the human transcriptome.

I would like to use BLAT myself but I am suspicious the installation I am using is flawed as you say you can do it quickly and I see the UCSC web one is also quick through a browser. If you installed the BLAT, I would be interested how you did it.

I hope this is useful.

Kind regards,

J.
poisson200 is offline   Reply With Quote
Old 04-23-2010, 06:21 AM   #7
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Default

Hi Poisson,

Yes you are right. I have only 36bp and single end reads and tophat is optimised for paired end. But still the authors claims that it could be used. To which extend I do not know...
As you used BFAST can you tell how it perfoms in comparison to tophat or blat with mRNA seq data. I checked the package description but did not found much information on the website. And I have not read the papers cited yet... For me it seems that it is more used for genome assembly than for transcriptomic data?
Concerning Blat I have installed the 64bytes unix version (ubuntu type) running on a 4x xeon 2,6GHz workstation. I didn t make anything particular for the installation. Perhaps it is simply the processor power?
I will test it as you say and try to analyse more in details the results. I would have expect this was done already by someone ...
oliviera is offline   Reply With Quote
Old 04-23-2010, 06:30 AM   #8
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Default

Sorry,
The following paper:
Nils Homer1,2, Barry Merriman2*, Stanley F. Nelson2, Plos one 2009
compare bfast and bowtie. Forget my last question...
I should have read the paper befor sending the post...

Last edited by nilshomer; 04-23-2010 at 09:27 AM.
oliviera is offline   Reply With Quote
Old 04-23-2010, 07:07 AM   #9
poisson200
Member
 
Location: united kingdom

Join Date: Feb 2010
Posts: 63
Default

Hi Oliviera.
Sorry, I did not explain well or you misunderstand me (or I don't fully understand TopHat). I meant it is able to map a read that spans an an intron and did not refer to paired end, though this information may help TopHat too. I have some single read 33bp Illumina data from the SRA and mapped it using TopHat. I have to check the results. I just meant to say TopHat is optimised for reads of 75bp or so it says.

Please don't be sorry, I had not seen it, even though I have looked at a few NGS papers recently. Thanks for finding it and posting. I have not used BFAST yet but plan to in the next week.

Kind regards,

J.
poisson200 is offline   Reply With Quote
Old 04-23-2010, 09:51 AM   #10
thinkRNA
Member
 
Location: Carlsbad,CA

Join Date: Jan 2010
Posts: 94
Default

Quote:
Originally Posted by oliviera View Post
Dear all,

We have RNAseq data 36bp single end reads. I am interested to get quantitative expression data comparable between different exeprimental conditions and also on the discovery of new transcripts/exons. I have so far tried the bowtie/tophat/cufflinks pipeline for the analysis, using the default settings of those software with the option -g 1 -G /path/to/gff/file as a starting point.
I am however wondering if I can improve the mapping by using blat instead/or in complement of bowtie. I have access to a computing grid and will not have to run the pipeline everyday (as long as we believe the mapping is as comprehensive as possible). So the processing time of blat is theoriticaly not too problematic for me.

Does anyone knows how blat perform compared to bowtie? I know for instance that Erange use blat for the unmappable reads to increase the mapping coverage. Tophat is efficient for abundant transcripts but its ability to map low abundance exons-exons reads drops significantly, if I am not mistaken. This should not be a problem for blat however.

Is it possible to use the output of blat to generate exon-exon model using tophat or else? The purpose here is also to be able to use cufflink for the computation of the rpkm.

Last questions more related to tophat itself: tophat cover 80% of Erange modeled exon. Is it solely due to the coverage limit with low rpkm transcript of tophat algorithm? Is it possible to increase the coverage of known exons with tophat by providing the gff file to tophat? Will this increase the coverage of exon-exon mapped compared to Erange?

Thank if you read this message entirely!

Have a nice day, all of you



Cheers

Olivier
Tophat will only align your reads to the junctions you provide. As you mention, you can definitely increase coverage by providing your own more comprehensive GFF file or if you can read this post http://seqanswers.com/forums/showthr...?t=3967&page=2 where Cole mentions how "You can get even fancier with this kind of thing, including splice junctions from other evidence (are there any ESTs for this organism? You could make .juncs files from their alignments to the genome...). TopHat will check the reads against all the junctions in the .juncs file you provide."

Also, can you please tell me how did you notice that tophat doesn't align low abundance exon-exon reads? its an interesting point. Perhaps try changing --min-anchor-length to 3 (default is 8). It may be that this parameter will take into account the low-abundance reads. Too bad they don't allow to make it 1 (and I wonder why?)

I wish I could use bowtie parameters in tophat and I have asked Cole for this. Also, remember that the multireads you are throwing out by setting -g=1 may have a lot of interesting expression of genes that are paralogs. May by this is where you can use blat-just a suggestion.
thinkRNA is offline   Reply With Quote
Old 04-23-2010, 10:07 AM   #11
thinkRNA
Member
 
Location: Carlsbad,CA

Join Date: Jan 2010
Posts: 94
Default

Now that I read the tophat parameters in detail, there is a lot of tailoring one can do to align low abundance reads. Especially, look at this one:

-F/--min-isoform-fraction <0.0-1.0> TopHat filters out junctions supported by too few alignments. Suppose a junction spanning two exons, is supported by S reads. Let the average depth of coverage of exon A be D, and assume that it is higher than B. If S / D is less than the minimum isoform fraction, the junction is not reported. A value of zero disables the filter. The default is 0.15.
thinkRNA is offline   Reply With Quote
Reply

Tags
blat, computing grid, erange, mrna-seq, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO