SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
newbler ignores minimal read length DNAjunk Bioinformatics 1 01-13-2011 11:33 PM
Accepted practices of NGS quality filtering? gaffa Bioinformatics 7 11-17-2010 08:05 AM
SNP calling - is there an accepted Phred quality threshold? Francesco Lescai Bioinformatics 3 04-13-2010 11:51 AM
BWA Read Length AnamikaDarwin Bioinformatics 1 04-10-2009 11:47 PM
Bowtie print read length scirocco Bioinformatics 1 01-16-2009 05:43 PM

Reply
 
Thread Tools
Old 04-01-2012, 06:51 AM   #1
yuelics
Member
 
Location: Toronto, Canada

Join Date: Apr 2011
Posts: 13
Default minimal read length accepted by Bowtie

Hi all,

I wonder if anyone knows the minimal read length accepted by Bowtie. Basically, I have a set of short motif sequences (7mers) and want to see where they map to the mouse reference genome. I tried Bowtie, but it seems to not work because of the short read length (7 bp).

Any suggestions will be very much appreciated!

Thanks,
Yue
yuelics is offline   Reply With Quote
Old 04-02-2012, 02:26 AM   #2
NicoBxl
not just another member
 
Location: Belgium

Join Date: Aug 2010
Posts: 264
Default

same question for bwa
NicoBxl is offline   Reply With Quote
Old 04-02-2012, 03:39 AM   #3
Rocketknight
Member
 
Location: Ireland

Join Date: Sep 2011
Posts: 86
Default

Because a short sequence like 7 bases would map all over the place, it's very unlikely that any read aligner will handle it properly. The algorithms they use are mostly designed to handle sequences no shorter than the shortest reads that come from Illumina sequencers (32bp I think).

The good news is that since you're looking for a relatively small number of specific 7-base sequences without gaps or mismatches, a simple string search should be able to do it for you. A Python or Perl script could just loop over every line in the reference genome and print out any location where it finds one of the matching strings. If you have no idea how to code one, let me know and I'll write you one when I have a few spare minutes.
Rocketknight is offline   Reply With Quote
Old 04-04-2012, 04:30 AM   #4
yuelics
Member
 
Location: Toronto, Canada

Join Date: Apr 2011
Posts: 13
Default

Quote:
Originally Posted by Rocketknight View Post
Because a short sequence like 7 bases would map all over the place, it's very unlikely that any read aligner will handle it properly. The algorithms they use are mostly designed to handle sequences no shorter than the shortest reads that come from Illumina sequencers (32bp I think).

The good news is that since you're looking for a relatively small number of specific 7-base sequences without gaps or mismatches, a simple string search should be able to do it for you. A Python or Perl script could just loop over every line in the reference genome and print out any location where it finds one of the matching strings. If you have no idea how to code one, let me know and I'll write you one when I have a few spare minutes.
Hi Rocketknight,

Thanks a lot for your reply. I actually managed to get Bowtie working on the short 7mer with a few additional options. The tricky thing of writing a script to do it is that the alignment does not need to be exact (i.e. 2 mismatches somewhere in that 7mer are allowed).
yuelics is offline   Reply With Quote
Old 04-04-2012, 05:57 AM   #5
Rocketknight
Member
 
Location: Ireland

Join Date: Sep 2011
Posts: 86
Default

You're going to get a huge amount of matches if you search a large genome with those parameters (by my back-of-the-envelope calculations, a 7bp string with two allowed mismatches will hit by chance more than 0.1% of the time in a statistically average genome). In other words, for a 1GB genome, you should be seeing over one million matches for each 7-mer on average. Does Bowtie really report all of those matches?

Edit: If it doesn't, all isn't lost - it's definitely possible to write a string-searcher with mismatching in Python (though I give no guarantees about running time). I'm willing to help if you're stuck, it sounds like an interesting problem.

Extra edit: Whoops, mistake with my calculations. You should expect a random hit rate as high as about 0.45%. For the mouse genome (~3GB) you should expect to see around 13-14 million hits per 7-mer by chance.

Last edited by Rocketknight; 04-05-2012 at 03:28 AM.
Rocketknight is offline   Reply With Quote
Old 03-22-2013, 02:15 AM   #6
hanshart
Member
 
Location: Germany

Join Date: Nov 2011
Posts: 27
Default

Quote:
Originally Posted by Rocketknight View Post
... it's definitely possible to write a string-searcher with mismatching in Python (though I give no guarantees about running time). I'm willing to help if you're stuck, it sounds like an interesting problem.
It's possible to use fqgrep for the approximative sequence search.
hanshart is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:29 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO