SEQanswers (
-   Bioinformatics (
-   -   Allowing a high number of mismatches when mapping (

Jane M 07-21-2016 03:55 AM

Allowing a high number of mismatches when mapping
Dear all,

I have sequences of 53bp, among which between 23 and 30 bases are of interest (=motifs). For simplicity, I took only the first 17 bases. Each sample has between 5 and 23 millions of reads.
The reference is composed of 7450 distinct sequences. I took the 17 first bases of the reference sequences for simplicity.
My goal is to map the motifs to the reference.

If there was no sequencing error, I would find only 7450 distinct motifs in my samples. There was a problem during the sequencing most likely and 25% of the reads have poor quality.
When mapping with bowtie

bowtie --best --strata -v 2 -k 1 -m 1 --norc
the mapping rate is ~ 70-82%.
I used -v 3 on two samples, and it increases the mapping rate of ~ 1.5% only.

Since my reference is small (7450 distinct sequences), I know that with less than 17 bases (sometimes 6 bases are sufficient), I can uniquely identify from which of the 7450 references the sequence comes. Thus, I need to allow for this specific case a higher number of mismatches (bowtie is limited to 3).

I intend to try bowtie2 in local mode. I do not know it, but RMAP ( seems to correspond to my question.

Could you please give me some suggestions/ideas to deal with this particular case?
Thank you a lot for your help.

Brian Bushnell 07-21-2016 05:09 AM

I suggest you try BBMap, which is quite tolerant of low identity; it typically allows mapping down to around 60-70% identity. For very high sensitivity, try this command:

Code: in=reads.fq out=mapped.sam vslow minid=0.6 maxindel=5 k=11
Using only the first 17 bp of sequences will hurt the ability to map with BBMap, though; you need to use the full sequences.

Jane M 07-25-2016 12:56 AM

Thank you Brian for your suggestion.
I am doing some tests with Bowtie on shorter sequences and if it doesn't work, I will try BBMap. The maximum length I can use is 23 bp. Would it be sufficient?

Brian Bushnell 07-25-2016 11:10 AM

23 is fine, but more bases will always increase specificity. If your sequences are 53 bp, why are you cutting them down to 23?

Jane M 07-26-2016 02:31 AM

Thank you for your answer.

I am working on a sh screen. The first 22-30 bases are common to all sequences. Between 23 and 31 bases correspond to the sh in each sequence.
Since there is a problem of quality at the end (from the middle in fact), I use the minimum number of bases (from the left) needed to discriminate the sh.

All times are GMT -8. The time now is 06:47 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.