Building bfast/btestindexes index for 15% divergence

feederbing

Member

Join Date: Sep 2011

Posts: 11
- Share
- Tweet
#1

Building bfast/btestindexes index for 15% divergence

09-07-2011, 11:39 AM

I am trying to determine whether BFAST is appropriate for mapping 101 bp unpaired illumina reads to a reference with expected divergence of 15% (http://seqanswers.com/forums/showthread.php?t=13871 has related info on my problem). Primarily, I'm hoping Nils can advise me on whether or not BFAST is suited for this type of job. Assuming that it is, below I describe what I have tried and where I am stuck.

I am at the stage where I am trying to build an index for BFAST, using btestindexes, but I am stuck at a lack of understanding of the output of btestindexes in "evaluate" mode. I understand the concept of spaced seeds, what I don't understand is how to interpret the output of btestindexes. I have looked at the other four threads here that mention btestindexes, and I have read the supplementary info from the BFAST paper.

Following the advice in section 6.1 of the bfast-book, I get k+2=21 for a genome of size 2.4G.

I then ran an index search:
btestindexes -A 0 -a 0 -S 10000 -s 10 -r 101 -M 20 -n 10 -l 21 -w 31
I used -M 20 because I think my data will contain some unique matches out to 20% divergence. I used -n 10 to get 10 masks, expecting that the evaluation run of btestindexes will indicate how many I need.

The resulting masks are
111111111111111111111
11111011001101101111111111
1001111111101110011101001010111
11001111111011101101101101011
110011111001111110111010101011
10011110101111101111101101101
11111101011111101101011111
11111010101101110111111111
1111111101111111111111
11110001011111011111011110011

I then ran the evaluation:
btestindexes -A 1 -a 1 -S 10000 -r 101 -M 10 -f filename

Looking at the output of the evaluation is where I am stuck. Clearly it is a table with one row per mask and one column per mismatch count deom 0 to 20 (plus a column for a deletion but let's ignore that). There is also a column labeled "CE" which is always zero (perhaps "cumulative error"?). The values are undoubtedly probabilities, but probabilities of what? My initial assumption was that row m gave the probability, for the combination of masks 1 thru m, that a homologous read would be discovered using that set of seeds. This assumption is apparently wrong, because if I shuffle the list of masks, I don't get the same results in the final row.

Looking back at section 6.1, it advises that I "seelect the minimum number of masks sufficient to tolerate" my desired accuracy. But it gives no advice on how to interpret the output so as to make this decision.

I have also hunted through the supplement and the distribution to see if there is are any masks recommended for this type of divergence. The supplement states that there distribution includes mask sets for reads up to L=100. I have distribution 0.6.5a from sourceforge and I haven't been able to find them. (http://sourceforge.net/projects/bfas...-0.6.5a.tar.gz).

At this point, I'm just hoping to get some reassurance that BFAST will be useful for this problem.
Tags: bfast, btestindexes, homer, index
feederbing

Member

Join Date: Sep 2011

Posts: 11
- Share
- Tweet
#2

09-07-2011, 03:02 PM

Originally posted by feederbing View Post

I then ran the evaluation:
btestindexes -A 1 -a 1 -S 10000 -r 101 -M 10 -f filename

Retracing my steps, I see that should be -A 0 (nt space instead of color space). I've rerun the same masks now. Output is in a different format. Am trying to see if it makes more sense now.
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Building bfast/btestindexes index for 15% divergence

Comment

Latest Articles

ad_right_rmr

News