SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Complete Genomics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Indexing with bwa and samtools scami Bioinformatics 3 05-05-2014 12:26 AM
Segmentation fault (core dumped)-BWA indexing ref genome zgene Bioinformatics 1 12-16-2012 07:17 AM
Indexing of bacterial genome for bwa-sw/Bowtie2 pravee1216 Bioinformatics 0 07-15-2012 10:18 PM

Reply
 
Thread Tools
Old 04-22-2017, 01:00 PM   #1
Greeeeb
Junior Member
 
Location: USA

Join Date: Apr 2017
Posts: 4
Default BWA indexing mechanism

Hi all,

I am new to bioinformatics. I am parallelizing a code that use BWA for alignment.

Can someone please explain to me in general how the indexing is happening?

I ask because the previous coder index and run alignment by calling BWA. then, the code "reads-in the reference geneome"; the code extract subsequence from the read-in reference by using the starting position that appears in the alignment file. For instance;

assume we have the following read: ACCCA

the reference has two chromosomes:
>chr1
GGGGGTTTTT
>chr2
AAACCCCCCCC

and assume the read is aligned to ACCCC in the second chromosome, where the starting position is: 13--I think BWA will produce 13.

the code reads in the reference and stores every chromosomes in a separate data structure. The problem is when when the code gets the subsequence "ACCCA" from the reference, it uses the starting position (13) to extract the subsequence. The way the reference is stored leads to out of range (13 is more than the 11 bases stored in chr2) and the program crashes. and even if the 13 is in range, we will be reading the wrong subsequence.

I need to know the BWA indexing mechanism so I get the correct subsequence from the read-in reference. I tried to add the length prefix to get the correct subsequence, but it did not work, where there are some dependencies in the code that may led to the problem.

Thanks.
Greeeeb is offline   Reply With Quote
Old 04-22-2017, 01:41 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,584
Default

If you are a proficient programmer source code for bwa is available on SF. If you need technical help then you could post to bwa mailing list.
GenoMax is offline   Reply With Quote
Old 04-22-2017, 03:05 PM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

I thought BWA (as ends up being the case with a lot of aligners) ends up concatenating the chromosomes together when indexing (with a bunch of N in between). Anyway, BWA at least used to occasionally produce alignments that extended beyond the chromosome bounds, so if you have an example that makes it crash then post that on github as an issue.
dpryan is offline   Reply With Quote
Old 04-22-2017, 06:57 PM   #4
Greeeeb
Junior Member
 
Location: USA

Join Date: Apr 2017
Posts: 4
Default

Thanks!

I will go ahead and check the code.

The problem happens when our code is running, not the BWA. Also, I tried to read the indexing step output files. The large ones are not simple text formate. If that was the case, I would try to input small, tailored reference genomes to see how what is the output from the indexing.

As I noted, our program can handle 1 chromosome without problems. So, I merged the chromosomes of a human genome into one; but BWA exits without indexing the whole genome. The largest single chromosome reference BWA indexed successfully was about 160M bytes.
Greeeeb is offline   Reply With Quote
Old 04-23-2017, 05:29 AM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,584
Default

Not a direct answer but if you are open to substituting the aligner then BBMap from BBMap suite is an option. It is written in pure Java and author of BBMap participates (Brian Bushnell) here regularly.
GenoMax is offline   Reply With Quote
Old 04-23-2017, 10:13 AM   #6
Greeeeb
Junior Member
 
Location: USA

Join Date: Apr 2017
Posts: 4
Default

Actually it may not be possible to use something else. Since I am not well familiar with BWA, I am afraid I would not be able to preserve the compatibility with the code when I replace BWA commands with BBmap. Figuring out how to read the reference the way BWA does is the safest way.

Thanks.
Greeeeb is offline   Reply With Quote
Old 04-25-2017, 08:57 AM   #7
Greeeeb
Junior Member
 
Location: USA

Join Date: Apr 2017
Posts: 4
Default

I found a way to reconstruct the reference subsequence for an aligned read from the alignment information in SAM file. This is the "MD" field in the output. Here is example:

"MD: A string which summarizes the mismatch positions between the aligned read and the reference genome.

Example:
MD:Z:8G61 indicates a single base pair mismatch. Specifically, the aligned read matches the first 8 bases of the reference, after which it fails to match a G in the reference sequence, followed by 61 exact matches to the reference.
"
Source: http://biobits.org/samtools_primer.html

Thanks for all.
Greeeeb is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:23 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO