SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Merge multiple *.sai files in bwa ? hosseinv Bioinformatics 4 09-15-2012 06:50 PM
Merge multiple fq read files hosseinv Bioinformatics 7 08-23-2012 07:36 PM
Does the read depth distribution(not read depth) matters yuhao Bioinformatics 0 08-09-2012 06:22 AM
BWA: specifying SAM/BAM file header fields before read alignment? nora Bioinformatics 3 12-04-2010 10:11 PM
PubMed: Metagenomics: Read length matters. Newsbot! Literature Watch 0 02-13-2008 12:48 PM

Reply
 
Thread Tools
Old 07-12-2013, 11:45 AM   #1
gene_x
Senior Member
 
Location: MO

Join Date: May 2010
Posts: 108
Default BWA alignment of multiple read files, the order of processing matters?

Hi,
I have multiple fastq files coming from different sequencing lanes but of the same tissue origin. So I want to align them using BWA. I have tried two ways of doing this:

a: BWA align individual fastq and get bam file for each and then use samtools to merge the individual bam file into one final bam file

b: Merge fastq first, then BWA align and end up with one bam file.

The results differs only a little bit (7 reads out of 100+million) when I counted the number of mapped reads. But I thought they should be the same.

Does anyone know why is the difference between method a and b?

Thanks!
gene_x is offline   Reply With Quote
Old 07-12-2013, 12:22 PM   #2
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

I haven't used BWA but perhaps there is inherent randomness in regards to your settings with mapping reads that don't map uniquely. Regardless, if only 7 out of 100+ million reads are different, it probably won't be significant downstream. How did you determine only 7 reads were different?

Also, different topic, but I prefer using Picard to merge .bam files if you have proper @RG headers in each one as it will maintain those.
Heisman is offline   Reply With Quote
Old 07-12-2013, 12:32 PM   #3
gene_x
Senior Member
 
Location: MO

Join Date: May 2010
Posts: 108
Default

Quote:
Originally Posted by Heisman View Post
I haven't used BWA but perhaps there is inherent randomness in regards to your settings with mapping reads that don't map uniquely. Regardless, if only 7 out of 100+ million reads are different, it probably won't be significant downstream. How did you determine only 7 reads were different?

Also, different topic, but I prefer using Picard to merge .bam files if you have proper @RG headers in each one as it will maintain those.
I used a in-house tool to count. I think the idea is to look at the bitwise field in the sam file and determine if a read is mapped or not.

Do you mean samtools merge don't maintain those header lines?

Last edited by gene_x; 07-12-2013 at 12:35 PM.
gene_x is offline   Reply With Quote
Old 07-12-2013, 12:39 PM   #4
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

In the header file I don't believe it does, although it may in a field for each aligned read. This also only matters if you have the aligner include header information when you align. If you do and you have two separate .bam files now (with different header information), you can go ahead and try samtools merge and also Picard MergeSamFiles separately and see exactly what it keeps with samtools view -h <merged.bam>. I don't remember exactly what they do differently so I don't want to guess incorrectly.
Heisman is offline   Reply With Quote
Old 07-12-2013, 12:43 PM   #5
gene_x
Senior Member
 
Location: MO

Join Date: May 2010
Posts: 108
Default

I didn't realize I can look at the bam file...

Just checked it since I have done it and the header is there and are the same between the merged bam file using samtools merge and the bam file produced from aligning merged fastq files.
gene_x is offline   Reply With Quote
Old 07-12-2013, 12:55 PM   #6
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

Right, I think the difference would be if you have different "@RG\tID:" tags for each file that you merge together. I just checked and it doesn't affect the aligned reads themselves, just the header section. For example, one .bam file has "RG:A1_ACAGT_L003" and another has "RG:A2_AGTGAC_L005", and those show up with the RG flag in each aligned read, that will be maintained in the merged file with either program. With samtools merge, however, the header lines will only have one of the RG lines, while with Picard it will have both of them.

If you just type "samtools merge" there's a note at the bottom that actually says this.
Heisman is offline   Reply With Quote
Old 07-12-2013, 01:05 PM   #7
gene_x
Senior Member
 
Location: MO

Join Date: May 2010
Posts: 108
Default

I asked you in another post what RG LB IDs are..

Do you mean with Picard, it will have both of the header files (no matter they are the same or different) in the beginning of the alignment file?

When I run samtools merge with a bunch of bam files. I didn't specify any options. All the bam file are alignment of reads coming from the same sequencer (with reads from multiple lanes). Then the merged bam file has the same header as the individual bam file. I guess it just retain the header when the to-be-merged bam files have the same header.
gene_x is offline   Reply With Quote
Old 07-12-2013, 01:12 PM   #8
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

Yeah, that sounds right. If these are all from the same library just sequenced many times it doesn't really matter. If these are all from the same sample but different libraries and each library was sequenced one time it still probably doesn't matter too much. If there are different libraries and some were sequenced more than once, then this will become more important for properly removing duplicate reads.

Try to read through what I linked in the other thread and if there are still things unclear feel free to provide a more thorough description/flowchart of the overall experiment (detailing the number of samples, how many libraries were made of each sample, and how many times each library was sequenced) and we can try to figure out the best way to analyze the data.
Heisman is offline   Reply With Quote
Old 07-12-2013, 01:20 PM   #9
gene_x
Senior Member
 
Location: MO

Join Date: May 2010
Posts: 108
Default

Quote:
Originally Posted by Heisman View Post
Yeah, that sounds right. If these are all from the same library just sequenced many times it doesn't really matter. If these are all from the same sample but different libraries and each library was sequenced one time it still probably doesn't matter too much. If there are different libraries and some were sequenced more than once, then this will become more important for properly removing duplicate reads.

Try to read through what I linked in the other thread and if there are still things unclear feel free to provide a more thorough description/flowchart of the overall experiment (detailing the number of samples, how many libraries were made of each sample, and how many times each library was sequenced) and we can try to figure out the best way to analyze the data.
They are actually different libraries but sequenced on the same machine. I'll read the sam format a bit more. My feeling is reads from the same sequencer in one run have the same header.
gene_x is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:06 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO