![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to merge multiple sequencing runs | vinay052003 | Bioinformatics | 4 | 01-31-2012 04:34 AM |
Merging reads mapped on genome and CDS (SOLID data) | jmandel | Bioinformatics | 1 | 01-04-2012 07:34 AM |
CNV analysis using sequencing data? | foxyg | General | 13 | 12-21-2011 06:37 AM |
Sequencing data analysis of pair-end sequencing | Smriti | Illumina/Solexa | 8 | 11-04-2011 08:51 AM |
Coordinating sequencing data and microarrays | bio | Bioinformatics | 2 | 02-18-2011 10:41 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Connecticut Join Date: Jun 2009
Posts: 74
|
![]()
Hi everybody,
What are the consideration in merging sequencing data from from different sequencing runs? if data are paired-end, should it matter how close the insert sizes between each dataset that will be merged are? Also, SAMTOOLS have a merge tool for merging BAM files, what is the difference between merging BAM files and merging fastq sequence files? Are the two methods equivalent? Or which one has advantage over the other? Thanks. CSoong |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]()
The insert size is irrelevant for what you are trying to do.
The BAMs contain the alignments computed by your aligner of choice. The fastq contain the raw reads and associated qualities generated by your sequencer. Merging one or the other is not equivalent. Besides the alignments, your BAM will contain (if saved) useful metadata about your libraries, reference genome used, alignment tool, etc ... (check the SAM spec). The specification also supports keeping track of groups of reads that belong to a specific library.
__________________
-drd |
![]() |
![]() |
![]() |
#3 |
Member
Location: Connecticut Join Date: Jun 2009
Posts: 74
|
![]()
Hi Drio,
Thanks for the helpful explanation. Besides meta info about the libraries, would merging fastQ files then do alignments be equivalent to do alignments first on individual fastq files then merging them as BAM files? CSoong |
![]() |
![]() |
![]() |
#4 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
It depends on the aligner. For straight forward alignments (Bowtie, BWA etc) then the two operations would be the same since each sequence is aligned independently. However, spliced aligners (for example TopHat) use the combined evidence from the whole of an aligned file to detect potential splice junctions, so in some cases you wouldn't get the same result from aligning independently, or together.
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Connecticut Join Date: Jun 2009
Posts: 74
|
![]()
good to know. thanks.
|
![]() |
![]() |
![]() |
#6 |
Member
Location: Connecticut Join Date: Jun 2009
Posts: 74
|
![]()
Hi,
I did a little test and found out that the alignment results is slightly different between (A) merging independently produced BAM files and (B) merging FASTQ before producing BAM. (I use bwa 0.5.8c aligner & samtools 0.1.12a) The difference is very slight so that downstream analysis may not be affected. However, as simonandrews pointed out, the result is unexpected since BWA aligns read independently. Any thoughts on why the slight difference? Below is the output of samtools idxstats between group (A) and (B). group (A): ~/Downloads/samtools-0.1.12a/samtools idxstats merge-bam-files.bam chr1 249250621 2163811 47135 chr2 243199373 2269463 49964 chr3 198022430 1765679 29872 chr4 191154276 1652124 41892 chr5 180915260 1491753 25751 chr6 171115067 1537865 25355 chr7 159138663 1492743 28693 chr8 146364022 1341856 23936 chr9 141213431 1172519 31277 chr10 135534747 1494271 51130 chr11 135006516 1290433 25822 chr12 133851895 1244998 21283 chr13 115169878 820855 12780 chr14 107349540 854953 14756 chr15 102531392 827560 16404 chr16 90354753 946573 21926 chr17 81195210 894572 20000 chr18 78077248 714390 15820 chr19 59128983 698790 15445 chr20 63025520 650961 9911 chr21 48129895 380978 8569 chr22 51304566 442003 8882 chrX 155270560 697698 17739 chrY 59373566 232017 22257 chrMT 16571 85978 1492 * 0 0 1401138 group (B): !~/Downloads/samtools-0.1.12a/samtools idxstats merge-fastq-first.bam chr1 249250621 2163772 47094 chr2 243199373 2269455 49992 chr3 198022430 1765768 29921 chr4 191154276 1652083 41864 chr5 180915260 1491797 25754 chr6 171115067 1537813 25290 chr7 159138663 1492795 28695 chr8 146364022 1341854 23994 chr9 141213431 1172460 31211 chr10 135534747 1494343 51211 chr11 135006516 1290482 25840 chr12 133851895 1245085 21310 chr13 115169878 820821 12782 chr14 107349540 854905 14724 chr15 102531392 827450 16386 chr16 90354753 946565 21872 chr17 81195210 894580 20008 chr18 78077248 714386 15822 chr19 59128983 698839 15451 chr20 63025520 650978 9913 chr21 48129895 380980 8598 chr22 51304566 441970 8872 chrX 155270560 697602 17725 chrY 59373566 232109 22286 chrMT 16571 85953 1474 * 0 0 1401138 |
![]() |
![]() |
![]() |
#7 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
I'm not too familiar with BWA, but I know that in Bowtie there are some circumstances where it will select a random hit from an equally good set of potential matches, which can lead to getting slightly different results from repeating the same run. Have you tried rerunning the same file through BWA to see if you get exactly the same result?
|
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]()
BWA also picks a random alignment when there are multiple equally good matches. But, I am not sure how that is going to change those numbers from idxstats?
I am not sure what is the meaning of the last column (unmapped reads). Why are they assigned to a specific chromosome.
__________________
-drd |
![]() |
![]() |
![]() |
#9 |
Member
Location: Connecticut Join Date: Jun 2009
Posts: 74
|
![]()
Simon, the results are from the same files - file A and file B. I either merge A and B as fastQ or merge A and B as BAM.
Drio, I am not 100% sure as well, but I think the last column where it's associated with a chromosome are reads that have a paired-read that maps confidently to the specified chromosome. As oppose to the last row, which are reads that neither pair mapped. |
![]() |
![]() |
![]() |
#10 | |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]() Quote:
To confirm you can use samtools: Code:
$ samtools view -f3 merge-bam-files.bam | grep -v chr1 | wc -l # should be: 2163811 $ samtools view -f9 merge-bam-files.bam | grep -v chr1 | wc -l # should be: 47135
__________________
-drd |
|
![]() |
![]() |
![]() |
#11 |
Member
Location: Connecticut Join Date: Jun 2009
Posts: 74
|
![]()
Hi,
I did -f1 -f3 -f9. The -f3 options does not match the idxstats, see the output below. !~/Downloads/samtools-0.1.12a/samtools view -f3 merge.bam | awk '$3=="chr1"'| wc -l 2061072 !~/Downloads/samtools-0.1.12a/samtools view -f1 merge.bam | awk '$3=="chr1"'| wc -l 2210946 !~/Downloads/samtools-0.1.12a/samtools view -f9 merge.bam | awk '$3=="chr1"'| wc -l 47135 |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]()
Try:
Code:
$ samtools view -F5 merge.bam | awk '$3=="chr1"'| wc -l
__________________
-drd |
![]() |
![]() |
![]() |
#13 |
Member
Location: Connecticut Join Date: Jun 2009
Posts: 74
|
![]()
odd:
!~/Downloads/samtools-0.1.12a/samtools view -F5 merge.bam | awk '$3=="chr1"'| wc -l 0 seems like the middle column in idxstats is a little mysterious... |
![]() |
![]() |
![]() |
#14 | |
Member
Location: USA Join Date: Jan 2012
Posts: 38
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#15 | |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#16 |
Junior Member
Location: boston Join Date: May 2013
Posts: 3
|
![]()
Here is a reply from the GATK team:
Well, the thing to keep in mind is that if you merge all your BAMs together into a big one, the processing of that big BAM is going to be very computationally demanding. Also, note that the recalibrator will process read groups individually, so you will not get the "whole lane data" advantage that you think from recalibrating multiple samples together. Actually, let me take a step back and give you a quick run-down of what we do in-house. Our setup is a bit complex because we have samples spread over multiple lanes, with multiple samples per lane. So when we get the FastQs, we separate out the read data by read group into individual files, so that after alignment we have one bam file per read group. We run dedup-realign-recal on each bam file, then merge the bams of read groups that belong to the same sample to produce one bam file per sample. Then we do another round of realign-recal on the sample bams as a form of cross-lane cleanup. But if you don't have samples spread across different lanes, you don't need to do all of this. The simplest is probably to separate out the samples using their read group tags into individual bam files and process them separately through dedup-realign-recal. I realize this contradicts what I said earlier; technically that (earlier suggestion to just process all together) is still an option (with some advantages, e.g. all samples aligned the same way), but if you plan to split up your samples before variant calling anyway, you might as well split everything earlier on and save yourself the compute resources of processing everything together. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|