SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to merge multiple sequencing runs vinay052003 Bioinformatics 4 01-31-2012 03:34 AM
Merging reads mapped on genome and CDS (SOLID data) jmandel Bioinformatics 1 01-04-2012 06:34 AM
CNV analysis using sequencing data? foxyg General 13 12-21-2011 05:37 AM
Sequencing data analysis of pair-end sequencing Smriti Illumina/Solexa 8 11-04-2011 07:51 AM
Coordinating sequencing data and microarrays bio Bioinformatics 2 02-18-2011 09:41 AM

Reply
 
Thread Tools
Old 12-21-2010, 11:21 PM   #1
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default merging sequencing data from different sequencing runs

Hi everybody,

What are the consideration in merging sequencing data from from different sequencing runs?
if data are paired-end, should it matter how close the insert sizes between each dataset that will be merged are?
Also, SAMTOOLS have a merge tool for merging BAM files, what is the difference between merging BAM files and merging fastq sequence files? Are the two methods equivalent? Or which one has advantage over the other?
Thanks.

CSoong
csoong is offline   Reply With Quote
Old 12-22-2010, 02:21 AM   #2
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

The insert size is irrelevant for what you are trying to do.

The BAMs contain the alignments computed by your aligner of choice. The fastq contain the raw reads and associated qualities generated by your sequencer.

Merging one or the other is not equivalent. Besides the alignments, your BAM will contain (if saved) useful metadata about your libraries, reference genome used, alignment tool, etc ... (check the SAM spec).

The specification also supports keeping track of groups of reads that belong to a specific library.
__________________
-drd
drio is offline   Reply With Quote
Old 12-22-2010, 05:42 AM   #3
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default

Hi Drio,

Thanks for the helpful explanation.

Besides meta info about the libraries, would merging fastQ files then do alignments be equivalent to do alignments first on individual fastq files then merging them as BAM files?

CSoong
csoong is offline   Reply With Quote
Old 12-22-2010, 11:40 PM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by csoong View Post
Besides meta info about the libraries, would merging fastQ files then do alignments be equivalent to do alignments first on individual fastq files then merging them as BAM files?
It depends on the aligner. For straight forward alignments (Bowtie, BWA etc) then the two operations would be the same since each sequence is aligned independently. However, spliced aligners (for example TopHat) use the combined evidence from the whole of an aligned file to detect potential splice junctions, so in some cases you wouldn't get the same result from aligning independently, or together.
simonandrews is offline   Reply With Quote
Old 12-23-2010, 02:45 AM   #5
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default

good to know. thanks.
csoong is offline   Reply With Quote
Old 12-26-2010, 03:29 PM   #6
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default

Hi,

I did a little test and found out that the alignment results is slightly different between (A) merging independently produced BAM files and (B) merging FASTQ before producing BAM. (I use bwa 0.5.8c aligner & samtools 0.1.12a)

The difference is very slight so that downstream analysis may not be affected. However, as simonandrews pointed out, the result is unexpected since BWA aligns read independently. Any thoughts on why the slight difference? Below is the output of samtools idxstats between group (A) and (B).

group (A): ~/Downloads/samtools-0.1.12a/samtools idxstats merge-bam-files.bam
chr1 249250621 2163811 47135
chr2 243199373 2269463 49964
chr3 198022430 1765679 29872
chr4 191154276 1652124 41892
chr5 180915260 1491753 25751
chr6 171115067 1537865 25355
chr7 159138663 1492743 28693
chr8 146364022 1341856 23936
chr9 141213431 1172519 31277
chr10 135534747 1494271 51130
chr11 135006516 1290433 25822
chr12 133851895 1244998 21283
chr13 115169878 820855 12780
chr14 107349540 854953 14756
chr15 102531392 827560 16404
chr16 90354753 946573 21926
chr17 81195210 894572 20000
chr18 78077248 714390 15820
chr19 59128983 698790 15445
chr20 63025520 650961 9911
chr21 48129895 380978 8569
chr22 51304566 442003 8882
chrX 155270560 697698 17739
chrY 59373566 232017 22257
chrMT 16571 85978 1492
* 0 0 1401138
group (B): !~/Downloads/samtools-0.1.12a/samtools idxstats merge-fastq-first.bam
chr1 249250621 2163772 47094
chr2 243199373 2269455 49992
chr3 198022430 1765768 29921
chr4 191154276 1652083 41864
chr5 180915260 1491797 25754
chr6 171115067 1537813 25290
chr7 159138663 1492795 28695
chr8 146364022 1341854 23994
chr9 141213431 1172460 31211
chr10 135534747 1494343 51211
chr11 135006516 1290482 25840
chr12 133851895 1245085 21310
chr13 115169878 820821 12782
chr14 107349540 854905 14724
chr15 102531392 827450 16386
chr16 90354753 946565 21872
chr17 81195210 894580 20008
chr18 78077248 714386 15822
chr19 59128983 698839 15451
chr20 63025520 650978 9913
chr21 48129895 380980 8598
chr22 51304566 441970 8872
chrX 155270560 697602 17725
chrY 59373566 232109 22286
chrMT 16571 85953 1474
* 0 0 1401138
csoong is offline   Reply With Quote
Old 12-27-2010, 12:31 AM   #7
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I'm not too familiar with BWA, but I know that in Bowtie there are some circumstances where it will select a random hit from an equally good set of potential matches, which can lead to getting slightly different results from repeating the same run. Have you tried rerunning the same file through BWA to see if you get exactly the same result?
simonandrews is offline   Reply With Quote
Old 12-27-2010, 12:57 AM   #8
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

BWA also picks a random alignment when there are multiple equally good matches. But, I am not sure how that is going to change those numbers from idxstats?

I am not sure what is the meaning of the last column (unmapped reads). Why are they assigned to a specific chromosome.
__________________
-drd
drio is offline   Reply With Quote
Old 12-27-2010, 05:18 AM   #9
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default

Simon, the results are from the same files - file A and file B. I either merge A and B as fastQ or merge A and B as BAM.

Drio, I am not 100% sure as well, but I think the last column where it's associated with a chromosome are reads that have a paired-read that maps confidently to the specified chromosome. As oppose to the last row, which are reads that neither pair mapped.
csoong is offline   Reply With Quote
Old 12-27-2010, 05:44 AM   #10
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Quote:
Originally Posted by csoong View Post
I am not 100% sure as well, but I think the last column where it's associated with a chromosome are reads that have a paired-read that maps confidently to the specified chromosome. As oppose to the last row, which are reads that neither pair mapped.
You mean the third column shows reads where both ends map and the forth column shows reads where one of the reads maps? Then, if working with single end data, both columns should display the same values.

To confirm you can use samtools:

Code:
$ samtools view -f3 merge-bam-files.bam | grep -v chr1 | wc -l 
# should be: 2163811
$ samtools view -f9 merge-bam-files.bam | grep -v chr1 | wc -l 
# should be: 47135
__________________
-drd
drio is offline   Reply With Quote
Old 12-27-2010, 06:26 AM   #11
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default

Hi,
I did -f1 -f3 -f9. The -f3 options does not match the idxstats, see the output below.

!~/Downloads/samtools-0.1.12a/samtools view -f3 merge.bam | awk '$3=="chr1"'| wc -l
2061072

!~/Downloads/samtools-0.1.12a/samtools view -f1 merge.bam | awk '$3=="chr1"'| wc -l
2210946

!~/Downloads/samtools-0.1.12a/samtools view -f9 merge.bam | awk '$3=="chr1"'| wc -l
47135
csoong is offline   Reply With Quote
Old 12-27-2010, 07:04 AM   #12
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Try:

Code:
$ samtools view -F5  merge.bam | awk '$3=="chr1"'| wc -l
That plus 2061072 should equal 2163811
__________________
-drd
drio is offline   Reply With Quote
Old 12-27-2010, 07:19 AM   #13
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default

odd:
!~/Downloads/samtools-0.1.12a/samtools view -F5 merge.bam | awk '$3=="chr1"'| wc -l
0

seems like the middle column in idxstats is a little mysterious...
csoong is offline   Reply With Quote
Old 04-30-2012, 06:30 AM   #14
epi
Member
 
Location: USA

Join Date: Jan 2012
Posts: 38
Default

Quote:
Originally Posted by simonandrews View Post
It depends on the aligner. For straight forward alignments (Bowtie, BWA etc) then the two operations would be the same since each sequence is aligned independently. However, spliced aligners (for example TopHat) use the combined evidence from the whole of an aligned file to detect potential splice junctions, so in some cases you wouldn't get the same result from aligning independently, or together.
I am facing the exact situation mentioned in this thread, in fact i started a new since i was unaware. Simon your reply is useful, especially splice aligners. Whats your opinion on aligning and then merging in the case when one is looking for just the unique matches (like in chip-seq). Wouldn't even bowties/BWA give different results?
epi is offline   Reply With Quote
Old 04-30-2012, 07:41 AM   #15
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by epi View Post
I am facing the exact situation mentioned in this thread, in fact i started a new since i was unaware. Simon your reply is useful, especially splice aligners. Whats your opinion on aligning and then merging in the case when one is looking for just the unique matches (like in chip-seq). Wouldn't even bowties/BWA give different results?
Yes - I wouldn't generally trust mixing different aligners in the same analysis as although they operate on similar metrics in many cases they'll all have their own biases. Even if you're looking at two datasets from the same aligner I'd still want to know that they were run with the same options as this too can have an effect. To some extent the same problem exists when running different length reads in the same project. I'd always prefer to work on data from the same platform with the same run type analysed with the same aligner. That isn't to say that you can't do useful analysis if the aligners don't match, but this is definitely going to increase the noise in the results.
simonandrews is offline   Reply With Quote
Old 05-15-2013, 11:02 AM   #16
blueskypy
Junior Member
 
Location: boston

Join Date: May 2013
Posts: 3
Default

Here is a reply from the GATK team:
Well, the thing to keep in mind is that if you merge all your BAMs together into a big one, the processing of that big BAM is going to be very computationally demanding. Also, note that the recalibrator will process read groups individually, so you will not get the "whole lane data" advantage that you think from recalibrating multiple samples together.
Actually, let me take a step back and give you a quick run-down of what we do in-house.
Our setup is a bit complex because we have samples spread over multiple lanes, with multiple samples per lane. So when we get the FastQs, we separate out the read data by read group into individual files, so that after alignment we have one bam file per read group. We run dedup-realign-recal on each bam file, then merge the bams of read groups that belong to the same sample to produce one bam file per sample. Then we do another round of realign-recal on the sample bams as a form of cross-lane cleanup.
But if you don't have samples spread across different lanes, you don't need to do all of this. The simplest is probably to separate out the samples using their read group tags into individual bam files and process them separately through dedup-realign-recal. I realize this contradicts what I said earlier; technically that (earlier suggestion to just process all together) is still an option (with some advantages, e.g. all samples aligned the same way), but if you plan to split up your samples before variant calling anyway, you might as well split everything earlier on and save yourself the compute resources of processing everything together.
blueskypy is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:49 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO