Seqanswers Leaderboard Ad

**drio** · 12-22-2010, 03:21 AM

The insert size is irrelevant for what you are trying to do.

The BAMs contain the alignments computed by your aligner of choice. The fastq contain the raw reads and associated qualities generated by your sequencer.

Merging one or the other is not equivalent. Besides the alignments, your BAM will contain (if saved) useful metadata about your libraries, reference genome used, alignment tool, etc ... (check the SAM spec).

The specification also supports keeping track of groups of reads that belong to a specific library.

**csoong** · 12-22-2010, 06:42 AM

Hi Drio,

Thanks for the helpful explanation.

Besides meta info about the libraries, would merging fastQ files then do alignments be equivalent to do alignments first on individual fastq files then merging them as BAM files?

CSoong

**simonandrews** · 12-23-2010, 12:40 AM

Originally posted by csoong View Post

Besides meta info about the libraries, would merging fastQ files then do alignments be equivalent to do alignments first on individual fastq files then merging them as BAM files?

It depends on the aligner. For straight forward alignments (Bowtie, BWA etc) then the two operations would be the same since each sequence is aligned independently. However, spliced aligners (for example TopHat) use the combined evidence from the whole of an aligned file to detect potential splice junctions, so in some cases you wouldn't get the same result from aligning independently, or together.

**csoong** · 12-23-2010, 03:45 AM

good to know. thanks.

**csoong** · 12-26-2010, 04:29 PM

Hi,

I did a little test and found out that the alignment results is slightly different between (A) merging independently produced BAM files and (B) merging FASTQ before producing BAM. (I use bwa 0.5.8c aligner & samtools 0.1.12a)

The difference is very slight so that downstream analysis may not be affected. However, as simonandrews pointed out, the result is unexpected since BWA aligns read independently. Any thoughts on why the slight difference? Below is the output of samtools idxstats between group (A) and (B).

group (A): ~/Downloads/samtools-0.1.12a/samtools idxstats merge-bam-files.bam
chr1 249250621 2163811 47135
chr2 243199373 2269463 49964
chr3 198022430 1765679 29872
chr4 191154276 1652124 41892
chr5 180915260 1491753 25751
chr6 171115067 1537865 25355
chr7 159138663 1492743 28693
chr8 146364022 1341856 23936
chr9 141213431 1172519 31277
chr10 135534747 1494271 51130
chr11 135006516 1290433 25822
chr12 133851895 1244998 21283
chr13 115169878 820855 12780
chr14 107349540 854953 14756
chr15 102531392 827560 16404
chr16 90354753 946573 21926
chr17 81195210 894572 20000
chr18 78077248 714390 15820
chr19 59128983 698790 15445
chr20 63025520 650961 9911
chr21 48129895 380978 8569
chr22 51304566 442003 8882
chrX 155270560 697698 17739
chrY 59373566 232017 22257
chrMT 16571 85978 1492
* 0 0 1401138
group (B): !~/Downloads/samtools-0.1.12a/samtools idxstats merge-fastq-first.bam
chr1 249250621 2163772 47094
chr2 243199373 2269455 49992
chr3 198022430 1765768 29921
chr4 191154276 1652083 41864
chr5 180915260 1491797 25754
chr6 171115067 1537813 25290
chr7 159138663 1492795 28695
chr8 146364022 1341854 23994
chr9 141213431 1172460 31211
chr10 135534747 1494343 51211
chr11 135006516 1290482 25840
chr12 133851895 1245085 21310
chr13 115169878 820821 12782
chr14 107349540 854905 14724
chr15 102531392 827450 16386
chr16 90354753 946565 21872
chr17 81195210 894580 20008
chr18 78077248 714386 15822
chr19 59128983 698839 15451
chr20 63025520 650978 9913
chr21 48129895 380980 8598
chr22 51304566 441970 8872
chrX 155270560 697602 17725
chrY 59373566 232109 22286
chrMT 16571 85953 1474
* 0 0 1401138

**simonandrews** · 12-27-2010, 01:31 AM

I'm not too familiar with BWA, but I know that in Bowtie there are some circumstances where it will select a random hit from an equally good set of potential matches, which can lead to getting slightly different results from repeating the same run. Have you tried rerunning the same file through BWA to see if you get exactly the same result?

**drio** · 12-27-2010, 01:57 AM

BWA also picks a random alignment when there are multiple equally good matches. But, I am not sure how that is going to change those numbers from idxstats?

I am not sure what is the meaning of the last column (unmapped reads). Why are they assigned to a specific chromosome.

**csoong** · 12-27-2010, 06:18 AM

Simon, the results are from the same files - file A and file B. I either merge A and B as fastQ or merge A and B as BAM.

Drio, I am not 100% sure as well, but I think the last column where it's associated with a chromosome are reads that have a paired-read that maps confidently to the specified chromosome. As oppose to the last row, which are reads that neither pair mapped.

**drio** · 12-27-2010, 06:44 AM

Originally posted by csoong View Post

I am not 100% sure as well, but I think the last column where it's associated with a chromosome are reads that have a paired-read that maps confidently to the specified chromosome. As oppose to the last row, which are reads that neither pair mapped.

You mean the third column shows reads where both ends map and the forth column shows reads where one of the reads maps? Then, if working with single end data, both columns should display the same values.

To confirm you can use samtools:

Code:

$ samtools view -f3 merge-bam-files.bam | grep -v chr1 | wc -l 
# should be: 2163811
$ samtools view -f9 merge-bam-files.bam | grep -v chr1 | wc -l 
# should be: 47135

**csoong** · 12-27-2010, 07:26 AM

Hi,
I did -f1 -f3 -f9. The -f3 options does not match the idxstats, see the output below.

!~/Downloads/samtools-0.1.12a/samtools view -f3 merge.bam | awk '$3=="chr1"'| wc -l
2061072

!~/Downloads/samtools-0.1.12a/samtools view -f1 merge.bam | awk '$3=="chr1"'| wc -l
2210946

!~/Downloads/samtools-0.1.12a/samtools view -f9 merge.bam | awk '$3=="chr1"'| wc -l
47135

**drio** · 12-27-2010, 08:04 AM

Try:

Code:

$ samtools view -F5  merge.bam | awk '$3=="chr1"'| wc -l

That plus 2061072 should equal 2163811

**csoong** · 12-27-2010, 08:19 AM

odd:
!~/Downloads/samtools-0.1.12a/samtools view -F5 merge.bam | awk '$3=="chr1"'| wc -l
0

seems like the middle column in idxstats is a little mysterious...

**epi** · 04-30-2012, 06:30 AM

Originally posted by simonandrews View Post

It depends on the aligner. For straight forward alignments (Bowtie, BWA etc) then the two operations would be the same since each sequence is aligned independently. However, spliced aligners (for example TopHat) use the combined evidence from the whole of an aligned file to detect potential splice junctions, so in some cases you wouldn't get the same result from aligning independently, or together.

I am facing the exact situation mentioned in this thread, in fact i started a new since i was unaware. Simon your reply is useful, especially splice aligners. Whats your opinion on aligning and then merging in the case when one is looking for just the unique matches (like in chip-seq). Wouldn't even bowties/BWA give different results?

**simonandrews** · 04-30-2012, 07:41 AM

Originally posted by epi View Post

I am facing the exact situation mentioned in this thread, in fact i started a new since i was unaware. Simon your reply is useful, especially splice aligners. Whats your opinion on aligning and then merging in the case when one is looking for just the unique matches (like in chip-seq). Wouldn't even bowties/BWA give different results?

Yes - I wouldn't generally trust mixing different aligners in the same analysis as although they operate on similar metrics in many cases they'll all have their own biases. Even if you're looking at two datasets from the same aligner I'd still want to know that they were run with the same options as this too can have an effect. To some extent the same problem exists when running different length reads in the same project. I'd always prefer to work on data from the same platform with the same run type analysed with the same aligner. That isn't to say that you can't do useful analysis if the aligners don't match, but this is definitely going to increase the noise in the results.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

merging sequencing data from different sequencing runs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News