PDA

View Full Version : Problem merging BAM files


hlwright
10-10-2011, 02:05 AM
Hi everyone. We are having difficulty merging BAM files. Our pipeline so far is this:-

50bp single-end rna-seq reads - map to hg19 with Bowtie, writing unaligned reads to a new file and writing mapped reads to a SAM file. Then map unaligned Bowtie reads with TopHat and write to a BAM file.

We now want to merge the two output files, but there is a problem with the way the reads are sorted. We have tried (1) samtools sort on both files, then samtools reheader and samtools merge, (2) sorting files as per command on cufflinks manual page (sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted) and then merging, (3) picard sort and merge commands, and yet none of these will actually merge the files.

An example of the picard input / output is here:

java -Xmx2g -Dsnappy.disable=true -jar /path/to/picard-tools-1.52/SortSam.jar INPUT=bowtie.sam OUTPUT=bowtie_picardsort.bam SORT_ORDER=coordinate

[Fri Oct 07 15:48:34 BST 2011] net.sf.picard.sam.SortSam INPUT=bowtie.sam OUTPUT=bowtie_picardsort.bam SORT_ORDER=coordinate VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Fri Oct 07 15:48:34 BST 2011] Executing as swebioinf@SWEBIOINFs-Mac-Pro.local on Mac OS X 10.6.8 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.6.0_26-b03-384-10M3425
Snappy is disabled via system property.
INFO 2011-10-07 15:49:56 SortSam Read 10000000 records.
INFO 2011-10-07 15:51:13 SortSam Read 20000000 records.
INFO 2011-10-07 15:52:29 SortSam Read 30000000 records.
INFO 2011-10-07 15:53:44 SortSam Read 40000000 records.
INFO 2011-10-07 15:54:59 SortSam Read 50000000 records.
INFO 2011-10-07 15:55:38 SortSam Finished reading inputs, merging and writing to output now.
[Fri Oct 07 16:04:53 BST 2011] net.sf.picard.sam.SortSam done. Elapsed time: 16.31 minutes.
Runtime.totalMemory()=830885888

java -Xmx2g -Dsnappy.disable=true -jar /path/to/picard-tools-1.52/SortSam.jar INPUT=tophat.bam OUTPUT=tophat_picardsort.bam SORT_ORDER=coordinate

[Fri Oct 07 16:07:29 BST 2011] net.sf.picard.sam.SortSam INPUT=tophat.bam OUTPUT=tophat_picardsort.bam SORT_ORDER=coordinate VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Fri Oct 07 16:07:29 BST 2011] Executing as swebioinf@SWEBIOINFs-Mac-Pro.local on Mac OS X 10.6.8 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.6.0_26-b03-384-10M3425
Snappy is disabled via system property.
INFO 2011-10-07 16:07:40 SortSam Finished reading inputs, merging and writing to output now.
[Fri Oct 07 16:07:54 BST 2011] net.sf.picard.sam.SortSam done. Elapsed time: 0.40 minutes.
Runtime.totalMemory()=527294464

java -Xmx2g -Dsnappy.disable=true -jar /path/to/picard-tools-1.52/MergeSamFiles.jar INPUT=bowtie_picardsort.bam INPUT=tophat_picardsort.bam OUTPUT=picardall.bam SORT_ORDER=coordinate

[Fri Oct 07 16:10:02 BST 2011] net.sf.picard.sam.MergeSamFiles INPUT=[bowtie_picardsort.bam, tophat_picardsort.bam] OUTPUT=picardall.bam SORT_ORDER=coordinate ASSUME_SORTED=false MERGE_SEQUENCE_DICTIONARIES=false USE_THREADING=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Fri Oct 07 16:10:02 BST 2011] Executing as swebioinf@SWEBIOINFs-Mac-Pro.local on Mac OS X 10.6.8 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.6.0_26-b03-384-10M3425
INFO 2011-10-07 16:10:02 MergeSamFiles Input files are in same order as output so sorting to temp directory is not needed.
[Fri Oct 07 16:10:02 BST 2011] net.sf.picard.sam.MergeSamFiles done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=85000192
Exception in thread "main" net.sf.samtools.util.SequenceUtil$SequenceListsDifferException: Sequences at index 1 don't match: 1/243199373/chr2 1/135534747/chr10
at net.sf.samtools.util.SequenceUtil.assertSequenceListsEqual(SequenceUtil.java:109)
at net.sf.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:138)
at net.sf.picard.sam.SamFileHeaderMerger.getSequenceDictionary(SamFileHeaderMerger.java:482)
at net.sf.picard.sam.SamFileHeaderMerger.<init>(SamFileHeaderMerger.java:133)
at net.sf.picard.sam.MergeSamFiles.doWork(MergeSamFiles.java:137)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
at net.sf.picard.sam.MergeSamFiles.main(MergeSamFiles.java:84)


The error seems to be this:

Exception in thread "main" net.sf.samtools.util.SequenceUtil$SequenceListsDifferException: Sequences at index 1 dont match: 1/243199373/chr2 1/135534747/chr10

I think the problem lies somewhere with the way bowtie and tophat sort their output by default. Bowtie sorts numerically, chr1, chr2, chr3... etc whereas TopHat sorts chr1, chr10, chr11......chr19, chr2, chr20, chr21 etc. But we have sorted both files again using picard SortSam so we cannot understand why this discrepancy is not corrected. (Same happens with samtools sort).

Can anyone help us please?

Many thanks

Helen

maubp
10-10-2011, 02:48 AM
Can you double check the embedded SAM text header (@SQ lines) and the BAM binary header (sequence names and lengths) are consistent? I suspect "samtools reheader" doesn't check this.

[Just a guess - there may well be other reasons why the merge fails]

hlwright
10-10-2011, 03:15 AM
Here are the headers:

Tophat output before Picard sorting:
@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr2 LN:243199373
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chrM LN:16571
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566


Tophat output after picard sorting:-
@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr2 LN:243199373
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chrM LN:16571
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566


Bowtie output before picard sorting:
@HD VN:1.0 SO:unsorted
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@SQ SN:chrM LN:16571


Bowtie output after picard sorting:
@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@SQ SN:chrM LN:16571


You can see that they are sorted differently and that picard sorting has not corrected this. Why? How can I fix this?

Chipper
10-10-2011, 12:54 PM
Use Picard ReorderSam with the same ref.fa file for both bam files. SortSam / samtools sort just sorts the alignments in the same order as the SQ lines in the header.

Jon_Keats
10-10-2011, 01:34 PM
Why not just align all reads with tophat? I'd guess the tophat alignment is not working well as all the exon restricted reads are lost in the bowtie alignment. Basically tophat is doing your pipeline as it does what you are doing and leverages the exon reads you are removing by doing bowtie first.

hlwright
10-10-2011, 11:51 PM
Chipper - thank you. I will try your suggestion today.

Jon - We run Bowtie using best and strata settings, and trim the 3' end of the read, which as far as I am aware you cannot do using TopHat. We have spent considerable time optimising our mapping and find TopHat alone is not enough for our particular project.

hlwright
10-11-2011, 12:36 AM
Thanks Chipper, your advice has solved the problem.

For anyone else who may be having the same problem, here is the solution:-

Create a sequence dictionary for your reference.fa in the same folder:-
java -Xmx2g -jar CreateSequenceDictionary.jar REFERENCE=hg19.fa OUTPUT=hg19.dict


Then reorder each of your bam files:-
java -Xmx2g -jar ReorderSam.jar INPUT=bowtie.bam OUTPUT=bowtie_reorder.bam REFERENCE=/path/to/hg19.fa

java -Xmx2g -jar ReorderSam.jar INPUT=tophat.bam OUTPUT=tophat_reorder.bam REFERENCE=/path/to/hg19.fa


Then merge your bam files:
java -Xmx2g -jar MergeSamFiles.jar INPUT=bowtie_reorder.bam INPUT=tophat_reorder.bam OUTPUT=reorder_merge.bam SORT_ORDER=coordinate



Once again thank you to everyone who took the time to read and reply to this post. Your help is very much appreciated :D