Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • hlwright
    Member
    • Feb 2011
    • 30

    Problem merging BAM files

    Hi everyone. We are having difficulty merging BAM files. Our pipeline so far is this:-

    50bp single-end rna-seq reads - map to hg19 with Bowtie, writing unaligned reads to a new file and writing mapped reads to a SAM file. Then map unaligned Bowtie reads with TopHat and write to a BAM file.

    We now want to merge the two output files, but there is a problem with the way the reads are sorted. We have tried (1) samtools sort on both files, then samtools reheader and samtools merge, (2) sorting files as per command on cufflinks manual page (sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted) and then merging, (3) picard sort and merge commands, and yet none of these will actually merge the files.

    An example of the picard input / output is here:

    PHP Code:
    java -Xmx2g -Dsnappy.disable=true -jar /path/to/picard-tools-1.52/SortSam.jar INPUT=bowtie.sam OUTPUT=bowtie_picardsort.bam SORT_ORDER=coordinate

    [Fri Oct 07 15:48:34 BST 2011net.sf.picard.sam.SortSam INPUT=bowtie.sam OUTPUT=bowtie_picardsort.bam SORT_ORDER=coordinate    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Fri Oct 07 15:48:34 BST 2011Executing as swebioinf@SWEBIOINFs-Mac-Pro.local on Mac OS X 10.6.8 x86_64Java HotSpot(TM64-Bit Server VM 1.6.0_26-b03-384-10M3425
    Snappy is disabled via system property
    .
    INFO    2011-10-07 15:49:56    SortSam    Read 10000000 records.
    INFO    2011-10-07 15:51:13    SortSam    Read 20000000 records.
    INFO    2011-10-07 15:52:29    SortSam    Read 30000000 records.
    INFO    2011-10-07 15:53:44    SortSam    Read 40000000 records.
    INFO    2011-10-07 15:54:59    SortSam    Read 50000000 records.
    INFO    2011-10-07 15:55:38    SortSam    Finished reading inputsmerging and writing to output now.
    [
    Fri Oct 07 16:04:53 BST 2011net.sf.picard.sam.SortSam doneElapsed time16.31 minutes.
    Runtime.totalMemory()=830885888

    java 
    -Xmx2g -Dsnappy.disable=true -jar /path/to/picard-tools-1.52/SortSam.jar INPUT=tophat.bam OUTPUT=tophat_picardsort.bam SORT_ORDER=coordinate

    [Fri Oct 07 16:07:29 BST 2011net.sf.picard.sam.SortSam INPUT=tophat.bam OUTPUT=tophat_picardsort.bam SORT_ORDER=coordinate    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Fri Oct 07 16:07:29 BST 2011Executing as swebioinf@SWEBIOINFs-Mac-Pro.local on Mac OS X 10.6.8 x86_64Java HotSpot(TM64-Bit Server VM 1.6.0_26-b03-384-10M3425
    Snappy is disabled via system property
    .
    INFO    2011-10-07 16:07:40    SortSam    Finished reading inputsmerging and writing to output now.
    [
    Fri Oct 07 16:07:54 BST 2011net.sf.picard.sam.SortSam doneElapsed time0.40 minutes.
    Runtime.totalMemory()=527294464

    java 
    -Xmx2g -Dsnappy.disable=true -jar /path/to/picard-tools-1.52/MergeSamFiles.jar INPUT=bowtie_picardsort.bam INPUT=tophat_picardsort.bam OUTPUT=picardall.bam SORT_ORDER=coordinate

    [Fri Oct 07 16:10:02 BST 2011net.sf.picard.sam.MergeSamFiles INPUT=[bowtie_picardsort.bamtophat_picardsort.bamOUTPUT=picardall.bam SORT_ORDER=coordinate    ASSUME_SORTED=false MERGE_SEQUENCE_DICTIONARIES=false USE_THREADING=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Fri Oct 07 16:10:02 BST 2011Executing as swebioinf@SWEBIOINFs-Mac-Pro.local on Mac OS X 10.6.8 x86_64Java HotSpot(TM64-Bit Server VM 1.6.0_26-b03-384-10M3425
    INFO    2011
    -10-07 16:10:02    MergeSamFiles    Input files are in same order as output so sorting to temp directory is not needed.
    [
    Fri Oct 07 16:10:02 BST 2011net.sf.picard.sam.MergeSamFiles doneElapsed time0.00 minutes.
    Runtime.totalMemory()=85000192
    Exception in thread 
    "main" net.sf.samtools.util.SequenceUtil$SequenceListsDifferExceptionSequences at index 1 don't match: 1/243199373/chr2 1/135534747/chr10
        at net.sf.samtools.util.SequenceUtil.assertSequenceListsEqual(SequenceUtil.java:109)
        at net.sf.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:138)
        at net.sf.picard.sam.SamFileHeaderMerger.getSequenceDictionary(SamFileHeaderMerger.java:482)
        at net.sf.picard.sam.SamFileHeaderMerger.<init>(SamFileHeaderMerger.java:133)
        at net.sf.picard.sam.MergeSamFiles.doWork(MergeSamFiles.java:137)
        at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
        at net.sf.picard.sam.MergeSamFiles.main(MergeSamFiles.java:84) 
    The error seems to be this:

    Exception in thread "main" net.sf.samtools.util.SequenceUtil$SequenceListsDifferException: Sequences at index 1 dont match: 1/243199373/chr2 1/135534747/chr10

    I think the problem lies somewhere with the way bowtie and tophat sort their output by default. Bowtie sorts numerically, chr1, chr2, chr3... etc whereas TopHat sorts chr1, chr10, chr11......chr19, chr2, chr20, chr21 etc. But we have sorted both files again using picard SortSam so we cannot understand why this discrepancy is not corrected. (Same happens with samtools sort).

    Can anyone help us please?

    Many thanks

    Helen
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    Can you double check the embedded SAM text header (@SQ lines) and the BAM binary header (sequence names and lengths) are consistent? I suspect "samtools reheader" doesn't check this.

    [Just a guess - there may well be other reasons why the merge fails]
    Last edited by maubp; 10-10-2011, 01:51 AM.

    Comment

    • hlwright
      Member
      • Feb 2011
      • 30

      #3
      Here are the headers:

      Tophat output before Picard sorting:
      PHP Code:
      @HD    VN:1.0    SO:coordinate
      @SQ    SN:chr1    LN:249250621
      @SQ    SN:chr10    LN:135534747
      @SQ    SN:chr11    LN:135006516
      @SQ    SN:chr12    LN:133851895
      @SQ    SN:chr13    LN:115169878
      @SQ    SN:chr14    LN:107349540
      @SQ    SN:chr15    LN:102531392
      @SQ    SN:chr16    LN:90354753
      @SQ    SN:chr17    LN:81195210
      @SQ    SN:chr18    LN:78077248
      @SQ    SN:chr19    LN:59128983
      @SQ    SN:chr2    LN:243199373
      @SQ    SN:chr20    LN:63025520
      @SQ    SN:chr21    LN:48129895
      @SQ    SN:chr22    LN:51304566
      @SQ    SN:chr3    LN:198022430
      @SQ    SN:chr4    LN:191154276
      @SQ    SN:chr5    LN:180915260
      @SQ    SN:chr6    LN:171115067
      @SQ    SN:chr7    LN:159138663
      @SQ    SN:chr8    LN:146364022
      @SQ    SN:chr9    LN:141213431
      @SQ    SN:chrM    LN:16571
      @SQ    SN:chrX    LN:155270560
      @SQ    SN:chrY    LN:59373566 
      Tophat output after picard sorting:-
      PHP Code:
      @HD    VN:1.0    SO:coordinate
      @SQ    SN:chr1    LN:249250621
      @SQ    SN:chr10    LN:135534747
      @SQ    SN:chr11    LN:135006516
      @SQ    SN:chr12    LN:133851895
      @SQ    SN:chr13    LN:115169878
      @SQ    SN:chr14    LN:107349540
      @SQ    SN:chr15    LN:102531392
      @SQ    SN:chr16    LN:90354753
      @SQ    SN:chr17    LN:81195210
      @SQ    SN:chr18    LN:78077248
      @SQ    SN:chr19    LN:59128983
      @SQ    SN:chr2    LN:243199373
      @SQ    SN:chr20    LN:63025520
      @SQ    SN:chr21    LN:48129895
      @SQ    SN:chr22    LN:51304566
      @SQ    SN:chr3    LN:198022430
      @SQ    SN:chr4    LN:191154276
      @SQ    SN:chr5    LN:180915260
      @SQ    SN:chr6    LN:171115067
      @SQ    SN:chr7    LN:159138663
      @SQ    SN:chr8    LN:146364022
      @SQ    SN:chr9    LN:141213431
      @SQ    SN:chrM    LN:16571
      @SQ    SN:chrX    LN:155270560
      @SQ    SN:chrY    LN:59373566 
      Bowtie output before picard sorting:
      PHP Code:
      @HD    VN:1.0    SO:unsorted
      @SQ    SN:chr1    LN:249250621
      @SQ    SN:chr2    LN:243199373
      @SQ    SN:chr3    LN:198022430
      @SQ    SN:chr4    LN:191154276
      @SQ    SN:chr5    LN:180915260
      @SQ    SN:chr6    LN:171115067
      @SQ    SN:chr7    LN:159138663
      @SQ    SN:chr8    LN:146364022
      @SQ    SN:chr9    LN:141213431
      @SQ    SN:chr10    LN:135534747
      @SQ    SN:chr11    LN:135006516
      @SQ    SN:chr12    LN:133851895
      @SQ    SN:chr13    LN:115169878
      @SQ    SN:chr14    LN:107349540
      @SQ    SN:chr15    LN:102531392
      @SQ    SN:chr16    LN:90354753
      @SQ    SN:chr17    LN:81195210
      @SQ    SN:chr18    LN:78077248
      @SQ    SN:chr19    LN:59128983
      @SQ    SN:chr20    LN:63025520
      @SQ    SN:chr21    LN:48129895
      @SQ    SN:chr22    LN:51304566
      @SQ    SN:chrX    LN:155270560
      @SQ    SN:chrY    LN:59373566
      @SQ    SN:chrM    LN:16571 
      Bowtie output after picard sorting:
      PHP Code:
      @HD    VN:1.0    SO:coordinate
      @SQ    SN:chr1    LN:249250621
      @SQ    SN:chr2    LN:243199373
      @SQ    SN:chr3    LN:198022430
      @SQ    SN:chr4    LN:191154276
      @SQ    SN:chr5    LN:180915260
      @SQ    SN:chr6    LN:171115067
      @SQ    SN:chr7    LN:159138663
      @SQ    SN:chr8    LN:146364022
      @SQ    SN:chr9    LN:141213431
      @SQ    SN:chr10    LN:135534747
      @SQ    SN:chr11    LN:135006516
      @SQ    SN:chr12    LN:133851895
      @SQ    SN:chr13    LN:115169878
      @SQ    SN:chr14    LN:107349540
      @SQ    SN:chr15    LN:102531392
      @SQ    SN:chr16    LN:90354753
      @SQ    SN:chr17    LN:81195210
      @SQ    SN:chr18    LN:78077248
      @SQ    SN:chr19    LN:59128983
      @SQ    SN:chr20    LN:63025520
      @SQ    SN:chr21    LN:48129895
      @SQ    SN:chr22    LN:51304566
      @SQ    SN:chrX    LN:155270560
      @SQ    SN:chrY    LN:59373566
      @SQ    SN:chrM    LN:16571 
      You can see that they are sorted differently and that picard sorting has not corrected this. Why? How can I fix this?

      Comment

      • Chipper
        Senior Member
        • Mar 2008
        • 323

        #4
        Use Picard ReorderSam with the same ref.fa file for both bam files. SortSam / samtools sort just sorts the alignments in the same order as the SQ lines in the header.
        Last edited by Chipper; 10-10-2011, 11:57 AM.

        Comment

        • Jon_Keats
          Senior Member
          • Mar 2010
          • 279

          #5
          Why not just align all reads with tophat? I'd guess the tophat alignment is not working well as all the exon restricted reads are lost in the bowtie alignment. Basically tophat is doing your pipeline as it does what you are doing and leverages the exon reads you are removing by doing bowtie first.

          Comment

          • hlwright
            Member
            • Feb 2011
            • 30

            #6
            Chipper - thank you. I will try your suggestion today.

            Jon - We run Bowtie using best and strata settings, and trim the 3' end of the read, which as far as I am aware you cannot do using TopHat. We have spent considerable time optimising our mapping and find TopHat alone is not enough for our particular project.
            Last edited by hlwright; 10-11-2011, 01:16 AM.

            Comment

            • hlwright
              Member
              • Feb 2011
              • 30

              #7
              Thanks Chipper, your advice has solved the problem.

              For anyone else who may be having the same problem, here is the solution:-

              Create a sequence dictionary for your reference.fa in the same folder:-
              PHP Code:
              java -Xmx2g -jar CreateSequenceDictionary.jar REFERENCE=hg19.fa OUTPUT=hg19.dict 
              Then reorder each of your bam files:-
              PHP Code:
              java -Xmx2g -jar ReorderSam.jar INPUT=bowtie.bam OUTPUT=bowtie_reorder.bam REFERENCE=/path/to/hg19.fa

              java 
              -Xmx2g -jar ReorderSam.jar INPUT=tophat.bam OUTPUT=tophat_reorder.bam REFERENCE=/path/to/hg19.fa 
              Then merge your bam files:
              PHP Code:
              java -Xmx2g -jar MergeSamFiles.jar INPUT=bowtie_reorder.bam INPUT=tophat_reorder.bam OUTPUT=reorder_merge.bam SORT_ORDER=coordinate 

              Once again thank you to everyone who took the time to read and reply to this post. Your help is very much appreciated

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM
              • SEQadmin2
                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                by SEQadmin2


                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                Introduction

                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                05-22-2026, 06:42 AM
              • SEQadmin2
                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                by SEQadmin2

                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                05-06-2026, 09:04 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-02-2026, 12:03 PM
              0 responses
              21 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 11:40 AM
              0 responses
              14 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 05-28-2026, 11:40 AM
              0 responses
              29 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 05-26-2026, 10:12 AM
              0 responses
              31 views
              0 reactions
              Last Post SEQadmin2  
              Working...