Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat: merging accepted_hits.bam and unmapped.bam

    Hi everyone,

    I would like to merge the Tophat output files accepted_hits.bam and unmapped.bam into one file for subsequent analysis. The files contain paired-end Illumina reads.

    I used a naive approach of just calling samtools merge:

    Code:
    samtools merge merged.bam accepted_hits.bam unmapped.bam
    Using the resulting merged.bam file with any Picard tool results in an error however:

    Code:
    INFO	2013-03-07 10:29:53	AddOrReplaceReadGroups	Processed    39,000,000
    records.  Elapsed time: 00:11:07s.  Time for last 1,000,000:   16s.
    Last read position: chrX:153,627,857
    [Thu Mar 07 10:29:55 CET 2013] net.sf.picard.sam.AddOrReplaceReadGroups
    done. Elapsed time: 11.16 minutes.
    Runtime.totalMemory()=1110376448
    FAQ:  http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page
    Exception in thread "main" net.sf.samtools.SAMFormatException: SAM
    validation error: ERROR: Record 39136746, Read name
    HWI-ST587_0094:4:1101:7158:1939.ATCACGA/1, Mapped mate should have mate
    reference name
    	at net.sf.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448)
    	at
    net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:541)
    	at
    net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522)
    	at
    net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481)
    	at
    net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:672)
    	at
    net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:650)
    	at
    net.sf.picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:98)
    	at
    net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
    	at
    net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
    	at
    net.sf.picard.sam.AddOrReplaceReadGroups.main(AddOrReplaceReadGroups.java:66)
    The files were created with a Tophat version prior to 2.0.7, so the unmapped reads still have /1 and /2 suffixes. However, removing the suffixes before calling samtools merge results in the same error down the line.

    The reads picard complains about look like this:

    Code:
    samtools view merged.bam | grep "HWI-ST587_0094:4:1101:7158:1939.ATCACGA"
    HWI-ST587_0094:4:1101:7158:1939.ATCACGA	1177	chr3	129324900
    50	38M	*	0	0	CTAGCCCCACGGTGGACGCGTTCGGGTGGTTGGCCGCC
    FFFFHJJJHJJJJJJJJJJJJJJJJHHHHHFFFFFCCC	AS:i:0	XN:i:0	XM:i:0	XO:i:0
    XG:i:0	NM:i:0	MD:Z:38	YT:Z:UU	XS:A:-	NH:i:1
    HWI-ST587_0094:4:1101:7158:1939.ATCACGA/1	69	*	0	255	*	*	0	0
    TGNGTGTTCTCGAAGCGGTGGTCCTCCAGGCTGCGGTTGCGCGGGAAGAAGGNGCTGCCGTAACCGGTGTACGTGNCGCCCACGAGCAGGCGGCTGCCCC
    CC!4ADDFHHHHHJJIJJGHJHIJJJJJJJJJJJJJFHIJJIJGDBDDDDDD!,8?BDDDDDDDDDD>B>CDDCC!+8?BBBDDDDDDDDDDDB)&(28<
    Software versions used:
    Tophat: 2.0.6
    Picard: 1.85
    samtools: 0.1.18

    Has anyone done this successfully?

    Thanks a lot,

    Chris
    Last edited by offspring; 03-07-2013, 04:55 AM.

  • #2
    I have modified the unmapped reads as follows now:

    - Removed the /1 and /2 suffixes
    - Changes MAPQ from 255 to 0
    - For unmapped reads with a mapped mate, set RNAME and POS to the same value as the mapped mate

    However, even with these changes I get the "Mapped mate should have mate reference name" error from Picard.

    Here is read pair it stumbles upon now:

    Code:
    HWI-ST587_0094:4:1101:7158:1939.ATCACGA	69	chr3	129324900	0	*	*	0	0	TGNGTGTTCTCGAAGCGGTGGTCCTCCAGGCTGCGGTTGCGCGGGAAGAAGGNGCTGCCGTAACCGGTGTACGTGNCGCCCACGAGCAGGCGGCTGCCCCCC!4ADDFHHHHHJJIJJGHJHIJJJJJJJJJJJJJFHIJJIJGDBDDDDDD!,8?BDDDDDDDDDD>B>CDDCC!+8?BBBDDDDDDDDDDDB)&(28<
    HWI-ST587_0094:4:1101:7158:1939.ATCACGA	1177	chr3	129324900	50	38M	*	0	0	CTAGCCCCACGGTGGACGCGTTCGGGTGGTTGGCCGCC	FFFFHJJJHJJJJJJJJJJJJJJJJHHHHHFFFFFCCC	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:38	YT:Z:UU	XS:A:-	NH:i:1
    Any clue would be appreciated!

    Chris
    Last edited by offspring; 03-08-2013, 01:31 AM.

    Comment


    • #3
      Originally posted by offspring View Post
      I have modified the unmapped reads as follows now:

      - Removed the /1 and /2 suffixes
      - Changes MAPQ from 255 to 0
      - For unmapped reads with a mapped mate, set RNAME and POS to the same value as the mapped mate

      However, even with these changes I get the "Mapped mate should have mate reference name" error from Picard.

      Here is read pair it stumbles upon now:

      Code:
      HWI-ST587_0094:4:1101:7158:1939.ATCACGA	69	chr3	129324900	0	*	*	0	0	TGNGTGTTCTCGAAGCGGTGGTCCTCCAGGCTGCGGTTGCGCGGGAAGAAGGNGCTGCCGTAACCGGTGTACGTGNCGCCCACGAGCAGGCGGCTGCCCCCC!4ADDFHHHHHJJIJJGHJHIJJJJJJJJJJJJJFHIJJIJGDBDDDDDD!,8?BDDDDDDDDDD>B>CDDCC!+8?BBBDDDDDDDDDDDB)&(28<
      HWI-ST587_0094:4:1101:7158:1939.ATCACGA	1177	chr3	129324900	50	38M	*	0	0	CTAGCCCCACGGTGGACGCGTTCGGGTGGTTGGCCGCC	FFFFHJJJHJJJJJJJJJJJJJJJJHHHHHFFFFFCCC	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:38	YT:Z:UU	XS:A:-	NH:i:1
      Flag 69 means read unmapped and mate mapped, so Picard expects a chromosome name assigned to the mate of the unmapped read. Since the read flagged 69 has * for the chromosome of the mate, Picard complains.

      Probably this inconsistency has nothing to do with the quality of your data (unless it happens quite a lot) and I think it is just a small bug in the way Tophat sets flags. So the easy way around it is to run picard with option VALIDATION_STRINGENCY=LENIENT so that it will report the errors but it will ignore them.

      Hope this helps
      Dario

      Comment


      • #4
        Originally posted by dariober View Post
        Flag 69 means read unmapped and mate mapped, so Picard expects a chromosome name assigned to the mate of the unmapped read. Since the read flagged 69 has * for the chromosome of the mate, Picard complains.

        Probably this inconsistency has nothing to do with the quality of your data (unless it happens quite a lot) and I think it is just a small bug in the way Tophat sets flags. So the easy way around it is to run picard with option VALIDATION_STRINGENCY=LENIENT so that it will report the errors but it will ignore them.

        Hope this helps
        Dario
        Hi Dario,

        thanks for the input. VALIDATION_STRINGENCY=LENIENT was not really an option for me, since that would defeat the purpose of merging the two files in the first place (it would throw out all unmapped reads with a mapped mate in this case).

        However I finally got it to work, part of the problem is a bug in Tophat (already reported to the Tophat team). Here is how I had to modify unmapped.bam to make it work:

        1. Remove /1 and /2 suffixes from all reads (not necessary with Tophat version 2.0.7 and newer)
        2. Change MAPQ from 255 to 0 for all reads
        3. For unmapped reads with a mapped mate:
        - set RNAME and RNEXT of unmapped read to RNAME of mapped mate
        - set POS to POS of mapped mate
        - set PNEXT to 0
        4. For paired-end reads where both ends are unmapped, Tophat does not set the "mate is unmapped" (0x8) SAM flag on either read. This has to be fixed manually.

        After making these changes, the merged file of accepted_hits.bam and unmapped.bam could be processed by Picard tools.

        Chris

        Comment


        • #5
          In case it's useful for anyone else, here's the script I use to fix the unmapped reads:

          https://github.com/cbrueffer/misc_bi...apped_reads.py

          Chris

          Comment


          • #6
            Dear Chris,

            I was attempting to merge unmapped + mapped tophat output as well.

            Very excited to see your script! but somehow it returns an error when I tried running it?

            Code:
            [B]$ ~/bin/bin/python2.7 fix_unmapped.py outdir/[/B]
            Traceback (most recent call last):
              File "fix_unmapped.py", line 104, in <module>
                main(sys.argv[1])
              File "fix_unmapped.py", line 53, in main
                if read.qname.index("/") != -1:
            ValueError: substring not found
            Not sure if this could be the reason - I am using ssh to log into a linux server, and if i type import os, or any of the other import, the error below will show up.

            Code:
            import: unable to open X server `'.

            Comment


            • #7
              Hi,

              good to see it's useful to others. The problem is an error in the script, either replace the "index" with "find" or use the updated version from github.

              The second issue seems to be that you're typing "import os" in the shell instead of in Python. The import you're executing right now is part of the ImageMagick package.

              Chris

              Originally posted by zaki View Post
              Dear Chris,

              I was attempting to merge unmapped + mapped tophat output as well.

              Very excited to see your script! but somehow it returns an error when I tried running it?

              Code:
              [B]$ ~/bin/bin/python2.7 fix_unmapped.py outdir/[/B]
              Traceback (most recent call last):
                File "fix_unmapped.py", line 104, in <module>
                  main(sys.argv[1])
                File "fix_unmapped.py", line 53, in main
                  if read.qname.index("/") != -1:
              ValueError: substring not found
              Not sure if this could be the reason - I am using ssh to log into a linux server, and if i type import os, or any of the other import, the error below will show up.

              Code:
              import: unable to open X server `'.

              Comment


              • #8
                Thanks Chris!

                Your script worked wonderfully!!

                I am assuming you then proceed to merge the unmapped_fixed.bam & mapped.bam using samtools merge?

                Wondering if you have encountered picard MergeBAMAliginment tool?

                Thanks to your lovely script I managed to merge using samtools and proceed using picard tools. However I think MergeBAMAlignment returns an error if I attempt to merge unmapped.bam + accepted_hits.bam... and also returns the same error if using unmapped_fixed.bam + accepted_hits.bam

                Exception in thread "main" net.sf.picard.PicardException: Program Record ID already in use in unmapped BAM file.
                Would you have any idea why this might be?

                Comment


                • #9
                  Good to hear! I use "samtools merge" afterwards, indeed. I've never used MergeBamAlignment, so I'm not sure what the error means. It would be interesting to find out though, I'll test it with one of our datafiles in the next couple of days.

                  If you figure out what the problem is in the meantime, please let me know.

                  Chris

                  Comment


                  • #10
                    Originally posted by offspring View Post
                    Good to hear! I use "samtools merge" afterwards, indeed. I've never used MergeBamAlignment, so I'm not sure what the error means. It would be interesting to find out though, I'll test it with one of our datafiles in the next couple of days.

                    If you figure out what the problem is in the meantime, please let me know.

                    Chris
                    Hi Chris,
                    I processed the tophat output with your instructions. I merged the accepted_hits.bam and unmapped_fixup.bam with "samtools merge", then I gave the merged bam file to "AddOrReplaceReadGroups.jar" to replace the header. But I received the error like follow:
                    Code:
                    INFO	2013-05-06 12:31:53	AddOrReplaceReadGroups	Processed    11,000,000 records.  Elapsed time: 00:03:33s.  Time for last 1,000,000:   19s.  Last read position: chr11:56,940,830
                    INFO	2013-05-06 12:32:13	AddOrReplaceReadGroups	Processed    12,000,000 records.  Elapsed time: 00:03:53s.  Time for last 1,000,000:   19s.  Last read position: chr11:64,878,216
                    [Mon May 06 12:32:14 CST 2013] net.sf.picard.sam.AddOrReplaceReadGroups done. Elapsed time: 3.93 minutes.
                    Runtime.totalMemory()=813891584
                    To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
                    Exception in thread "main" java.lang.NullPointerException
                    	at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1646)
                    	at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:540)
                    	at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522)
                    	at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481)
                    	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:672)
                    	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:650)
                    	at net.sf.picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:98)
                    	at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
                    	at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
                    	at net.sf.picard.sam.AddOrReplaceReadGroups.main(AddOrReplaceReadGroups.java:66)
                    Is this pipeline right? what should I do with this error?
                    Thank you for your help.
                    Vivienne
                    Last edited by vivienne_lovely; 05-06-2013, 08:30 PM.

                    Comment


                    • #11
                      Hi Vivienne,

                      try running ReorderSam on your merged file (same as with accepted_hits.bam before).

                      Chris

                      Comment


                      • #12
                        Hi Chris,
                        I tried the ReorderSam just now, but the mistake is the same as before.
                        Vivienne
                        Originally posted by offspring View Post
                        Hi Vivienne,

                        try running ReorderSam on your merged file (same as with accepted_hits.bam before).

                        Chris

                        Comment


                        • #13
                          Originally posted by vivienne_lovely View Post
                          Hi Chris,
                          I tried the ReorderSam just now, but the mistake is the same as before.
                          Vivienne
                          I can't think of anything else that might cause this at the moment... I have used this workflow on more than 200 tophat runs without issues.

                          To be clear, I use the following:

                          1. fix up unmapped.bam
                          2. merge accepted_hits.bam and unmapped_fixup.bam (samtools merge)
                          3. reorder merged file (Picard ReorderSam)
                          4. add RG header to merged file (Picard AddOrReplaceReadGroups)
                          5. create SAM index for merged file (samtools index)
                          6. run RNA-SeQC

                          Steps 1-3 can be varied, depending on the original sorting of the files. I don't think the order you have done things in should make a difference.

                          Chris

                          Comment


                          • #14
                            Originally posted by offspring View Post
                            4. For paired-end reads where both ends are unmapped, Tophat does not set the "mate is unmapped" (0x8) SAM flag on either read. This has to be fixed manually.
                            What is a good way to fix this manually?

                            How come this step was not included in the linked fix_tophat_unmapped_reads.py script? Is there some caveat?

                            Comment


                            • #15
                              Originally posted by id0 View Post
                              What is a good way to fix this manually?

                              How come this step was not included in the linked fix_tophat_unmapped_reads.py script? Is there some caveat?
                              It's included, lines 85-89 in the current version. However, the bug has been fixed in TopHat 2.0.9.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X