Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • input BAM files for GATK

    Hello,

    I try to use GATK and I think that my BAM files are not well formated. I read that:

    The file must be binary (.bam).
    The file must be indexed.
    The file must be sorted in coordinate order with respect to the reference (i.e. the contig ordering in your bam must exactly match that of the reference you are using).
    The file must have a proper bam header with read groups. Each read group must contain the platform (PL) and sample (SM) tags. For the platform value, we currently support 454, LS454, Illumina, Solid, ABI_Solid, and CG (all case-insensitive).
    Each read in the file must be associated with exactly one read group.
    I did the three first steps with samtools, but I'm not sure if my file is correctly sorted.
    My main problem is for the latest steps. I don't know how to do and I don't understand the notion of read group.

    Could someone clarify this please?
    Thanks,
    Jane

  • #2
    Hello,

    Which mapper/assembler do you use? BWA for example has the parameter "-r" in the sampe/samse step, which specifies the read group (in the header as well as in the reads).

    Best regards
    Robby

    Comment


    • #3
      If you look at the SAM header in your BAM file, do you have an @RG lines (read groups) as well as one @SQ line per reference sequence?

      Comment


      • #4
        It has been done by a sequencing plateform (allowance), they used CASAVA, I don't know if it was the version 1.7 or 1.8...

        Comment


        • #5
          Originally posted by Robby View Post
          Hello,

          Which mapper/assembler do you use? BWA for example has the parameter "-r" in the sampe/samse step, which specifies the read group (in the header as well as in the reads).

          Best regards
          Robby
          Here is my sam file:
          more s_garma-fibros_converted.sam
          @PG ID:illumina_export2sam.pl VN:2.0.0 CL:/usr/local/bin/illumina_export2sam.pl --read1=s_garma-fibros_1_export.txt
          --read2=s_garma-fibros_2_export.txt
          HWI-ST584_81:4:1101:1198:2065 89 chr7 156837088 28 76M * 0 0 GGCTTGAACAACGGAAATGTGTCAAATGT
          GTCAGCTCCCAGCTCAGAGACTGGGAGACCAGGCCGAGGCGCCGGCN ############################################################################ BC:Z:
          AGTCAA XD:Z:75A SM:i:28 AS:i:0
          HWI-ST584_81:4:1101:1198:2065 165 * 0 0 * chr7 156837088 0 GGAGCCCTGCTGCGTAGTNNNNNNNNNNA
          CACGGTGTATTATTACTTTCCCAGGACCACCGTAACAAAGTAGCACA CCCFFFFFHHHHGJHGIH########################################################## BC:Z:
          AGTCAA
          HWI-ST584_81:4:1101:1174:2078 73 chr5 41048452 254 76M * 0 0 AAACGTGTTTTCCATAGGTCTACCAATTT
          TGGGTGAATTATCTCAGGCAGTATCTTCAAAAGCCCTATTGCACCAG CCCFFFFDHHHHHIIIJJIJJIJJJJJJJJJJJGHGJJJIJIJJJJIJJJIIJJJJJJJIIJGIJJJIIIJJHHHA BC:Z:
          AGTCAA XD:Z:76 SM:i:359 AS:i:0
          HWI-ST584_81:4:1101:1174:2078 133 * 0 0 * chr5 41048452 0 GCCCCTGAAATTGATGAGNNNNNNNNNNT
          CTATCCTTCAGGTAATATCTATGCCTGCCAGTTTAGGGGAGTTACAT @@CFFFFFHGHFHIJJFG########################################################## BC:Z:
          AGTCAA
          HWI-ST584_81:4:1101:1242:2096 73 chr9 135156764 254 76M * 0 0 CCAATCCAAGAAAGATGTCTCTCCCTCCT
          GAAAACAAAAATTTTAAAAAGCCCCTTCCATTTTAAAGCAATCTGAA ;<<:BDDD<D;FFFABFFCFEHFFFIFIIIIIIII=BGIIIFEFC>G@GDECD;BFF@FFEFCFDAEFEFCEEFA? BC:Z:
          AGTCAA XD:Z:76 SM:i:359 AS:i:0
          HWI-ST584_81:4:1101:1242:2096 133 * 0 0 * chr9 135156764 0 ACTGCCTCTCTCTTCTGTNTCNNNNNNNG
          GCTGGACAGTCTTGTGAAATTGAGACTCTTACTCCACTCATCCATCC ;?@DHHDHDEBE<G########################################################## BC:Z:
          AGTCAA
          And my bam file is not readable:

          more s_garma-fibros_converted_sorted.bam
           7�,7�cr"�1e��9���1X3�8�
          ��5�PCY�N(ǒ���,6��^L�%D�<CN8ψ�{�/vϘ!,�.�g*–��284��yf^L�Xa<s�S���x^L���<K��6m�P��C�ƃp�!Cz�=&ψ��3�
          ^L
          la��d����
          --Plus--(0%)
          Thanks for your answers !

          Comment


          • #6
            you can read BAM files using samtools view <file.bam>. Pipe that through more or less to reduce information overload.

            Comment


            • #7
              Originally posted by gringer View Post
              you can read BAM files using samtools view <file.bam>. Pipe that through more or less to reduce information overload.
              Thank you.
              So my bam file looks like this:

              HWI-ST584_81:4:2107:8344:147434 147 chr1 13474 26 76M = 13404 -146 TCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACGGTGTTTGTCAT #A@CA@@38(CCACC@?>AA?B?BA=A<BA;ED@=GACGCCEFB;???1?)0081@>HF;GFDFHHDFBDFFF@@@ BC:Z:AGTCAA XD:Z:76 SM:i:3 AS:i:26
              HWI-ST584_81:4:1204:11080:95072 147 chr1 13480 26 76M = 13418 -138 CAGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACGGTGTTTGTCATGGGCCT >CACDACCAC?>CCABBDDB?;8>DFED=;:7D===8DGCIGCJIJGIIGGIHIIDHE=BGFGGHHHHFFFFFC@@ BC:Z:AGTCAA XD:Z:76 SM:i:3 AS:i:26
              There is no @RG lines and @SQ line...

              Comment


              • #8
                There is no @RG lines and @SQ line...
                *cough*

                Sorry, I should have mentioned that earlier. If you want to show the header information, you need to put a -h in there as well.

                Code:
                samtools view [b]-h[/b] <file.bam>
                I recommend looking at the samtools documentation and the SAM specification:

                Comment


                • #9
                  You definitely need @RG records in your file.

                  To check this do
                  Code:
                  samtools view -H file.bam | grep ^@RG
                  This will probably yield no results.

                  Use Picard to add a readgroup to your file e.g.
                  Code:
                  java -jar $PICARD/AddOrReplaceReadGroups I=file.bam O=newfile.bam \
                  SORT_ORDER=coordinate CREATE_INDEX=true  \
                  RGPL=illumina RGID=11 RGSM=mysample
                  After this is done then run GATK on newfile.bam

                  Replace the RGID/RGSM with the values you desire

                  Comment


                  • #10
                    Originally posted by zee View Post
                    You definitely need @RG records in your file.

                    To check this do
                    Code:
                    samtools view -H file.bam | grep ^@RG
                    This will probably yield no results.



                    Use Picard to add a readgroup to your file e.g.
                    Code:
                    java -jar $PICARD/AddOrReplaceReadGroups I=file.bam O=newfile.bam \
                    SORT_ORDER=coordinate CREATE_INDEX=true  \
                    RGPL=illumina RGID=11 RGSM=mysample
                    After this is done then run GATK on newfile.bam

                    Replace the RGID/RGSM with the values you desire
                    Thanks for your help!

                    Exactly, I've got no result:
                    [merlevede@U1009-PCJane patient1]$ samtools view -H s_garma-fibros_converted_sorted.bam | grep ^@RG
                    [merlevede@U1009-PCJane patient1]$
                    Is it easier to use Picard to add the header or to rerun the alignment, with BWA this time and with the right options?

                    Comment


                    • #11
                      Using Picard is easy enough, the command that zee posted should solve your problems.

                      Comment


                      • #12
                        Thanks a lot for your help.
                        I have installed picard and I tried to run it:

                        java -jar ./AddOrReplaceReadGroups.jar I=~/../../../data/patient1/s_garma-fibros_converted_sorted.bam O=~/../../../data/patient1/picard_s_garma-fibros_converted_sorted.bam \
                        > SORT_ORDER=coordinate CREATE_INDEX=true \
                        > RGPL=illumina RGID=11 RGSM=garma-fibros
                        ERROR: Option 'RGLB' is required.
                        but I need one more option: RGLB which is defined as:
                        RGLB=String
                        LB=String Read Group Library Required.
                        I found this discussion http://seqanswers.com/forums/showthread.php?t=11887 and RGLB seems to require a fastq file... So, will Picard more or less realigned my alignment?
                        I will try to add my fastq file

                        Comment


                        • #13
                          You're just telling Picard to creating a missing information section at the start of the file and add a small label to the associated reads, it won't do anything else. The UnifiedGenotyper error message you posted at the start of the thread indicated that it doesn't even use the RGLB field, so you can probably enter anything you want in there. You could probably enter gibberish and not have it matter for this purpose.

                          Comment


                          • #14
                            Originally posted by dpryan View Post
                            You're just telling Picard to creating a missing information section at the start of the file and add a small label to the associated reads, it won't do anything else. The UnifiedGenotyper error message you posted at the start of the thread indicated that it doesn't even use the RGLB field, so you can probably enter anything you want in there. You could probably enter gibberish and not have it matter for this purpose.
                            OK, then I tried to fill the option with "gibberish", it ran for a while and gave:
                            java -jar ./AddOrReplaceReadGroups.jar I=~/../../../data/patient1/s_garma-fibros_converted_sorted.bam O=~/../../../data/patient1/picard_s_garma-fibros_converted_sorted.bam \
                            > SORT_ORDER=coordinate CREATE_INDEX=true \
                            > RGPL=illumina RGID=garma-fibros RGSM=garma RGLB=toto RGPU=tata
                            [Wed Jan 18 13:53:51 CET 2012] net.sf.picard.sam.AddOrReplaceReadGroups INPUT=/home/merlevede/../../../data/patient1/s_garma-fibros_converted_sorted.bam OUTPUT=/home/merlevede/../../../data/patient1/picard_s_garma-fibros_converted_sorted.bam SORT_ORDER=coordinate RGID=garma-fibros RGLB=toto RGPL=illumina RGPU=tata RGSM=garma CREATE_INDEX=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false
                            [Wed Jan 18 13:53:51 CET 2012] Executing as merlevede@U1009-PCJane on Linux 3.1.6-1.fc16.x86_64 amd64; OpenJDK 64-Bit Server VM 1.6.0_22-b22; Picard version: 1.60(1086)
                            INFO 2012-01-18 13:53:51 AddOrReplaceReadGroups Created read group ID=garma-fibros PL=illumina LB=toto SM=garma


                            [Wed Jan 18 14:08:08 CET 2012] net.sf.picard.sam.AddOrReplaceReadGroups done. Elapsed time: 14,29 minutes.
                            Runtime.totalMemory()=2009399296
                            Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Read name HWI-ST584_81:4:1106:7158:91967, CIGAR M operator maps off end of reference
                            at net.sf.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448)
                            at net.sf.samtools.BAMRecord.getCigar(BAMRecord.java:247)
                            at net.sf.samtools.BAMRecordCodec.encode(BAMRecordCodec.java:136)
                            at net.sf.samtools.BAMRecordCodec.encode(BAMRecordCodec.java:37)
                            at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:210)
                            at net.sf.samtools.util.SortingCollection.add(SortingCollection.java:150)
                            at net.sf.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:170)
                            at net.sf.picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:93)
                            at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
                            at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
                            at net.sf.picard.sam.AddOrReplaceReadGroups.main(AddOrReplaceReadGroups.java:61)
                            [merlevede@U1009-PCJane picard-tools-1.60]$
                            I needed to add also the option RGPU.
                            The bam file that I got is of course empty

                            Comment


                            • #15
                              From the Picard FAQ:
                              Q: A Picard program complains that CIGAR M operator maps off the end of reference. I want this record to be treated as valid despite the fact that the alignment end is greater than the length of the reference sequence.

                              A: Picard validation errors may be turned into warnings by passing the command line argument VALIDATION_STRINGENCY=LENIENT. Picard validation messages may be suppressed completely with VALIDATION_STRINGENCY=SILENT. Another option is to use CleanSam to soft-clip these reads so they don't map off the end of the reference.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X