Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Add header to SAM file

    I have the following problem. I am trying to analyze bam files using GATK. The bam files I have do not have a header so this causes an error with GATK. I can add the header manually but I do not know what the read groups in the alignment section of the SAM file are to include in the header.

    Is there any way to extract fields from the alignment section of the SAM file? I could then extract the RGIDs from the alignments and get the unique values so I could include them in the header.

    Are there any other solutions to this problem that I am not thinking of? I can always go brute force and add RG lines to the header until I get no errors.

  • #2
    If your RG tags are all in the same column, just do:

    Code:
    cut -f# file.sam | sort | uniq
    Replace # with the column number containing the RG tag.

    If they aren't in the same column, you could do this:

    Code:
    tr '\t' '\n' < file.sam | grep RG: | sort | uniq

    Comment


    • #3
      I solved this problem on my own. I had a small number of RG tags in each BAM file. I used the ValidateSamFile tool from Picard to get the different RG tags in the alignment section. Then I manually wrote the header files and added the new header to the existing bam file using the samtools reheader tool. To make sure the header is correctly written, I validated the file again. Then I indexed the new bam file.

      Comment


      • #4
        I think you will want to look at picard tools (http://picard.sourceforge.net/comman...laceReadGroups) to fix this problem.

        Below I've listed an example of how I've used picard to fix headers in my bam files. You'll want to look at the picard website for additional details.

        java -jar -Xmx10g /path/to/picard-tools-1.54/AddOrReplaceReadGroups.jar INPUT=sample.bam OUTPUT=sample_with_header.bam SORT_ORDER=coordinate RGID=sample RGLB=sample RGPL=illumina RGSM=sample RGPU=name CREATE_INDEX=true

        Comment


        • #5
          Also, if you want to loop through many bam files which are in the same directory you could use this command:

          for i in *bam
          do
          java -jar -Xmx10g /path/to/picard-tools-1.54/AddOrReplaceReadGroups.jar INPUT=$i OUTPUT=/path/to/a/different/directory/$i SORT_ORDER=coordinate RGID=$i RGLB=$i RGPL=illumina RGSM=$i RGPU=name CREATE_INDEX=true
          done

          Comment


          • #6
            Interesting. I tried using the AddOrReplaceReadGroups function in picard and it gave me an error saying it could not find the RGID in the header. Perhaps I did not include all the arguments I needed.

            The call I used was

            java -jar AddOrReplaceReadGroups.jar INPUT=input OUTPUT=output SORT_ORDER=coordinate RGID=X10010 RGLB=LIB-X10010 RGPL=solid RGPU=slide-X10010 RGSM=X10010

            Maybe the create_index argument is needed?

            Comment


            • #7
              Can you share your header and the results of the command I suggested?

              Comment


              • #8
                Originally posted by gormleymp View Post
                I solved this problem on my own. I had a small number of RG tags in each BAM file. I used the ValidateSamFile tool from Picard to get the different RG tags in the alignment section. Then I manually wrote the header files and added the new header to the existing bam file using the samtools reheader tool. To make sure the header is correctly written, I validated the file again. Then I indexed the new bam file.
                Please how did you write the header and can you share your example of how you solved this problem step by step because i have the same problem.

                this is the report of i got when i try to validate so what are the RG tags here and how can i write it manually?
                bilbo@ubuntu:/media/ java -jar ValidateSamFile.jar I=outone.bam O=Validate_report.txt

                ERROR: Read groups is empty
                ERROR: Record 105, Read name HWI-ST365_0157:7:1206:13997:190087#GCGGGC, Mate negative strand flag does not match read negative strand flag of mate
                ERROR: Record 212, Read name HWI-ST365_0157:7:2203:9846:109491#GCGGTC, Mate negative strand flag does not match read negative strand flag of mate
                ERROR: Record 349, Read name HWI-ST365_0157:7:1103:15462:80840#GCGGTC, Mate negative strand flag does not match read negative strand flag of mate
                ERROR: Record 748, Read name HWI-ST365_0157:7:2104:13313:62994#GCGGTC, Mate negative strand flag does not match read negative strand flag of mate
                ERROR: Record 807, Read name HWI-ST365_0157:7:2106:17824:99461#GCGGCC, Mate negative strand flag does not match read negative strand flag of mate
                ERROR: Record 996, Read name HWI-ST365_0157:7:1102:4610:63216#GCGGTC, Mate negative strand flag does not match read negative strand flag of mate
                ERROR: Record 998, Read name HWI-ST365_0157:7:2101:14161:147407#GCGGTC, Mate negative strand flag does not match read negative strand flag of mate
                ERROR: Record 1000, Read name HWI-ST365_0157:7:2104:11372:153164#GCGGTC, Mate negative strand flag does not match read negative strand flag of

                Comment


                • #9
                  HI
                  I hope I can get an answer to one question I have. I am sure it's a real newb question but I'm a real newb
                  When you use Picard AddOrReplaceReadGroups I don't really understand where you get the inputs for RGID, RGLB, RGPL, RGPU & RGSM
                  Can anyone help em with this?
                  Thanks

                  Comment


                  • #10
                    Originally posted by plumb_r View Post
                    HI
                    I hope I can get an answer to one question I have. I am sure it's a real newb question but I'm a real newb
                    When you use Picard AddOrReplaceReadGroups I don't really understand where you get the inputs for RGID, RGLB, RGPL, RGPU & RGSM
                    Can anyone help em with this?
                    Thanks
                    You define them yourself based on your experiment. They really are just a way to associate some information about the sample in your sam/bam file.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    68 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X