Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Samtools....SAM to BAM...warning or error!! Can I ignore?

    Hello All

    I am trying to convert a Tophat SAM file to BAM and I get the warning message "The tag [ID] required for [PG] not present.". I have included it below. However then [sam_header_read2] sequences are loaded and the program runs successfully.

    Code:
    ./../../samtools-0.1.7a/samtools import ../../../ZmB73_AGPv1_genome.fasta.fai accepted_hits.sam accepted_hits_complete.bam
    The tag [ID] required for [PG] not present.
    [sam_header_read2] 11 sequences loaded
    .

    fyi...the first ten lines of the sam file:
    Code:
    @HD	VN:1.0	SO:sorted
    @PG	TopHat	VN:1.0.13	CL:/share/apps/tophat-1.0.13/bin/tophat -o ./tophat_out_s8 -p 4 --solexa1.3-quals ./maize_genome_bwtind/ZmB73_AGPv1_genome_ind s_8_sequence.txt
    HWI-EAS313:8:102:1328:1633#0	16	chr1	421	3	42M	*	GATTTCCAGTACAGTCCTCGCTATTGCTGTGAAAAGTTGGCC	@A??=A?B@@?BB@ABA>BA@BBB@:BABBBBBBB@BCBABB	NM:i:0
    HWI-EAS313:8:91:606:9#0	0	chr1	449	255	42M	*	0	GTGAAAAGTTGGCCTCATATTCTTGGCTCCTCTTCAAAAAGA	B@B@BB@B?@A@ABB@BBAB?@>=>??;<<=:?:07:4?8+<	NM:i:0
    HWI-EAS313:8:65:992:914#0	16	chr1	520	3	42M	*	TCTGGGCATCAGTAAAAAAATGGTGGTTCCAGTCATTACATC	A?@5;@=?@>@>ABABBBBBBA>A:>BA@?AABBBBBACC@;	NM:i:0
    HWI-EAS313:8:10:620:686#0	0	chr1	559	3	42M	*	ATCAAGTCCACAGTTATTACTGAGAAAACCTGATCAGTTTAT	BB?BBAAB6A@BC;AABBB@B@BA@@@AB@AB>A>A@;AAAB	NM:i:0
    HWI-EAS313:8:27:1782:1073#0	16	chr1	578	255	27M411N15M	CTGAGAAAACCTGATCAGTTTATGCAGAATGTTTTGTTTTTC	AA=@8@BB<A=A<?ABB@B@BBB=AA6BBB?BBBBBBBBBBB	NM:i:0	XS:A:+	NS:i:0
    HWI-EAS313:8:43:783:738#0	0	chr1	1052	255	42M	*	AAAACAACAGGAAAAATTCTGTGTCGTTCGCCTGAAATATTT	:ABCBACBBCCBBBB?CB=A<>>ABB@B@@<BBB?BBBBBBB	NM:i:0
    HWI-EAS313:8:78:753:514#0	0	chr1	1052	255	42M	*	AAAACAACAGGAAAAATTCTGTGTCGTTCGCCTGAAATATTT	AAA=CCAABACBC?BCA9<@@4B=<BB??C=?ABBBCBBB>7	NM:i:0
    HWI-EAS313:8:17:949:19#0	0	chr1	1057	255	42M	*	AACAGGAAAAATTCTGTGTCGTTCGCCTGAAATATTTGCTTC	?CBCBCB=BBBCCBBB3?@@AB?AC8ABBBBBB@B@@@/B@9	NM:i:0
    I get a BAM output and would like to know if this would be a true binary representative of the original SAM file?

    any thoughts or suggestions?
    Siva

  • #2
    Yes, you can ignore that warning.

    I'm not sure why the ID tag is missing... it should be

    ID:Tophat
    SpliceMap: De novo detection of splice junctions from RNA-seq
    Download SpliceMap Comment here

    Comment


    • #3
      As a followup: I work with maize which has 10 chromosomes. So when it says "11 sequences loaded" does it refer to chromosomes 1 through 10 and chr Unknown? There is a chunk of sequences in the maize genome that has not been assigned to any chromosome.

      Siva
      Last edited by Siva; 05-24-2010, 12:09 AM.

      Comment


      • #4
        I am only familiar with the SAM format... not too familiar with the SAM to BAM conversion at this stage. So, I can't help you there, in case I give the wrong information.

        But your reasoning sounds reasonable, since I believe the unmapped reads at denoted as a "*".
        SpliceMap: De novo detection of splice junctions from RNA-seq
        Download SpliceMap Comment here

        Comment


        • #5
          Originally posted by john_mu View Post
          Yes, you can ignore that warning.

          I'm not sure why the ID tag is missing... it should be

          ID:Tophat
          Thanks John, I too don't know why the tag ID is missing.

          Comment


          • #6
            The missing ID: tag is a small bug in tophat. Mostly harmless, but it will prevent processing the SAM file with picard, even with VALIDATION_STRINGENCY=SILENT, or other tools that expect headers to be syntactically correct. It's trivial to fix in tophat.py (line 1006 in 1.0.13).

            Code:
            ./../../samtools-0.1.7a/samtools import ../../../ZmB73_AGPv1_genome.fasta.fai accepted_hits.sam accepted_hits_complete.bam
            The tag [ID] required for [PG] not present.
            [sam_header_read2] 11 sequences loaded
            The 11 here comes from the number of lines in your ZmB73_AGPv1_genome.fasta.fai file. This won't include the "*" denoting unmapped reads; rather, as you surmised, your reference presumably contains an extra "chromosome" labelled Unknown or so, containing those unassigned sequences.

            -- John
            Last edited by jmarshall; 05-24-2010, 12:42 AM. Reason: clarified

            Comment


            • #7
              Hi I am having a similar problem using picard to parse a BAM file from tophat. Is there anyway around it?

              I get an error similar to this one when I try to validate the BAM file
              Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing SAM header. Problem parsing @PG key:value pair. Line:
              @PG TopHat VN:1.0.13
              Last edited by xue.vin; 06-04-2010, 08:44 AM.

              Comment


              • #8
                xue.vin,
                The way to solve the problem with parsing the BAM in Picard would be to edit the SAM file, changing the line
                @PG TopHat VN:1.0.13
                to
                @PG ID:TopHat VN:1.0.13

                This can be done with a text editor, with sed/awk/perl, or by editing the TopHat source code to include the ID: on the @PG line. The line is at the top of the file. The best way to do this will depend on what system you're using, how big the SAM file is, and how comfortable you are with programming. If it's not a SAM, but a BAM, then your best bet is editing the TopHat source, since the BAM files use fairly complicated compression.

                -Mitch

                Comment


                • #9
                  Thank You for the quick reply. Your suggestion was very helpful.

                  -Vincent

                  Comment


                  • #10
                    Originally posted by jmarshall View Post
                    The missing ID: tag is a small bug in tophat. Mostly harmless, but it will prevent processing the SAM file with picard, even with VALIDATION_STRINGENCY=SILENT, or other tools that expect headers to be syntactically correct. It's trivial to fix in tophat.py (line 1006 in 1.0.13).

                    Code:
                    ./../../samtools-0.1.7a/samtools import ../../../ZmB73_AGPv1_genome.fasta.fai accepted_hits.sam accepted_hits_complete.bam
                    The tag [ID] required for [PG] not present.
                    [sam_header_read2] 11 sequences loaded


                    -- John
                    Does anyone know if this has been fixed in Tophat1.0.14?

                    Comment


                    • #11
                      Just checked the tophat.py script, looks like it has. Maybe Picard will now work with the TopHat SAM files.....

                      Comment


                      • #12
                        does it also produce the @SQ entries now?
                        because some of the picard tools need them. I tried to add them, but it seems that I made a mistake somewhere.

                        Comment


                        • #13
                          No it does not.

                          Since I'm having it part of a workflow, and got bored to have to edit the header manually, I've created a patch file for tophat v1.0.14 (attached). It's probably not optimal, but it does add the SQ fields to the header.

                          To apply that patch to a fresh new installation:
                          1. unpack tophat-1.0.14.tar.gz
                          2. copy the attached patch to the tophat-1.0.14 directory
                          3. unzip the patch
                          4. verify that the patch works
                            Code:
                            patch --dry-run -p1 -i tophat-1.0.14.py.patch
                          5. apply it
                            Code:
                            patch -p1 -i tophat-1.0.14.py.patch
                          6. install tophat
                            Code:
                            ./configure --prefix=`pwd`/../tophat
                            make
                            make install


                          To apply it to an existing installation, do the step 2 to 5 above and then do
                          Code:
                          make clean
                          make
                          make install
                          I hope this helps. It did for me :-)

                          I'll send the patch to Cole Trapnell (TopHat author) too, so that it does not get lost (and hopefully he will find it sensible).
                          Attached Files

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 08:47 AM
                          0 responses
                          12 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          60 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          59 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          54 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X