Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bowtie changes read names in SAM output

    I have fastq files with read names like this:

    "@HWUSI-EAS614:91:70KKMAAXX:5:43:12933:8491 1:N:0:"

    Notice there is a space in the name. This is the format of all reads in the input fastq file. However, the sam file produced by bowtie sometimes trims off the last part after the space, and sometimes does not. Has anyone seen this behavior?

    The SAM format does not allow spaces in the read name, so the general question is how does Bowtie modify read names? And why is not doing so consistently for all reads in the same run?

    Thank you for any help.

  • #2
    You can't have spaces in the name.

    With FASTQ by definition/convention as with FASTA, the first word is the name/identifier and anything after a white space is a comment or description.

    [I don't know why Bowtie might be inconsistent in this regard - are you sure these are really spaces in both cases?]

    Comment


    • #3
      > You can't have spaces in the name.

      I don't think this is true. The wikipedia article on FASTQ gives examples where spaces are used. In fact, my fastq files are from CASAVA 1.8, which standardizes on using a space.

      However, the SAM format specification clearly states that spaces are disallowed. Thus, any tool transferring read names from FASTQ files to SAM files needs to specify a name conversion technique.

      Comment


      • #4
        Originally posted by ashish View Post
        You can't have spaces in the name.
        I don't think this is true. The wikipedia article on FASTQ gives examples where spaces are used. In fact, my fastq files are from CASAVA 1.8, which standardizes on using a space.
        You CAN have spaces in the @ line (and + line) of FASTQ, just like you can in the > line of FASTA.

        My point is the space acts as a delimiter, the name/identifier is the first WORD of that string.

        Originally posted by ashish View Post
        However, the SAM format specification clearly states that spaces are disallowed. Thus, any tool transferring read names from FASTQ files to SAM files needs to specify a name conversion technique.
        If you regard the whole string after the @ as the name in FASTQ, then yes. All the tools I've worked with take the first word.

        Comment


        • #5
          Originally posted by maubp View Post
          My point is the space acts as a delimiter, the name/identifier is the first WORD of that string.
          I was using "name" to refer to the entire @ line. I haven't seen a specification of FASTQ that defines more structure within the line.

          Originally posted by maubp View Post
          All the tools I've worked with take the first word.
          Bowtie seems not to sometimes. Here's an example showing that the entire line, with space, is retained:

          $ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061" SL6140.fastq
          @HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061 1:Y:0:

          $ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061" SL6140.sam
          HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061 1:Y:0: 4 * 0 0 * * 0 0 CNNNCAGTGAAAATTAAATTTGCCCCAAGGAACTCC <###<><6<<AAAAAAAAAAAAAAAAAAAAAAAAAA XM:i:0


          And here's an example from the same files, showing that only the first word is retained:

          $ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067" SL6140.fastq
          @HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067 1:Y:0:


          $ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067" SL6140.sam
          HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067 16 chr9 3034652 255 36M * 0 0 AGTGGACATTTCTAAATTTTCCACCTTTTTCAGNNT 9:83@:@@@@@@@@@@3::::99999)+(.+,-##> XA:i:2 MD:Z:33T0T1 NM:i:2

          Comment


          • #6
            Originally posted by ashish View Post
            I was using "name" to refer to the entire @ line. I haven't seen a specification of FASTQ that defines more structure within the line.
            We tried to make this "first word is the identifier" point clearly here:


            Likewise I thought the Wikipedia page was fairly clear:

            Comment


            • #7
              Originally posted by ashish View Post
              Here's an example ...
              That is strange and looks like a bug in bowtie to me.

              Try piping those grep results I to hexdump to double check it is a space (chr 32, x20), and not some other non-printing character.

              Comment


              • #8
                Originally posted by ashish View Post
                I was using "name" to refer to the entire @ line. I haven't seen a specification of FASTQ that defines more structure within the line.



                Bowtie seems not to sometimes. Here's an example showing that the entire line, with space, is retained:

                $ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061" SL6140.fastq
                @HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061 1:Y:0:

                $ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061" SL6140.sam
                HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061 1:Y:0: 4 * 0 0 * * 0 0 CNNNCAGTGAAAATTAAATTTGCCCCAAGGAACTCC <###<><6<<AAAAAAAAAAAAAAAAAAAAAAAAAA XM:i:0


                And here's an example from the same files, showing that only the first word is retained:

                $ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067" SL6140.fastq
                @HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067 1:Y:0:


                $ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067" SL6140.sam
                HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067 16 chr9 3034652 255 36M * 0 0 AGTGGACATTTCTAAATTTTCCACCTTTTTCAGNNT 9:83@:@@@@@@@@@@3::::99999)+(.+,-##> XA:i:2 MD:Z:33T0T1 NM:i:2
                I don't know if that has anything to do with it, but in the two examples you have just given, it looks to me like that the unmodified read name is an unaligned read.

                Whereas the modified read name is clearly an aligned read. Is this true for all the examples you see?
                Last edited by chadn737; 07-22-2011, 12:22 PM.

                Comment


                • #9
                  Originally posted by maubp View Post
                  Try piping those grep results I to hexdump to double check it is a space (chr 32, x20), and not some other non-printing character.
                  Good idea. I did that and confirmed that they are always space characters in both the input fastq file and output sam file.

                  Comment


                  • #10
                    Originally posted by chadn737 View Post
                    I don't know if that has anything to do with it, but in the two examples you have just given, it looks to me like that the unmodified read name is an unaligned read.

                    Whereas the modified read name is clearly an aligned read. Is this true for all the examples you see?
                    If you're right that should make it much easier to trace the bug inside bowtie - well spotted.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    25 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    29 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    25 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X