Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • please help with .fastaq and .sam format

    Hi guys,
    Following sequences are from .sam file (very left several columns). What I do not understand is what is the meaning of second column at the last two lines. Those are
    1:N:0:ACTTGA and 1:Y:0:ACTTGA.
    I was wondering if anyone can help me with those. Should I ignore these sequences? Thank you very much for you help.

    HWI-ST273:295:C0C4PACXX:7:1101:1554:1975 0 chr 1986354 255 50M * 0 NTAATTGTCTCTGCAATGTTATTAACCATAATATCAATTTCACCGAAACG #1=ADDBBFFFHBAGIGEHIFIHIIICGIIGCHGIIIIHIIIIIFDGIIH XA:i:2
    HWI-ST273:295:C0C4PACXX:7:1101:1612:1993 0 chr 292259 255 50M * 0 ACTGCTAATGAAGTAACACAAATAGATGGCGTGGCGTCAGTTGATGAAAA @@CDFFFFFHHFHHHGIJJIHHIIICIGIIJFHGHHGGHIJEIJBHGIII XA:i:0
    HWI-ST273:295:C0C4PACXX:7:1101:2526:1955 16 chr 135945 255 50M * 0 TCAGTAACGACAGTAAGTTGGCAAGCGACATTAGCCGGTTTAGTAATTGN JJGHFGHIGJJHGGIGDIGIIDJJJJIHHCIJJJJJIHHGHHDFFFD=1# XA:i:1
    HWI-ST273:295:C0C4PACXX:7:1101:3762:1995 1:N:0:ACTTGA 4 * 0 0 * * TTGCTTGTTTACCAATGATTAAAAACCATACTTATTTTCAATTTACTGGA @@CFFFFACFHHHGIIIIJJHHJJJJIJIIJJIIJJJJJIIIJJHIIJJD XM:i:0
    HWI-ST273:295:C0C4PACXX:7:1101:4204:1964 1:Y:0:ACTTGA 4 * 0 0 * * NTGCTAGTCAATTGCTACACCATTTAATTGTGGAAGCAAAAGCTAAAGGT #1DBD+=2CBDDEEEDEIE@FDFEEEEEFEEEIIEDEEA@DDEEDIE? XM:i:0

  • #2
    "1:N:0:ACTTGA" and "1:Y:0:ACTTGA" are not in a second column but part of the read identifiers, since they are separated by a space, and not by a tab as SAM fields are (I had to look at the HTML source to find that out, though). They look like the output from the Illumina pipelines for reads and include the barcode for that read and some other stuff that I'm not sure about.
    Last edited by arvid; 02-01-2012, 07:52 AM.

    Comment


    • #3
      So two things:

      These reads are aligned, thats why there are genomic coordinates:
      Code:
      HWI-ST273:295:C0C4PACXX:7:1101:1554:1975 0 chr 1986354 255 50M * 0 NTAATTGTCTCTGCAATGTTATTAACCATAATATCAATTTCACCGAAACG #1=ADDBBFFFHBAGIGEHIFIHIIICGIIGCHGIIIIHIIIIIFDGIIH XA:i:2
      HWI-ST273:295:C0C4PACXX:7:1101:1612:1993 0 chr 292259 255 50M * 0 ACTGCTAATGAAGTAACACAAATAGATGGCGTGGCGTCAGTTGATGAAAA @@CDFFFFFHHFHHHGIJJIHHIIICIGIIJFHGHHGGHIJEIJBHGIII XA:i:0
      HWI-ST273:295:C0C4PACXX:7:1101:2526:1955 16 chr 135945 255 50M * 0 TCAGTAACGACAGTAAGTTGGCAAGCGACATTAGCCGGTTTAGTAATTGN JJGHFGHIGJJHGGIGDIGIIDJJJJIHHCIJJJJJIHHGHHDFFFD=1# XA:i:1
      Whereas these are not aligned:

      Code:
      HWI-ST273:295:C0C4PACXX:7:1101:3762:1995 1:N:0:ACTTGA 4 * 0 0 * * TTGCTTGTTTACCAATGATTAAAAACCATACTTATTTTCAATTTACTGGA @@CFFFFACFHHHGIIIIJJHHJJJJIJIIJJIIJJJJJIIIJJHIIJJD XM:i:0
      HWI-ST273:295:C0C4PACXX:7:1101:4204:1964 1:Y:0:ACTTGA 4 * 0 0 * * NTGCTAGTCAATTGCTACACCATTTAATTGTGGAAGCAAAAGCTAAAGGT #1DBD+=2CBDDEEEDEIE@FDFEEEEEFEEEIIEDEEA@DDEEDIE? XM:i:0
      Secondly, these two fields are part of the original read data in the fastq file:
      1:N:0:ACTTGA and 1:Y:0:ACTTGA.
      After Illumina Casava v something or another (1.8?) they made some changes. One of the changes was that reads that did not pass filter were still included in the fastq files, but are flagged with either a 1:N:0:ACTTGA, indicating no, it did not pass filter or a 1:Y:0:ACTTGA, indicating that yes, it did pass filter.

      You should actually sort the fastq files and remove those flagged with a N before alignment. There is a recommended script in the Casava documentation to filter these:

      Code:
      cd /path/to/project/sample
          mkdir filtered
          for fastq in *.fastq.gz ; do zcat $fastq | grep
            -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq
          ; done
      Last edited by chadn737; 02-01-2012, 08:02 AM.

      Comment


      • #4
        Thank you "arvid" and "chadn737". You guys made my day. I very much appreciate your help.
        Best,

        Comment


        • #5
          Originally posted by chadn737 View Post
          After Illumina Casava v something or another (1.8?) they made some changes. One of the changes was that reads that did not pass filter were still included in the fastq files, but are flagged with either a 1:N:0:ACTTGA, indicating no, it did not pass filter or a 1:Y:0:ACTTGA, indicating that yes, it did pass filter.

          You should actually sort the fastq files and remove those flagged with a N before alignment. There is a recommended script in the Casava documentation to filter these:

          Code:
          cd /path/to/project/sample
              mkdir filtered
              for fastq in *.fastq.gz ; do zcat $fastq | grep
                -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq
              ; done


          chadn,

          There's a slight error in your description. In CASAVA 1.8+ the Y/N means "Is the read filtered", on other words did the read FAIL filtering. This means that reads with Y=failed whereas N=passed. Your code fragment will keep the passed (N) reads which is the desired outcome.

          Confusion is understandable. In versions previous to 1.8 the meaning of the Y/N was reversed, and to my way of thinking more understandable.

          Comment


          • #6
            Good to know. Thanks.

            Comment


            • #7
              Originally posted by kmcarr View Post
              chadn,

              There's a slight error in your description. In CASAVA 1.8+ the Y/N means "Is the read filtered", on other words did the read FAIL filtering. This means that reads with Y=failed whereas N=passed. Your code fragment will keep the passed (N) reads which is the desired outcome.

              Confusion is understandable. In versions previous to 1.8 the meaning of the Y/N was reversed, and to my way of thinking more understandable.
              Ah, thanks for the clarification.

              Comment


              • #8
                Thanks for making it clear.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X