Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • please help with .fastaq and .sam format

    Hi guys,
    Following sequences are from .sam file (very left several columns). What I do not understand is what is the meaning of second column at the last two lines. Those are
    1:N:0:ACTTGA and 1:Y:0:ACTTGA.
    I was wondering if anyone can help me with those. Should I ignore these sequences? Thank you very much for you help.

    HWI-ST273:295:C0C4PACXX:7:1101:1554:1975 0 chr 1986354 255 50M * 0 NTAATTGTCTCTGCAATGTTATTAACCATAATATCAATTTCACCGAAACG #1=ADDBBFFFHBAGIGEHIFIHIIICGIIGCHGIIIIHIIIIIFDGIIH XA:i:2
    HWI-ST273:295:C0C4PACXX:7:1101:1612:1993 0 chr 292259 255 50M * 0 ACTGCTAATGAAGTAACACAAATAGATGGCGTGGCGTCAGTTGATGAAAA @@CDFFFFFHHFHHHGIJJIHHIIICIGIIJFHGHHGGHIJEIJBHGIII XA:i:0
    HWI-ST273:295:C0C4PACXX:7:1101:2526:1955 16 chr 135945 255 50M * 0 TCAGTAACGACAGTAAGTTGGCAAGCGACATTAGCCGGTTTAGTAATTGN JJGHFGHIGJJHGGIGDIGIIDJJJJIHHCIJJJJJIHHGHHDFFFD=1# XA:i:1
    HWI-ST273:295:C0C4PACXX:7:1101:3762:1995 1:N:0:ACTTGA 4 * 0 0 * * TTGCTTGTTTACCAATGATTAAAAACCATACTTATTTTCAATTTACTGGA @@CFFFFACFHHHGIIIIJJHHJJJJIJIIJJIIJJJJJIIIJJHIIJJD XM:i:0
    HWI-ST273:295:C0C4PACXX:7:1101:4204:1964 1:Y:0:ACTTGA 4 * 0 0 * * NTGCTAGTCAATTGCTACACCATTTAATTGTGGAAGCAAAAGCTAAAGGT #1DBD+=2CBDDEEEDEIE@FDFEEEEEFEEEIIEDEEA@DDEEDIE? XM:i:0

  • #2
    "1:N:0:ACTTGA" and "1:Y:0:ACTTGA" are not in a second column but part of the read identifiers, since they are separated by a space, and not by a tab as SAM fields are (I had to look at the HTML source to find that out, though). They look like the output from the Illumina pipelines for reads and include the barcode for that read and some other stuff that I'm not sure about.
    Last edited by arvid; 02-01-2012, 07:52 AM.

    Comment


    • #3
      So two things:

      These reads are aligned, thats why there are genomic coordinates:
      Code:
      HWI-ST273:295:C0C4PACXX:7:1101:1554:1975 0 chr 1986354 255 50M * 0 NTAATTGTCTCTGCAATGTTATTAACCATAATATCAATTTCACCGAAACG #1=ADDBBFFFHBAGIGEHIFIHIIICGIIGCHGIIIIHIIIIIFDGIIH XA:i:2
      HWI-ST273:295:C0C4PACXX:7:1101:1612:1993 0 chr 292259 255 50M * 0 ACTGCTAATGAAGTAACACAAATAGATGGCGTGGCGTCAGTTGATGAAAA @@CDFFFFFHHFHHHGIJJIHHIIICIGIIJFHGHHGGHIJEIJBHGIII XA:i:0
      HWI-ST273:295:C0C4PACXX:7:1101:2526:1955 16 chr 135945 255 50M * 0 TCAGTAACGACAGTAAGTTGGCAAGCGACATTAGCCGGTTTAGTAATTGN JJGHFGHIGJJHGGIGDIGIIDJJJJIHHCIJJJJJIHHGHHDFFFD=1# XA:i:1
      Whereas these are not aligned:

      Code:
      HWI-ST273:295:C0C4PACXX:7:1101:3762:1995 1:N:0:ACTTGA 4 * 0 0 * * TTGCTTGTTTACCAATGATTAAAAACCATACTTATTTTCAATTTACTGGA @@CFFFFACFHHHGIIIIJJHHJJJJIJIIJJIIJJJJJIIIJJHIIJJD XM:i:0
      HWI-ST273:295:C0C4PACXX:7:1101:4204:1964 1:Y:0:ACTTGA 4 * 0 0 * * NTGCTAGTCAATTGCTACACCATTTAATTGTGGAAGCAAAAGCTAAAGGT #1DBD+=2CBDDEEEDEIE@FDFEEEEEFEEEIIEDEEA@DDEEDIE? XM:i:0
      Secondly, these two fields are part of the original read data in the fastq file:
      1:N:0:ACTTGA and 1:Y:0:ACTTGA.
      After Illumina Casava v something or another (1.8?) they made some changes. One of the changes was that reads that did not pass filter were still included in the fastq files, but are flagged with either a 1:N:0:ACTTGA, indicating no, it did not pass filter or a 1:Y:0:ACTTGA, indicating that yes, it did pass filter.

      You should actually sort the fastq files and remove those flagged with a N before alignment. There is a recommended script in the Casava documentation to filter these:

      Code:
      cd /path/to/project/sample
          mkdir filtered
          for fastq in *.fastq.gz ; do zcat $fastq | grep
            -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq
          ; done
      Last edited by chadn737; 02-01-2012, 08:02 AM.

      Comment


      • #4
        Thank you "arvid" and "chadn737". You guys made my day. I very much appreciate your help.
        Best,

        Comment


        • #5
          Originally posted by chadn737 View Post
          After Illumina Casava v something or another (1.8?) they made some changes. One of the changes was that reads that did not pass filter were still included in the fastq files, but are flagged with either a 1:N:0:ACTTGA, indicating no, it did not pass filter or a 1:Y:0:ACTTGA, indicating that yes, it did pass filter.

          You should actually sort the fastq files and remove those flagged with a N before alignment. There is a recommended script in the Casava documentation to filter these:

          Code:
          cd /path/to/project/sample
              mkdir filtered
              for fastq in *.fastq.gz ; do zcat $fastq | grep
                -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq
              ; done


          chadn,

          There's a slight error in your description. In CASAVA 1.8+ the Y/N means "Is the read filtered", on other words did the read FAIL filtering. This means that reads with Y=failed whereas N=passed. Your code fragment will keep the passed (N) reads which is the desired outcome.

          Confusion is understandable. In versions previous to 1.8 the meaning of the Y/N was reversed, and to my way of thinking more understandable.

          Comment


          • #6
            Good to know. Thanks.

            Comment


            • #7
              Originally posted by kmcarr View Post
              chadn,

              There's a slight error in your description. In CASAVA 1.8+ the Y/N means "Is the read filtered", on other words did the read FAIL filtering. This means that reads with Y=failed whereas N=passed. Your code fragment will keep the passed (N) reads which is the desired outcome.

              Confusion is understandable. In versions previous to 1.8 the meaning of the Y/N was reversed, and to my way of thinking more understandable.
              Ah, thanks for the clarification.

              Comment


              • #8
                Thanks for making it clear.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                57 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                56 views
                0 likes
                Last Post seqadmin  
                Working...
                X