Seqanswers Leaderboard Ad

**arvid** · 02-01-2012, 07:49 AM

"1:N:0:ACTTGA" and "1:Y:0:ACTTGA" are not in a second column but part of the read identifiers, since they are separated by a space, and not by a tab as SAM fields are (I had to look at the HTML source to find that out, though). They look like the output from the Illumina pipelines for reads and include the barcode for that read and some other stuff that I'm not sure about.

**chadn737** · 02-01-2012, 07:52 AM

So two things:

These reads are aligned, thats why there are genomic coordinates:

Code:

HWI-ST273:295:C0C4PACXX:7:1101:1554:1975 0 chr 1986354 255 50M * 0 NTAATTGTCTCTGCAATGTTATTAACCATAATATCAATTTCACCGAAACG #1=ADDBBFFFHBAGIGEHIFIHIIICGIIGCHGIIIIHIIIIIFDGIIH XA:i:2
HWI-ST273:295:C0C4PACXX:7:1101:1612:1993 0 chr 292259 255 50M * 0 ACTGCTAATGAAGTAACACAAATAGATGGCGTGGCGTCAGTTGATGAAAA @@CDFFFFFHHFHHHGIJJIHHIIICIGIIJFHGHHGGHIJEIJBHGIII XA:i:0
HWI-ST273:295:C0C4PACXX:7:1101:2526:1955 16 chr 135945 255 50M * 0 TCAGTAACGACAGTAAGTTGGCAAGCGACATTAGCCGGTTTAGTAATTGN JJGHFGHIGJJHGGIGDIGIIDJJJJIHHCIJJJJJIHHGHHDFFFD=1# XA:i:1

Whereas these are not aligned:

Code:

HWI-ST273:295:C0C4PACXX:7:1101:3762:1995 1:N:0:ACTTGA 4 * 0 0 * * TTGCTTGTTTACCAATGATTAAAAACCATACTTATTTTCAATTTACTGGA @@CFFFFACFHHHGIIIIJJHHJJJJIJIIJJIIJJJJJIIIJJHIIJJD XM:i:0
HWI-ST273:295:C0C4PACXX:7:1101:4204:1964 1:Y:0:ACTTGA 4 * 0 0 * * NTGCTAGTCAATTGCTACACCATTTAATTGTGGAAGCAAAAGCTAAAGGT #1DBD+=2CBDDEEEDEIE@FDFEEEEEFEEEIIEDEEA@DDEEDIE? XM:i:0

Secondly, these two fields are part of the original read data in the fastq file:

1:N:0:ACTTGA and 1:Y:0:ACTTGA.

After Illumina Casava v something or another (1.8?) they made some changes. One of the changes was that reads that did not pass filter were still included in the fastq files, but are flagged with either a 1:N:0:ACTTGA, indicating no, it did not pass filter or a 1:Y:0:ACTTGA, indicating that yes, it did pass filter.

You should actually sort the fastq files and remove those flagged with a N before alignment. There is a recommended script in the Casava documentation to filter these:

Code:

cd /path/to/project/sample
    mkdir filtered
    for fastq in *.fastq.gz ; do zcat $fastq | grep
      -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq
    ; done

**rnaeye** · 02-01-2012, 08:45 AM

Thank you "arvid" and "chadn737". You guys made my day. I very much appreciate your help.
Best,

**kmcarr** · 02-01-2012, 08:54 AM

Originally posted by chadn737 View Post

After Illumina Casava v something or another (1.8?) they made some changes. One of the changes was that reads that did not pass filter were still included in the fastq files, but are flagged with either a 1:N:0:ACTTGA, indicating no, it did not pass filter or a 1:Y:0:ACTTGA, indicating that yes, it did pass filter.

You should actually sort the fastq files and remove those flagged with a N before alignment. There is a recommended script in the Casava documentation to filter these:

Code:

cd /path/to/project/sample
    mkdir filtered
    for fastq in *.fastq.gz ; do zcat $fastq | grep
      -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq
    ; done

chadn,

There's a slight error in your description. In CASAVA 1.8+ the Y/N means "Is the read filtered", on other words did the read FAIL filtering. This means that reads with Y=failed whereas N=passed. Your code fragment will keep the passed (N) reads which is the desired outcome.

Confusion is understandable. In versions previous to 1.8 the meaning of the Y/N was reversed, and to my way of thinking more understandable.

**rnaeye** · 02-01-2012, 08:56 AM

Good to know. Thanks.

**chadn737** · 02-01-2012, 08:56 AM

Originally posted by kmcarr View Post

chadn,

There's a slight error in your description. In CASAVA 1.8+ the Y/N means "Is the read filtered", on other words did the read FAIL filtering. This means that reads with Y=failed whereas N=passed. Your code fragment will keep the passed (N) reads which is the desired outcome.

Confusion is understandable. In versions previous to 1.8 the meaning of the Y/N was reversed, and to my way of thinking more understandable.

Ah, thanks for the clarification.

**rnaeye** · 02-01-2012, 08:58 AM

Thanks for making it clear.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

please help with .fastaq and .sam format

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News