SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SAM/BAM format to wiggle format pinki999 Bioinformatics 19 08-12-2015 12:35 AM
SAM to CUFFLINKS SAM format repinementer Bioinformatics 4 03-15-2012 08:53 AM
Convert FASTA to FASTAQ jomaco Bioinformatics 6 10-31-2011 10:14 AM
Looking process to convert gff3 format into ace format or sam format andylai Bioinformatics 1 05-17-2011 02:09 AM
anyone help me on bowtie format -> sam format! tninja Bioinformatics 2 04-25-2010 09:33 PM

Reply
 
Thread Tools
Old 02-01-2012, 06:26 AM   #1
rnaeye
Member
 
Location: East Cost

Join Date: May 2011
Posts: 79
Default please help with .fastaq and .sam format

Hi guys,
Following sequences are from .sam file (very left several columns). What I do not understand is what is the meaning of second column at the last two lines. Those are
1:N:0:ACTTGA and 1:Y:0:ACTTGA.
I was wondering if anyone can help me with those. Should I ignore these sequences? Thank you very much for you help.

HWI-ST273:295:C0C4PACXX:7:1101:1554:1975 0 chr 1986354 255 50M * 0 NTAATTGTCTCTGCAATGTTATTAACCATAATATCAATTTCACCGAAACG #1=ADDBBFFFHBAGIGEHIFIHIIICGIIGCHGIIIIHIIIIIFDGIIH XA:i:2
HWI-ST273:295:C0C4PACXX:7:1101:1612:1993 0 chr 292259 255 50M * 0 ACTGCTAATGAAGTAACACAAATAGATGGCGTGGCGTCAGTTGATGAAAA @@CDFFFFFHHFHHHGIJJIHHIIICIGIIJFHGHHGGHIJEIJBHGIII XA:i:0
HWI-ST273:295:C0C4PACXX:7:1101:2526:1955 16 chr 135945 255 50M * 0 TCAGTAACGACAGTAAGTTGGCAAGCGACATTAGCCGGTTTAGTAATTGN JJGHFGHIGJJHGGIGDIGIIDJJJJIHHCIJJJJJIHHGHHDFFFD=1# XA:i:1
HWI-ST273:295:C0C4PACXX:7:1101:3762:1995 1:N:0:ACTTGA 4 * 0 0 * * TTGCTTGTTTACCAATGATTAAAAACCATACTTATTTTCAATTTACTGGA @@CFFFFACFHHHGIIIIJJHHJJJJIJIIJJIIJJJJJIIIJJHIIJJD XM:i:0
HWI-ST273:295:C0C4PACXX:7:1101:4204:1964 1:Y:0:ACTTGA 4 * 0 0 * * NTGCTAGTCAATTGCTACACCATTTAATTGTGGAAGCAAAAGCTAAAGGT #1DBD+=2CBDDEEEDEIE@FDFEEEEEFEEEIIEDEEA@DDEEDIE? XM:i:0
rnaeye is offline   Reply With Quote
Old 02-01-2012, 06:49 AM   #2
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

"1:N:0:ACTTGA" and "1:Y:0:ACTTGA" are not in a second column but part of the read identifiers, since they are separated by a space, and not by a tab as SAM fields are (I had to look at the HTML source to find that out, though). They look like the output from the Illumina pipelines for reads and include the barcode for that read and some other stuff that I'm not sure about.

Last edited by arvid; 02-01-2012 at 06:52 AM.
arvid is offline   Reply With Quote
Old 02-01-2012, 06:52 AM   #3
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

So two things:

These reads are aligned, thats why there are genomic coordinates:
Code:
HWI-ST273:295:C0C4PACXX:7:1101:1554:1975 0 chr 1986354 255 50M * 0 NTAATTGTCTCTGCAATGTTATTAACCATAATATCAATTTCACCGAAACG #1=ADDBBFFFHBAGIGEHIFIHIIICGIIGCHGIIIIHIIIIIFDGIIH XA:i:2
HWI-ST273:295:C0C4PACXX:7:1101:1612:1993 0 chr 292259 255 50M * 0 ACTGCTAATGAAGTAACACAAATAGATGGCGTGGCGTCAGTTGATGAAAA @@CDFFFFFHHFHHHGIJJIHHIIICIGIIJFHGHHGGHIJEIJBHGIII XA:i:0
HWI-ST273:295:C0C4PACXX:7:1101:2526:1955 16 chr 135945 255 50M * 0 TCAGTAACGACAGTAAGTTGGCAAGCGACATTAGCCGGTTTAGTAATTGN JJGHFGHIGJJHGGIGDIGIIDJJJJIHHCIJJJJJIHHGHHDFFFD=1# XA:i:1
Whereas these are not aligned:

Code:
HWI-ST273:295:C0C4PACXX:7:1101:3762:1995 1:N:0:ACTTGA 4 * 0 0 * * TTGCTTGTTTACCAATGATTAAAAACCATACTTATTTTCAATTTACTGGA @@CFFFFACFHHHGIIIIJJHHJJJJIJIIJJIIJJJJJIIIJJHIIJJD XM:i:0
HWI-ST273:295:C0C4PACXX:7:1101:4204:1964 1:Y:0:ACTTGA 4 * 0 0 * * NTGCTAGTCAATTGCTACACCATTTAATTGTGGAAGCAAAAGCTAAAGGT #1DBD+=2CBDDEEEDEIE@FDFEEEEEFEEEIIEDEEA@DDEEDIE? XM:i:0
Secondly, these two fields are part of the original read data in the fastq file:
Quote:
1:N:0:ACTTGA and 1:Y:0:ACTTGA.
After Illumina Casava v something or another (1.8?) they made some changes. One of the changes was that reads that did not pass filter were still included in the fastq files, but are flagged with either a 1:N:0:ACTTGA, indicating no, it did not pass filter or a 1:Y:0:ACTTGA, indicating that yes, it did pass filter.

You should actually sort the fastq files and remove those flagged with a N before alignment. There is a recommended script in the Casava documentation to filter these:

Code:
cd /path/to/project/sample
    mkdir filtered
    for fastq in *.fastq.gz ; do zcat $fastq | grep
      -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq
    ; done

Last edited by chadn737; 02-01-2012 at 07:02 AM.
chadn737 is offline   Reply With Quote
Old 02-01-2012, 07:45 AM   #4
rnaeye
Member
 
Location: East Cost

Join Date: May 2011
Posts: 79
Default

Thank you "arvid" and "chadn737". You guys made my day. I very much appreciate your help.
Best,
rnaeye is offline   Reply With Quote
Old 02-01-2012, 07:54 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by chadn737 View Post
After Illumina Casava v something or another (1.8?) they made some changes. One of the changes was that reads that did not pass filter were still included in the fastq files, but are flagged with either a 1:N:0:ACTTGA, indicating no, it did not pass filter or a 1:Y:0:ACTTGA, indicating that yes, it did pass filter.

You should actually sort the fastq files and remove those flagged with a N before alignment. There is a recommended script in the Casava documentation to filter these:

Code:
cd /path/to/project/sample
    mkdir filtered
    for fastq in *.fastq.gz ; do zcat $fastq | grep
      -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq
    ; done


chadn,

There's a slight error in your description. In CASAVA 1.8+ the Y/N means "Is the read filtered", on other words did the read FAIL filtering. This means that reads with Y=failed whereas N=passed. Your code fragment will keep the passed (N) reads which is the desired outcome.

Confusion is understandable. In versions previous to 1.8 the meaning of the Y/N was reversed, and to my way of thinking more understandable.
kmcarr is offline   Reply With Quote
Old 02-01-2012, 07:56 AM   #6
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Quote:
Originally Posted by kmcarr View Post
chadn,

There's a slight error in your description. In CASAVA 1.8+ the Y/N means "Is the read filtered", on other words did the read FAIL filtering. This means that reads with Y=failed whereas N=passed. Your code fragment will keep the passed (N) reads which is the desired outcome.

Confusion is understandable. In versions previous to 1.8 the meaning of the Y/N was reversed, and to my way of thinking more understandable.
Ah, thanks for the clarification.
chadn737 is offline   Reply With Quote
Old 02-01-2012, 07:58 AM   #7
rnaeye
Member
 
Location: East Cost

Join Date: May 2011
Posts: 79
Default

Thanks for making it clear.
rnaeye is offline   Reply With Quote
Reply

Tags
illumina .sam file format

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:57 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO