SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
NCBI SRA database bair Bioinformatics 15 11-03-2015 02:06 PM
data set from NCBI SRA masylichu RNA Sequencing 2 10-27-2015 04:27 PM
Convert fastq from NCBI SRA to fasta and qual? kmkocot Bioinformatics 7 10-09-2012 10:15 AM
454 data analysis & Mapping Abishai3911 Bioinformatics 3 07-03-2011 03:27 AM
ncbi sra cburger Bioinformatics 0 02-02-2011 09:04 AM

Reply
 
Thread Tools
Old 12-23-2009, 07:36 AM   #1
v_kisand
Member
 
Location: Eesti

Join Date: Jan 2009
Posts: 37
Default 454 /NCBI SRA & traceinfo

Are there SFF files for 454 projects in SRA somewhere? For recent submissions I find only fastq, but I am looking for traceinfo xml as well belonging to particular short reads. Somehow I remember xml files were also available earlier?!

v.
v_kisand is offline   Reply With Quote
Old 12-23-2009, 09:07 AM   #2
v_kisand
Member
 
Location: Eesti

Join Date: Jan 2009
Posts: 37
Default

ok re-found again TraceDB (some time since I tried to retrieve such data)
ftp://ftp.ncbi.nlm.nih.gov/pub/TraceDB

BUT

I do not find any similar organisms in TraceDB which correspond to SRR numbers

v.
v_kisand is offline   Reply With Quote
Old 12-23-2009, 11:48 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

V.

The NCBI Trace Archive (TA) and Short Read Archive (now renamed the Sequence Read Archive or SRA) are two separate databases with separate missions. The TA was designed to store traces, sequences and metadata generated by Sanger sequencing, primarily from WGS projects. When next gen sequencing came on the scene the NCBI recognized that the TA design was not a good fit for this new type of massively parallel sequencing thus they designed the SRA. The SRA does not use or have traceinfo.xml files. And while data from 454 experiments is uploaded to the SRA as SFF files, you can not download said SFF files. The SRA only provides the sequence and q-scores available for download in the form of FASTQ files.
kmcarr is offline   Reply With Quote
Old 12-24-2009, 12:56 AM   #4
v_kisand
Member
 
Location: Eesti

Join Date: Jan 2009
Posts: 37
Default

right, now I remember that TA was down for a while because next-generation data (?) and there was not possible to get data but I did not follow the developments there... Are these fastq traces cleaned for adaptor sequences (454 reads)? Should be known issue that Roche-software does not clean properly ...

I guess I found some scripts to do adaptor clipping, I'll try soon. Anyway seems that would be much easier to do run clipping on sff, not a problem with your own data though.

v.
v_kisand is offline   Reply With Quote
Old 12-24-2009, 06:23 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

The SFF file definition includes the full flowgram and base calls plus left (3') and right (5') clipping points. The 3' end of the read is clipped for the keytag sequence (TCAG). The 3' end of the read has a number of trimming filters applied including one which identifies the 454-B adapter sequence. The downloaded FASTQ is the trimmed sequence only.

Quote:
Should be known issue that Roche-software does not clean properly ...
I'm not sure what you mean by this. I've never seen the 454 filter failing to remove the 454 adapter sequence. I suppose this is possible if the quality of the read was so degraded that it could not recognize the sequence, but in that case the signal/quality based filters would trim off that portion of the read.
kmcarr is offline   Reply With Quote
Old 12-26-2009, 05:28 AM   #6
v_kisand
Member
 
Location: Eesti

Join Date: Jan 2009
Posts: 37
Default

Quote:
Originally Posted by kmcarr View Post
The SFF file definition includes the full flowgram and base calls plus left (3') and right (5') clipping points. The 3' end of the read is clipped for the keytag sequence (TCAG). The 3' end of the read has a number of trimming filters applied including one which identifies the 454-B adapter sequence. The downloaded FASTQ is the trimmed sequence only.



I'm not sure what you mean by this. I've never seen the 454 filter failing to remove the 454 adapter sequence. I suppose this is possible if the quality of the read was so degraded that it could not recognize the sequence, but in that case the signal/quality based filters would trim off that portion of the read.
Yes, that's why I am looking for SFF files
Seems Roche's software is not the best in clipping, or at least used to be not the best. Why , I do not know, check for example the discussion in:
http://www.freelists.org/post/mira_t...aptor-clipping
v_kisand is offline   Reply With Quote
Old 12-26-2009, 07:29 AM   #7
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by v_kisand View Post
Yes, that's why I am looking for SFF files
Seems Roche's software is not the best in clipping, or at least used to be not the best. Why , I do not know, check for example the discussion in:
http://www.freelists.org/post/mira_t...aptor-clipping
The thread you linked to is discussing clipping of adapters introduced for cDNA synthesis, specifically the SMART cDNA construction adapters. The Roche signal processing pipeline, which outputs the SFF files, was never intended to remove cloning/adapter sequences introduced by the end user; it only removes the primer from the 454 library construction which it does just fine. The Roche assembly programs (gsAssembler, gsMapper) can trim other adapter sequences provided by the user as part of their assembly or mapping process. If you are using third party software (like MIRA) then of course you will have to trim any non-Roche adapters yourself.
kmcarr is offline   Reply With Quote
Old 12-28-2009, 02:13 AM   #8
v_kisand
Member
 
Location: Eesti

Join Date: Jan 2009
Posts: 37
Default

Quote:
Originally Posted by kmcarr View Post
The thread you linked to is discussing clipping of adapters introduced for cDNA synthesis, specifically the SMART cDNA construction adapters. The Roche signal processing pipeline, which outputs the SFF files, was never intended to remove cloning/adapter sequences introduced by the end user; it only removes the primer from the 454 library construction which it does just fine. The Roche assembly programs (gsAssembler, gsMapper) can trim other adapter sequences provided by the user as part of their assembly or mapping process. If you are using third party software (like MIRA) then of course you will have to trim any non-Roche adapters yourself.
Thanks for clarifying but what about
http://chevreux.org/uploads/media/mi...tml#section_27 ?

maybe this TCTCCGTC is custom adapter

maybe I am wrong that Roche processing pipeline should not take care of it but then it is sequence provider problem and data in NCBI may contain adaptors, right?

Why I started this discussion was because downloading quite resent SRR029264 for testing various assemblers as theses data should be quite similar too data I get soon and I see CCGGCCAC in it. Should SFF file contain information about such adaptors? Anyway getting rid of these 8 bp is not a big problem, but as I am not too much into the topic yet, can NCBI short reads contain more of such type of stuff? Do uploaded data need to be cleaned or it is ok for database to have them in without auxiliary information (i.e. traceinfo)?

v.

Last edited by v_kisand; 12-28-2009 at 02:16 AM.
v_kisand is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:31 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO