SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SRA to fastq conversion with fastq-dump loses sequences pcantalupo Bioinformatics 13 10-08-2015 05:09 PM
Convert fastq to sra mimakaev Bioinformatics 2 05-13-2014 09:20 AM
sra to Fastq files cement_head General 2 08-19-2013 07:40 AM
SRA -> FastQ, Problem with SRA toolkit? kelseyca Bioinformatics 12 05-23-2013 12:59 PM
SRA to FASTQ ramashankar16 Bioinformatics 5 10-26-2011 04:03 AM

Reply
 
Thread Tools
Old 08-05-2014, 11:54 PM   #1
tinguzman
Member
 
Location: Philippines

Join Date: Aug 2014
Posts: 13
Default SRA to fastq

Hi,

I'm trying to convert an sra file to fastq. i downloaded the sra toolkit and ran fastq-dump, here's what i got:


"Unrecognized character \xCF; marked by <-- HERE after <-- HERE near column 1 at fastq-dump line 1."

can somebody help me, are there any other script to convert sra files to fastq?

thanks,
christine
tinguzman is offline   Reply With Quote
Old 08-06-2014, 12:21 AM   #2
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Hi christine,
can you provide the command you used and where you got the SRA file from. I don't think that fastq-dump is the problem, but rather your SRA file.
WhatsOEver is offline   Reply With Quote
Old 08-06-2014, 01:04 AM   #3
tinguzman
Member
 
Location: Philippines

Join Date: Aug 2014
Posts: 13
Default

Hi

thanks for your reply. I ran this command, within the bin directory, - perl fastq-dump.2.3.5.2 SRR504687.sra. I also copied the sra file to the bin.

I downloaded this rsa file from here http://www.ncbi.nlm.nih.gov/sra/SRX151862%5Baccn%5D

thanks,
Christine
tinguzman is offline   Reply With Quote
Old 08-06-2014, 01:22 AM   #4
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Quote:
Originally Posted by tinguzman View Post
perl fastq-dump.2.3.5.2 SRR504687.sra.
That's the problem. What you have is a pre-compiled binary, not a perl script. Simply running
Code:
./fastq-dump.2.3.5.2 ./SRR504687.sra
from within your bin directory should work.
WhatsOEver is offline   Reply With Quote
Old 08-06-2014, 01:25 AM   #5
tinguzman
Member
 
Location: Philippines

Join Date: Aug 2014
Posts: 13
Default

Hi,

thanks a lot! it's working now.

best,
christine
tinguzman is offline   Reply With Quote
Old 08-06-2014, 02:39 AM   #6
tinguzman
Member
 
Location: Philippines

Join Date: Aug 2014
Posts: 13
Default

hi again!

i just finished running ./fastq-dump.2.3.5.2 -split-files ./SRR504687.sra. i have 2 output SRR504687_1 and SRR504687_2. one is 13gb while the other is only 3gb. i'm expecting that they should have the same size, for forward and reverse reads, right? correct me if i'm wrong coz i'm planning to assemble them using trinity. should I qc them first?

best,
christine
tinguzman is offline   Reply With Quote
Old 08-06-2014, 02:43 AM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You can download the compressed fastq files from ENA (here and here). That's often faster than dealing with SRA, since the SRA toolkit is painfully slow and often buggy. You'll note that their fastq files also have a different size, so this isn't something you're doing wrong.
dpryan is offline   Reply With Quote
Old 08-06-2014, 02:54 AM   #8
tinguzman
Member
 
Location: Philippines

Join Date: Aug 2014
Posts: 13
Default

Hi Ryan,

thank you very much. submitted fastq files or sra files are not yet trimmed/filtered right? they are raw sequences, for example from illumina sequencing.

best,
christine
tinguzman is offline   Reply With Quote
Old 08-06-2014, 02:58 AM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Yeah, they should be the raw files. This definitely raises the question of why the files have such different sizes. My guess is that the only person that can really answer that is the person that uploaded the files (Ana Riesgo).
dpryan is offline   Reply With Quote
Old 08-06-2014, 03:44 AM   #10
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Quote:
Originally Posted by tinguzman View Post
hi again!

i just finished running ./fastq-dump.2.3.5.2 -split-files ./SRR504687.sra. i have 2 output SRR504687_1 and SRR504687_2. one is 13gb while the other is only 3gb. i'm expecting that they should have the same size, for forward and reverse reads, right? correct me if i'm wrong coz i'm planning to assemble them using trinity. should I qc them first?

best,
christine
I should have mentioned that the -split-3 command is always preferred over "split-files". This way 1 file will be generated for single end data, 2 files (with sufffix "_1" and "_2") for paired ends and 3 files (suffix "_1", "_2" and w/o suffix) if there are reads without mate pairs. This way I get a 13.2Gb file ("_1") and a 7.0Gb file ("_2") from the sra.
Looking at the fastq files, you'll see that forward reads are 150bp while reverse are 48bp. I have actually never seen this for PE data before and would contact the uploader (as dpryan suggested). At least there are no missing mates... I would give it a try and use them in trinity while waiting for an answer.
WhatsOEver is offline   Reply With Quote
Old 01-20-2015, 07:02 AM   #11
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Has anyone had any issues downloading and splitting files from dnanexus? I'm unable to use aspera connect to download files from sra. I'm trying to replicate an experiment to test out an RNA-seq analysis pipeline but am having issues with 3 particular files. SRR639124, SRR639239, SRR639263.

The sra toolkit is working as other files has split without issue. SRR639124 oddly splits to two files _1 is 100bp and _2 is only 1bp? The other two will not split. The stat script shows only 1 read.

I'm wondering if it's an issue with dnanexus or the original SRA upload. Any help would be appreciated.
lac302 is offline   Reply With Quote
Old 01-20-2015, 07:19 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,989
Default

Are you using the latest sratoolkit? You no longer need to download .sra files. fastq-dump from the new sratoolkit will download fastq files directly.
GenoMax is offline   Reply With Quote
Old 01-20-2015, 07:34 AM   #13
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Quote:
Originally Posted by GenoMax View Post
Are you using the latest sratoolkit? You no longer need to download .sra files. fastq-dump from the new sratoolkit will download fastq files directly.
I believe so. Using 2.4.3.

How exactly does that work? curl?
lac302 is offline   Reply With Quote
Old 01-20-2015, 07:36 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,989
Default

Quote:
Originally Posted by lac302 View Post
I believe so. Using 2.4.3.

How exactly does that work? curl?
See this: http://seqanswers.com/forums/showpos...36&postcount=7

BTW: SRR639124 appears to be a single end dataset irrespective of what the SRA record says (this type of thing happens at times).
GenoMax is offline   Reply With Quote
Old 01-20-2015, 07:39 AM   #15
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Thanks Geno. I need to improve my search skills.
lac302 is offline   Reply With Quote
Old 01-20-2015, 08:01 AM   #16
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Quote:
Originally Posted by GenoMax View Post
See this: http://seqanswers.com/forums/showpos...36&postcount=7

BTW: SRR639124 appears to be a single end dataset irrespective of what the SRA record says (this type of thing happens at times).
There must have been an issue with the original submission. The paper this data is based on states that all the runs were paired-end.
lac302 is offline   Reply With Quote
Old 01-20-2015, 09:39 AM   #17
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,989
Default

--split-files option does not seem to be generating the correct files either for SRR639124.

Grab the fastq files from EBI: http://www.ebi.ac.uk/ena/data/view/SRR639124 hopefully they are correct over there.

Note: Even the EBI version has only 1 base in "paired" file.

Quote:
@SRR639124.1 LEOPARD:627RGAAXX:4:001:00932:10203:0:1/2
N
+
B
At least it is consistent with the SRA record (that says 100 x 1): http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR639124

It is useful to let SRA tech support know when you find records that do not seem to have the right downloads associated with them.

Last edited by GenoMax; 01-20-2015 at 09:49 AM.
GenoMax is offline   Reply With Quote
Old 01-20-2015, 09:46 AM   #18
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,989
Default

EBI may not have the right files either for all (here is an example: http://www.ebi.ac.uk/ena/data/view/SRR639239 says PAIRED but there is only one file).
GenoMax is offline   Reply With Quote
Reply

Tags
fastq format, sra

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:23 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO