SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SRA - SRR*.lite.sra adrian Bioinformatics 2 03-19-2012 09:43 AM
.csfasta - Bowtie - trimming schmima SOLiD 6 12-12-2011 06:35 AM
Convert csfasta to fastq yksikaksi Bioinformatics 2 10-30-2011 08:36 PM
csfasta to fasta? brachysclereid Bioinformatics 5 08-31-2011 09:27 AM
details of *.csfasta creation d17 SOLiD 4 02-21-2010 09:33 AM

Reply
 
Thread Tools
Old 04-18-2011, 12:11 AM   #1
chip_seq
Member
 
Location: Japan

Join Date: Mar 2011
Posts: 11
Default SRA to .csfasta

Hi All,
does any one know how to convert .sra files into .csfasata?
chip_seq is offline   Reply With Quote
Old 04-18-2011, 11:38 PM   #2
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

You'd need to use abi-dump from the sra-toolkit.
simonandrews is offline   Reply With Quote
Old 04-19-2011, 03:23 AM   #3
chip_seq
Member
 
Location: Japan

Join Date: Mar 2011
Posts: 11
Default

Hi Simon,
Thank you very much
when i tried abi-dump i got .csfasta with following format:

>SRR089316.sra.1 1_18_263_F3
T2203022222121332.03122.0.30.1.03100330.2010101.000
>SRR089316.sra.2 1_18_325_F3
T1222000000310122.13222.2.23.0.22030010.1100120.000
>SRR089316.sra.3 1_18_483_F3
T3211330120000113.00231.0.20.2.30013200.1121300.100

as you can see after > file name +space that i removed later,also u can see in sequence For ex (T1222000000310122.13222.2.23.0.22030010.1100120.000) there are dots that i also removed but still there is problem in mapping ,do u have any idea?

Thanks in Advance
chip_seq is offline   Reply With Quote
Old 04-19-2011, 03:46 AM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

You don't want to remove the dots. Those are locations in your read where the color could not be determined (equivalent to an N in base space). Removing the dots will create deletions which won't help your efforts to map the data.

You'll need to be a bit more specific about what problems you're having in mapping. What program are you using? What command are you running and what do you get?
simonandrews is offline   Reply With Quote
Old 04-20-2011, 02:11 AM   #5
chip_seq
Member
 
Location: Japan

Join Date: Mar 2011
Posts: 11
Default

Hi simon,

Thank you.
After using abi-dump i got .csfast file with the following format:
>SRR089316.sra.1 1_18_263_F3
T2203022222121332.03122.0.30.1.03100330.2010101.000
>SRR089316.sra.2 1_18_325_F3
T1222000000310122.13222.2.23.0.22030010.1100120.000

when i map using corona lite i run this command

matching_large_genomes_cmap_save_script.pl -csfasta data_F3.csfasta -dir out_dir_path -cmap cmap -t 35 -e 2 -z 10

Name "Template::Filters::BASEARGS" used only once: possible typo at path/Base.pm line 49.
Name "Template::Context::BASEARGS" used only once: possible typo at path/Base.pm line 49.
Name "Template::BASEARGS" used only once: possible typo at path/Base.pm line 49.
Name "Template::Service::BASEARGS" used only once: possible typo at path/Base.pm line 49.
Name "Template::Provider::BASEARGS" used only once: possible typo at path/pathBase.pm line 49.
Name "Template::Plugins::BASEARGS" used only once: possible typo at path/Base.pm line 49.

Read Length Specified: 35, Read Length Detected: 35
Note, tempdir /scratch not found. Make sure it exists on executing nodes.

You have 4 seconds to proofread and CTRL-C if appropriate...
1,2,3,4.
Making scripts for the following:
ALIGN_1_1 ALIGN_2_1 ALIGN_3_1 ALIGN_4_1 ALIGN_5_1 ALIGN_6_1 ALIGN_7_1 ALIGN_8_1 ALIGN_9_1 ALIGN_10_1 ALIGN_11_1 ALIGN_12_1 ALIGN_13_1 ALIGN_14_1 ALIGN_15_1 ALIGN_16_1 ALIGN_17_1 ALIGN_18_1 POST_MATCHING_BY_SETS_1 POST_MATCHING_BY_CHR_1 POST_MATCHING_BY_CHR_2 POST_MATCHING_BY_CHR_3 POST_MATCHING_BY_CHR_4 POST_MATCHING_BY_CHR_5 POST_MATCHING_BY_CHR_6 POST_MATCHING_BY_CHR_7 POST_MATCHING_BY_CHR_8 POST_MATCHING_BY_CHR_9 POST_MATCHING_BY_CHR_10 POST_MATCHING_BY_CHR_11 POST_MATCHING_BY_CHR_12 POST_MATCHING_BY_CHR_13 POST_MATCHING_BY_CHR_14 POST_MATCHING_BY_CHR_15 POST_MATCHING_BY_CHR_16 POST_MATCHING_BY_CHR_17 POST_MATCHING_BY_CHR_18 POST_MATCHING_CONCAT_MATCH_FILESstats_flag = 0
POST_MATCHING_FINAL POST_MATCHING_MAKING_INDEX

In out_dir
scripts have been made. Use submit_scripts_to_XXX.pl to submit to a cluster.

and after running scripts i got:

S[START]: 2011-04-20 17:32:44.326588000
StartTime is Wed Apr 20 17:32:44 JST 2011
Directory is /out_dir
Running on host
Job - in Queue
Preparing out_dir/scripts/output_ALIGN_1_1.txt
CORONAROOT=/path
TS[JOB_START]: 2011-04-20 17:32:44.340211000

genome_file = /home/path/Validated/chrI.fa
reads_file = path/SRR089316.sra_F3.csfasta
output_directory = /out_dir/chrI
tag_length = 50
number_of_errors = 2
schema_file = /path/schemas/DBschema
start = 0
adj_errors = 0
maximum_hits = 10
reference option = 0
offset = 0

[WARNING]: Unable to find scratch directory (/scratch).
*** mapreads will run in current directory ('/out_dir/chrI').
*** It may run very slowly. matching reads to the genome ...
running mapreads /path/SRR089316.sra_F3.csfasta /path_of_cmap/Validated/chrI.fa M=2 S=0 u=2 L=50 T=/path/schemas/DBschema A=0 O=0 Z=10 R=0 I=0 q=1 r=1 > /outdir/chrI/SRR089316.sra_F3.csfasta.ma.50.2.tmp
if [ ! $? -eq 0 ]
then echo `date` FAILURE. Making SRR089316.sra_F3.csfasta.ma.50.2.tmp failed. >&2;rm /out_dir/chrI/SRR089316.sra_F3.csfasta.ma.50.2.tmp;exit 1
else mv out_dir/chrI/SRR089316.sra_F3.csfasta.ma.50.2.tmp /out_dir/testmap_16_wed/chrI/SRR089316.sra_F3.csfasta.ma.50.2; echo `date` Making of SRR089316.sra_F3.csfasta.ma.50.2 sucessful.>&2
fi;

map start run No. 1
reads file format is wrong, expecting > sign
fail to execute command:
/path/bin/map /out_dir/SRR089316.sra_F3.csfasta / path/Validated/chrI.fa T=20 L=49 C=1 E=.Tmpfile1303288364cKkjWT F=0 D=1 np=1 V=15.000000 u=1 r=0 n=1 Z=10 P="1111111111111100000000000000000000000000000000000" M=0 U=0.000000 H=0 B=1 m=0 | gzip -3 -c -f > .Tmpfile1303288364cKkjWT.out.1 ; exit ${PIPESTATUS[0]}
Wed Apr 20 17:32:46 JST 2011 FAILURE. Making SRR089316.sra_F3.csfasta.ma.50.2.tmp failed.

ERROR: mapreads failed


Thank you in advance.
chip_seq is offline   Reply With Quote
Old 04-20-2011, 04:45 AM   #6
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by chip_seq View Post
reads file format is wrong, expecting > sign
This seems to be the relevant error. The program doesn't like the format of your csfasta file. This could be something as simple as their being a blank line somewhere in the file, it could be that you have odd line endings or there could be some other formatting problem.

I'd start by creating a small file out of the first few hundred lines of your csfasta file and checking through it for any formatting problems. If that's OK then run that through your mapping pipeline - if it works then you know that there's a formatting problem elsewhere in your file which you can track down. If it still fails then there's something more fundamentally wrong.
simonandrews is offline   Reply With Quote
Old 04-20-2011, 06:25 PM   #7
chip_seq
Member
 
Location: Japan

Join Date: Mar 2011
Posts: 11
Default

Thank you very much .
Waiting for your answer.
chip_seq is offline   Reply With Quote
Old 04-21-2011, 12:02 AM   #8
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by chip_seq View Post
Waiting for your answer.
Did you see the note I posted yesterday? There's not much else anyone here can do - you need to figure out what the formatting problem in your csfasta file is. Try searching with a small section from the top of the file which you can manually review, and then move on from there depending on what you find.
simonandrews is offline   Reply With Quote
Old 04-21-2011, 02:02 AM   #9
chip_seq
Member
 
Location: Japan

Join Date: Mar 2011
Posts: 11
Default

I see.Thank you very much.
chip_seq is offline   Reply With Quote
Old 04-25-2011, 07:27 PM   #10
chip_seq
Member
 
Location: Japan

Join Date: Mar 2011
Posts: 11
Default

Hi Simon,
I found this formatting error:
>SRR089306.sra.55 3_31_1136^P_F3
T20320322233120100222232221320320221322203222222223
>SRR089306.sra.56 3_32_245D�^Y_F3
T30013201101131222330001113030201223332222222222323
>SRR089306.sra.57 3_32_290_F3
T03100031011311322322323133331003223002320022233232
>SRR089306.sra.58 3_32_337@oT^Y_F3
T03321131302130332121103032223221222312223122222222
>SRR089306.sra.59 3_32_1472_F3
T00101003220302223100012023300321020222220120220222
>SRR089306.sra.60 3_32_1533oT^Y_F3
T00010310223113300302102232302301222012223122222222

Do you know why i got this formatting error and how to fix it?
Thanks in Advance
chip_seq is offline   Reply With Quote
Old 04-25-2011, 11:57 PM   #11
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

You could try the following script (only lightly tested) which should find any oddly formatted entries in your file and remove them. Hopefully it should leave you with a file which you can process.

Code:
#!/usr/bin/perl
use warnings;
use strict;

my ($infile,$outfile) = @ARGV;

die "Usage is fix_csfasta.pl [input file] [output file]\n" unless ($outfile);

open (IN,$infile) or die "Can't read $infile: $!";
open (OUT,'>',$outfile) or die "Can't write to $outfile: $!";

while (<IN>) {

  if (/^>/) {
    my $header = $_;
    chomp $header;
    $header =~ s/[\r\n]//g;
    $header =~ s/[^>\w_\. ]//g;

    my $seq = <IN>;
    chomp $seq;
    $seq =~ s/[\r\n]//g;
    unless ($seq =~ /^T[0123\.]+$/) {
      warn "Skipping odd looking sequence '$seq'\n";
      next;
    }

    print OUT "$header\n$seq\n";
    
  }
  else {
    warn "Skipping unexpected line : $_";
  }

}
simonandrews is offline   Reply With Quote
Old 04-26-2011, 02:48 AM   #12
chip_seq
Member
 
Location: Japan

Join Date: Mar 2011
Posts: 11
Default

Thank you very much.
however i got many skipped lines ,do those skipped lines will affect the output
Skipping odd looking sequence 'Q{???_F3'
Skipping unexpected line : T03101002001200001210100000100020001210222303123002
Skipping odd looking sequence 'fj?_F3'
Skipping unexpected line : T00012002231322013012032211220223110033322330033030
Skipping odd looking sequence 'fj?_F3'
Skipping unexpected line : T21330231213330011101102123131102012101033000313322
Skipping odd looking sequence '_F3'
Skipping unexpected line : T22013201203033023103231220203232200101112233003222
Skipping odd looking sequence '_F3'
Skipping unexpected line : T33022110112231122002232221332332220102223320303320

Do you know why i got those odd looking sequences.
Thank you very much for you help.
chip_seq is offline   Reply With Quote
Old 04-26-2011, 03:07 AM   #13
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

It looks like you have a load of lines where there is an extra line break in the header line. This will cause the next line (which should be the sequence) to actually be the second part of the header, and the actual sequence will be skipped as the program searches for the next valid line.

Have a look and see how many of your sequences are affected. If it's only a small proportion then don't worry about it and just use the cleaned file. If it's a high proportion of your original file then you'd need to do a more sensitive extraction of the useful data (probably by looking for lines which look like valid sequence and using those, whilst discarding the existing headers all together).
simonandrews is offline   Reply With Quote
Old 04-26-2011, 06:56 PM   #14
chip_seq
Member
 
Location: Japan

Join Date: Mar 2011
Posts: 11
Default

Thank you very much for your kind help
chip_seq is offline   Reply With Quote
Old 05-10-2011, 07:51 PM   #15
chip_seq
Member
 
Location: Japan

Join Date: Mar 2011
Posts: 11
Default

Hi Simon,

Thank you for help previously.
after i removed strange characters from seq files and mapped them to genome i got 0% coverage which suggests severe problem although i'm using Corona lite with almost same previous parameters.
Any idea?

Thank you in advance
chip_seq is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:08 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO