Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SRA to .csfasta

    Hi All,
    does any one know how to convert .sra files into .csfasata?

  • #2
    You'd need to use abi-dump from the sra-toolkit.

    Comment


    • #3
      Hi Simon,
      Thank you very much
      when i tried abi-dump i got .csfasta with following format:

      >SRR089316.sra.1 1_18_263_F3
      T2203022222121332.03122.0.30.1.03100330.2010101.000
      >SRR089316.sra.2 1_18_325_F3
      T1222000000310122.13222.2.23.0.22030010.1100120.000
      >SRR089316.sra.3 1_18_483_F3
      T3211330120000113.00231.0.20.2.30013200.1121300.100

      as you can see after > file name +space that i removed later,also u can see in sequence For ex (T1222000000310122.13222.2.23.0.22030010.1100120.000) there are dots that i also removed but still there is problem in mapping ,do u have any idea?

      Thanks in Advance

      Comment


      • #4
        You don't want to remove the dots. Those are locations in your read where the color could not be determined (equivalent to an N in base space). Removing the dots will create deletions which won't help your efforts to map the data.

        You'll need to be a bit more specific about what problems you're having in mapping. What program are you using? What command are you running and what do you get?

        Comment


        • #5
          Hi simon,

          Thank you.
          After using abi-dump i got .csfast file with the following format:
          >SRR089316.sra.1 1_18_263_F3
          T2203022222121332.03122.0.30.1.03100330.2010101.000
          >SRR089316.sra.2 1_18_325_F3
          T1222000000310122.13222.2.23.0.22030010.1100120.000

          when i map using corona lite i run this command

          matching_large_genomes_cmap_save_script.pl -csfasta data_F3.csfasta -dir out_dir_path -cmap cmap -t 35 -e 2 -z 10

          Name "Template::Filters::BASEARGS" used only once: possible typo at path/Base.pm line 49.
          Name "Template::Context::BASEARGS" used only once: possible typo at path/Base.pm line 49.
          Name "Template::BASEARGS" used only once: possible typo at path/Base.pm line 49.
          Name "Template::Service::BASEARGS" used only once: possible typo at path/Base.pm line 49.
          Name "Template::Provider::BASEARGS" used only once: possible typo at path/pathBase.pm line 49.
          Name "Template::Plugins::BASEARGS" used only once: possible typo at path/Base.pm line 49.

          Read Length Specified: 35, Read Length Detected: 35
          Note, tempdir /scratch not found. Make sure it exists on executing nodes.

          You have 4 seconds to proofread and CTRL-C if appropriate...
          1,2,3,4.
          Making scripts for the following:
          ALIGN_1_1 ALIGN_2_1 ALIGN_3_1 ALIGN_4_1 ALIGN_5_1 ALIGN_6_1 ALIGN_7_1 ALIGN_8_1 ALIGN_9_1 ALIGN_10_1 ALIGN_11_1 ALIGN_12_1 ALIGN_13_1 ALIGN_14_1 ALIGN_15_1 ALIGN_16_1 ALIGN_17_1 ALIGN_18_1 POST_MATCHING_BY_SETS_1 POST_MATCHING_BY_CHR_1 POST_MATCHING_BY_CHR_2 POST_MATCHING_BY_CHR_3 POST_MATCHING_BY_CHR_4 POST_MATCHING_BY_CHR_5 POST_MATCHING_BY_CHR_6 POST_MATCHING_BY_CHR_7 POST_MATCHING_BY_CHR_8 POST_MATCHING_BY_CHR_9 POST_MATCHING_BY_CHR_10 POST_MATCHING_BY_CHR_11 POST_MATCHING_BY_CHR_12 POST_MATCHING_BY_CHR_13 POST_MATCHING_BY_CHR_14 POST_MATCHING_BY_CHR_15 POST_MATCHING_BY_CHR_16 POST_MATCHING_BY_CHR_17 POST_MATCHING_BY_CHR_18 POST_MATCHING_CONCAT_MATCH_FILESstats_flag = 0
          POST_MATCHING_FINAL POST_MATCHING_MAKING_INDEX

          In out_dir
          scripts have been made. Use submit_scripts_to_XXX.pl to submit to a cluster.

          and after running scripts i got:

          S[START]: 2011-04-20 17:32:44.326588000
          StartTime is Wed Apr 20 17:32:44 JST 2011
          Directory is /out_dir
          Running on host
          Job - in Queue
          Preparing out_dir/scripts/output_ALIGN_1_1.txt
          CORONAROOT=/path
          TS[JOB_START]: 2011-04-20 17:32:44.340211000

          genome_file = /home/path/Validated/chrI.fa
          reads_file = path/SRR089316.sra_F3.csfasta
          output_directory = /out_dir/chrI
          tag_length = 50
          number_of_errors = 2
          schema_file = /path/schemas/DBschema
          start = 0
          adj_errors = 0
          maximum_hits = 10
          reference option = 0
          offset = 0

          [WARNING]: Unable to find scratch directory (/scratch).
          *** mapreads will run in current directory ('/out_dir/chrI').
          *** It may run very slowly. matching reads to the genome ...
          running mapreads /path/SRR089316.sra_F3.csfasta /path_of_cmap/Validated/chrI.fa M=2 S=0 u=2 L=50 T=/path/schemas/DBschema A=0 O=0 Z=10 R=0 I=0 q=1 r=1 > /outdir/chrI/SRR089316.sra_F3.csfasta.ma.50.2.tmp
          if [ ! $? -eq 0 ]
          then echo `date` FAILURE. Making SRR089316.sra_F3.csfasta.ma.50.2.tmp failed. >&2;rm /out_dir/chrI/SRR089316.sra_F3.csfasta.ma.50.2.tmp;exit 1
          else mv out_dir/chrI/SRR089316.sra_F3.csfasta.ma.50.2.tmp /out_dir/testmap_16_wed/chrI/SRR089316.sra_F3.csfasta.ma.50.2; echo `date` Making of SRR089316.sra_F3.csfasta.ma.50.2 sucessful.>&2
          fi;

          map start run No. 1
          reads file format is wrong, expecting > sign
          fail to execute command:
          /path/bin/map /out_dir/SRR089316.sra_F3.csfasta / path/Validated/chrI.fa T=20 L=49 C=1 E=.Tmpfile1303288364cKkjWT F=0 D=1 np=1 V=15.000000 u=1 r=0 n=1 Z=10 P="1111111111111100000000000000000000000000000000000" M=0 U=0.000000 H=0 B=1 m=0 | gzip -3 -c -f > .Tmpfile1303288364cKkjWT.out.1 ; exit ${PIPESTATUS[0]}
          Wed Apr 20 17:32:46 JST 2011 FAILURE. Making SRR089316.sra_F3.csfasta.ma.50.2.tmp failed.

          ERROR: mapreads failed


          Thank you in advance.

          Comment


          • #6
            Originally posted by chip_seq View Post
            reads file format is wrong, expecting > sign
            This seems to be the relevant error. The program doesn't like the format of your csfasta file. This could be something as simple as their being a blank line somewhere in the file, it could be that you have odd line endings or there could be some other formatting problem.

            I'd start by creating a small file out of the first few hundred lines of your csfasta file and checking through it for any formatting problems. If that's OK then run that through your mapping pipeline - if it works then you know that there's a formatting problem elsewhere in your file which you can track down. If it still fails then there's something more fundamentally wrong.

            Comment


            • #7
              Thank you very much .
              Waiting for your answer.

              Comment


              • #8
                Originally posted by chip_seq View Post
                Waiting for your answer.
                Did you see the note I posted yesterday? There's not much else anyone here can do - you need to figure out what the formatting problem in your csfasta file is. Try searching with a small section from the top of the file which you can manually review, and then move on from there depending on what you find.

                Comment


                • #9
                  I see.Thank you very much.

                  Comment


                  • #10
                    Hi Simon,
                    I found this formatting error:
                    >SRR089306.sra.55 3_31_1136^P_F3
                    T20320322233120100222232221320320221322203222222223
                    >SRR089306.sra.56 3_32_245D�^Y_F3
                    T30013201101131222330001113030201223332222222222323
                    >SRR089306.sra.57 3_32_290_F3
                    T03100031011311322322323133331003223002320022233232
                    >SRR089306.sra.58 3_32_337@oT^Y_F3
                    T03321131302130332121103032223221222312223122222222
                    >SRR089306.sra.59 3_32_1472_F3
                    T00101003220302223100012023300321020222220120220222
                    >SRR089306.sra.60 3_32_1533oT^Y_F3
                    T00010310223113300302102232302301222012223122222222

                    Do you know why i got this formatting error and how to fix it?
                    Thanks in Advance

                    Comment


                    • #11
                      You could try the following script (only lightly tested) which should find any oddly formatted entries in your file and remove them. Hopefully it should leave you with a file which you can process.

                      Code:
                      #!/usr/bin/perl
                      use warnings;
                      use strict;
                      
                      my ($infile,$outfile) = @ARGV;
                      
                      die "Usage is fix_csfasta.pl [input file] [output file]\n" unless ($outfile);
                      
                      open (IN,$infile) or die "Can't read $infile: $!";
                      open (OUT,'>',$outfile) or die "Can't write to $outfile: $!";
                      
                      while (<IN>) {
                      
                        if (/^>/) {
                          my $header = $_;
                          chomp $header;
                          $header =~ s/[\r\n]//g;
                          $header =~ s/[^>\w_\. ]//g;
                      
                          my $seq = <IN>;
                          chomp $seq;
                          $seq =~ s/[\r\n]//g;
                          unless ($seq =~ /^T[0123\.]+$/) {
                            warn "Skipping odd looking sequence '$seq'\n";
                            next;
                          }
                      
                          print OUT "$header\n$seq\n";
                          
                        }
                        else {
                          warn "Skipping unexpected line : $_";
                        }
                      
                      }

                      Comment


                      • #12
                        Thank you very much.
                        however i got many skipped lines ,do those skipped lines will affect the output
                        Skipping odd looking sequence 'Q{???_F3'
                        Skipping unexpected line : T03101002001200001210100000100020001210222303123002
                        Skipping odd looking sequence 'fj?_F3'
                        Skipping unexpected line : T00012002231322013012032211220223110033322330033030
                        Skipping odd looking sequence 'fj?_F3'
                        Skipping unexpected line : T21330231213330011101102123131102012101033000313322
                        Skipping odd looking sequence '_F3'
                        Skipping unexpected line : T22013201203033023103231220203232200101112233003222
                        Skipping odd looking sequence '_F3'
                        Skipping unexpected line : T33022110112231122002232221332332220102223320303320

                        Do you know why i got those odd looking sequences.
                        Thank you very much for you help.

                        Comment


                        • #13
                          It looks like you have a load of lines where there is an extra line break in the header line. This will cause the next line (which should be the sequence) to actually be the second part of the header, and the actual sequence will be skipped as the program searches for the next valid line.

                          Have a look and see how many of your sequences are affected. If it's only a small proportion then don't worry about it and just use the cleaned file. If it's a high proportion of your original file then you'd need to do a more sensitive extraction of the useful data (probably by looking for lines which look like valid sequence and using those, whilst discarding the existing headers all together).

                          Comment


                          • #14
                            Thank you very much for your kind help

                            Comment


                            • #15
                              Hi Simon,

                              Thank you for help previously.
                              after i removed strange characters from seq files and mapped them to genome i got 0% coverage which suggests severe problem although i'm using Corona lite with almost same previous parameters.
                              Any idea?

                              Thank you in advance

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X