Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • shell scripting searching directories and subdirectories

    Hi Everybody ,
    I have a directory which 8 subdirectories each subdirectory has single sra file . Now my problem i have take each sra file and input to my script which i created to converting sra to bam file . I want all my output under the same as parent directory name . How to do it ?.


    Thanks,
    Anusha.Ch

  • #2
    It sounds like you have

    upper_dir
    -dir1
    --sra_in_dir_1
    -..
    -dir8
    --sra_in_dir_8

    From NCBI (http://www.ncbi.nlm.nih.gov/Traces/s...lkit_doc&f=std)

    The following dumpers access sequence data
    sff-dump
    abi-dump
    illumina-dump
    Use sam-dump to convert SRA to SAM format. [...]
    You can convert the SRA file to FASTQ format using the fastq-dump program: fastq-dump SRR390728 will produce a file with the same name as the SRA file with the extension of .fastq[...]
    Conversion to BAM requires an additional step of converting produced SAM file to BAM using samtools from the SAMtools website http://samtools.sourceforge.net/.
    [...]
    If your SRA is aligned, then sam-dump, then convert with samtools to bam.
    If your SRA is unaligned, then you'd have to fastq-dump, map to reference, then convert sam output to bam with samtools.

    Most likely the ideal way to set up your shell script is to use a loop to go through each directory and execute on the sra found in each one.

    From a pseudo-code standpoint, it could be

    for DIR in dir1 dir2 dir3
    do
    sam-dump ./$DIR/*.sra | samtools > bam file output
    done

    Note: sam-dump output can be piped to samtools, which in turn can take streaming stdin and redirect to a bam of your choice.

    Though *.sra is rather sloppy way of capturing the name of the SRA file.

    Not sure if all SRA are mapped to a reference, this would affect what you do with the bam file. The other option is to fastq-dump, then map, then pipe the map output to samtools and convert to bam.

    Is this a homolog of (http://seqanswers.com/forums/showthread.php?t=34247)
    Last edited by winsettz; 10-04-2013, 04:22 PM.

    Comment


    • #3
      Hi ,
      I create alignment from sam to bam using fastq-dump ,bow tie then sam tools view to convert from sra to bam format( i did for one sra file in sub directory .
      I have a /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364 home directory and 8 other sub directories under it . I want iterate over 8 subdirectories and take sra files under each subdirectory repeat my shell . Don't scolde me if i am dumb question i am beginner

      Thanks,
      Anusha.Ch

      Comment


      • #4
        find . -name *.sra

        p.s. Your questions are not dumb. The problem is that you don't seem to put any effort from your own part into solving the problem. You're beginner? So maybe you should read some bash guide for beginners (or at the very least some relevant parts)?
        Last edited by rhinoceros; 10-05-2013, 03:49 AM.
        savetherhino.org

        Comment


        • #5
          Originally posted by rhinoceros View Post
          find . -name *.sra

          p.s. Your questions are not dumb. The problem is that you don't seem to put any effort from your own part into solving the problem.
          One vote for you

          Comment


          • #6
            Thank you so much guys ,
            I will read the link .


            Thanks,
            Anusha.ch

            Comment


            • #7
              Hi ,
              I understood how to use find command and list all the subdirectories in a file but now my problem is

              this is my shell script

              ./SRAtoBAM.copy.sh /raid/development/anusha/python_test/shelltest/SRR203400.sra /raid/references-and-indexes/hg19/bowtie-indexes/hg19 /raid/development/anusha/pypython_test/shelltest/SRR203400.sort.bam ---for a single sra file .

              now my find command gives me this

              find /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062 -type f -print
              /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203400/SRR203400.sra
              /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203401/SRR203401.sra
              /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203402/SRR203402.sra
              /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203403/SRR203403.sra
              /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203404/SRR203404.sra
              /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203405/SRR203405.sra
              /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203406/SRR203406.sra
              /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203407/SRR203407.sra
              /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203408/SRR203408.sra
              /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203409/SRR203409.sra

              which are list of sea files but when i want to execute my shell which takes sra files and store the output result with same as given sra ,i mean suppose SRR203408.sra is running i want to name my output SRR203408.bam

              Comment


              • #8
                sorry my question was 1/2 cut
                my find command gives me list of sra files
                anusha@cn1:~> find /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062 -type f -print
                /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203400/SRR203400.sra
                /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203401/SRR203401.sra
                /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203402/SRR203402.sra
                /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203403/SRR203403.sra
                /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203404/SRR203404.sra
                /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203405/SRR203405.sra
                /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203406/SRR203406.sra
                /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203407/SRR203407.sra
                /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203408/SRR203408.sra
                /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203409/SRR203409.sra
                but if i want to name output of my shell that is bam file with same name what shouuld I do ?.

                Thanks,
                Anusha.Ch

                Comment


                • #9
                  This would be an excellent time for you to do a bash tutorial.

                  $variable is an initialized variable (that's the dollar sign). So for instance, one can design a for-loop that changes the value of $variable for each cycle, which is critical to your problem.

                  For instance, changing variable from SRR1, SRR2, SRR3, SRR4, you could write your workflow like so:


                  for variable in SRR1 SRR2 SRR3 ...
                  {
                  fastq-dump ./blablabla/$variable/$variable.sra | stuff | stuff
                  }

                  where the for-loop repeats the contents of the for-loop (fastq-dump ./blabla...) over and over for each valuable of $variable.

                  You could probably do something like

                  for NAME in SRR203400 SRR203401 SRR203402 SRR203403 SRR203404 SRR203405 SRR203406 SRR203407 SRR203408
                  do
                  fastq-dump /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/$NAME/$NAME.sra > $NAME.fastq
                  bowtie2 -x /raid/references-and-indexes/hg19/bowtie-indexes/hg19 -U $NAME.fastq -S $NAME.sam
                  samtools view -bS $NAME.sam | samtools sort - $NAME.sorted
                  done

                  I am not sure if bowtie can take piped stdout from a command and feed it in as stdout. But you can pipe stdout directly from bowtie2 to samtools, and obviate the need to write output to a sam file before sending it to samtools.

                  I don't have any sra files on hand for me to test at the moment, but this should be enough to get the ball rolling. You already know what directory structure you have in common (/raid/databases.../SRX062364. And considering the parent name of the directory and the name of the sra are exactly the same, you can store those as a character when iterating through.

                  The only caveat here is that you're doing all of this in serial, which may not be very time-efficient. It would be more efficient to execute these as one job at a time on cluster computing.
                  Last edited by winsettz; 10-08-2013, 02:14 PM.

                  Comment


                  • #10
                    Hi ,
                    Firstly i don't want hardcode file names because latter i want to extend my shell many programs .
                    this is my shell


                    anusha@cn1:/raid/development/anusha/python_test/shelltest> cat SRAtoBAM.copy.sh
                    #!/bin/bash
                    # generating a fastq file using fastq dump
                    sraip=$1
                    hg19ref=$2
                    fastq-dump -Z $1 | bowtie -v 3 -k 2 --sam $2 - | samtools view -bS -o $3 -

                    i am not using bowtie2 but instead i am using old bow tie .

                    Is there any way to write generic instead of giving hardcore subdirectory names ?

                    Thanks,
                    Anusha.Ch

                    Comment


                    • #11
                      To be specific

                      /raid/development/anusha/python_test/shelltest> ./SRAtoBAM.wc1.sh find /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062 -type f /raid/references-and-indexes/hg19/bowtie-indexes/hg19 ?
                      what should i give if i want to name my output with same name as sra file name?
                      Thanks,
                      Anusha.ch

                      Comment


                      • #12
                        In bash, given a filename with the full path stored in a variable called $f, you can strip the path and extension using:
                        Code:
                        f=${f##*/} #Strip the path
                        ${f%.sra} #strip the extension
                        BTW, the google search for this is "bash basename" and the first has the syntax and examples (I use this in a lot of scripts but always forget the exact syntax myself since I don't actually write it frequently). Much of figuring things out is just learning where to look.

                        Comment


                        • #13
                          Originally posted by dpryan View Post
                          In bash, given a filename with the full path stored in a variable called $f, you can strip the path and extension using:
                          Code:
                          f=${f##*/} #Strip the path
                          ${f%.sra} #strip the extension
                          BTW, the google search for this is "bash basename" and the first has the syntax and examples (I use this in a lot of scripts but always forget the exact syntax myself since I don't actually write it frequently). Much of figuring things out is just learning where to look.
                          Fascinating. Learn something new everyday.

                          Comment


                          • #14
                            Originally posted by dpryan View Post
                            In bash, given a filename with the full path stored in a variable called $f, you can strip the path and extension using:
                            Code:
                            f=${f##*/} #Strip the path
                            ${f%.sra} #strip the extension
                            BTW, the google search for this is "bash basename" and the first has the syntax and examples (I use this in a lot of scripts but always forget the exact syntax myself since I don't actually write it frequently). Much of figuring things out is just learning where to look.
                            tcsh equivalent: http://stackoverflow.com/questions/1...a-path-in-tcsh

                            Comment


                            • #15
                              Hi ,
                              I #bin/bash
                              set $f =/raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203400/SRR203400.sra
                              f=${f##*/} #Strip the path
                              ${f%.sra} #strip the extension
                              $echo $f
                              It did not work for me
                              I tried to
                              set j = `basename ~/foo/bar.c`
                              echo $j
                              I did try like this also
                              set i = ~/foo/bar.c
                              I did try like this

                              #!/bin/bash
                              #basename /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRX062364/SRR203400/SRR203400.sra
                              #set f=basename /raid/databases/ROADMAP/DGF/ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX062/SRR203400/SRR203400.sra
                              filename=`rev <<< "$1" | cut -d"." -f2- | rev`
                              why it is not working ?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X