Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I'm LMAO since it appears that I'm coming to these revelations at just about the exact same time you were/are. It's definitely comforting to see somebody is figuring out the exact same stuff that I am at basically the same time I am. Everytime I talk about this stuff here, people look at me like I'm ALF or something.

    Comment


    • #17
      Do yourself a favor & swear off shell scripts for Perl/Python/Ruby -- once you start trying to have any sort of conditional programming, shell becomes a nightmare.

      Converting your script above to Perl is pretty easy (but left as an exercise for the student :-)

      Comment


      • #18
        Sounds like I should finish working through the "Unix and Perl for biologist" course to see if I can figure Perl out. I managed to figure out a solution for my problem about half way through the Unix section so I started with that but maybe I shouldn't have jumped the gun so much. I'll see what I can come up with this weekend.

        Comment


        • #19
          Aaah, I'm too big a softie. Also, IMHO the perl is more readable than shell to understand what is going on (yikes! something less readable than perl!)

          Personally, I prefer directory names in all lowercase. Brevity is valuable too -- "bam" & "sam" would be my choices for those directories & I probably wouldn't bother having separate ones anyway. But that is all personal preference & you should pick what suits you.

          Also: DO NOT try to use a subdirectory to backup your scripts -- it is easy to install & learn git (or a similar system) to control versions. It is much too easy to accidentally copy your file the wrong way and wipe something out; version control programs track all your changes & allow you to safely move backwards & forwards through the history of your code. They can also be a handy way to replicate code across machines (and keep them in sync) as well as a path towards collaborative coding.
          Code:
          #!/usr/bin/perl
          use strict;
          print "***Creating Pipeline Directory Structure***\n"; # \n is newline
          my $root="NGS";
          system("mkdir $root");
          
          # this line only makes directories with no subdirectories
          &makeSubdirs($root,
                                    "AnalysisNotes", "ApplicationDownloads",
                                    "InputSequence","SAMfiles");
          &makeSubdirs("$root/BAMfiles","Merged","Original","Sorted");
          &makeSubdirs("$root/FinalOutputs",
                               "AlignmentResults","Illumina","SangerFasta",
                                 "SortedBAMfiles","MergedBAMfiles");
          &makeSubdirs("$root/RefGenomes",
                             "BFAST_Indexed","BOWTIE_Indexed","GenomeDownloads");
          &makeSubdirs("$root/Scripts","ScriptBackups");  # DANGEROUS!!!
          print "Pipeline Directory Structure Created\n";
          
          sub makeSubdirs
          {
            my @subdirs=@_;
            my $parentDir=shift(@subdirs);
            mkdir($parentDir) if (! -d $parentDir);
            foreach my $subDir(@subdirs)
            {
               mkdir("$parentDir/$subDir");
            }
          }
          
          #cd ../Scripts
          #mkdir ScriptBackups
          #cd ../..
          #cp Create_NGS_DirectoryStructureV1.sh NGS/Scripts/ScriptBackups/
          #cd NGS/
          #pwd

          Comment


          • #20
            Okay still working on figuring out the Perl version but it seem pretty obvious less some of the perl coding language. However, I'm nothing if not competitive so here is the three line version. Not sure if "krobinson" perl version is fewer characters or not but this must win for number of lines.
            I guess reading the manuals for functions helps mkdir -p sure helps! I'll work on figuring out git version control and a bit more perl but I can't wait to see krobinson's response to the 400+ pipeline shell script. At least I took the suggestion to ditch the upper and lower case filenames, so everything is now lower case.

            NOTE - THIS SCRIPT HAS BEEN UPATED TO VERSION 3 IN A SUBSEQUENT POST

            Code:
            #!/bin/sh
            
            #Create_NGS_DirectoryStructureV2.sh
            #Written by Jonathan Keats 04/12/2010
            
            echo ***Creating Pipeline Directory Structure***
            mkdir -p ngs/analysisnotes ngs/applicationdownloads ngs/bamfiles/merged ngs/bamfiles/original ngs/bamfiles/sorted ngs/finaloutputs/alignmentresults ngs/finaloutputs/illumina/read1 ngs/finaloutputs/illumina/read2 ngs/finaloutputs/mergedbamfiles ngs/finaloutputs/sangerfastq ngs/finaloutputs/sortedbamfiles ngs/inputsequences/illumina/read1 ngs/inputsequences/illumina/read2 ngs/inputsequences/sangerfastq ngs/refgenomes/bfast_indexed ngs/refgenomes/bowtie_indexed ngs/refgenomes/bwa_indexed ngs/refgenomes/genomedownloads ngs/samfiles ngs/scripts/scriptbackups
            cp Create_NGS_DirectoryStructureV2.sh ngs/scripts/scriptbackups/
            echo ***Pipeline Directory Structure Created***
            Last edited by Jon_Keats; 09-08-2010, 12:31 PM. Reason: Updating to match subsequent posts

            Comment


            • #21
              Better metrics have to do with readability & maintainability rather than utter brevity -- it is quite possible to write extremely short code in Perl which looks like hieroglyphics.

              Your program is straightforward -- but as it grows it will get less so. One thing in particular I addressed is if you change the name of certain of your directories -- in your shell script some of the directories are listed twice, which is an opportunity for error should you decide to rename them.

              P.S. Only a terminal -- no internal -- Asparagine on my name!

              Comment


              • #22
                Hi all,

                Let me do my best to help out with the Perl.

                A lot of basic operations have already been written in Perl. You can easily use these bits of code. If you need something done, check http://search.cpan.org/. This goes for bioinformatics things as well: check http://www.bioperl.org for stuff like dealing with sequence files.

                For your Perl script I used File::Path (http://search.cpan.org/~jhi/perl-5.8.0/lib/File/Path.pm). It took me about 5 minutes to write.
                Code:
                #!/usr/bin/perl
                use strict;
                use File::Path; # CPAN modules are this easy to use
                
                # the dirs
                my $dirs = [ 'NGS/AnalysisNotes',
                	  'NGS/ApplicationDownloads',
                	  'NGS/InputSequence',
                	  'NGS/SAMfiles', 
                	  'NGS/BAMfiles/Merged', 'NGS/BAMfiles/Original', 'NGS/BAMfiles/Sorted', 
                	  'NGS/FinalOutputs/AlignmentResults', 'NGS/FinalOutputs/Illumina', 'NGS/FinalOutputs/SangerFasta', 'NGS/FinalOutputs/SortedBAMfiles', 'NGS/FinalOutputs/MergedBAMfiles', 
                	  'NGS/RefGenomes/BFAST_Indexed', 'NGS/RefGenomes/BOWTIE_Indexed', 'NGS/RefGenomes/GenomeDownloads',
                	  'NGS/Scripts/ScriptBackups' ];
                
                print "***Creating Pipeline Directory Structure***\n";
                
                # eval will evaluate whatever is inside the {}. Any errors are stored in $@
                # mkpath is the File::Path method that will create the given directories
                # it takes three arguments in total: the dirs, whether you wish to print the created directories (in that case, 1) or not (0) and the last argument has to do with file permissions, I omitted it
                eval { mkpath ( $dirs, 1 ); };
                
                # now check if there were errors. If so, print.
                if ( $@ ) {
                  print "Couldn't create NGS directories: $@\n";
                }
                print "Pipeline Directory Structure Created\n";
                Hope this helps?

                Cheers,
                Wil

                Comment


                • #23
                  To Genome or Not to Genome

                  I'm sure I'm not the only person facing a conundrum on which genome version to use for alignment. In general I'm a believer in the "out with the old, in with the new" mentality so my gut is pushing me to use the newest genome build GRCh37/hg19/Ensembl55+ over the old version NCBI36/hg18/Ensembl38-54. Plus CCDS has been on GRCh37 for nearly a year now and they just released dbSNP131 on GRCh37 so it's getting harder and harder to ignore the newest build. BUT, for my dataset I have paired copy number data on Agilent arrays, which still use hg18 design files. So after wasting a couple of days aligning things to GRCh37 (1000 genomes version) I'm going back to redo it with hg18 so my coordinates match between the two datasets, though wasting might be the wrong word as I'm betting I'll want to be on the new version in a month. To get this genome reference file I decided to use the UCSC chromosome files (http://hgdownload.cse.ucsc.edu/golde...8/chromosomes/) since I'd seen someone mentioning the advantage of their simple naming structure. I did not use the various haplotype builds or the random contigs.
                  - Downloaded each individual chromosome file Chr1-Chr22, ChrX, ChrY, and ChrM.
                  - Placed them all in my "ngs/refgenomes/genome_downloads/hg18" folder
                  - Navigate to the folder in Terminal "cd ngs/refgenomes/genome_downloads/hg18"
                  - Decompress the files using gunzip, type "gunzip *.fa.gz"
                  - Then merge the files using unix cat command [cat (infile1) (infile2) (infileN) > (outfile)]
                  - Type "cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg18.fasta"
                  - check the ouput file contains all chromosomes in the correct order using unix grep command to pull out the fasta header lines
                  - Type "grep ">" hg18.fa"
                  - Which should output the following if correct:
                  >chr1
                  >chr2
                  >chr3
                  >chr4
                  >chr5
                  >chr6
                  >chr7
                  >chr8
                  >chr9
                  >chr10
                  >chr11
                  >chr12
                  >chr13
                  >chr14
                  >chr15
                  >chr16
                  >chr17
                  >chr18
                  >chr19
                  >chr20
                  >chr21
                  >chr22
                  >chrX
                  >chrY
                  >chrM

                  A similar process should be possible with any genome version or source like ensembl. Just make sure the end platform supports the version you are using. And be careful with the annotation files if you get them from UCSC as the positions are 0 base indexed.
                  Last edited by Jon_Keats; 09-08-2010, 12:35 PM. Reason: Found error in suggested cat command

                  Comment


                  • #24
                    The Shell Script from Hell or My Masterpiece you Decide

                    I'll preface this with acknowledging all the help from various people who have replied to this thread and I'll promise to try and figure out perl over shell scripts, but two weeks ago the "Hello World!" script in any programing language was "Greek" to me so bare with my ignorance for a week or two more. But in the interest of sharing here is my current bwa samse pipeline. I'm stuck at not being able to merge the sorted bam files but at least this lets me get back to the lab and opens up some of my time to write the back log of papers I need to polish off so I can get my own lab and make someone else deal with these issues
                    **Currently, this uses all default settings so modify commands as you see fit***

                    Code:
                    #!/bin/sh
                    
                    # Batch_BWAsamse_V1.sh
                    # Created by Jonathan Keats on 04/05/10.
                    # This script is designed to take a batch of raw Illumina 1.3+ reads to sorted BAM files.
                    # It is designed to be initiated from a folder called "NGS" with a specific subdirectory structure
                    # Use the script called "Create_NGS_DirectoryStructureV1.sh" or the other variations (see Jon_Keats SeqAnswers Introduction Thread) to create the analysis directory structure
                    # To run this script you MUST first place your reference file in "NGS/RefGenomes/BWA_Indexed" and have run the "bwa index" command to create the BWT index files
                    # If you are NOT using a reference genome called "hg18.fa" you will NEED to edit lines 124 and 150 accordingly!
                    # The script is based on having ***RENAMED*** Illumina files in NGS/InputSequence/Illumina/Read1 and if available NGS/InputSequence/Illumina/Read2 if paired end runs
                    # The renamed format MUST be "YourSampleName_RunX_LaneX_R1.txt" and "YourSampleName_RunX_LaneX_R2.txt" otherwise files will be overwritten, paired-end analysis will not be possible, and lane merging will not be possible 
                    # At each step it queries specific folders for available files and passes them to the next analysis module
                    # After each step the filename extension of the output files are corrected. (ie. "MySequenceFile.txt.fastq" to "MySequenceFile.fastq")
                    # Order of Embedded Steps	- Converts Illumina 1.3+ fastq files "s_1_sequence.txt" to Sanger fastq using "maq ill2sanger" command
                    #							- Aligns created fastq files to reference genome using "bwa aln" command
                    #							- Generates SAM files from alignment files using "bwa samse" command
                    #							- Converts SAM files to BAM files using "samtools view" command
                    #							- Sorts BAM files using "samtools sort" command
                    #							- Final output files are archived then the input and analysis directories are cleaned-up and readied for the next analysis batch
                    #    ***DURING CLEAN-UP THE ENTIRE CONTENTS OF SEVERAL FOLDERS ARE DELETED, SO ONLY PLACE ADDITIONAL FILES IN THE "NGS/FINALOUTPUTS" DIRECTORY OR OUTSIDE THE NGS FOLDER SUBDIRECTORIES***
                    # The script creates a log file in "NGS/AnalysisNotes" to track the steps completed and the time each step started and finished
                    # Some of the log events will print to both the terminal screen and the log file so you can see what is going on
                    # On our workstation this will process two lanes of PE data overnight or a full run over the weekend
                    
                    # Sould you merge files (lane 1 and 2) before analysis as illumina fastq files or after sorting BAM files?
                    # If you merge before analysis does this mess up the a paired-end analysis?
                    
                    # Carabelle - Can you have a fail safe start query to kill the script if it is not initiated from the "NGS" folder? Probably doesn't run anyways
                    # Carabelle - Also can you check if destination folders are empty and if not kill the script?
                    # Carabelle - Such as if directory NGS/SAMfiles is NOT empty kill script and print "The NGS/SAMfiles directory is not empty please remove files otherwise they will be deleted"
                    # Ligne=Line en francais (I had a significant amount of help from our French post-docs husband who is a unix programer in Paris)
                    
                    #Starting directory = NGS/
                    
                    echo ***Starting Analysis Batch***
                    date '+%m/%d/%y %H:%M:%S'
                    
                    #The following step creates the log file in the AnalysisNotes subdirectory the first time the script is run
                    #On subsequent runs the results are printed at the bottom of the pre-existing log file
                    
                    echo ***Starting Analysis Batch*** >> AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log
                    
                    #In this step we will converted the original Illumina 1.3+ fastq format files to Sanger fastq format
                    #Input files from NGS/InputSequences/Illumina/Read1 and NGS/InputSequences/Illumina/Read2
                    #Output files to NGS/InputSequences/SangerFastq
                    
                    echo Starting Step1a - Read1 Fastq Conversion with maq ill2sanger
                    date '+%m/%d/%y %H:%M:%S'
                    echo Starting Step1a - Read1 Fastq Conversion with maq ill2sanger >> AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log
                    cd InputSequence/Illumina/Read1
                    #Current directory = NGS/InputSequence/Illumina/Read1
                    echo Converting the following Illumina files from the Read1 folder:
                    for ligne in `ls *.txt`
                    do                                                                     
                    echo $ligne
                    done
                    echo Converting the following Illumina files from the Read1 folder: >> ../../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.txt`
                    do
                    echo $ligne >> ../../../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.txt`
                    do
                    maq ill2sanger $ligne ../../SangerFastq/$ligne.fastq
                    done
                    echo Finished Step1a - Read1 Fastq Conversion
                    date '+%m/%d/%y %H:%M:%S'
                    echo Finished Step1a - Read1 Fastq Conversion >> ../../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../../AnalysisNotes/Analysis.log
                    cd ../Read2
                    #Current directory = NGS/InputSequence/Illumina/Read2
                    echo Starting Step1b - Read2 Fastq Conversion with maq ill2sanger
                    date '+%m/%d/%y %H:%M:%S'
                    echo Starting Step1b - Read2 Fastq Conversion with maq ill2sanger >> ../../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../../AnalysisNotes/Analysis.log
                    echo Converting the following Illumina files from the Read2 folder:
                    for ligne in `ls *.txt`
                    do                                                                     
                    echo $ligne
                    done
                    echo Converting the following Illumina files from the Read2 folder: >> ../../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.txt`
                    do
                    echo $ligne >> ../../../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.txt`
                    do
                    maq ill2sanger $ligne ../../SangerFastq/$ligne.fastq
                    done
                    cd ../../SangerFastq/
                    #Current directory = NGS/InputSequence/SangerFastq
                    #The next step will change the file extensions of ths Sanger Fastq outputs from "File_Run1_Lane1_R1.txt.fastq" to "File_Run1_Lane1_R1.fastq"
                    old_ext=txt.fastq
                    new_ext=fastq
                    find . -type f -name "*.$old_ext" -print | while read file
                    do
                        mv $file ${file%${old_ext}}${new_ext}
                    done
                    echo Finished Step1b - Read2 Fastq Conversion
                    date '+%m/%d/%y %H:%M:%S'
                    echo Finished Step1b - Read2 Fastq Conversion >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    
                    #In the next step we will align the converted sanger fastq format files to the reference genome (hg18.fa)
                    
                    echo Starting Step2 - bwa aln process
                    date '+%m/%d/%y %H:%M:%S'
                    echo Starting Step2 - bwa aln process >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    echo The following fastq files will be aligned:
                    for ligne in `ls *.fastq`
                    do                                                                     
                    echo $ligne
                    done
                    echo The following fastq files will be aligned: >> ../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.fastq`
                    do
                    echo $ligne >> ../../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.fastq`
                    do
                    bwa aln ../../RefGenomes/BWA_Indexed/hg18.fa $ligne > $ligne.sai 	 
                    done
                    echo Finished Step2 - bwa aln process
                    date '+%m/%d/%y %H:%M:%S'
                    echo Finished Step2 - bwa aln process >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    
                    #In the next step we will generate SAM files for the alignments using bwa samse
                    
                    echo Starting Step3 - bwa samse process
                    date '+%m/%d/%y %H:%M:%S'
                    echo Starting Step3 - bwa samse process >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    echo The following alignment files will be converted to SAM files:
                    for ligne in `ls *.sai`
                    do                                                                     
                    echo $ligne
                    done
                    echo The following alignment files will be converted to SAM files: >> ../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.sai`
                    do
                    echo $ligne >> ../../AnalysisNotes/Analysis.log
                    done
                    #To get this step to work I had to trick the system to get the conventional (bwa samse <database.fasta> <aln.sai> <input.fastq> > aln.sam) command as I can only pull up one list at a time to automatically feed files into the command
                    for ligne in `ls *.fastq`
                    do
                    bwa samse ../../RefGenomes/BWA_Indexed/hg18.fa $ligne.sai $ligne > ../../SAMfiles/$ligne.sam
                    done
                    #This step renames the bwa aln output files from "File_Run1_Lane1_R1.fastq.sai" to "File_Run1_Lane1_R1.sai"
                    #This is done on purpuse at this point to allow the preceeding step to work
                    old_ext=fastq.sai
                    new_ext=sai
                    find . -type f -name "*.$old_ext" -print | while read file
                    do
                        mv $file ${file%${old_ext}}${new_ext}
                    done 
                    cd ../../SAMfiles/
                    #Current directory = NGS/SAMfiles
                    #This step renames the bwa samse output files from "File_Run1_Lane1_R1.fastq.sam" to "File_Run1_Lane1_R1_bwase.sam"
                    #We up in the "_bwase" to indicate the files are bwa alignments converted to SAM files with the bwa samse command
                    old_ext=.fastq.sam
                    new_ext=_bwase.sam
                    find . -type f -name "*$old_ext" -print | while read file
                    do
                        mv $file ${file%${old_ext}}${new_ext}
                    done 
                    echo Finished Step3 - bwa samse process
                    date '+%m/%d/%y %H:%M:%S'
                    echo Finished Step3 - bwa samse process >> ../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../AnalysisNotes/Analysis.log
                    
                    #In the next step we will convert each SAM file to a BAM file
                    
                    echo Starting Step4 - samtools SAM to BAM conversion
                    date '+%m/%d/%y %H:%M:%S'
                    echo Starting Step4 - samtools SAM to BAM conversion >> ../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../AnalysisNotes/Analysis.log
                    echo The following SAM files will be converted to BAM files:
                    for ligne in `ls *.sam`
                    do                                                                     
                    echo $ligne
                    done
                    echo The following SAM files will be converted to BAM files: >> ../AnalysisNotes/Analysis.log
                    for ligne in `ls *.sam`
                    do                                                                     
                    echo $ligne >> ../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.sam`
                    do
                    samtools view -bS -o ../BAMfiles/Original/$ligne.bam $ligne
                    done
                    cd ../BAMfiles/Original
                    #Current directory = NGS/BAMfiles/Original
                    #This step renames the samtools view output files from "File_Run1_Lane1_R1_bwase.sam.bam" to "File_Run1_Lane1_R1_bwase.bam"
                    old_ext=sam.bam
                    new_ext=bam
                    find . -type f -name "*.$old_ext" -print | while read file
                    do
                        mv $file ${file%${old_ext}}${new_ext}
                    done 
                    echo Finished Step4 - samtools SAM to BAM conversion
                    date '+%m/%d/%y %H:%M:%S'
                    echo Finished Step4 - samtools SAM to BAM conversion >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    
                    #In the next step we will sort the BAM file by chromosome coordinate
                    
                    echo Starting Step4 - samtools BAM sorting process
                    date '+%m/%d/%y %H:%M:%S'
                    echo Starting Step4 - samtools BAM sorting process >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    echo The following BAM files will be sorted:
                    for ligne in `ls *.bam`
                    do                                                                     
                    echo $ligne
                    done
                    echo The following BAM files will be sorted: >> ../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.bam`
                    do                                                                     
                    echo $ligne >> ../../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.bam`
                    do                                                                     
                    samtools sort $ligne ../Sorted/$ligne
                    done
                    cd ../Sorted
                    #Current directory = NGS/BAMfiles/Sorted
                    #This step renames the samtools sort output files from "File_Run1_Lane1_R1_bwase.bam.bam" to "File_Run1_Lane1_R1_bwase_sorted.bam
                    old_ext=.bam.bam
                    new_ext=_sorted.bam
                    find . -type f -name "*$old_ext" -print | while read file
                    do
                        mv $file ${file%${old_ext}}${new_ext}
                    done 
                    echo Finished Step4 - samtools BAM sorting process
                    date '+%m/%d/%y %H:%M:%S'
                    echo Finished Step4 - samtools BAM sorting process >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    cd ../..
                    #Current directory = NGS/
                    
                    #In the next step we will move files from the input and analysis directories to their respective "NGS/FinalOutputs" subdirectory so the pipeline can be used another time.
                    
                    echo Cleaning up Input and Analysis Directories
                    date '+%m/%d/%y %H:%M:%S'
                    echo Cleaning up Input and Analysis Directories >> AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log
                    cd InputSequence/Illumina/Read1
                    #Current directory = NGS/InputSequence/Illumina/Read1
                    echo Moving the following Illumina Fastq Files from NGS/InputSequence/Illumina/Read1 to NGS/FinalOutputs/Illumina/Read1:
                    for ligne in `ls *.txt`
                    do
                    echo $ligne
                    done
                    echo Moving Illumina Fastq Files from NGS/InputSequence/Illumina/Read1 to NGS/FinalOutputs/Illumina/Read1 >> ../../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.txt`
                    do
                    echo $ligne >> ../../../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.txt`
                    do
                    mv $ligne ../../../FinalOutputs/Illumina/Read1/
                    done
                    cd ../Read2
                    #Current directory = NGS/InputSequence/Illumina/Read2
                    echo Moving the following Illumina Fastq Files from NGS/InputSequence/Illumina/Read2 to NGS/FinalOutputs/Illumina/Read2:
                    for ligne in `ls *.txt`
                    do
                    echo $ligne
                    done
                    echo Moving Illumina Fastq Files from NGS/InputSequence/Illumina/Read2 to NGS/FinalOutputs/Illumina/Read2 >> ../../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.txt`
                    do
                    echo $ligne >> ../../../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.txt`
                    do
                    mv $ligne ../../../FinalOutputs/Illumina/Read2/
                    done
                    echo Moving Illumina Input Files Complete
                    date '+%m/%d/%y %H:%M:%S'
                    echo Moving Illumina Input Files Complete >> ../../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../../AnalysisNotes/Analysis.log
                    cd ../../SangerFastq
                    #Current directory = NGS/InputSequence/SangerFastq
                    echo Moving the following Sanger Format Fastq Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/SangerFastq:
                    for ligne in `ls *.fastq`
                    do
                    echo $ligne
                    done
                    echo Moving the following Sanger Format Fastq Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/SangerFastq: >> ../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.fastq`
                    do
                    echo $ligne >> ../../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.fastq`
                    do
                    mv $ligne ../../FinalOutputs/SangerFastq/
                    done
                    echo Moving Sanger Format Fastq Files Complete
                    date '+%m/%d/%y %H:%M:%S'
                    echo Moving Sanger Format Fastq Files Complete >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    echo Moving the following Alignment .sai Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/AlignmentResults:
                    for ligne in `ls *.sai`
                    do
                    echo $ligne
                    done
                    echo Moving the following Alignment .sai Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/AlignmentResults: >> ../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.sai`
                    do
                    echo $ligne >> ../../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.sai`
                    do
                    mv $ligne ../../FinalOutputs/AlignmentResults/
                    done
                    echo Moving Alignment Results Files Complete
                    date '+%m/%d/%y %H:%M:%S'
                    echo Moving Alignment Results Files Complete >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    cd ../../BAMfiles/Sorted/
                    #Current directory = NGS/BAMfiles/Sorted/
                    echo Moving the following Sorted BAM files from NGS/BAMfiles/Sorted/ to NGS/FinalOutputs/SortedBAMfiles:
                    for ligne in `ls *.bam`
                    do
                    echo $ligne
                    done
                    echo Moving the following Sorted BAM files from NGS/BAMfiles/Sorted/ to NGS/FinalOutputs/SortedBAMfiles: >> ../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.bam`
                    do
                    echo $ligne >> ../../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.bam`
                    do
                    mv $ligne ../../FinalOutputs/SortedBAMfiles/
                    done
                    echo Moving Sorted BAM Files Complete
                    date '+%m/%d/%y %H:%M:%S'
                    echo Moving Alignment Results Files Complete >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    cd ../../Samfiles
                    #Current directory = NGS/SAMfiles
                    echo Deleting the following SAM Files from NGS/SAMfiles:
                    for ligne in `ls *.sam`
                    do
                    echo $ligne
                    done
                    echo Deleting the following SAM Files from NGS/SAMfiles: >> ../AnalysisNotes/Analysis.log
                    for ligne in `ls *.sam`
                    do
                    echo $ligne >> ../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.sam`
                    do
                    rm $ligne
                    done
                    echo Deleting SAM Files Complete
                    date '+%m/%d/%y %H:%M:%S'
                    echo Deleting SAM Files Complete >> ../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../AnalysisNotes/Analysis.log
                    cd ../BAMfiles/Original/
                    #Current directory = NGS/BAMfiles/Original
                    echo Deleting the following BAM Files from NGS/BAMfiles/Original:
                    for ligne in `ls *.bam`
                    do
                    echo $ligne
                    done
                    echo Deleting the following BAM Files from NGS/BAMfiles/Original: >> ../../AnalysisNotes/Analysis.log
                    for ligne in `ls *.bam`
                    do
                    echo $ligne >> ../../AnalysisNotes/Analysis.log
                    done
                    for ligne in `ls *.bam`
                    do
                    rm $ligne
                    done
                    echo Deleting BAM Files from NGS/BAMfiles/Original Complete
                    date '+%m/%d/%y %H:%M:%S'
                    echo Deleting BAM Files from NGS/BAMfiles/Original Complete >> ../../AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
                    cd ../..
                    #Current directory = NGS/
                    echo Cleaning up Input and Analysis Directories Complete
                    date '+%m/%d/%y %H:%M:%S'
                    echo Cleaning up Input and Analysis Directories Complete >> AnalysisNotes/Analysis.log
                    date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log
                    echo ***Analysis Batch Complete***
                    echo ***Analysis Batch Complete*** >> AnalysisNotes/Analysis.log
                    
                    # Future development: Merge BAM files based on filename structure "RunX_LaneX_RX", Index files for IGV viewing
                    # Figure out Perl or Python so I don't look like such an amateur :)

                    Comment


                    • #25
                      Originally posted by Jon_Keats View Post
                      The Shell Script from Hell or My Masterpiece you Decide

                      I'll preface this with acknowledging all the help from various people who have replied to this thread and I'll promise to try and figure out perl over shell scripts, but two weeks ago the "Hello World!" script in any programing language was "Greek" to me so bare with my ignorance for a week or two more. But in the interest of sharing here is my current bwa samse pipeline. I'm stuck at not being able to merge the sorted bam files but at least this lets me get back to the lab and opens up some of my time to write the back log of papers I need to polish off so I can get my own lab and make someone else deal with these issues
                      **Currently, this uses all default settings so modify commands as you see fit***
                      lol, one of the main benefits of new-PIship is being able to tell someone else to do all the boring stuff- in fact a lot of science revolves around finding other people to do your work

                      that said I do actually kind of enjoy dealing with these kind of informatics problems and I'm glad you are providing an introduction that many folks will find useful. I wanted to point out another alignment program you haven't mentioned, Mosaik (from the Marth lab at BC), which has some of the best documentation of any of the short read aligners I've used. The other advantage of Mosaik is that it allows much higher levels of divergence between the read and the reference than BWA/Bowtie does. I've found this useful as I'm working with reads from related bacterial strains that can diverge substantially from the reference, yet still have alignments supported with high depth. Mosaik alignments can be easily converted to SAM/BAM format for viewing in IGV/snp calling with SAMtools and there is also an associated SNP caller (Gigabayes).
                      I am also a big fan of IGV due to its flexibility. It is relatively easy to write Perl scripts to create your own custom tracks for viewing using the gff format. For example, I created a custom track that shows me the location of all SNPs above certain depth/quality thresholds.

                      Comment


                      • #26
                        Thank god we have CLCBio for assembly and variant detection. This post has shown how archaic the freeware tools are.

                        We can assemble and detect variants on 30x human whole exome data overnight (admittedly that is the easy part now).

                        Comment


                        • #27
                          Making Some Progress

                          Getting ready to head out for AACR and get depressed about how far ahead the big sequencing centers are in the sequencing world. I'm betting we'll seeing at least 100 cancer genomes being discussed, any other guesses? Well I know of at least 40 so that might be on the low side.
                          The good news on the home front is that I've cleaned up a bunch of disk space on our workstation and will have the poor beast going for the next 3-4 days processing data. So there should be a big pile of data to sift through when I get back. One lesson I learned is that pipelines are great but you need to remember how much disk space the whole pipeline uses....or your morning starts with an error message of "0 bytes remaining". What the hell are bytes, I know of these things called megabytes, gigabytes, or terabytes, and I roughly remember kilobytes...I though the sticker on our machine said Mac Pro not Apple IIE. Therefore, version2 of the pipeline will need some kind of a check for the available amount of disk space relative to the number of samples to be processed. Oh goodie, one more thing to learn and I thought I was smart enough already!

                          RE: Comments

                          MOSAIK, funny you mention it the guys were up at the Mothership giving at talk on Wednesday that we tuned into over the webcast. Sounded pretty good from what they showed but it did sounds like it is a bit to much of a memory hog for our current workstation. However, they mention that an upcoming release is coming that is less memory intensive.

                          CLCBio: I'm a all for turn key solutions and would be the first to say if you can drop thousands of dollars on data generation you need to be willing to pay for analysis...otherwise what is the point. But as a new user it seems simple for most sequencing cores to have in place basic pipelines using open source software that can generate the basic outputs that most investigators are looking to get back.

                          Important Note: If you use the open source software packages please remember someone put a great deal of work into even the bad ones and many of the developers are great citizens on this forum. So if you see requests like Nils' request for grant support letters for BFAST, or similar ones from the other developers if they make similar requests, please take a couple of minutes to help the people that help you!

                          Comment


                          • #28
                            Remove the Duplicates

                            Well the meeting is over and I'm disappointed we didn't see a bigger impact of sequencing. Boo Hoo to all the people who only talked about published datasets, whats the point of a meeting if people only talk about their nature paper from 3 months ago...I read that already! But a couple of little birdies told me about several big, multi-patient, papers that are already accepted for publication...so does that mean 40 tumor normal pairs is an Oncogene paper now? Okay enough complaining.

                            So if you have been following the thread I've managed to get from some raw Illumina reads to sorted BAM files by pipe lining the following commands in a unix shell script:
                            bwa index -a bwtsw hg18.fasta (done manually as it only needs to be done once)
                            maq ill2sanger s_x_sequence.txt s_x_sequence.fastq
                            bwa aln hg18.fasta s_x_sequence.fastq > s_x_sequence.sai
                            bwa samse hg18.fasta s_x_sequence.sai s_x_sequence.fastq > s_x_sequence.sam
                            samtools view -bS -o s_x_sequence.bam s_x_sequence.sam
                            samtools sort s_x_sequence.bam s_x_sequence_sorted.bam

                            But I'm still stuck at having multiple BAM files per sample and having to manually merge the files after the current pipeline finishes using:

                            samtools merge PatientX_MergedBAM.bam s_1_sequence_sorted.bam s_2_sequence_sorted.bam

                            Then I create index files (needed for IGV) using:

                            samtools index PatientX_MergedBAM.bam

                            And then I drop them into the IGV browser (anyone else find it is faster to navigate around with chromosome coordinates than gene names?) FYI - I use the binary distribution version so I can use more than 2GB of RAM, since I can't use the 10GB version on our 8GB workstation. I've modified the igv_mac-intel.sh script to -Xmx6144m and then launch it from Terminal to open IGV with 6.144GB of RAM.

                            After looking at the data in IGV I came to appreciate how many duplicates are present in my dataset... So I'm back to google, seqanswers trying to figure out how to do something.. What I've come to understand is that most people, including Heng Li who built bwa/samtools, recommends using Picard to mark/remove duplicates over samtools as picard removes duplicates on multiple chromosomes. So back to sourceforge.com to download Picard (http://sourceforge.net/projects/picard/files/). I downloaded the current version and placed the folder in my "NGS/ApplicationDownloads/" folder and decompressed it. Right now I can't figure out how to put java .jar files in my JAVA path (I'm sure I can figure it out but I'm getting lazy and just copy the .jar files around as needed, bad I know).

                            I've encountered one issue with my BWA-Samtools BAM files giving an error because the unmapped reads have characters in the CIGAR string which makes Picard grumpy. So you need to add one additional option to the command to set the file validation stringency to silent:

                            java -Xmx4g -jar MarkDuplicates.jar INPUT=PatientX_MergedBAM.bam OUTPUT=PatientX_MergedBAM_NoDups.bam METRICS_FILE=PatientX_NoDups_Stats.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT

                            Next getting piled up in my pileup files...
                            Last edited by Jon_Keats; 09-02-2010, 07:43 PM. Reason: updated to match entire thread

                            Comment


                            • #29
                              I'll one up you...just a cleaner mkdir command.

                              #!/bin/sh

                              #Create_NGS_DirectoryStructureV3.sh
                              #Written by Ryan Golhar 05/05/2010

                              echo ***Creating Pipeline Directory Structure***
                              mkdir -p ngs/{analysisnotes,applicationdownloads,samfiles}
                              mkdir -p ngs/bamfiles/{merged,original,sorted}
                              mkdir -p ngs/finaloutputs/{alignmentresults,mergedbamfiles,sangerfastq,sortedbamfiles}
                              mkdir -p ngs/finaloutputs/illumina/{read1,read2}
                              mkdir -p ngs/inputsequences/illumina/{read1,read2,sangerfastq}
                              mkdir -p ngs/refgenomes/{bfast_indexed,bowtie_indexed,bwa_indexed,genomesdownloads}
                              mkdir -p ngs/scripts/scriptbackups

                              cp Create_NGS_DirectoryStructureV2.sh ngs/scripts/scriptbackups/
                              echo ***Pipeline Directory Structure Created***

                              Comment


                              • #30
                                Re: Removing the duplicates

                                Noticed this a 2nd ago. It looks like you are working with single end sequencing data which you wouldn't want to remove duplicates.

                                Download SAM tools for free. SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAMtools provide efficient utilities on manipulating alignments in the SAM format.


                                (Item #6)

                                The duplicate removal across chromosomes would only apply for paired end data.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                34 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                72 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                81 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X