I'm LMAO since it appears that I'm coming to these revelations at just about the exact same time you were/are. It's definitely comforting to see somebody is figuring out the exact same stuff that I am at basically the same time I am. Everytime I talk about this stuff here, people look at me like I'm ALF or something.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Do yourself a favor & swear off shell scripts for Perl/Python/Ruby -- once you start trying to have any sort of conditional programming, shell becomes a nightmare.
Converting your script above to Perl is pretty easy (but left as an exercise for the student :-)
Comment
-
Sounds like I should finish working through the "Unix and Perl for biologist" course to see if I can figure Perl out. I managed to figure out a solution for my problem about half way through the Unix section so I started with that but maybe I shouldn't have jumped the gun so much. I'll see what I can come up with this weekend.
Comment
-
Aaah, I'm too big a softie. Also, IMHO the perl is more readable than shell to understand what is going on (yikes! something less readable than perl!)
Personally, I prefer directory names in all lowercase. Brevity is valuable too -- "bam" & "sam" would be my choices for those directories & I probably wouldn't bother having separate ones anyway. But that is all personal preference & you should pick what suits you.
Also: DO NOT try to use a subdirectory to backup your scripts -- it is easy to install & learn git (or a similar system) to control versions. It is much too easy to accidentally copy your file the wrong way and wipe something out; version control programs track all your changes & allow you to safely move backwards & forwards through the history of your code. They can also be a handy way to replicate code across machines (and keep them in sync) as well as a path towards collaborative coding.
Code:#!/usr/bin/perl use strict; print "***Creating Pipeline Directory Structure***\n"; # \n is newline my $root="NGS"; system("mkdir $root"); # this line only makes directories with no subdirectories &makeSubdirs($root, "AnalysisNotes", "ApplicationDownloads", "InputSequence","SAMfiles"); &makeSubdirs("$root/BAMfiles","Merged","Original","Sorted"); &makeSubdirs("$root/FinalOutputs", "AlignmentResults","Illumina","SangerFasta", "SortedBAMfiles","MergedBAMfiles"); &makeSubdirs("$root/RefGenomes", "BFAST_Indexed","BOWTIE_Indexed","GenomeDownloads"); &makeSubdirs("$root/Scripts","ScriptBackups"); # DANGEROUS!!! print "Pipeline Directory Structure Created\n"; sub makeSubdirs { my @subdirs=@_; my $parentDir=shift(@subdirs); mkdir($parentDir) if (! -d $parentDir); foreach my $subDir(@subdirs) { mkdir("$parentDir/$subDir"); } } #cd ../Scripts #mkdir ScriptBackups #cd ../.. #cp Create_NGS_DirectoryStructureV1.sh NGS/Scripts/ScriptBackups/ #cd NGS/ #pwd
Comment
-
Okay still working on figuring out the Perl version but it seem pretty obvious less some of the perl coding language. However, I'm nothing if not competitive so here is the three line version. Not sure if "krobinson" perl version is fewer characters or not but this must win for number of lines.
I guess reading the manuals for functions helps mkdir -p sure helps! I'll work on figuring out git version control and a bit more perl but I can't wait to see krobinson's response to the 400+ pipeline shell script. At least I took the suggestion to ditch the upper and lower case filenames, so everything is now lower case.
NOTE - THIS SCRIPT HAS BEEN UPATED TO VERSION 3 IN A SUBSEQUENT POST
Code:#!/bin/sh #Create_NGS_DirectoryStructureV2.sh #Written by Jonathan Keats 04/12/2010 echo ***Creating Pipeline Directory Structure*** mkdir -p ngs/analysisnotes ngs/applicationdownloads ngs/bamfiles/merged ngs/bamfiles/original ngs/bamfiles/sorted ngs/finaloutputs/alignmentresults ngs/finaloutputs/illumina/read1 ngs/finaloutputs/illumina/read2 ngs/finaloutputs/mergedbamfiles ngs/finaloutputs/sangerfastq ngs/finaloutputs/sortedbamfiles ngs/inputsequences/illumina/read1 ngs/inputsequences/illumina/read2 ngs/inputsequences/sangerfastq ngs/refgenomes/bfast_indexed ngs/refgenomes/bowtie_indexed ngs/refgenomes/bwa_indexed ngs/refgenomes/genomedownloads ngs/samfiles ngs/scripts/scriptbackups cp Create_NGS_DirectoryStructureV2.sh ngs/scripts/scriptbackups/ echo ***Pipeline Directory Structure Created***
Comment
-
Better metrics have to do with readability & maintainability rather than utter brevity -- it is quite possible to write extremely short code in Perl which looks like hieroglyphics.
Your program is straightforward -- but as it grows it will get less so. One thing in particular I addressed is if you change the name of certain of your directories -- in your shell script some of the directories are listed twice, which is an opportunity for error should you decide to rename them.
P.S. Only a terminal -- no internal -- Asparagine on my name!
Comment
-
Hi all,
Let me do my best to help out with the Perl.
A lot of basic operations have already been written in Perl. You can easily use these bits of code. If you need something done, check http://search.cpan.org/. This goes for bioinformatics things as well: check http://www.bioperl.org for stuff like dealing with sequence files.
For your Perl script I used File::Path (http://search.cpan.org/~jhi/perl-5.8.0/lib/File/Path.pm). It took me about 5 minutes to write.
Code:#!/usr/bin/perl use strict; use File::Path; # CPAN modules are this easy to use # the dirs my $dirs = [ 'NGS/AnalysisNotes', 'NGS/ApplicationDownloads', 'NGS/InputSequence', 'NGS/SAMfiles', 'NGS/BAMfiles/Merged', 'NGS/BAMfiles/Original', 'NGS/BAMfiles/Sorted', 'NGS/FinalOutputs/AlignmentResults', 'NGS/FinalOutputs/Illumina', 'NGS/FinalOutputs/SangerFasta', 'NGS/FinalOutputs/SortedBAMfiles', 'NGS/FinalOutputs/MergedBAMfiles', 'NGS/RefGenomes/BFAST_Indexed', 'NGS/RefGenomes/BOWTIE_Indexed', 'NGS/RefGenomes/GenomeDownloads', 'NGS/Scripts/ScriptBackups' ]; print "***Creating Pipeline Directory Structure***\n"; # eval will evaluate whatever is inside the {}. Any errors are stored in $@ # mkpath is the File::Path method that will create the given directories # it takes three arguments in total: the dirs, whether you wish to print the created directories (in that case, 1) or not (0) and the last argument has to do with file permissions, I omitted it eval { mkpath ( $dirs, 1 ); }; # now check if there were errors. If so, print. if ( $@ ) { print "Couldn't create NGS directories: $@\n"; } print "Pipeline Directory Structure Created\n";
Cheers,
Wil
Comment
-
To Genome or Not to Genome
I'm sure I'm not the only person facing a conundrum on which genome version to use for alignment. In general I'm a believer in the "out with the old, in with the new" mentality so my gut is pushing me to use the newest genome build GRCh37/hg19/Ensembl55+ over the old version NCBI36/hg18/Ensembl38-54. Plus CCDS has been on GRCh37 for nearly a year now and they just released dbSNP131 on GRCh37 so it's getting harder and harder to ignore the newest build. BUT, for my dataset I have paired copy number data on Agilent arrays, which still use hg18 design files. So after wasting a couple of days aligning things to GRCh37 (1000 genomes version) I'm going back to redo it with hg18 so my coordinates match between the two datasets, though wasting might be the wrong word as I'm betting I'll want to be on the new version in a month. To get this genome reference file I decided to use the UCSC chromosome files (http://hgdownload.cse.ucsc.edu/golde...8/chromosomes/) since I'd seen someone mentioning the advantage of their simple naming structure. I did not use the various haplotype builds or the random contigs.
- Downloaded each individual chromosome file Chr1-Chr22, ChrX, ChrY, and ChrM.
- Placed them all in my "ngs/refgenomes/genome_downloads/hg18" folder
- Navigate to the folder in Terminal "cd ngs/refgenomes/genome_downloads/hg18"
- Decompress the files using gunzip, type "gunzip *.fa.gz"
- Then merge the files using unix cat command [cat (infile1) (infile2) (infileN) > (outfile)]
- Type "cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg18.fasta"
- check the ouput file contains all chromosomes in the correct order using unix grep command to pull out the fasta header lines
- Type "grep ">" hg18.fa"
- Which should output the following if correct:
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr20
>chr21
>chr22
>chrX
>chrY
>chrM
A similar process should be possible with any genome version or source like ensembl. Just make sure the end platform supports the version you are using. And be careful with the annotation files if you get them from UCSC as the positions are 0 base indexed.
Comment
-
The Shell Script from Hell or My Masterpiece you Decide
I'll preface this with acknowledging all the help from various people who have replied to this thread and I'll promise to try and figure out perl over shell scripts, but two weeks ago the "Hello World!" script in any programing language was "Greek" to me so bare with my ignorance for a week or two more. But in the interest of sharing here is my current bwa samse pipeline. I'm stuck at not being able to merge the sorted bam files but at least this lets me get back to the lab and opens up some of my time to write the back log of papers I need to polish off so I can get my own lab and make someone else deal with these issues
**Currently, this uses all default settings so modify commands as you see fit***
Code:#!/bin/sh # Batch_BWAsamse_V1.sh # Created by Jonathan Keats on 04/05/10. # This script is designed to take a batch of raw Illumina 1.3+ reads to sorted BAM files. # It is designed to be initiated from a folder called "NGS" with a specific subdirectory structure # Use the script called "Create_NGS_DirectoryStructureV1.sh" or the other variations (see Jon_Keats SeqAnswers Introduction Thread) to create the analysis directory structure # To run this script you MUST first place your reference file in "NGS/RefGenomes/BWA_Indexed" and have run the "bwa index" command to create the BWT index files # If you are NOT using a reference genome called "hg18.fa" you will NEED to edit lines 124 and 150 accordingly! # The script is based on having ***RENAMED*** Illumina files in NGS/InputSequence/Illumina/Read1 and if available NGS/InputSequence/Illumina/Read2 if paired end runs # The renamed format MUST be "YourSampleName_RunX_LaneX_R1.txt" and "YourSampleName_RunX_LaneX_R2.txt" otherwise files will be overwritten, paired-end analysis will not be possible, and lane merging will not be possible # At each step it queries specific folders for available files and passes them to the next analysis module # After each step the filename extension of the output files are corrected. (ie. "MySequenceFile.txt.fastq" to "MySequenceFile.fastq") # Order of Embedded Steps - Converts Illumina 1.3+ fastq files "s_1_sequence.txt" to Sanger fastq using "maq ill2sanger" command # - Aligns created fastq files to reference genome using "bwa aln" command # - Generates SAM files from alignment files using "bwa samse" command # - Converts SAM files to BAM files using "samtools view" command # - Sorts BAM files using "samtools sort" command # - Final output files are archived then the input and analysis directories are cleaned-up and readied for the next analysis batch # ***DURING CLEAN-UP THE ENTIRE CONTENTS OF SEVERAL FOLDERS ARE DELETED, SO ONLY PLACE ADDITIONAL FILES IN THE "NGS/FINALOUTPUTS" DIRECTORY OR OUTSIDE THE NGS FOLDER SUBDIRECTORIES*** # The script creates a log file in "NGS/AnalysisNotes" to track the steps completed and the time each step started and finished # Some of the log events will print to both the terminal screen and the log file so you can see what is going on # On our workstation this will process two lanes of PE data overnight or a full run over the weekend # Sould you merge files (lane 1 and 2) before analysis as illumina fastq files or after sorting BAM files? # If you merge before analysis does this mess up the a paired-end analysis? # Carabelle - Can you have a fail safe start query to kill the script if it is not initiated from the "NGS" folder? Probably doesn't run anyways # Carabelle - Also can you check if destination folders are empty and if not kill the script? # Carabelle - Such as if directory NGS/SAMfiles is NOT empty kill script and print "The NGS/SAMfiles directory is not empty please remove files otherwise they will be deleted" # Ligne=Line en francais (I had a significant amount of help from our French post-docs husband who is a unix programer in Paris) #Starting directory = NGS/ echo ***Starting Analysis Batch*** date '+%m/%d/%y %H:%M:%S' #The following step creates the log file in the AnalysisNotes subdirectory the first time the script is run #On subsequent runs the results are printed at the bottom of the pre-existing log file echo ***Starting Analysis Batch*** >> AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log #In this step we will converted the original Illumina 1.3+ fastq format files to Sanger fastq format #Input files from NGS/InputSequences/Illumina/Read1 and NGS/InputSequences/Illumina/Read2 #Output files to NGS/InputSequences/SangerFastq echo Starting Step1a - Read1 Fastq Conversion with maq ill2sanger date '+%m/%d/%y %H:%M:%S' echo Starting Step1a - Read1 Fastq Conversion with maq ill2sanger >> AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log cd InputSequence/Illumina/Read1 #Current directory = NGS/InputSequence/Illumina/Read1 echo Converting the following Illumina files from the Read1 folder: for ligne in `ls *.txt` do echo $ligne done echo Converting the following Illumina files from the Read1 folder: >> ../../../AnalysisNotes/Analysis.log for ligne in `ls *.txt` do echo $ligne >> ../../../AnalysisNotes/Analysis.log done for ligne in `ls *.txt` do maq ill2sanger $ligne ../../SangerFastq/$ligne.fastq done echo Finished Step1a - Read1 Fastq Conversion date '+%m/%d/%y %H:%M:%S' echo Finished Step1a - Read1 Fastq Conversion >> ../../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../../AnalysisNotes/Analysis.log cd ../Read2 #Current directory = NGS/InputSequence/Illumina/Read2 echo Starting Step1b - Read2 Fastq Conversion with maq ill2sanger date '+%m/%d/%y %H:%M:%S' echo Starting Step1b - Read2 Fastq Conversion with maq ill2sanger >> ../../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../../AnalysisNotes/Analysis.log echo Converting the following Illumina files from the Read2 folder: for ligne in `ls *.txt` do echo $ligne done echo Converting the following Illumina files from the Read2 folder: >> ../../../AnalysisNotes/Analysis.log for ligne in `ls *.txt` do echo $ligne >> ../../../AnalysisNotes/Analysis.log done for ligne in `ls *.txt` do maq ill2sanger $ligne ../../SangerFastq/$ligne.fastq done cd ../../SangerFastq/ #Current directory = NGS/InputSequence/SangerFastq #The next step will change the file extensions of ths Sanger Fastq outputs from "File_Run1_Lane1_R1.txt.fastq" to "File_Run1_Lane1_R1.fastq" old_ext=txt.fastq new_ext=fastq find . -type f -name "*.$old_ext" -print | while read file do mv $file ${file%${old_ext}}${new_ext} done echo Finished Step1b - Read2 Fastq Conversion date '+%m/%d/%y %H:%M:%S' echo Finished Step1b - Read2 Fastq Conversion >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log #In the next step we will align the converted sanger fastq format files to the reference genome (hg18.fa) echo Starting Step2 - bwa aln process date '+%m/%d/%y %H:%M:%S' echo Starting Step2 - bwa aln process >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log echo The following fastq files will be aligned: for ligne in `ls *.fastq` do echo $ligne done echo The following fastq files will be aligned: >> ../../AnalysisNotes/Analysis.log for ligne in `ls *.fastq` do echo $ligne >> ../../AnalysisNotes/Analysis.log done for ligne in `ls *.fastq` do bwa aln ../../RefGenomes/BWA_Indexed/hg18.fa $ligne > $ligne.sai done echo Finished Step2 - bwa aln process date '+%m/%d/%y %H:%M:%S' echo Finished Step2 - bwa aln process >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log #In the next step we will generate SAM files for the alignments using bwa samse echo Starting Step3 - bwa samse process date '+%m/%d/%y %H:%M:%S' echo Starting Step3 - bwa samse process >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log echo The following alignment files will be converted to SAM files: for ligne in `ls *.sai` do echo $ligne done echo The following alignment files will be converted to SAM files: >> ../../AnalysisNotes/Analysis.log for ligne in `ls *.sai` do echo $ligne >> ../../AnalysisNotes/Analysis.log done #To get this step to work I had to trick the system to get the conventional (bwa samse <database.fasta> <aln.sai> <input.fastq> > aln.sam) command as I can only pull up one list at a time to automatically feed files into the command for ligne in `ls *.fastq` do bwa samse ../../RefGenomes/BWA_Indexed/hg18.fa $ligne.sai $ligne > ../../SAMfiles/$ligne.sam done #This step renames the bwa aln output files from "File_Run1_Lane1_R1.fastq.sai" to "File_Run1_Lane1_R1.sai" #This is done on purpuse at this point to allow the preceeding step to work old_ext=fastq.sai new_ext=sai find . -type f -name "*.$old_ext" -print | while read file do mv $file ${file%${old_ext}}${new_ext} done cd ../../SAMfiles/ #Current directory = NGS/SAMfiles #This step renames the bwa samse output files from "File_Run1_Lane1_R1.fastq.sam" to "File_Run1_Lane1_R1_bwase.sam" #We up in the "_bwase" to indicate the files are bwa alignments converted to SAM files with the bwa samse command old_ext=.fastq.sam new_ext=_bwase.sam find . -type f -name "*$old_ext" -print | while read file do mv $file ${file%${old_ext}}${new_ext} done echo Finished Step3 - bwa samse process date '+%m/%d/%y %H:%M:%S' echo Finished Step3 - bwa samse process >> ../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../AnalysisNotes/Analysis.log #In the next step we will convert each SAM file to a BAM file echo Starting Step4 - samtools SAM to BAM conversion date '+%m/%d/%y %H:%M:%S' echo Starting Step4 - samtools SAM to BAM conversion >> ../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../AnalysisNotes/Analysis.log echo The following SAM files will be converted to BAM files: for ligne in `ls *.sam` do echo $ligne done echo The following SAM files will be converted to BAM files: >> ../AnalysisNotes/Analysis.log for ligne in `ls *.sam` do echo $ligne >> ../AnalysisNotes/Analysis.log done for ligne in `ls *.sam` do samtools view -bS -o ../BAMfiles/Original/$ligne.bam $ligne done cd ../BAMfiles/Original #Current directory = NGS/BAMfiles/Original #This step renames the samtools view output files from "File_Run1_Lane1_R1_bwase.sam.bam" to "File_Run1_Lane1_R1_bwase.bam" old_ext=sam.bam new_ext=bam find . -type f -name "*.$old_ext" -print | while read file do mv $file ${file%${old_ext}}${new_ext} done echo Finished Step4 - samtools SAM to BAM conversion date '+%m/%d/%y %H:%M:%S' echo Finished Step4 - samtools SAM to BAM conversion >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log #In the next step we will sort the BAM file by chromosome coordinate echo Starting Step4 - samtools BAM sorting process date '+%m/%d/%y %H:%M:%S' echo Starting Step4 - samtools BAM sorting process >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log echo The following BAM files will be sorted: for ligne in `ls *.bam` do echo $ligne done echo The following BAM files will be sorted: >> ../../AnalysisNotes/Analysis.log for ligne in `ls *.bam` do echo $ligne >> ../../AnalysisNotes/Analysis.log done for ligne in `ls *.bam` do samtools sort $ligne ../Sorted/$ligne done cd ../Sorted #Current directory = NGS/BAMfiles/Sorted #This step renames the samtools sort output files from "File_Run1_Lane1_R1_bwase.bam.bam" to "File_Run1_Lane1_R1_bwase_sorted.bam old_ext=.bam.bam new_ext=_sorted.bam find . -type f -name "*$old_ext" -print | while read file do mv $file ${file%${old_ext}}${new_ext} done echo Finished Step4 - samtools BAM sorting process date '+%m/%d/%y %H:%M:%S' echo Finished Step4 - samtools BAM sorting process >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log cd ../.. #Current directory = NGS/ #In the next step we will move files from the input and analysis directories to their respective "NGS/FinalOutputs" subdirectory so the pipeline can be used another time. echo Cleaning up Input and Analysis Directories date '+%m/%d/%y %H:%M:%S' echo Cleaning up Input and Analysis Directories >> AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log cd InputSequence/Illumina/Read1 #Current directory = NGS/InputSequence/Illumina/Read1 echo Moving the following Illumina Fastq Files from NGS/InputSequence/Illumina/Read1 to NGS/FinalOutputs/Illumina/Read1: for ligne in `ls *.txt` do echo $ligne done echo Moving Illumina Fastq Files from NGS/InputSequence/Illumina/Read1 to NGS/FinalOutputs/Illumina/Read1 >> ../../../AnalysisNotes/Analysis.log for ligne in `ls *.txt` do echo $ligne >> ../../../AnalysisNotes/Analysis.log done for ligne in `ls *.txt` do mv $ligne ../../../FinalOutputs/Illumina/Read1/ done cd ../Read2 #Current directory = NGS/InputSequence/Illumina/Read2 echo Moving the following Illumina Fastq Files from NGS/InputSequence/Illumina/Read2 to NGS/FinalOutputs/Illumina/Read2: for ligne in `ls *.txt` do echo $ligne done echo Moving Illumina Fastq Files from NGS/InputSequence/Illumina/Read2 to NGS/FinalOutputs/Illumina/Read2 >> ../../../AnalysisNotes/Analysis.log for ligne in `ls *.txt` do echo $ligne >> ../../../AnalysisNotes/Analysis.log done for ligne in `ls *.txt` do mv $ligne ../../../FinalOutputs/Illumina/Read2/ done echo Moving Illumina Input Files Complete date '+%m/%d/%y %H:%M:%S' echo Moving Illumina Input Files Complete >> ../../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../../AnalysisNotes/Analysis.log cd ../../SangerFastq #Current directory = NGS/InputSequence/SangerFastq echo Moving the following Sanger Format Fastq Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/SangerFastq: for ligne in `ls *.fastq` do echo $ligne done echo Moving the following Sanger Format Fastq Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/SangerFastq: >> ../../AnalysisNotes/Analysis.log for ligne in `ls *.fastq` do echo $ligne >> ../../AnalysisNotes/Analysis.log done for ligne in `ls *.fastq` do mv $ligne ../../FinalOutputs/SangerFastq/ done echo Moving Sanger Format Fastq Files Complete date '+%m/%d/%y %H:%M:%S' echo Moving Sanger Format Fastq Files Complete >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log echo Moving the following Alignment .sai Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/AlignmentResults: for ligne in `ls *.sai` do echo $ligne done echo Moving the following Alignment .sai Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/AlignmentResults: >> ../../AnalysisNotes/Analysis.log for ligne in `ls *.sai` do echo $ligne >> ../../AnalysisNotes/Analysis.log done for ligne in `ls *.sai` do mv $ligne ../../FinalOutputs/AlignmentResults/ done echo Moving Alignment Results Files Complete date '+%m/%d/%y %H:%M:%S' echo Moving Alignment Results Files Complete >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log cd ../../BAMfiles/Sorted/ #Current directory = NGS/BAMfiles/Sorted/ echo Moving the following Sorted BAM files from NGS/BAMfiles/Sorted/ to NGS/FinalOutputs/SortedBAMfiles: for ligne in `ls *.bam` do echo $ligne done echo Moving the following Sorted BAM files from NGS/BAMfiles/Sorted/ to NGS/FinalOutputs/SortedBAMfiles: >> ../../AnalysisNotes/Analysis.log for ligne in `ls *.bam` do echo $ligne >> ../../AnalysisNotes/Analysis.log done for ligne in `ls *.bam` do mv $ligne ../../FinalOutputs/SortedBAMfiles/ done echo Moving Sorted BAM Files Complete date '+%m/%d/%y %H:%M:%S' echo Moving Alignment Results Files Complete >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log cd ../../Samfiles #Current directory = NGS/SAMfiles echo Deleting the following SAM Files from NGS/SAMfiles: for ligne in `ls *.sam` do echo $ligne done echo Deleting the following SAM Files from NGS/SAMfiles: >> ../AnalysisNotes/Analysis.log for ligne in `ls *.sam` do echo $ligne >> ../AnalysisNotes/Analysis.log done for ligne in `ls *.sam` do rm $ligne done echo Deleting SAM Files Complete date '+%m/%d/%y %H:%M:%S' echo Deleting SAM Files Complete >> ../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../AnalysisNotes/Analysis.log cd ../BAMfiles/Original/ #Current directory = NGS/BAMfiles/Original echo Deleting the following BAM Files from NGS/BAMfiles/Original: for ligne in `ls *.bam` do echo $ligne done echo Deleting the following BAM Files from NGS/BAMfiles/Original: >> ../../AnalysisNotes/Analysis.log for ligne in `ls *.bam` do echo $ligne >> ../../AnalysisNotes/Analysis.log done for ligne in `ls *.bam` do rm $ligne done echo Deleting BAM Files from NGS/BAMfiles/Original Complete date '+%m/%d/%y %H:%M:%S' echo Deleting BAM Files from NGS/BAMfiles/Original Complete >> ../../AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log cd ../.. #Current directory = NGS/ echo Cleaning up Input and Analysis Directories Complete date '+%m/%d/%y %H:%M:%S' echo Cleaning up Input and Analysis Directories Complete >> AnalysisNotes/Analysis.log date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log echo ***Analysis Batch Complete*** echo ***Analysis Batch Complete*** >> AnalysisNotes/Analysis.log # Future development: Merge BAM files based on filename structure "RunX_LaneX_RX", Index files for IGV viewing # Figure out Perl or Python so I don't look like such an amateur :)
Comment
-
Originally posted by Jon_Keats View PostThe Shell Script from Hell or My Masterpiece you Decide
I'll preface this with acknowledging all the help from various people who have replied to this thread and I'll promise to try and figure out perl over shell scripts, but two weeks ago the "Hello World!" script in any programing language was "Greek" to me so bare with my ignorance for a week or two more. But in the interest of sharing here is my current bwa samse pipeline. I'm stuck at not being able to merge the sorted bam files but at least this lets me get back to the lab and opens up some of my time to write the back log of papers I need to polish off so I can get my own lab and make someone else deal with these issues
**Currently, this uses all default settings so modify commands as you see fit***
that said I do actually kind of enjoy dealing with these kind of informatics problems and I'm glad you are providing an introduction that many folks will find useful. I wanted to point out another alignment program you haven't mentioned, Mosaik (from the Marth lab at BC), which has some of the best documentation of any of the short read aligners I've used. The other advantage of Mosaik is that it allows much higher levels of divergence between the read and the reference than BWA/Bowtie does. I've found this useful as I'm working with reads from related bacterial strains that can diverge substantially from the reference, yet still have alignments supported with high depth. Mosaik alignments can be easily converted to SAM/BAM format for viewing in IGV/snp calling with SAMtools and there is also an associated SNP caller (Gigabayes).
I am also a big fan of IGV due to its flexibility. It is relatively easy to write Perl scripts to create your own custom tracks for viewing using the gff format. For example, I created a custom track that shows me the location of all SNPs above certain depth/quality thresholds.
Comment
-
Making Some Progress
Getting ready to head out for AACR and get depressed about how far ahead the big sequencing centers are in the sequencing world. I'm betting we'll seeing at least 100 cancer genomes being discussed, any other guesses? Well I know of at least 40 so that might be on the low side.
The good news on the home front is that I've cleaned up a bunch of disk space on our workstation and will have the poor beast going for the next 3-4 days processing data. So there should be a big pile of data to sift through when I get back. One lesson I learned is that pipelines are great but you need to remember how much disk space the whole pipeline uses....or your morning starts with an error message of "0 bytes remaining". What the hell are bytes, I know of these things called megabytes, gigabytes, or terabytes, and I roughly remember kilobytes...I though the sticker on our machine said Mac Pro not Apple IIE. Therefore, version2 of the pipeline will need some kind of a check for the available amount of disk space relative to the number of samples to be processed. Oh goodie, one more thing to learn and I thought I was smart enough already!
RE: Comments
MOSAIK, funny you mention it the guys were up at the Mothership giving at talk on Wednesday that we tuned into over the webcast. Sounded pretty good from what they showed but it did sounds like it is a bit to much of a memory hog for our current workstation. However, they mention that an upcoming release is coming that is less memory intensive.
CLCBio: I'm a all for turn key solutions and would be the first to say if you can drop thousands of dollars on data generation you need to be willing to pay for analysis...otherwise what is the point. But as a new user it seems simple for most sequencing cores to have in place basic pipelines using open source software that can generate the basic outputs that most investigators are looking to get back.
Important Note: If you use the open source software packages please remember someone put a great deal of work into even the bad ones and many of the developers are great citizens on this forum. So if you see requests like Nils' request for grant support letters for BFAST, or similar ones from the other developers if they make similar requests, please take a couple of minutes to help the people that help you!
Comment
-
Remove the Duplicates
Well the meeting is over and I'm disappointed we didn't see a bigger impact of sequencing. Boo Hoo to all the people who only talked about published datasets, whats the point of a meeting if people only talk about their nature paper from 3 months ago...I read that already! But a couple of little birdies told me about several big, multi-patient, papers that are already accepted for publication...so does that mean 40 tumor normal pairs is an Oncogene paper now? Okay enough complaining.
So if you have been following the thread I've managed to get from some raw Illumina reads to sorted BAM files by pipe lining the following commands in a unix shell script:
bwa index -a bwtsw hg18.fasta (done manually as it only needs to be done once)
maq ill2sanger s_x_sequence.txt s_x_sequence.fastq
bwa aln hg18.fasta s_x_sequence.fastq > s_x_sequence.sai
bwa samse hg18.fasta s_x_sequence.sai s_x_sequence.fastq > s_x_sequence.sam
samtools view -bS -o s_x_sequence.bam s_x_sequence.sam
samtools sort s_x_sequence.bam s_x_sequence_sorted.bam
But I'm still stuck at having multiple BAM files per sample and having to manually merge the files after the current pipeline finishes using:
samtools merge PatientX_MergedBAM.bam s_1_sequence_sorted.bam s_2_sequence_sorted.bam
Then I create index files (needed for IGV) using:
samtools index PatientX_MergedBAM.bam
And then I drop them into the IGV browser (anyone else find it is faster to navigate around with chromosome coordinates than gene names?) FYI - I use the binary distribution version so I can use more than 2GB of RAM, since I can't use the 10GB version on our 8GB workstation. I've modified the igv_mac-intel.sh script to -Xmx6144m and then launch it from Terminal to open IGV with 6.144GB of RAM.
After looking at the data in IGV I came to appreciate how many duplicates are present in my dataset... So I'm back to google, seqanswers trying to figure out how to do something.. What I've come to understand is that most people, including Heng Li who built bwa/samtools, recommends using Picard to mark/remove duplicates over samtools as picard removes duplicates on multiple chromosomes. So back to sourceforge.com to download Picard (http://sourceforge.net/projects/picard/files/). I downloaded the current version and placed the folder in my "NGS/ApplicationDownloads/" folder and decompressed it. Right now I can't figure out how to put java .jar files in my JAVA path (I'm sure I can figure it out but I'm getting lazy and just copy the .jar files around as needed, bad I know).
I've encountered one issue with my BWA-Samtools BAM files giving an error because the unmapped reads have characters in the CIGAR string which makes Picard grumpy. So you need to add one additional option to the command to set the file validation stringency to silent:
java -Xmx4g -jar MarkDuplicates.jar INPUT=PatientX_MergedBAM.bam OUTPUT=PatientX_MergedBAM_NoDups.bam METRICS_FILE=PatientX_NoDups_Stats.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT
Next getting piled up in my pileup files...
Comment
-
I'll one up you...just a cleaner mkdir command.
#!/bin/sh
#Create_NGS_DirectoryStructureV3.sh
#Written by Ryan Golhar 05/05/2010
echo ***Creating Pipeline Directory Structure***
mkdir -p ngs/{analysisnotes,applicationdownloads,samfiles}
mkdir -p ngs/bamfiles/{merged,original,sorted}
mkdir -p ngs/finaloutputs/{alignmentresults,mergedbamfiles,sangerfastq,sortedbamfiles}
mkdir -p ngs/finaloutputs/illumina/{read1,read2}
mkdir -p ngs/inputsequences/illumina/{read1,read2,sangerfastq}
mkdir -p ngs/refgenomes/{bfast_indexed,bowtie_indexed,bwa_indexed,genomesdownloads}
mkdir -p ngs/scripts/scriptbackups
cp Create_NGS_DirectoryStructureV2.sh ngs/scripts/scriptbackups/
echo ***Pipeline Directory Structure Created***
Comment
-
Re: Removing the duplicates
Noticed this a 2nd ago. It looks like you are working with single end sequencing data which you wouldn't want to remove duplicates.
Download SAM tools for free. SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAMtools provide efficient utilities on manipulating alignments in the SAM format.
(Item #6)
The duplicate removal across chromosomes would only apply for paired end data.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 09:19 AM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
Today, 09:19 AM
|
||
Started by seqadmin, 11-08-2024, 11:09 AM
|
0 responses
283 views
0 likes
|
Last Post
by seqadmin
11-08-2024, 11:09 AM
|
||
Started by seqadmin, 11-08-2024, 06:13 AM
|
0 responses
212 views
0 likes
|
Last Post
by seqadmin
11-08-2024, 06:13 AM
|
||
Started by seqadmin, 11-01-2024, 06:09 AM
|
0 responses
90 views
0 likes
|
Last Post
by seqadmin
11-01-2024, 06:09 AM
|
Comment