Seqanswers Leaderboard Ad

**Kurt** · 04-09-2010, 10:25 AM

I'm LMAO since it appears that I'm coming to these revelations at just about the exact same time you were/are. It's definitely comforting to see somebody is figuring out the exact same stuff that I am at basically the same time I am. Everytime I talk about this stuff here, people look at me like I'm ALF or something.

**krobison** · 04-10-2010, 07:06 AM

Do yourself a favor & swear off shell scripts for Perl/Python/Ruby -- once you start trying to have any sort of conditional programming, shell becomes a nightmare.

Converting your script above to Perl is pretty easy (but left as an exercise for the student :-)

**Jon_Keats** · 04-10-2010, 10:18 AM

Sounds like I should finish working through the "Unix and Perl for biologist" course to see if I can figure Perl out. I managed to figure out a solution for my problem about half way through the Unix section so I started with that but maybe I shouldn't have jumped the gun so much. I'll see what I can come up with this weekend.

**krobison** · 04-12-2010, 10:23 AM

Aaah, I'm too big a softie. Also, IMHO the perl is more readable than shell to understand what is going on (yikes! something less readable than perl!)

Personally, I prefer directory names in all lowercase. Brevity is valuable too -- "bam" & "sam" would be my choices for those directories & I probably wouldn't bother having separate ones anyway. But that is all personal preference & you should pick what suits you.

Also: DO NOT try to use a subdirectory to backup your scripts -- it is easy to install & learn git (or a similar system) to control versions. It is much too easy to accidentally copy your file the wrong way and wipe something out; version control programs track all your changes & allow you to safely move backwards & forwards through the history of your code. They can also be a handy way to replicate code across machines (and keep them in sync) as well as a path towards collaborative coding.

Code:

#!/usr/bin/perl
use strict;
print "***Creating Pipeline Directory Structure***\n"; # \n is newline
my $root="NGS";
system("mkdir $root");

# this line only makes directories with no subdirectories
&makeSubdirs($root,
                          "AnalysisNotes", "ApplicationDownloads",
                          "InputSequence","SAMfiles");
&makeSubdirs("$root/BAMfiles","Merged","Original","Sorted");
&makeSubdirs("$root/FinalOutputs",
                     "AlignmentResults","Illumina","SangerFasta",
                       "SortedBAMfiles","MergedBAMfiles");
&makeSubdirs("$root/RefGenomes",
                   "BFAST_Indexed","BOWTIE_Indexed","GenomeDownloads");
&makeSubdirs("$root/Scripts","ScriptBackups");  # DANGEROUS!!!
print "Pipeline Directory Structure Created\n";

sub makeSubdirs
{
  my @subdirs=@_;
  my $parentDir=shift(@subdirs);
  mkdir($parentDir) if (! -d $parentDir);
  foreach my $subDir(@subdirs)
  {
     mkdir("$parentDir/$subDir");
  }
}

#cd ../Scripts
#mkdir ScriptBackups
#cd ../..
#cp Create_NGS_DirectoryStructureV1.sh NGS/Scripts/ScriptBackups/
#cd NGS/
#pwd

**Jon_Keats** · 04-12-2010, 11:11 PM

Okay still working on figuring out the Perl version but it seem pretty obvious less some of the perl coding language. However, I'm nothing if not competitive so here is the three line version. Not sure if "krobinson" perl version is fewer characters or not but this must win for number of lines.

I guess reading the manuals for functions helps mkdir -p sure helps! I'll work on figuring out git version control and a bit more perl but I can't wait to see krobinson's response to the 400+ pipeline shell script. At least I took the suggestion to ditch the upper and lower case filenames, so everything is now lower case.

NOTE - THIS SCRIPT HAS BEEN UPATED TO VERSION 3 IN A SUBSEQUENT POST

Code:

#!/bin/sh

#Create_NGS_DirectoryStructureV2.sh
#Written by Jonathan Keats 04/12/2010

echo ***Creating Pipeline Directory Structure***
mkdir -p ngs/analysisnotes ngs/applicationdownloads ngs/bamfiles/merged ngs/bamfiles/original ngs/bamfiles/sorted ngs/finaloutputs/alignmentresults ngs/finaloutputs/illumina/read1 ngs/finaloutputs/illumina/read2 ngs/finaloutputs/mergedbamfiles ngs/finaloutputs/sangerfastq ngs/finaloutputs/sortedbamfiles ngs/inputsequences/illumina/read1 ngs/inputsequences/illumina/read2 ngs/inputsequences/sangerfastq ngs/refgenomes/bfast_indexed ngs/refgenomes/bowtie_indexed ngs/refgenomes/bwa_indexed ngs/refgenomes/genomedownloads ngs/samfiles ngs/scripts/scriptbackups
cp Create_NGS_DirectoryStructureV2.sh ngs/scripts/scriptbackups/
echo ***Pipeline Directory Structure Created***

**krobison** · 04-13-2010, 06:27 AM

Better metrics have to do with readability & maintainability rather than utter brevity -- it is quite possible to write extremely short code in Perl which looks like hieroglyphics.

Your program is straightforward -- but as it grows it will get less so. One thing in particular I addressed is if you change the name of certain of your directories -- in your shell script some of the directories are listed twice, which is an opportunity for error should you decide to rename them.

P.S. Only a terminal -- no internal -- Asparagine on my name!

**Bruins** · 04-13-2010, 07:26 AM

Hi all,

Let me do my best to help out with the Perl.

A lot of basic operations have already been written in Perl. You can easily use these bits of code. If you need something done, check http://search.cpan.org/. This goes for bioinformatics things as well: check http://www.bioperl.org for stuff like dealing with sequence files.

For your Perl script I used File::Path (http://search.cpan.org/~jhi/perl-5.8.0/lib/File/Path.pm). It took me about 5 minutes to write.

Code:

#!/usr/bin/perl
use strict;
use File::Path; # CPAN modules are this easy to use

# the dirs
my $dirs = [ 'NGS/AnalysisNotes',
	  'NGS/ApplicationDownloads',
	  'NGS/InputSequence',
	  'NGS/SAMfiles', 
	  'NGS/BAMfiles/Merged', 'NGS/BAMfiles/Original', 'NGS/BAMfiles/Sorted', 
	  'NGS/FinalOutputs/AlignmentResults', 'NGS/FinalOutputs/Illumina', 'NGS/FinalOutputs/SangerFasta', 'NGS/FinalOutputs/SortedBAMfiles', 'NGS/FinalOutputs/MergedBAMfiles', 
	  'NGS/RefGenomes/BFAST_Indexed', 'NGS/RefGenomes/BOWTIE_Indexed', 'NGS/RefGenomes/GenomeDownloads',
	  'NGS/Scripts/ScriptBackups' ];

print "***Creating Pipeline Directory Structure***\n";

# eval will evaluate whatever is inside the {}. Any errors are stored in $@
# mkpath is the File::Path method that will create the given directories
# it takes three arguments in total: the dirs, whether you wish to print the created directories (in that case, 1) or not (0) and the last argument has to do with file permissions, I omitted it
eval { mkpath ( $dirs, 1 ); };

# now check if there were errors. If so, print.
if ( $@ ) {
  print "Couldn't create NGS directories: $@\n";
}
print "Pipeline Directory Structure Created\n";

Hope this helps?

Cheers,
Wil

**Jon_Keats** · 04-13-2010, 09:50 AM

To Genome or Not to Genome

I'm sure I'm not the only person facing a conundrum on which genome version to use for alignment. In general I'm a believer in the "out with the old, in with the new" mentality so my gut is pushing me to use the newest genome build GRCh37/hg19/Ensembl55+ over the old version NCBI36/hg18/Ensembl38-54. Plus CCDS has been on GRCh37 for nearly a year now and they just released dbSNP131 on GRCh37 so it's getting harder and harder to ignore the newest build. BUT, for my dataset I have paired copy number data on Agilent arrays, which still use hg18 design files. So after wasting a couple of days aligning things to GRCh37 (1000 genomes version) I'm going back to redo it with hg18 so my coordinates match between the two datasets, though wasting might be the wrong word as I'm betting I'll want to be on the new version in a month. To get this genome reference file I decided to use the UCSC chromosome files (http://hgdownload.cse.ucsc.edu/golde...8/chromosomes/) since I'd seen someone mentioning the advantage of their simple naming structure. I did not use the various haplotype builds or the random contigs.
- Downloaded each individual chromosome file Chr1-Chr22, ChrX, ChrY, and ChrM.
- Placed them all in my "ngs/refgenomes/genome_downloads/hg18" folder
- Navigate to the folder in Terminal "cd ngs/refgenomes/genome_downloads/hg18"
- Decompress the files using gunzip, type "gunzip *.fa.gz"
- Then merge the files using unix cat command [cat (infile1) (infile2) (infileN) > (outfile)]
- Type "cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg18.fasta"
- check the ouput file contains all chromosomes in the correct order using unix grep command to pull out the fasta header lines
- Type "grep ">" hg18.fa"
- Which should output the following if correct:
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr20
>chr21
>chr22
>chrX
>chrY
>chrM

A similar process should be possible with any genome version or source like ensembl. Just make sure the end platform supports the version you are using. And be careful with the annotation files if you get them from UCSC as the positions are 0 base indexed.

**Jon_Keats** · 04-13-2010, 10:02 AM

The Shell Script from Hell or My Masterpiece you Decide

I'll preface this with acknowledging all the help from various people who have replied to this thread and I'll promise to try and figure out perl over shell scripts, but two weeks ago the "Hello World!" script in any programing language was "Greek" to me so bare with my ignorance for a week or two more. But in the interest of sharing here is my current bwa samse pipeline. I'm stuck at not being able to merge the sorted bam files but at least this lets me get back to the lab and opens up some of my time to write the back log of papers I need to polish off so I can get my own lab and make someone else deal with these issues

**Currently, this uses all default settings so modify commands as you see fit***

Code:

#!/bin/sh

# Batch_BWAsamse_V1.sh
# Created by Jonathan Keats on 04/05/10.
# This script is designed to take a batch of raw Illumina 1.3+ reads to sorted BAM files.
# It is designed to be initiated from a folder called "NGS" with a specific subdirectory structure
# Use the script called "Create_NGS_DirectoryStructureV1.sh" or the other variations (see Jon_Keats SeqAnswers Introduction Thread) to create the analysis directory structure
# To run this script you MUST first place your reference file in "NGS/RefGenomes/BWA_Indexed" and have run the "bwa index" command to create the BWT index files
# If you are NOT using a reference genome called "hg18.fa" you will NEED to edit lines 124 and 150 accordingly!
# The script is based on having ***RENAMED*** Illumina files in NGS/InputSequence/Illumina/Read1 and if available NGS/InputSequence/Illumina/Read2 if paired end runs
# The renamed format MUST be "YourSampleName_RunX_LaneX_R1.txt" and "YourSampleName_RunX_LaneX_R2.txt" otherwise files will be overwritten, paired-end analysis will not be possible, and lane merging will not be possible 
# At each step it queries specific folders for available files and passes them to the next analysis module
# After each step the filename extension of the output files are corrected. (ie. "MySequenceFile.txt.fastq" to "MySequenceFile.fastq")
# Order of Embedded Steps	- Converts Illumina 1.3+ fastq files "s_1_sequence.txt" to Sanger fastq using "maq ill2sanger" command
#							- Aligns created fastq files to reference genome using "bwa aln" command
#							- Generates SAM files from alignment files using "bwa samse" command
#							- Converts SAM files to BAM files using "samtools view" command
#							- Sorts BAM files using "samtools sort" command
#							- Final output files are archived then the input and analysis directories are cleaned-up and readied for the next analysis batch
#    ***DURING CLEAN-UP THE ENTIRE CONTENTS OF SEVERAL FOLDERS ARE DELETED, SO ONLY PLACE ADDITIONAL FILES IN THE "NGS/FINALOUTPUTS" DIRECTORY OR OUTSIDE THE NGS FOLDER SUBDIRECTORIES***
# The script creates a log file in "NGS/AnalysisNotes" to track the steps completed and the time each step started and finished
# Some of the log events will print to both the terminal screen and the log file so you can see what is going on
# On our workstation this will process two lanes of PE data overnight or a full run over the weekend

# Sould you merge files (lane 1 and 2) before analysis as illumina fastq files or after sorting BAM files?
# If you merge before analysis does this mess up the a paired-end analysis?

# Carabelle - Can you have a fail safe start query to kill the script if it is not initiated from the "NGS" folder? Probably doesn't run anyways
# Carabelle - Also can you check if destination folders are empty and if not kill the script?
# Carabelle - Such as if directory NGS/SAMfiles is NOT empty kill script and print "The NGS/SAMfiles directory is not empty please remove files otherwise they will be deleted"
# Ligne=Line en francais (I had a significant amount of help from our French post-docs husband who is a unix programer in Paris)

#Starting directory = NGS/

echo ***Starting Analysis Batch***
date '+%m/%d/%y %H:%M:%S'

#The following step creates the log file in the AnalysisNotes subdirectory the first time the script is run
#On subsequent runs the results are printed at the bottom of the pre-existing log file

echo ***Starting Analysis Batch*** >> AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log

#In this step we will converted the original Illumina 1.3+ fastq format files to Sanger fastq format
#Input files from NGS/InputSequences/Illumina/Read1 and NGS/InputSequences/Illumina/Read2
#Output files to NGS/InputSequences/SangerFastq

echo Starting Step1a - Read1 Fastq Conversion with maq ill2sanger
date '+%m/%d/%y %H:%M:%S'
echo Starting Step1a - Read1 Fastq Conversion with maq ill2sanger >> AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log
cd InputSequence/Illumina/Read1
#Current directory = NGS/InputSequence/Illumina/Read1
echo Converting the following Illumina files from the Read1 folder:
for ligne in `ls *.txt`
do                                                                     
echo $ligne
done
echo Converting the following Illumina files from the Read1 folder: >> ../../../AnalysisNotes/Analysis.log
for ligne in `ls *.txt`
do
echo $ligne >> ../../../AnalysisNotes/Analysis.log
done
for ligne in `ls *.txt`
do
maq ill2sanger $ligne ../../SangerFastq/$ligne.fastq
done
echo Finished Step1a - Read1 Fastq Conversion
date '+%m/%d/%y %H:%M:%S'
echo Finished Step1a - Read1 Fastq Conversion >> ../../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../../AnalysisNotes/Analysis.log
cd ../Read2
#Current directory = NGS/InputSequence/Illumina/Read2
echo Starting Step1b - Read2 Fastq Conversion with maq ill2sanger
date '+%m/%d/%y %H:%M:%S'
echo Starting Step1b - Read2 Fastq Conversion with maq ill2sanger >> ../../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../../AnalysisNotes/Analysis.log
echo Converting the following Illumina files from the Read2 folder:
for ligne in `ls *.txt`
do                                                                     
echo $ligne
done
echo Converting the following Illumina files from the Read2 folder: >> ../../../AnalysisNotes/Analysis.log
for ligne in `ls *.txt`
do
echo $ligne >> ../../../AnalysisNotes/Analysis.log
done
for ligne in `ls *.txt`
do
maq ill2sanger $ligne ../../SangerFastq/$ligne.fastq
done
cd ../../SangerFastq/
#Current directory = NGS/InputSequence/SangerFastq
#The next step will change the file extensions of ths Sanger Fastq outputs from "File_Run1_Lane1_R1.txt.fastq" to "File_Run1_Lane1_R1.fastq"
old_ext=txt.fastq
new_ext=fastq
find . -type f -name "*.$old_ext" -print | while read file
do
    mv $file ${file%${old_ext}}${new_ext}
done
echo Finished Step1b - Read2 Fastq Conversion
date '+%m/%d/%y %H:%M:%S'
echo Finished Step1b - Read2 Fastq Conversion >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log

#In the next step we will align the converted sanger fastq format files to the reference genome (hg18.fa)

echo Starting Step2 - bwa aln process
date '+%m/%d/%y %H:%M:%S'
echo Starting Step2 - bwa aln process >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
echo The following fastq files will be aligned:
for ligne in `ls *.fastq`
do                                                                     
echo $ligne
done
echo The following fastq files will be aligned: >> ../../AnalysisNotes/Analysis.log
for ligne in `ls *.fastq`
do
echo $ligne >> ../../AnalysisNotes/Analysis.log
done
for ligne in `ls *.fastq`
do
bwa aln ../../RefGenomes/BWA_Indexed/hg18.fa $ligne > $ligne.sai 	 
done
echo Finished Step2 - bwa aln process
date '+%m/%d/%y %H:%M:%S'
echo Finished Step2 - bwa aln process >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log

#In the next step we will generate SAM files for the alignments using bwa samse

echo Starting Step3 - bwa samse process
date '+%m/%d/%y %H:%M:%S'
echo Starting Step3 - bwa samse process >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
echo The following alignment files will be converted to SAM files:
for ligne in `ls *.sai`
do                                                                     
echo $ligne
done
echo The following alignment files will be converted to SAM files: >> ../../AnalysisNotes/Analysis.log
for ligne in `ls *.sai`
do
echo $ligne >> ../../AnalysisNotes/Analysis.log
done
#To get this step to work I had to trick the system to get the conventional (bwa samse <database.fasta> <aln.sai> <input.fastq> > aln.sam) command as I can only pull up one list at a time to automatically feed files into the command
for ligne in `ls *.fastq`
do
bwa samse ../../RefGenomes/BWA_Indexed/hg18.fa $ligne.sai $ligne > ../../SAMfiles/$ligne.sam
done
#This step renames the bwa aln output files from "File_Run1_Lane1_R1.fastq.sai" to "File_Run1_Lane1_R1.sai"
#This is done on purpuse at this point to allow the preceeding step to work
old_ext=fastq.sai
new_ext=sai
find . -type f -name "*.$old_ext" -print | while read file
do
    mv $file ${file%${old_ext}}${new_ext}
done 
cd ../../SAMfiles/
#Current directory = NGS/SAMfiles
#This step renames the bwa samse output files from "File_Run1_Lane1_R1.fastq.sam" to "File_Run1_Lane1_R1_bwase.sam"
#We up in the "_bwase" to indicate the files are bwa alignments converted to SAM files with the bwa samse command
old_ext=.fastq.sam
new_ext=_bwase.sam
find . -type f -name "*$old_ext" -print | while read file
do
    mv $file ${file%${old_ext}}${new_ext}
done 
echo Finished Step3 - bwa samse process
date '+%m/%d/%y %H:%M:%S'
echo Finished Step3 - bwa samse process >> ../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../AnalysisNotes/Analysis.log

#In the next step we will convert each SAM file to a BAM file

echo Starting Step4 - samtools SAM to BAM conversion
date '+%m/%d/%y %H:%M:%S'
echo Starting Step4 - samtools SAM to BAM conversion >> ../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../AnalysisNotes/Analysis.log
echo The following SAM files will be converted to BAM files:
for ligne in `ls *.sam`
do                                                                     
echo $ligne
done
echo The following SAM files will be converted to BAM files: >> ../AnalysisNotes/Analysis.log
for ligne in `ls *.sam`
do                                                                     
echo $ligne >> ../AnalysisNotes/Analysis.log
done
for ligne in `ls *.sam`
do
samtools view -bS -o ../BAMfiles/Original/$ligne.bam $ligne
done
cd ../BAMfiles/Original
#Current directory = NGS/BAMfiles/Original
#This step renames the samtools view output files from "File_Run1_Lane1_R1_bwase.sam.bam" to "File_Run1_Lane1_R1_bwase.bam"
old_ext=sam.bam
new_ext=bam
find . -type f -name "*.$old_ext" -print | while read file
do
    mv $file ${file%${old_ext}}${new_ext}
done 
echo Finished Step4 - samtools SAM to BAM conversion
date '+%m/%d/%y %H:%M:%S'
echo Finished Step4 - samtools SAM to BAM conversion >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log

#In the next step we will sort the BAM file by chromosome coordinate

echo Starting Step4 - samtools BAM sorting process
date '+%m/%d/%y %H:%M:%S'
echo Starting Step4 - samtools BAM sorting process >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
echo The following BAM files will be sorted:
for ligne in `ls *.bam`
do                                                                     
echo $ligne
done
echo The following BAM files will be sorted: >> ../../AnalysisNotes/Analysis.log
for ligne in `ls *.bam`
do                                                                     
echo $ligne >> ../../AnalysisNotes/Analysis.log
done
for ligne in `ls *.bam`
do                                                                     
samtools sort $ligne ../Sorted/$ligne
done
cd ../Sorted
#Current directory = NGS/BAMfiles/Sorted
#This step renames the samtools sort output files from "File_Run1_Lane1_R1_bwase.bam.bam" to "File_Run1_Lane1_R1_bwase_sorted.bam
old_ext=.bam.bam
new_ext=_sorted.bam
find . -type f -name "*$old_ext" -print | while read file
do
    mv $file ${file%${old_ext}}${new_ext}
done 
echo Finished Step4 - samtools BAM sorting process
date '+%m/%d/%y %H:%M:%S'
echo Finished Step4 - samtools BAM sorting process >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
cd ../..
#Current directory = NGS/

#In the next step we will move files from the input and analysis directories to their respective "NGS/FinalOutputs" subdirectory so the pipeline can be used another time.

echo Cleaning up Input and Analysis Directories
date '+%m/%d/%y %H:%M:%S'
echo Cleaning up Input and Analysis Directories >> AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log
cd InputSequence/Illumina/Read1
#Current directory = NGS/InputSequence/Illumina/Read1
echo Moving the following Illumina Fastq Files from NGS/InputSequence/Illumina/Read1 to NGS/FinalOutputs/Illumina/Read1:
for ligne in `ls *.txt`
do
echo $ligne
done
echo Moving Illumina Fastq Files from NGS/InputSequence/Illumina/Read1 to NGS/FinalOutputs/Illumina/Read1 >> ../../../AnalysisNotes/Analysis.log
for ligne in `ls *.txt`
do
echo $ligne >> ../../../AnalysisNotes/Analysis.log
done
for ligne in `ls *.txt`
do
mv $ligne ../../../FinalOutputs/Illumina/Read1/
done
cd ../Read2
#Current directory = NGS/InputSequence/Illumina/Read2
echo Moving the following Illumina Fastq Files from NGS/InputSequence/Illumina/Read2 to NGS/FinalOutputs/Illumina/Read2:
for ligne in `ls *.txt`
do
echo $ligne
done
echo Moving Illumina Fastq Files from NGS/InputSequence/Illumina/Read2 to NGS/FinalOutputs/Illumina/Read2 >> ../../../AnalysisNotes/Analysis.log
for ligne in `ls *.txt`
do
echo $ligne >> ../../../AnalysisNotes/Analysis.log
done
for ligne in `ls *.txt`
do
mv $ligne ../../../FinalOutputs/Illumina/Read2/
done
echo Moving Illumina Input Files Complete
date '+%m/%d/%y %H:%M:%S'
echo Moving Illumina Input Files Complete >> ../../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../../AnalysisNotes/Analysis.log
cd ../../SangerFastq
#Current directory = NGS/InputSequence/SangerFastq
echo Moving the following Sanger Format Fastq Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/SangerFastq:
for ligne in `ls *.fastq`
do
echo $ligne
done
echo Moving the following Sanger Format Fastq Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/SangerFastq: >> ../../AnalysisNotes/Analysis.log
for ligne in `ls *.fastq`
do
echo $ligne >> ../../AnalysisNotes/Analysis.log
done
for ligne in `ls *.fastq`
do
mv $ligne ../../FinalOutputs/SangerFastq/
done
echo Moving Sanger Format Fastq Files Complete
date '+%m/%d/%y %H:%M:%S'
echo Moving Sanger Format Fastq Files Complete >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
echo Moving the following Alignment .sai Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/AlignmentResults:
for ligne in `ls *.sai`
do
echo $ligne
done
echo Moving the following Alignment .sai Files from NGS/InputSequence/SangerFastq to NGS/FinalOutputs/AlignmentResults: >> ../../AnalysisNotes/Analysis.log
for ligne in `ls *.sai`
do
echo $ligne >> ../../AnalysisNotes/Analysis.log
done
for ligne in `ls *.sai`
do
mv $ligne ../../FinalOutputs/AlignmentResults/
done
echo Moving Alignment Results Files Complete
date '+%m/%d/%y %H:%M:%S'
echo Moving Alignment Results Files Complete >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
cd ../../BAMfiles/Sorted/
#Current directory = NGS/BAMfiles/Sorted/
echo Moving the following Sorted BAM files from NGS/BAMfiles/Sorted/ to NGS/FinalOutputs/SortedBAMfiles:
for ligne in `ls *.bam`
do
echo $ligne
done
echo Moving the following Sorted BAM files from NGS/BAMfiles/Sorted/ to NGS/FinalOutputs/SortedBAMfiles: >> ../../AnalysisNotes/Analysis.log
for ligne in `ls *.bam`
do
echo $ligne >> ../../AnalysisNotes/Analysis.log
done
for ligne in `ls *.bam`
do
mv $ligne ../../FinalOutputs/SortedBAMfiles/
done
echo Moving Sorted BAM Files Complete
date '+%m/%d/%y %H:%M:%S'
echo Moving Alignment Results Files Complete >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
cd ../../Samfiles
#Current directory = NGS/SAMfiles
echo Deleting the following SAM Files from NGS/SAMfiles:
for ligne in `ls *.sam`
do
echo $ligne
done
echo Deleting the following SAM Files from NGS/SAMfiles: >> ../AnalysisNotes/Analysis.log
for ligne in `ls *.sam`
do
echo $ligne >> ../AnalysisNotes/Analysis.log
done
for ligne in `ls *.sam`
do
rm $ligne
done
echo Deleting SAM Files Complete
date '+%m/%d/%y %H:%M:%S'
echo Deleting SAM Files Complete >> ../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../AnalysisNotes/Analysis.log
cd ../BAMfiles/Original/
#Current directory = NGS/BAMfiles/Original
echo Deleting the following BAM Files from NGS/BAMfiles/Original:
for ligne in `ls *.bam`
do
echo $ligne
done
echo Deleting the following BAM Files from NGS/BAMfiles/Original: >> ../../AnalysisNotes/Analysis.log
for ligne in `ls *.bam`
do
echo $ligne >> ../../AnalysisNotes/Analysis.log
done
for ligne in `ls *.bam`
do
rm $ligne
done
echo Deleting BAM Files from NGS/BAMfiles/Original Complete
date '+%m/%d/%y %H:%M:%S'
echo Deleting BAM Files from NGS/BAMfiles/Original Complete >> ../../AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> ../../AnalysisNotes/Analysis.log
cd ../..
#Current directory = NGS/
echo Cleaning up Input and Analysis Directories Complete
date '+%m/%d/%y %H:%M:%S'
echo Cleaning up Input and Analysis Directories Complete >> AnalysisNotes/Analysis.log
date '+%m/%d/%y %H:%M:%S' >> AnalysisNotes/Analysis.log
echo ***Analysis Batch Complete***
echo ***Analysis Batch Complete*** >> AnalysisNotes/Analysis.log

# Future development: Merge BAM files based on filename structure "RunX_LaneX_RX", Index files for IGV viewing
# Figure out Perl or Python so I don't look like such an amateur :)

**greigite** · 04-14-2010, 12:39 PM

Originally posted by Jon_Keats View Post

The Shell Script from Hell or My Masterpiece you Decide

I'll preface this with acknowledging all the help from various people who have replied to this thread and I'll promise to try and figure out perl over shell scripts, but two weeks ago the "Hello World!" script in any programing language was "Greek" to me so bare with my ignorance for a week or two more. But in the interest of sharing here is my current bwa samse pipeline. I'm stuck at not being able to merge the sorted bam files but at least this lets me get back to the lab and opens up some of my time to write the back log of papers I need to polish off so I can get my own lab and make someone else deal with these issues

**Currently, this uses all default settings so modify commands as you see fit***

lol, one of the main benefits of new-PIship is being able to tell someone else to do all the boring stuff- in fact a lot of science revolves around finding other people to do your work

that said I do actually kind of enjoy dealing with these kind of informatics problems and I'm glad you are providing an introduction that many folks will find useful. I wanted to point out another alignment program you haven't mentioned, Mosaik (from the Marth lab at BC), which has some of the best documentation of any of the short read aligners I've used. The other advantage of Mosaik is that it allows much higher levels of divergence between the read and the reference than BWA/Bowtie does. I've found this useful as I'm working with reads from related bacterial strains that can diverge substantially from the reference, yet still have alignments supported with high depth. Mosaik alignments can be easily converted to SAM/BAM format for viewing in IGV/snp calling with SAMtools and there is also an associated SNP caller (Gigabayes).
I am also a big fan of IGV due to its flexibility. It is relatively easy to write Perl scripts to create your own custom tracks for viewing using the gff format. For example, I created a custom track that shows me the location of all SNPs above certain depth/quality thresholds.

**NextGenSeq** · 04-15-2010, 05:00 AM

Thank god we have CLCBio for assembly and variant detection. This post has shown how archaic the freeware tools are.

We can assemble and detect variants on 30x human whole exome data overnight (admittedly that is the easy part now).

**Jon_Keats** · 04-15-2010, 09:46 PM

Making Some Progress

Getting ready to head out for AACR and get depressed about how far ahead the big sequencing centers are in the sequencing world. I'm betting we'll seeing at least 100 cancer genomes being discussed, any other guesses? Well I know of at least 40 so that might be on the low side.
The good news on the home front is that I've cleaned up a bunch of disk space on our workstation and will have the poor beast going for the next 3-4 days processing data. So there should be a big pile of data to sift through when I get back. One lesson I learned is that pipelines are great but you need to remember how much disk space the whole pipeline uses....or your morning starts with an error message of "0 bytes remaining". What the hell are bytes, I know of these things called megabytes, gigabytes, or terabytes, and I roughly remember kilobytes...I though the sticker on our machine said Mac Pro not Apple IIE. Therefore, version2 of the pipeline will need some kind of a check for the available amount of disk space relative to the number of samples to be processed. Oh goodie, one more thing to learn and I thought I was smart enough already!

RE: Comments

MOSAIK, funny you mention it the guys were up at the Mothership giving at talk on Wednesday that we tuned into over the webcast. Sounded pretty good from what they showed but it did sounds like it is a bit to much of a memory hog for our current workstation. However, they mention that an upcoming release is coming that is less memory intensive.

CLCBio: I'm a all for turn key solutions and would be the first to say if you can drop thousands of dollars on data generation you need to be willing to pay for analysis...otherwise what is the point. But as a new user it seems simple for most sequencing cores to have in place basic pipelines using open source software that can generate the basic outputs that most investigators are looking to get back.

Important Note: If you use the open source software packages please remember someone put a great deal of work into even the bad ones and many of the developers are great citizens on this forum. So if you see requests like Nils' request for grant support letters for BFAST, or similar ones from the other developers if they make similar requests, please take a couple of minutes to help the people that help you!

**Jon_Keats** · 04-28-2010, 10:53 PM

Remove the Duplicates

Well the meeting is over and I'm disappointed we didn't see a bigger impact of sequencing. Boo Hoo to all the people who only talked about published datasets, whats the point of a meeting if people only talk about their nature paper from 3 months ago...I read that already! But a couple of little birdies told me about several big, multi-patient, papers that are already accepted for publication...so does that mean 40 tumor normal pairs is an Oncogene paper now? Okay enough complaining.

So if you have been following the thread I've managed to get from some raw Illumina reads to sorted BAM files by pipe lining the following commands in a unix shell script:
bwa index -a bwtsw hg18.fasta (done manually as it only needs to be done once)
maq ill2sanger s_x_sequence.txt s_x_sequence.fastq
bwa aln hg18.fasta s_x_sequence.fastq > s_x_sequence.sai
bwa samse hg18.fasta s_x_sequence.sai s_x_sequence.fastq > s_x_sequence.sam
samtools view -bS -o s_x_sequence.bam s_x_sequence.sam
samtools sort s_x_sequence.bam s_x_sequence_sorted.bam

But I'm still stuck at having multiple BAM files per sample and having to manually merge the files after the current pipeline finishes using:

samtools merge PatientX_MergedBAM.bam s_1_sequence_sorted.bam s_2_sequence_sorted.bam

Then I create index files (needed for IGV) using:

samtools index PatientX_MergedBAM.bam

And then I drop them into the IGV browser (anyone else find it is faster to navigate around with chromosome coordinates than gene names?) FYI - I use the binary distribution version so I can use more than 2GB of RAM, since I can't use the 10GB version on our 8GB workstation. I've modified the igv_mac-intel.sh script to -Xmx6144m and then launch it from Terminal to open IGV with 6.144GB of RAM.

After looking at the data in IGV I came to appreciate how many duplicates are present in my dataset... So I'm back to google, seqanswers trying to figure out how to do something.. What I've come to understand is that most people, including Heng Li who built bwa/samtools, recommends using Picard to mark/remove duplicates over samtools as picard removes duplicates on multiple chromosomes. So back to sourceforge.com to download Picard (http://sourceforge.net/projects/picard/files/). I downloaded the current version and placed the folder in my "NGS/ApplicationDownloads/" folder and decompressed it. Right now I can't figure out how to put java .jar files in my JAVA path (I'm sure I can figure it out but I'm getting lazy and just copy the .jar files around as needed, bad I know).

I've encountered one issue with my BWA-Samtools BAM files giving an error because the unmapped reads have characters in the CIGAR string which makes Picard grumpy. So you need to add one additional option to the command to set the file validation stringency to silent:

java -Xmx4g -jar MarkDuplicates.jar INPUT=PatientX_MergedBAM.bam OUTPUT=PatientX_MergedBAM_NoDups.bam METRICS_FILE=PatientX_NoDups_Stats.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT

Next getting piled up in my pileup files...

**golharam** · 05-05-2010, 09:09 AM

I'll one up you...just a cleaner mkdir command.

#!/bin/sh

#Create_NGS_DirectoryStructureV3.sh
#Written by Ryan Golhar 05/05/2010

echo ***Creating Pipeline Directory Structure***
mkdir -p ngs/{analysisnotes,applicationdownloads,samfiles}
mkdir -p ngs/bamfiles/{merged,original,sorted}
mkdir -p ngs/finaloutputs/{alignmentresults,mergedbamfiles,sangerfastq,sortedbamfiles}
mkdir -p ngs/finaloutputs/illumina/{read1,read2}
mkdir -p ngs/inputsequences/illumina/{read1,read2,sangerfastq}
mkdir -p ngs/refgenomes/{bfast_indexed,bowtie_indexed,bwa_indexed,genomesdownloads}
mkdir -p ngs/scripts/scriptbackups

cp Create_NGS_DirectoryStructureV2.sh ngs/scripts/scriptbackups/
echo ***Pipeline Directory Structure Created***

**Kurt** · 05-07-2010, 10:05 AM

Re: Removing the duplicates

Noticed this a 2nd ago. It looks like you are working with single end sequencing data which you wouldn't want to remove duplicates.

SAM tools

http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_protocol#Generate_merged_alignment

Download SAM tools for free. SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAMtools provide efficient utilities on manipulating alignments in the SAM format.

(Item #6)

The duplicate removal across chromosomes would only apply for paired end data.

Topics	Statistics	Last Post
New Software Simplifies 3D Gene Expression Mapping by seqadmin Started by seqadmin, Yesterday, 10:17 AM	0 responses 7 views 0 reactions	Last Post by seqadmin Yesterday, 10:17 AM
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 59 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM
Mapping the snoRNAome in Zebrafish to Advance Disease Research by seqadmin Started by seqadmin, 03-18-2025, 12:50 PM	0 responses 50 views 0 reactions	Last Post by seqadmin 03-18-2025, 12:50 PM

Seqanswers Leaderboard Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News