SEQanswers

Go Back   SEQanswers > Introductions



Similar Threads
Thread Thread Starter Forum Replies Last Post
runAnalysisFilter on GS Junior computer SABC 454 Pyrosequencing 6 12-04-2011 03:22 AM
computer for RNA-seq analysis eggplant72 Bioinformatics 2 11-12-2011 08:05 AM
HP Z400 computer issues? suefo 454 Pyrosequencing 0 09-22-2011 05:21 AM
Computer hardware requirements Najim Bioinformatics 25 04-30-2010 04:46 PM
Which Computer to Buy? polsum Bioinformatics 7 08-04-2009 09:48 PM

Reply
 
Thread Tools
Old 09-25-2011, 09:34 PM   #61
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default Getting Better with Variables

Now that I have the lab (www.keatslab.org) moving forward to some extent I'm back to working on my analysis methods. For a couple of reasons I've been searching for ways to improve the previous methods so that even less user input is required and thus saving me time running things myself or worrying about other users making mistakes. So here is what seems like a good version that relies on very limited user input. It takes advantage of a common parameters file, and a series of basic user queries as the script launches to setup the tophat and cufflinks run. It also uses bwa to find the insert size and standard deviation of the library.

Requirements:
1) Create_Directory_Structure_V4 directory structure, but the pipeline is so flexible this is not really a requirement just simplifies effort.
2) bwa transcriptome index file for genome
3) bowtie index file for genome
4) Annotation files for Picard CollectRnaSeqMetrics.jar to function properly

Options:
1) It accepts compressed gzip or bzip2 files or standard read files with any extension
2) It creates a summary file at the end of each run and you have the option to output a list of genes and associated FPKM values to this summary file. So if you are looking at an siRNA experiment you could drop the gene of interest values for your samples or a list of random genes you would like a quick look at.

General Outline:
1) Query User to define setup, which sets the naming convention for each sample, the tophat run format (two options), the cufflinks run format (six options), the gene list added to the summary file
2) Counts the number of reads in each input file (used to calculate % aligned)
3) Cuts out the first 1 million reads for bwa alignment against a transcriptome reference to define the insert size and standard deviation
4) Runs tophat
5) Runs cufflinks
6) Runs a series of picard metrics programs
7) Summarizes metrics and gene values

create_ngs_directorystructure_v4.sh

Code:
#!/bin/sh

# Create_NGS_DirectoryStructure_V4.sh
# 
#
# Created by Jonathan Keats on 09/25/11
# Translational Genomics Research Institute

#########################################################################
#  CREATES A DIRECTORY STURCTURE TO SUPPORT A VARIETY OF NGS PIPELINES  #
#########################################################################

# Designed for a Mac OS enviroment and requires initiation from your home folder (/User/You/)

# Check to confirm current location is $HOME/ (ie. /User/You/)

echo "Confirming Script Initiation Directory"
var1=$HOME
if [ "`pwd`" != "$var1" ] 
	then 
	echo " The script must be launched from your home directory "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi
echo "1) Launch Location is Correct ($HOME/)"

# Create required directories to support pipelines (RNAseq_MensFormal_)

echo ***Creating Pipeline Directory Structure***
mkdir -p ngs/{applications,bwa,run_parameters,run_parameters,scripts,temp,tophat,tophat_fusion}
mkdir -p ngs/analyzed_read_files/{chipseq,exomes,genomes,matepair,rnaseq}
mkdir -p ngs/finaloutputs/{chipseq,exomes,genomes,matepair,rnaseq}
mkdir -p ngs/refgenomes/{bfast_indexed,bowtie_indexed,bwa_indexed,downloads}
mkdir -p ngs/refgenomes/downloads/{ncbi36_hg18,grch37_hg19}
mkdir -p ngs/refgenomes/downloads/ncbi36_hg18/{annotation_files,reference_sequences}
mkdir -p ngs/refgenomes/downloads/grch37_hg19/{annotation_files,reference_sequences}
mv create_ngs_directorystructure_v4.sh ngs/scripts/
echo ***Pipeline Directory Structure Created***
RNAseq_MensFormal_Mac.sh
Code:
#!/bin/sh

# RNAseq_MensFormal_Mac.sh
# 
#
# Created by Jonathan Keats on 9/21/11.
# Copyright 2011 Translational Genomics Research Institute. All rights reserved.

# Outline - Automated pipeline requiring limited input for multiple RNAseq samples using tophat and cufflinks

# Usage = RNAseq_MensFormal_Mac.sh <rnaseq_common_parameters.ini>

######################################################################################################
##																									##
##	Dependencies - Unix (mv, cd, find, basename, mkdir, echo, wc, head, tail, cut, awk, sed)		##
##				 - BWA (version with -I option)	(Tested with v0.5.9)								##
##				 - SAMTOOLS (Tested with v0.1.12a)													##
##				 - PICARD Tools (Tested with v1.52)													##
##					- CollectInsertSizeMetrics.jar (WATCH VERSION AS COLUMN OUTPUT DIFFERS)			##
##					- CollectRNAseqMetrics.jar														##
##					- MarkDuplicates.jar															##
##					- CollectAlignmentSummaryMetrics.jar											##
##					- MeanQualityByCycle.jar														##
##				 - TOPHAT (Tested with v1.3.2)														##
##				 - CUFFLINKS (Tested with v1.1.0)													##
##																									##
##	Outline - Place uniformly named read file pairs in TOPHAT folder								##
##					- This method is dependent on XXXX folder structure								##
##			- Update "tophat_run_parameters.ini" as appropriate to your samples						##
##			- Launch script which queries the user to define methods used and renaming convention   ##
##			  then does the following steps automatically:											##
##				- Create custom directory for each sample											##
##				- Create custom .ini file for each sample											##
##				- Cut out first 1 million reads from each read file									##
##				- Align to transcriptome reference using BWA										##
##				- Calculate inner-mate-distance and standard deviation								##
##					** inner-mate-distance = (insert size) - (combined read length)					##
##				- Run TOPHAT																		##
##				- Run CUFFLINKS																		##
##				- Calculate metrics 																##
##																									##
##   Notes	- Extensive use of common variables is used to facilitate automation					##
##			- There is some naming convention requirements but they are flexible					##
##					- Paired samples must follow the same convention:								##
##							--> SampleA_1.fastq  and  SampleA_2.fastq								##
##												OR													##
##							--> SampleA_Time1_1.txt and SampleA_Time1_2.txt							##
##					- You can mix different basenames but not common extensions						##
##							--> SampleB_1.fa / SampleB_2.fa											##
##							--> SampleC_Time1_1.fa / SampleC_Time1_2.fa								##
##		** This is fine: Common Extension = .fa ; Read Extension = _1.fa and _2.fa **				##
##							--> SampleB_1.fastq / SampleB_2.fastq									##
##							--> SampleC_1.fa / SampleC_2.fa											##
##							--> SampleD_R1.fa / SampleD_R2.fa										##
##      ** These can not be mixed, no "Common Extension" or common "Read Extension" **				##
##																									##
######################################################################################################

# Read in the tophat_common_parameters file
. $1

#Starting Directory=ngs/tophat/
echo
echo "*** STARTING SCRIPT ***"
date '+%m/%d/%y %H:%M:%S'
echo

##############################################################
##  Check that the script was launched from ngs/tophat dir  ##
##############################################################

echo Checking Launch Location
date '+%m/%d/%y %H:%M:%S'
echo

# Checking if launch location is correct

if [ "`pwd`" != "${LAUNCH_DIR}" ] 
	then 
	echo " The script must be launched from the ${LAUNCH_DIR} directory "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi
echo "1) Launch Location is Correct ($HOME/ngs/tophat)"

# Checking if launch directory contains any .ini files (No such files are expected)

if [ `ls *.ini | wc -l` != 0 ]       
	then 
	echo " The tophat directory contains unexpected .ini files - These must be removed before the script starts "
	echo " ERROR - The script was automatically killed due to a launch error - See Above Error Message"
	exit 2                     
fi
echo "2) No .ini Files Found in Launch Directory as Expected"
echo

echo Directory Check Completed
date '+%m/%d/%y %H:%M:%S'
echo

##############################################################
##  Query User for tophat and cufflinks configurations      ##
##############################################################

echo Starting User Configuration Query Step
date '+%m/%d/%y %H:%M:%S'
echo

# Ask user if the input read files are compressed or not
echo "Are the input read files compressed? (ie. gzip or bzip2) [Yes/No]"
read COMPRESSION_QUERY

# Test to see if the "COMPRESSION QUERY" Query Response was appropriate
## NOTE - In the IF statement use && as AND and || as OR in new shells (old OR is -o)
if [[ ${COMPRESSION_QUERY} = "Yes" || ${COMPRESSION_QUERY} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Ask user for the compression type if they indicated that the read files are compressed
if [ ${COMPRESSION_QUERY} = "Yes" ]
	then
	echo Please indicate the type of compression. [gzip/bzip2] 
	read COMPRESSION_TYPE
	if [[ ${COMPRESSION_TYPE} = "gzip" || ${COMPRESSION_QUERY} = "bzip2" ]]
		then
		echo 
		else
		echo
		echo " Incorrect Response Option, Please use gzip or bzip2 (Case-Sensitive) "
		echo " The script was automatically killed due to a launch error - See Above Error Message" 
		exit 2                              
	fi
fi
######################

echo Would you like to run tophat with a gtf reference file? [Yes/No]
read TOPHAT_QUERY_GTF

# Test to see if the Query Response to the use of a GTF file during tophat alignment was correct or not
## NOTE - In the IF statement use && as AND and || as OR in new shells (old OR is -o)
if [[ ${TOPHAT_QUERY_GTF} = "Yes" || ${TOPHAT_QUERY_GTF} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Ask user for a short GTF Identifier for renaming purposes
if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
	then
	echo "Please provide a short descriptor for the GTF file used during tophat alignment"
	echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
	echo Please Enter Response Now:
	read TOPHAT_GTF_SHORT
	echo                              
fi

echo   Notes regarding the following questions pertain to how cufflinks estimates abundances
echo -----------------------------------------------------------------------------------------
echo
echo A GTF file can be used in two mutually exclusive fashions -G/--GTF or -g/--GTF-guide
echo
echo "1) To limit abundance estimates to just the isoforms in the GTF file use -G/--GTF"
echo "*** In this script this is called LIMIT ABUNDANCE ESTIMATION to GTF reference ***"
echo NOTE: This limits the estimates to a consistent set of known isoforms
echo 
echo "2) To guide BUT NOT limit the abundance estimates to the GTF file use -g/--GTF-guide"
echo "*** In this script this is called GUIDE ABUNDANCE ESTIMATION to GTF reference ***"
echo NOTE: This allows identification of known and novel isoforms
echo -----------------------------------------------------------------------------------------
echo

echo Would you like to LIMIT ABUNDANCE ESTIMATION with cufflinks to a gtf reference file? [Yes/No]
read CUFFLINKS_QUERY_LIMIT_GTF

# Test to see if the Query Response to the use of a LIMIT ABUNDANCE ESTIMATION GTF file during cufflinks isoform assembly was correct or not
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" || ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

echo Would you like to GUIDE ABUNDANCE ESTIMATION with cufflinks to a gtf reference file? [Yes/No]
read CUFFLINKS_QUERY_GUIDE_GTF

# Test to see if the Query Response to the use of a GUIDE ABUNDANCE ESTIMATION GTF file during cufflinks isoform assembly was correct or not
if [[ ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" || ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Test to ensure the user did not say "Yes" to the use both the LIMIT and GUIDE GTF options
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" ]]
	then
	echo " Incorrect Response Combination, You Can Not Say Yes to both the LIMIT and GUIDE ABUNDANCE ESTIMATION Options!! "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Ask user for a short GTF Identifier for renaming purposes
if [ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" ]
	then
	echo "Please provide a short descriptor for the GTF file used to LIMIT ABUNDANCE ESTIMATION"
	echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
	echo Please Enter Response Now:
	read LIMIT_GTF_SHORT
	echo                              
fi

if [ ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" ]
	then
	echo "Please provide a short descriptor for the GTF file used to GUIDE ABUNDANCE ESTIMATION"
	echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
	echo Please Enter Response Now:
	read GUIDE_GTF_SHORT
	echo                              
fi

echo Would you like to run cufflinks with an abundance estimate mask file? [Yes/No]
read CUFFLINKS_QUERY_MASK

# Test to see if the Query Response to the use of a mask file during cufflinks isoform assembly was correct or not
if [[ ${CUFFLINKS_QUERY_MASK} = "Yes" || ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Ask user for a short mask identifier for renaming purposes
if [ ${CUFFLINKS_QUERY_MASK} = "Yes" ]
	then
	echo "Please provide a short descriptor for the mask file used to prevent certain transscripts from being used during abundance estimation"
	echo "Such as MT for mitochorndia, IG for immunoglobulin, this is not standardized but for file sorting purposes you should be consistent between runs"
	echo Please Enter Response Now:
	read MASK_SHORT
	echo                              
fi

# Ask the user if they want to extract a list of Genes and associated FPKM vales from the cufflinks output into the run report
echo Would you like to extract a gene list, defined by GENES_OF_INTEREST in .ini file into the run report? [Yes/No]
read GENESOFINTEREST_QUERY

# Test to see if the Query Response to the use of a mask file during cufflinks isoform assembly was correct or not
if [[ ${GENESOFINTEREST_QUERY} = "Yes" || ${GENESOFINTEREST_QUERY} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

#Final User Query to ensure the variables in the Tophat_Run_Parameters.ini file match the users responses

echo FINAL CHECK: Are you sure the Tophat_Run_Parameters.ini file is updated to match the user responses you provided? [Yes/No]
read FINAL_CHECK

# Test to see if the Query Response was correct or not
if [[ ${FINAL_CHECK} = "Yes" || ${FINAL_CHECK} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Test to see if the user wants to stop the script (ie. they entered "No" as a response to the FINAL CHECK

if [ ${FINAL_CHECK} = "Yes" ]
	then
	echo "Thanks... Your User Configured Run Will Start in a Moment"
	echo
	else
	echo
	echo You Enter No to the query... The Script Will Stop Now
	exit 2                              
fi

echo Finished User Configuration Query Step
date '+%m/%d/%y %H:%M:%S'
echo

##############################################################
##  Define the renaming tags based on the users responses	##
##############################################################

echo Defining Renaming Convention Based on User Configuration Inputs
date '+%m/%d/%y %H:%M:%S'
echo

# TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF
if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
	then
	TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF" 
	else
	TOPHAT_TAG="TophatPE_DeNovo"                              
fi
echo The tophat tag has been set to ${TOPHAT_TAG}

# CUFFLINKS_TAG options are Cufflinks_DeNovo_NoMask, Cufflinks_DeNovo_Masked, Cufflinks_Limited_NoMask, Cufflinks_Limited_Masked, Cufflinks_Guided_NoMask, Cufflinks_Guided_Masked

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_TAG="Cufflinks_DeNovo_NoMask"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_TAG="Cufflinks_DeNovo_${MASK_SHORT}Mask"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_TAG="Cufflinks_${LIMIT_GTF_SHORT}Limited_NoMask"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_TAG="Cufflinks_${LIMIT_GTF_SHORT}Limited_${MASK_SHORT}Mask"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_TAG="Cufflinks_${GUIDE_GTF_SHORT}Guided_NoMask"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_TAG="Cufflinks_${GUIDE_GTF_SHORT}Guided_${MASK_SHORT}Mask"                             
fi
echo The cufflinks tag has been set to ${CUFFLINKS_TAG}
echo

##############################################################
##  Create a custom analysis directory for each sample      ##
##############################################################

## Lots of tricks in this step, find function, basename, learning how \, ", and ' quotes differ allowing you to write special characters and use variables

echo Creating Custom Directories and Splitting Files
date '+%m/%d/%y %H:%M:%S'

for file in `find . -type f -iname "*${READ1_ID}"`
do
filename=`basename $file "${READ1_ID}"`
mkdir $filename
mv $filename${READ1_ID} $filename
mv $filename${READ2_ID} $filename
echo "SAMPLE_NAME=\"$filename\"" >> $filename.ini
done

echo Directory Constuction and File Splitting Complete
date '+%m/%d/%y %H:%M:%S'
echo

##############################################################
##  Analysis Pipeline - BWA, TOPHAT, CUFFLINKS, QC metrics  ##
##############################################################

# Initiate analysis of each identified sample one at a time

echo Starting Individual Sample Analysis
date '+%m/%d/%y %H:%M:%S'
echo

for ini in `ls *.ini`
do
#Read in the .ini file created during the directory construction period
. $ini
mv $ini ${SAMPLE_NAME}
cd ${SAMPLE_NAME}

#Current Directory=ngs/tophat/SampleX/tophat_out
echo Analyzing ${SAMPLE_NAME} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Extract and Count reads in input files, 3 parse version depending on compression and type of compression
if [ ${COMPRESSION_QUERY} = "No" ]
	then
	#Calculate the number of reads present in each file
	echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	COUNT_READ1=`wc -l ${SAMPLE_NAME}${READ1_ID} | awk '{print $1/4}'`
	echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	COUNT_READ2=`wc -l ${SAMPLE_NAME}${READ2_ID} | awk '{print $1/4}'`
	echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	echo Finished  identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

	#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
	echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	head -n 4000000 ${SAMPLE_NAME}${READ1_ID} > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
	echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	head -n 4000000 ${SAMPLE_NAME}${READ2_ID} > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
	echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
	else
	if [ ${COMPRESSION_TYPE} = "gzip" ]
		then
		echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		COUNT_READ1=`gunzip -c ${SAMPLE_NAME}${READ1_ID} | wc -l | awk '{print $1/4}'`
		echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		COUNT_READ2=`gunzip -c ${SAMPLE_NAME}${READ2_ID} | wc -l | awk '{print $1/4}'`
		echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo Finished  identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

		#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
		echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		gunzip -c ${SAMPLE_NAME}${READ1_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
		echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		gunzip -c ${SAMPLE_NAME}${READ2_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
		echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	fi
	if [ ${COMPRESSION_TYPE} = "bzip2" ]
		then
		echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		COUNT_READ1=`bunzip2 -k ${SAMPLE_NAME}${READ1_ID} | wc -l | awk '{print $1/4}'`
		echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		COUNT_READ2=`bunzip2 -k ${SAMPLE_NAME}${READ2_ID} | wc -l | awk '{print $1/4}'`
		echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo Finished  identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

		#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
		echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		bunzip2 -k ${SAMPLE_NAME}${READ1_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
		echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		bunzip2 -k ${SAMPLE_NAME}${READ2_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
		echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	fi
fi

#Align subset reads to transcriptome reference using BWA aln
echo Starting Alignment of Read1 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bwa aln -t ${CPU_CORES} -I ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ1_BWA_FASTQ} > ${SAMPLE_NAME}${READ1_BWA_SAI}
echo Finished Aligning Read1 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo Starting Alignment of Read2 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bwa aln -t ${CPU_CORES} -I ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ2_BWA_FASTQ} > ${SAMPLE_NAME}${READ2_BWA_SAI}
echo Finished Aligning Read2 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Generate .sam file using BWA sampe
echo Starting BWA SAMPE Assembly of Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bwa sampe ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ1_BWA_SAI} ${SAMPLE_NAME}${READ2_BWA_SAI} ${SAMPLE_NAME}${READ1_BWA_FASTQ} ${SAMPLE_NAME}${READ2_BWA_FASTQ} > ${SAMPLE_NAME}${BWA_SAMPE}.sam
echo Finished BWA SAMPE Assembly Phase | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Convert BWA sam file to bam file using SAMTOOLS
echo Starting SAM to BAM conversion with Samtools | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
samtools view -bS -o ${SAMPLE_NAME}${BWA_SAMPE}.bam ${SAMPLE_NAME}${BWA_SAMPE}.sam
echo Finished SAM to BAM conversion wth Samtools | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Sort BWA bam file using SAMTOOLS
echo Starting BAM file sorting step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
samtools sort ${SAMPLE_NAME}${BWA_SAMPE}.bam ${SAMPLE_NAME}${BWA_SAMPE}${SAMTOOLS_SORT_ID}
echo Finished sorting BAM file | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Determine insert metrics using PICARD
echo Starting Picard CollectInsertSizeMetrics step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/CollectInsertSizeMetrics.jar INPUT=${SAMPLE_NAME}${BWA_SAMPE}${SAMTOOLS_SORT_ID}.bam OUTPUT=${SAMPLE_NAME}_Picard_InsertMetrics.txt HISTOGRAM_FILE=${SAMPLE_NAME}_Picard_InsertMetrics.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished collecting insert metrics step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Calculate "inner mate distance" and standard deviation
echo Starting to calculate insert metrics | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

INNER_DISTANCE=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f5 | awk -v var1="$COMBINED_READ_LENGTH" '{printf "%.0f\n", $1-var1}'`
echo The mean inner-mate-distance was identified as being ${INNER_DISTANCE}bp | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

STDEV=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f6 | awk '{printf "%.0f\n", $1}'`
echo The standard deviation of the inner-mate-distance was identified as being ${STDEV}bp | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

echo Finished calculating insert metrics | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Run tophat analysis
echo Starting Tophat Alignment of ${SAMPLE_NAME} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

if [ ${TOPHAT_QUERY_GTF} = "Yes" ]       
	then 
	TOPHAT_COMMAND="tophat --num-threads ${CPU_CORES} -r ${INNER_DISTANCE} --mate-std-dev ${STDEV} --max-insertion-length ${MAX_INSERT_SIZE} --max-deletion-length ${MAX_DELETION_SIZE} ${READ_QUALITY} --library-type ${LIBRARY_FORMAT} -G ${GTF_FILE} ${BOWTIE_INDEX} ${SAMPLE_NAME}${READ1_ID} ${SAMPLE_NAME}${READ2_ID}"
fi

if [ ${TOPHAT_QUERY_GTF} = "No" ]       
	then 
	TOPHAT_COMMAND="tophat --num-threads ${CPU_CORES} -r ${INNER_DISTANCE} --mate-std-dev ${STDEV} --max-insertion-length ${MAX_INSERT_SIZE} --max-deletion-length ${MAX_DELETION_SIZE} ${READ_QUALITY} --library-type ${LIBRARY_FORMAT} ${BOWTIE_INDEX} ${SAMPLE_NAME}${READ1_ID} ${SAMPLE_NAME}${READ2_ID}"
fi

echo ${TOPHAT_COMMAND} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
${TOPHAT_COMMAND}

## This creates a subfolder called "tophat_out/"
## This folder will contain:	accepted_hits.bam (is it sorted?)
##								deletions.bed
##								insertions.bed
##								junctions.bed

cd tophat_out/
#Current Directory=ngs/tophat/SampleX/tophat_out

mv accepted_hits.bam ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam
mv deletions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_deletions.bed
mv insertions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_insertions.bed
mv junctions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_junctions.bed

echo Finished Tophat Alignment | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Run cufflinks analysis
echo Starting Cufflinks Abundance Estimation of ${SAMPLE_NAME} | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF ${GTF_FILE} --compatible-hits-norm --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF ${GTF_FILE} --compatible-hits-norm -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF-guide ${GTF_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF-guide ${GTF_FILE} -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

echo ${CUFFLINKS_COMMAND} | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
${CUFFLINKS_COMMAND}

## This creates a subfolder called "cufflinks_out/" in the sample directory
## This folder will contain:	genes.expr
##								transcripts.expr
##								transcripts.gtf

#In the next step we will rename the expression estimate files

cd ../cufflinks_out/
#Current Directory=ngs/tophat/SampleX/cufflinks_out

mv genes.fpkm_tracking ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_genes.expr
mv isoforms.fpkm_tracking ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_transcripts.expr
mv transcripts.gtf ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_transcripts.gtf

echo Finished Cufflinks Abundance Estimation | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

##############################################################
##  Calculate metrics using Samtools and Picard				##
##############################################################

cd ../tophat_out/
#Current Directory=ngs/tophat/SampleX/tophat_out

#Determine the RNA Seq Metrics using Picard "CollectRNASeqMetrics.jar"
echo Starting Picard CollectRnaSeqMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/CollectRnaSeqMetrics.jar REF_FLAT=${REFFLAT_FILE} RIBOSOMAL_INTERVALS=${RIBOSOME_LIST} STRAND_SPECIFICITY=NONE REFERENCE_SEQUENCE=${REFERENCE_FILE} INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt CHART_OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished collecting RNAseq metrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Mark the duplicate reads using Picard "MarkDuplicates.jar"
echo Starting Picard MarkDuplicates step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
## NOTE: "-Dsnappy.disable=true" is passed to this command to prevent the use of snappy-java by MarkDuplicates as I can not figure out how to add it to the JAVA CLASSPATH 
java -Dsnappy.disable=true -Xmx2g -jar $HOME/local/bin/MarkDuplicates.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam METRICS_FILE=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished MarkDuplicates step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Create an index file for the bam file generated from the MarkDuplicates step using SAMTOOLS INDEX
echo Starting bam file indexing step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
samtools index ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam
echo Finished bam file indexing step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Calculate mapping statistics using SAMTOOLS FlAGSTAT
echo Starting Samtools Flagstat step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
samtools flagstat ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam > ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt
echo Finished Samtools Flagstat step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Calculate mapping statistics using Picard "CollectAlignmentSummaryMetrics.jar"
echo Starting Picard CollectAlignmentSummaryMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/CollectAlignmentSummaryMetrics.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt VALIDATION_STRINGENCY=SILENT REFERENCE_SEQUENCE=${REFERENCE_FILE} ASSUME_SORTED=true IS_BISULFITE_SEQUENCED=false TMP_DIR=${TEMP_DIRECTORY}
echo Finished Picard CollectAlignmentSummaryMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Calculate read quality stats using Picard "MeanQualityByCycle.jar"
echo Starting Picard MeanQualityByCycle step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/MeanQualityByCycle.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.txt CHART_OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished Picard MeanQualityByCycle step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

##############################################################
##  Write out summary file (Stats/Metrics/Indexes?)			##
##############################################################

#Create a summary text file
echo Starting Creation of Summary File | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

echo "*** Analysis Run Report ***" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo Sample ${SAMPLE_NAME} was Analyzed | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

# Test to see if the user wants a to extract FPKM values for a list of genes into the run report
if [ ${GENESOFINTEREST_QUERY} = "Yes" ]
	then
	echo FPKM Values for Common genes of Interests: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
	grep -wf ${GENES_OF_INTEREST} ../cufflinks_out/${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_genes.expr | awk '{print $1, $10}' | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
	echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
fi

echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

echo Sample Summary: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "The two read files processed were ${SAMPLE_NAME}${READ1_ID} and ${SAMPLE_NAME}${READ2_ID}" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${SAMPLE_NAME}${READ1_ID} contained ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${SAMPLE_NAME}${READ2_ID} contained ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo After aligning the first one million reads to a transcriptome reference using BWA | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo the inner-mate-distance was determined to be ${INNER_DISTANCE}bp with a standard deviation of ${STDEV} | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "The alignment metrics for this sample were as follows:" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

#Extract Alignment Metrics
UNIQUE_READ1_READS_ALIGNED=`cut -f3 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 8 | tail -n 1`
UNIQUE_READ2_READS_ALIGNED=`cut -f3 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 9 | tail -n 1`
TOTAL_READ1_ALIGNMENTS=`awk '{print $1}' ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt | head -n 5 | tail -n 1`
TOTAL_READ2_ALIGNMENTS=`awk '{print $1}' ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt | head -n 6 | tail -n 1`
PERCENT_READ1_ALIGNED=`echo ${COUNT_READ1} | awk -v var2="${UNIQUE_READ1_READS_ALIGNED}" '{print var2/$1}'`
PERCENT_READ2_ALIGNED=`echo ${COUNT_READ2} | awk -v var3="${UNIQUE_READ2_READS_ALIGNED}" '{print var3/$1}'`
PERCENT_READ1_Q20_BASES=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | tail -n 1 | awk '{print $11/$8}'`
PERCENT_READ2_Q20_BASES=`head -n 9 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | tail -n 1 | awk '{print $11/$8}'`
PERCENT_PAIRED_READ_ALIGNMENT=`cut -f17 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 8 | tail -n 1 | awk -v var4="${COUNT_READ1}" '{print $1/var4}'`

echo Alignment Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${UNIQUE_READ1_READS_ALIGNED} - Number of Unique Reads from Read1 Aligned | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${UNIQUE_READ2_READS_ALIGNED} - Number of Unique Reads from Read2 Aligned | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${TOTAL_READ1_ALIGNMENTS} - Total Number of Alignment Events from the Unique Reads in Read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${TOTAL_READ2_ALIGNMENTS} - Total Number of Alignment Events from the Unique Reads in Read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ1_ALIGNED} - Percentage of Unique Read1 Reads Aligned to at least one Position | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ2_ALIGNED} - Percentage of Unique Read2 Reads Aligned to at least one Position | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ1_Q20_BASES} - Percentage of Aligned Bases from Read1, which are greater than Q20 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ2_Q20_BASES} - Percentage of Aligned Bases from Read2,  which are greater than Q20 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_PAIRED_READ_ALIGNMENT} - Percentage of Reads Aligned as Pairs | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

#Extract RNAseq Metrics
PERCENT_RIBOSOME_BASES=`cut -f11 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_CODING_BASES=`cut -f12 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_UTR_BASES=`cut -f13 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_INTRONIC_BASES=`cut -f14 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_INTERGENIC_BASES=`cut -f15 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_MRNA_BASES=`cut -f16 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
MEDIAN_5PRIME_BIAS=`cut -f20 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
MEDIAN_3PRIME_BIAS=`cut -f21 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
MEDIAN_5PRIME_3PRIME_BIAS_RATIO=`cut -f22 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`

echo Transcriptome Sequencing Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_RIBOSOME_BASES} - Percent Ribosomal Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_CODING_BASES} - Percent Coding Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_UTR_BASES} - Percent UTR Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_INTRONIC_BASES} - Percent Intronic Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_INTERGENIC_BASES} - Percent Intergenic Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_MRNA_BASES} - Percent mRNA Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${MEDIAN_5PRIME_BIAS} - Median 5 Prime Coverage Bias | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${MEDIAN_3PRIME_BIAS} - Median 3 Prime Coverage Bias | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${MEDIAN_5PRIME_3PRIME_BIAS_RATIO} - Median 5/3 Coverage Ratio | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

#Extract Duplication Metrics
PERCENT_DUPLICATION=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt | tail -n 1 | cut -f8`
LIBRARY_SIZE=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt | tail -n 1 | cut -f9`

echo Duplication Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${LIBRARY_SIZE} - Estimated fragments in Library | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_DUPLICATION} - Percent Duplicate Reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

#Calculate Quality Metrics (first/last cycle and mean per read)
QUALITY_CAPTURE_LINES=`echo ${COMBINED_READ_LENGTH} | awk '{print $1+8}'`
QUALITY_CAPTURE_READS=`echo ${COMBINED_READ_LENGTH} | awk '{print $1/2}'`
head -n ${QUALITY_CAPTURE_LINES} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.txt | tail -n ${COMBINED_READ_LENGTH} > run_qualities.txt
head -n ${QUALITY_CAPTURE_READS} run_qualities.txt > read1_qualities.txt
tail -n ${QUALITY_CAPTURE_READS} run_qualities.txt > read2_qualities.txt
QUALITY_READ1_FIRST_CYCLE1=`cut -f2 read1_qualities.txt | head -n 1`
QUALITY_READ1_LAST_CYCLE1=`cut -f2 read1_qualities.txt | tail -n 1`
QUALITY_READ1_AVERAGE=`awk '{s+=$2} END {print s/NR}' < read1_qualities.txt`
QUALITY_READ2_FIRST_CYCLE1=`cut -f2 read2_qualities.txt | head -n 1`
QUALITY_READ2_LAST_CYCLE1=`cut -f2 read2_qualities.txt | tail -n 1`
QUALITY_READ2_AVERAGE=`awk '{s+=$2} END {print s/NR}' < read2_qualities.txt`

echo Quality Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo This run consists of paired ${QUALITY_CAPTURE_READS}x${QUALITY_CAPTURE_READS}bp reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ1_AVERAGE} - Mean phred score for read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ1_FIRST_CYCLE1} - Mean phred score for the first cycle of read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ1_LAST_CYCLE1} - Mean phred score for the last cycle of read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ2_AVERAGE} - Mean phred score for read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ2_FIRST_CYCLE1} - Mean phred score for the first cycle of read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ2_LAST_CYCLE1} - Mean phred score for the last cycle of read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

#Output complete list of variables used in the run
echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo   THE FOLLOWING VARIABLES WERE GENERATED DURING THE ANALYSIS OF THIS SAMPLE: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

echo "COMPRESSION_QUERY=\"${COMPRESSION_QUERY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMPRESSION_TYPE=\"${COMPRESSION_TYPE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_QUERY_GTF=\"${TOPHAT_QUERY_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_GTF_SHORT=\"${TOPHAT_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_QUERY_LIMIT_GTF=\"${CUFFLINKS_QUERY_LIMIT_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_QUERY_GUIDE_GTF=\"${CUFFLINKS_QUERY_GUIDE_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "LIMIT_GTF_SHORT=\"${LIMIT_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GUIDE_GTF_SHORT=\"${GUIDE_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_QUERY_MASK=\"${CUFFLINKS_QUERY_MASK}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MASK_SHORT=\"${MASK_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GENESOFINTEREST_QUERY=\"${GENESOFINTEREST_QUERY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "FINAL_CHECK=\"${FINAL_CHECK}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_TAG=\"${TOPHAT_TAG}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_TAG=\"${CUFFLINKS_TAG}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "SAMPLE_NAME=\"${SAMPLE_NAME}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COUNT_READ1=\"${COUNT_READ1}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COUNT_READ2=\"${COUNT_READ2}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "INNER_DISTANCE=\"${INNER_DISTANCE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "STDEV=\"${STDEV}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_COMMAND=\"${TOPHAT_COMMAND}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_COMMAND=\"${CUFFLINKS_COMMAND}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo THE FOLLOWING VARIABLES WERE LOADED FROM THE RNAseq_MensFormalWear_Parameters.ini DURING THE ANALYSIS OF THIS SAMPLE: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo They may or may not have been used depending on your query respones
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "LAUNCH_DIR=\"${LAUNCH_DIR}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ_EXT=\"${READ_EXT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ1_ID=\"${READ1_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ2_ID=\"${READ2_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMBINED_READ_LENGTH=\"${COMBINED_READ_LENGTH}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CPU_CORES=\"${CPU_CORES}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GENOME_NAME=\"${GENOME_NAME}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "REFERENCE_FILE=\"${REFERENCE_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "BWA_GENOME_FILE=\"${BWA_GENOME_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ1_BWA_FASTQ=\"${READ1_BWA_FASTQ}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ2_BWA_FASTQ=\"${READ2_BWA_FASTQ}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ1_BWA_SAI=\"${READ1_BWA_SAI}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ2_BWA_SAI=\"${READ2_BWA_SAI}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "BWA_SAMPE=\"${BWA_SAMPE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "SAMTOOLS_SORT_ID=\"${SAMTOOLS_SORT_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "BOWTIE_INDEX=\"${BOWTIE_INDEX}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GTF_FILE=\"${GTF_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ_QUALITY=\"${READ_QUALITY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "LIBRARY_FORMAT=\"${LIBRARY_FORMAT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MAX_INSERT_SIZE=\"${MAX_INSERT_SIZE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MAX_DELETION_SIZE=\"${MAX_DELETION_SIZE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MASK_FILE=\"${MASK_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TEMP_DIRECTORY=\"${TEMP_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "REFFLAT_FILE=\"${REFFLAT_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "RIBOSOME_LIST=\"${RIBOSOME_LIST}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GENES_OF_INTEREST=\"${GENES_OF_INTEREST}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMPLETED_RNASEQ_READ_FILE_DIRECTORY=\"${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMPLETED_RNASEQ_RESULTS_DIRECTORY=\"${COMPLETED_RNASEQ_RESULTS_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

echo Finished Creating the Summary File | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

##############################################################
##  Clean up analysis directory and Move files to Archive	##
##############################################################

echo Starting to Clean up Analysis Directory | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Remove temp files created during calculations steps
rm run_qualities.txt
rm read1_qualities.txt
rm read2_qualities.txt

#Move tophat outputs and metric files to archive directory
mv ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam.bai ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.bed ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.pdf ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.txt ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}

#Move cufflinks outputs to archive directory
cd ../cufflinks_out
#Current Directory=ngs/tophat/SampleX/cufflinks_out
mv *.expr ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.gtf ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}

#Move fastq, ini, and log file to archive directories
cd ../
#Current Directory=ngs/tophat/SampleX
mv ${SAMPLE_NAME}${READ1_ID} ${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}
mv ${SAMPLE_NAME}${READ2_ID} ${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}

echo Finished Cleaning up Analysis Directory | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

mv *.log ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}

#Move back to launch directory to start next sample in batch
cd ../
#Current Directory=ngs/tophat/

#Remove Custom analysis directory and none required files
rm -rf ${SAMPLE_NAME}

done
rnaseq_common_parameters.ini
Code:
## RNAseq Common parameters file ##

# Use this variable to indicate the expected launch directory
# (Update as appropriate)
LAUNCH_DIR=/Users/jkeats/ngs/tophat

##################################################################################
## 										##	
##    Common Parameters used at multiple steps or used to name output files	##
##				(Update as appropriate)				##
##										##
##################################################################################

#sequence read file extension
READ_EXT=".fastq.gz"

#forward or read1 identifier
READ1_ID="_1.fastq.gz"

#reverse or read2 identifier
READ2_ID="_2.fastq.gz"

#What is the combined read length of the two read pairs (ie. 50x50=100, 100x100=200)
COMBINED_READ_LENGTH="100"

#How many cores/processors are available for the multi-threaded phases of BWA, TOPHAT, CUFFLINKS
CPU_CORES="1"

#This is used to rename result files to include the genome version
GENOME_NAME=hg19

#Provide the path and full file name to the genome reference file used to create the bowtie index
REFERENCE_FILE=/Users/jkeats/ngs/refgenomes/bowtie_indexed/igenomes_hg19/genome.fa

##################################################################################
##										##
##    Common BWA parameters (Only the BWA_GENOME_FILE needs to be updated)	##
##										##
##################################################################################

BWA_GENOME_FILE="/Users/jkeats/ngs/refgenomes/bwa_indexed/GRCh37_E60cDNA.fa"

READ1_BWA_FASTQ="_1_bwa.fastq"
READ2_BWA_FASTQ="_2_bwa.fastq"

READ1_BWA_SAI="_1_bwa.sai"
READ2_BWA_SAI="_2_bwa.sai"

BWA_SAMPE="_bwa_sampe"

##################################################################################
##										##
##    Common Samtools parameters (Shouldn't need to be modified)		##
##										##	
##################################################################################

SAMTOOLS_SORT_ID="_sorted"

##################################################################################
## 										##
##    Common Tophat Parameters (Update as appropriate)				##
##										##
##################################################################################

#Bowtie index to align data against [ <filename> ]
BOWTIE_INDEX="/Users/jkeats/ngs/refgenomes/bowtie_indexed/igenomes_hg19/genome"

#GTF file to use for LIMITED ABUNDANCE ESTIMATION (-G/--GTF) or GUIDED ABUNDANCE ESTIMATION ( -g/--GTF-guide)
GTF_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_genes.gtf"

#Quality type [ --solexa-quals  OR  --solexa1.3-quals (same as phred64-quals)  OR  --phred64-quals (same as solexa1.3-quals) ]
READ_QUALITY="--solexa1.3-quals"

#Library type [ fr-unstranded  OR  fr-firststrand  OR  fr-secondstrand)
LIBRARY_FORMAT="fr-unstranded"

#Maximum insertion size allow IF allowing indels [ --max-insertion-length     <int>     default: 3 ]
MAX_INSERT_SIZE="3"

#Maximum deletion size allowed IF allowing indels [ --max-deletion-length     <int>     default: 3 ]
MAX_DELETION_SIZE="3"

##################################################################################
##										##
##   Common Cufflinks Parameters (Update as appropriate)			##
##										##
##################################################################################

#This should be used if using an exclusion file for cufflinks abundance estimates
MASK_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/GRCh37_E60_CufflinksExcludedTranscripts.gtf"

##################################################################################
##										##
##   Common Picard Parameters (Update as appropriate)				##
##										##
##################################################################################

#Provide a directory for temporary files (Not always needed but on multi-harddrive systems JAVA may use an unexpected directory)
TEMP_DIRECTORY="/Users/jkeats/ngs/refgenomes/temp"

#This is a refFlat format table outlining the transcript space of the genome analyzed
REFFLAT_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_refFlat.txt"

#This is the ribosome exclusion file, a list of ribosome gene locations to report the percentage of reads related to ribosomal transcripts
RIBOSOME_LIST="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_ribosome_gene_locations.txt"

##################################################################################
##										##
##   Gene List of Interest (Update as appropriate)				##
##										##
##################################################################################

#List of genes to extract
GENES_OF_INTEREST="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/TC_Class_Genes.txt"

##################################################################################
##										##
##   Final Directories (Update as appropriate)					##
##										##
##################################################################################

#Provide the path to the folder were you want to archive the read files
COMPLETED_RNASEQ_READ_FILE_DIRECTORY="/Users/jkeats/ngs/analyzed_read_files/rnaseq/"

#Provide the path to the folder were you want to archive the read files
COMPLETED_RNASEQ_RESULTS_DIRECTORY="/Users/jkeats/ngs/finaloutputs/rnaseq/"

Last edited by Jon_Keats; 11-26-2011 at 09:07 PM. Reason: updated script bug
Jon_Keats is offline   Reply With Quote
Old 09-26-2011, 08:23 AM   #62
LOH
Registered Vendor
 
Location: CA

Join Date: Jul 2010
Posts: 23
Default

I learned a lot from your post. And the website for your lab looks great.
keep up with the good work.
LOH is offline   Reply With Quote
Old 11-11-2011, 10:44 PM   #63
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

GATK Trial and Error

I'm finally getting into using the GATK package. My first impression is that I find it much more intimidating that any of the Picard tools and the documentation is a bit harder to follow. Though I think that is largely the difference between a wiki style interface and a singular website page.

Notes:
1) Most analysis methods require indexed bam files. So get into a habit of just indexing every bam file. Typically I only indexed the final bam file for viewing in IGV
2) Related to 1) almost every file seems to require indexes. The reference genome (genome.fasta) file needs an index (genome.fasta.fai), and a dictionary file (genome.dict)
3) Related to 1) and 2) ....see a trend. The input vcf/rod files (file.vcf) need to be indexed (file.vcf.idx)

The frustrating thing is I know how to make the .fai file
Code:
samtools faidx <myFASTQ>
I know how to make the dictionary file using picard
Code:
java -jar CreateSequenceDictionary.jar REFERENCE=genome.fasta OUTPUT=genome.dict
But I have no idea how to create the .idx index file of the vcf file. So right now I'm limited to the file provided on the GATK website, as they provide complete indexes for each.

4) Fixing mates with Picard is no longer required after indel realignment as GATK does it on the fly now

Much more to come I have a feeling

Last edited by Jon_Keats; 11-16-2011 at 10:21 AM. Reason: fixed error
Jon_Keats is offline   Reply With Quote
Old 11-18-2011, 07:46 AM   #64
ahmetz
Member
 
Location: new york

Join Date: Jun 2011
Posts: 23
Default

I'm glad to hear that i'm not the only one intimidated by the documentation or there lack of. I feel like a big portion of TopHat/Cufflinks users are using those because of nice, relatively easy to understand documentation.

please keep on posting Jon, as I had given up on GATK, trying to learn python, but now i feel like i should get back into it!
ahmetz is offline   Reply With Quote
Old 11-20-2011, 04:33 PM   #65
DineshCyanam
Compendia Bio
 
Location: Ann Arbor

Join Date: Oct 2010
Posts: 35
Default

Thanks a lot Jon for all the scripts... Really appreciate your work... Makes my life super easy... I have lost count of how many times I have referred back to this thread...

- Dinesh
DineshCyanam is offline   Reply With Quote
Old 11-26-2011, 08:28 PM   #66
DineshCyanam
Compendia Bio
 
Location: Ann Arbor

Join Date: Oct 2010
Posts: 35
Default

I believe line #316 in the RNAseq_MensFormal_Mac.sh should be as below. Or else no matter what user response you give (Yes/No), the tophat tag will always be set to TophatPE_GTF.
Code:
# TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF
if [[ ${TOPHAT_QUERY_GTF} = "Yes" ]]
    then
    TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF" 
fi

if [[ ${TOPHAT_QUERY_GTF} = "No" ]]
    then
    TOPHAT_TAG="TophatPE_DeNovo"
fi
echo The tophat tag has been set to ${TOPHAT_TAG}
DineshCyanam is offline   Reply With Quote
Old 11-26-2011, 09:52 PM   #67
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Hi Dinesh,

Thanks for the good catch. I'd made this update on my local version but forgot to update this thread and the version on the website. The version above is updated now and the website version should be done tonight.

Updated as follows:
Code:
# TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF
if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
	then
	TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF" 
	else
	TOPHAT_TAG="TophatPE_DeNovo"                              
fi
echo The tophat tag has been set to ${TOPHAT_TAG}
I got your email about the mask file. I've avoided providing one, not because I don't think people should use one but because the contents may be specific to different applications. In my case we exclude immunoglobulin regions, which is a disease specific issue, but things like ribosomal and mitochondrial genes are likely things that 99% of users will exclude so I've placed a copy of the file I use on the website (www.keatslab.org).

To generate a similar file with the elements
Code:
#see what RNA types exist in your gtf file
cut -f 2 ensembl_for_your_genome.gtf | sort | uniq

#create a text file with the RNA types you want to exclude, such as
#to_exclude.txt
Mt_rRNA
Mt_tRNA
Mt_tRNA_pseudogene
rRNA
rRNA_pseudogene
MT

grep -f to_exclude.txt ensembl_for_your_genome.gtf > CufflinksExcludedTranscripts.gtf
Jon_Keats is offline   Reply With Quote
Old 11-26-2011, 11:24 PM   #68
DineshCyanam
Compendia Bio
 
Location: Ann Arbor

Join Date: Oct 2010
Posts: 35
Thumbs up

Thank You, Jon for fixing the bug and for the cufflinks_exclude files.

I was about to post about HiSeq data and Illumina's switch to Sanger quality scores (as I had to update your script to work with HiSeq data) and see that you have posted it on your website. So thanks once again.

- Dinesh
DineshCyanam is offline   Reply With Quote
Old 12-05-2011, 04:39 AM   #69
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 308
Default

Hi Jon,
I nominate this as the best running thread on SeqAnswers. I'm going to be looking at a lot of RNA-seq data in the next months and I was wondering why you use Cufflinks over DESeq or EdgeR?

Thanks for the scripts. I'll give them a go once I start getting some data back.

I've been embedding too much UNIX in my Perl scripts and your scripts are a nice example on how to write shell scripts. Do you or anyone else know a good reference on writing shell scripts beyond the 'Unix and perl tutorial for biologists'.
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 12-13-2011, 11:57 AM   #70
anjulka
Member
 
Location: Cambridge, MA

Join Date: Oct 2011
Posts: 11
Default

Quote:
Originally Posted by Jon_Keats View Post
GATK Trial and Error
4) Fixing mates with Picard is no longer required after indel realignment as GATK does it on the fly now
How did you find this out? (I don't mean to doubt you, but I've been led astray before. Plus, I need to learn how to find out change information.)

Thanks.
anjulka is offline   Reply With Quote
Old 12-14-2011, 01:32 PM   #71
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Hi Anjulka,

I completely understand. The source was not obvious, I was following the GATK best practice guide which no longer suggests the picard fix-mate step. Similar to your concern I spent some time searching for the reason and finally found this thread on the GATK Get Satisfaction help page (http://getsatisfaction.com/gsa/topic...el_realignment). See the first reply by Eric Banks (GATK developer)
Jon_Keats is offline   Reply With Quote
Old 01-17-2012, 11:27 AM   #72
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

I'm a bit amazed by how many people follow or have read this thread, especially with it recently passing ECO introduction sticky thread. I wonder if I can reference this thread to the grant reviewers that say I need more informatics help on my grants... Oh well enough complaining.

I've been finalizing my exome pipeline for one of the projects that deals with human myeloma cell lines were there is not germline/constitutional DNA sample available. The primary goal is to identify variants in genes identified in our patient sample screens to identify potential model systems and appropriate controls.

I'll post the entire pipeline and examples of all the required input files but the pipeline looks like the following (see below) and generates a Summary Run Report shown (see below) after the following command is initiated in a directory full of fastq files (can be a total mixed bag and either uncompressed, gzip, or bzip2 compressed)

<BWA_exome_pipeline.sh> <run_parameters.ini> <sample_sheet.txt>
http://www.keatslab.org/resources/ngs-tools

Code:
*** Analysis Run Report ***

Sample KMS11_JPN_p3_TruSeq70MbExome was Analyzed

-----------------------------------------------------------------------------

*** Lane Alignment Run Report ***

Sample KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7 was Analyzed

The two read files processed were KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R1.fastq.gz and KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R2.fastq.gz
KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R1.fastq.gz contained 50457788 reads
KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R2.fastq.gz contained 50457788 reads
-----------------------------------------------------------------------------

*** Lane Alignment Run Report ***

Lane KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8 was Analyzed
The two read files processed were KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R1.fastq.gz and KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R2.fastq.gz
KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R1.fastq.gz contained 50279910 reads
KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R2.fastq.gz contained 50279910 reads
-----------------------------------------------------------------------------

Sample Summary:

After alignment with BWA the mean insert size was determined to be 270.820553bp with a standard deviation of 91.662134

The alignment metrics for this sample are as follows:

Alignment Metrics:
83 - Average Read Length
201475396 - Total Number of Passing Filter Reads Aligned with bwa
199321958 - Number of Reads Aligned
99991035 - Number of Reads from Read1 Aligned
99330923 - Number of Reads from Read2 Aligned
0.989312 - Percentage of Unique Reads Aligned to at least one Position
0.992588 - Percentage of Unique Read1 Reads Aligned to at least one Position
0.986035 - Percentage of Unique Read2 Reads Aligned to at least one Position
0.844223 - Percentage of Aligned Bases, which are greater than Q20
0.99472 - Percentage of Reads Aligned as Pairs
0.000131 - Percentage of Reads which appear to be Adaptor Reads

Duplication Metrics:
500572402 - Estimated fragments in Library
0.095995 - Percent Duplicate Reads

Transcriptome Metrics:
0.56342 - Percent mRNA Bases
0.247456 - Percent Coding Bases
0.315964 - Percent UTR Bases
0.293039 - Percent Intronic Bases
0.143454 - Percent Intergenic Bases
0.661081 - Median 5 Prime Coverage Bias
0.730395 - Median 3 Prime Coverage Bias
0.938235 - Median 5/3 Coverage Ratio

Capture Metrics:
182341432 - Number of unique alignment events [non-duplicates]
0.905031 - Percent unique alignment events [non-duplicates]
162369533 - Number of unique alignment events with MAPQ>0
0.89047 - Percent of reads which are unique alignments with MAPQ>0
13.446 - Gigabases of unique alignments with MAPQ>0
0.696883 - Percent of aligned bases within padded bait region
0.303117 - Percent of aligned bases not associated with a bait region
101.257402 - Mean coverage of the target region
0.376064 - Percentage of bases on target
5.057 - Gigabases of on target base pairs
23.291078 - Fold enrichment
0.003347 - Percent of undetected targets
0.95061 - Percent of targets with greater than 10x coverage
0.923333 - Percent of targets with greater than 20x coverage
0.886545 - Percent of targets with greater than 30x coverage
0.03663 - Percent of low GC targets which are under represented
5.835684 - Percent of high GC targets which are under represented

Quality Metrics:
This run consists of paired 83x83bp reads
32.6069 - Mean phred score for read1
32.216839 - Mean phred score for the first cycle of read1
30.498119 - Mean phred score for the last cycle of read1
31.2961 - Mean phred score for read2
30.722569 - Mean phred score for the first cycle of read2
28.530552 - Mean phred score for the last cycle of read2

Variant Calling Metrics:
40956 - Number of high-confidence variants called by both GATK and Samtools
2.63 - Ti/Tv Ratio for high-confidence variants called by both GATK and Samtools
2059 - Number of high-confidence variants that do not exist in dbsnp
2.19 - Ti/Tv Ratio for high-confidence variants called by both GATK and Samtools that do not exist in dbsnp
3266 - Number of high-confidence variants that do not exist in 1000G project
2.00 - Ti/Tv Ratio for high-confidence variants called by both GATK and Samtools that do not exist in 1000G project
94.97 - Percentage of high-confidence variants that exist in dbsnp
92.60 - Percentage of variants called by GATK and filtered by Samtools that exist in dbsnp
81.86 - Percentage of variants called by Samtools and filtered by GATK that exist in dbsnp
87.19 - Percentage of variants called exclusively by GATK that exist in dbsnp
41.50 - Percentage of variants called exclusively by Samtools that exist in dbsnp
92.95 - Percentage of variants called by but filtered by GATK and Samtools that exist in dbsnp

581 of the 3266 high-confidence variants absent from 1000G are non-synonymous changes

The 581 non-synonymous changes includes the following genes, which are mutated in the MMGI Study:
ZNF804A	-697T	2	185802211	C	CACA	[0, 0, 78, 75]	rs5836928

The 581 non-synonymous changes includes the following positions, which exist in the Cosmic database:
ZNF804A	-697T	2	185802211	C	CACA	[0, 0, 78, 75]	rs5836928
FGFR3	Y373C	4	1806099	A	G	[8, 4, 14, 5]	rs121913485

The 581 non-synonymous changes includes the following nonsense changes:
RP4-788L13.1	E298*	1	99771640	G	T	[28, 35, 7, 11]	.
TAF1A	Y36*	1	222761798	G	T	[53, 28, 12, 11]	.
MAP4K4	Q488*	2	102482984	C	T	[8, 19, 7, 14]	.
PTPRN	Q22*	2	220172517	G	A	[39, 75, 19, 15]	.
PAQR3	S133*	4	79851430	G	C	[29, 26, 25, 27]	.
CYFIP2	W204*	5	156786100	G	A	[7, 13, 9, 11]	.
ANKS1A	E420*	6	34957049	G	T	[26, 23, 12, 9]	.
PLEKHA2	C182*	8	38810216	T	A	[52, 33, 33, 15]	.
SOX6	R596*	11	16010642	G	A	[32, 26, 15, 11]	.
CD4	W53*	12	6909582	G	A	[12, 35, 7, 16]	.
GCOM1	Y122*	15	57913853	T	G	[39, 26, 17, 16]	.
MEF2A	K25*	15	100185784	A	T	[41, 5, 18, 1]	.
NOXO1	W269*	16	2029435	C	T	[19, 27, 24, 33]	.
SRCAP	Q2*	16	30712149	C	T	[17, 29, 8, 9]	.
ZNF613	E134*	19	52447644	G	T	[81, 31, 18, 9]	.
ZIM2	R335*	19	57286637	G	A	[37, 32, 34, 16]	.
FTHL17	E148*	X	31089629	C	A	[0, 0, 30, 8]	rs7057057
MAGEB16	R272*	X	35821127	C	T	[0, 0, 60, 63]	rs4829392
MAGEE2	E120*	X	75004529	C	A	[0, 0, 27, 28]	rs1343879
UBE2NL	L89*	X	142967468	T	G	[0, 0, 49, 62]	rs237520

The 581 non-synonymous changes includes the following frameshift changes:
PROX1	NA	1	214170645	AG	A	[18, 17, 62, 63]	.
DNAH14	NA	1	225328438	G	GAT	[10, 0, 29, 1]	.
FGFR3	NA	4	1809127	CGT	C	[0, 0, 27, 26]	rs34003391
NR3C2	NA	4	149002016	AT	A	[0, 0, 11, 12]	rs10708334
ZNF608	NA	5	123982405	CT	C	[26, 3, 10, 3]	.
CYFIP2	NA	5	156721863	T	TC	[0, 0, 16, 19]	rs76159126
STEAP2	NA	7	89854652	ACAAAT	A	[20, 27, 33, 45]	.
OR13C2	NA	9	107367392	TGTTA	T	[23, 19, 25, 39]	rs71991336
C9orf173	NA	9	140147822	GATGCACCCTTCTGGCTGGCCAACCCTTCT	G	[11, 15, 11, 18]	.
CREB3L1	NA	11	46342259	A	AG	[0, 0, 54, 2]	rs71038900
DIXDC1	NA	11	111853106	G	GC	[0, 0, 33, 51]	rs67635503
TREH	NA	11	118529044	C	CG	[0, 0, 17, 24]	rs11448549
HSP90B1	NA	12	104326109	TAGAG	T	[12, 19, 20, 16]	.
PKD1L2	NA	16	81242148	GTT	G	[0, 0, 61, 26]	rs62734462
LRRC59	NA	17	48474672	TGGTCA	T	[9, 14, 2, 7]	.
HSH2D	NA	19	16268207	TA	T	[0, 0, 7, 18]	rs5827321
LTBP4	NA	19	41123093	A	AG	[0, 0, 18, 43]	rs67216392
RASIP1	NA	19	49230400	GA	G	[22, 49, 5, 18]	.
CD33	NA	19	51729577	ACCCAACAACTGGTATCTTT	A	[37, 19, 24, 14]	.
ADAMTS1	NA	21	28210040	GA	G	[69, 35, 19, 8]	.
C21orf49	NA	21	34169317	C	CT	[0, 0, 48, 55]	rs5843567
COL6A2	NA	21	47545369	A	AC	[0, 1, 29, 69]	.
CLTCL1	NA	22	19189003	A	AC	[0, 0, 1, 22]	rs78649162
EFHC2	NA	X	44202890	G	GC	[0, 0, 2, 35]	rs5902346
SSX9	NA	X	48163733	TC	T	[0, 0, 57, 22]	rs112256157
NUDT11	NA	X	51239295	ATCCTCGAGGCAGCC	A	[0, 0, 4, 12]	rs78182391
TCEAL6	NA	X	101395780	T	TG	[0, 0, 27, 90]	rs11408120
TEX13A	NA	X	104464276	CA	C	[0, 0, 36, 7]	rs66463437

The 581 non-synonymous changes includes the following start codon losses:
TLR8	M1V	X	12924826	A	G	[0, 0, 17, 48]	rs3764880

The 581 non-synonymous changes includes the following stop codon losses:
MPL	*636W	1	43818443	A	G	[31, 17, 45, 40]	.
PCSK2	*604Q	20	17462713	T	C	[26, 13, 18, 14]	.
Attached Files
File Type: pdf BWA_Exome_Pipeline.pdf (376.8 KB, 106 views)
Jon_Keats is offline   Reply With Quote
Old 01-17-2012, 12:59 PM   #73
golharam
Member
 
Location: Philadelphia, PA

Join Date: Dec 2009
Posts: 55
Default

Hi Jon - The BWA Exome Pipeline looks great. I assume the RNA-Seq pipeline is different just by looking at the pdf you posted. I assume the last RNA-Seq pipeline was from 09-26-2011. I also see that the Circos stuff wasn't included. Is there a reason why?
golharam is offline   Reply With Quote
Old 01-20-2012, 08:13 AM   #74
ahmetz
Member
 
Location: new york

Join Date: Jun 2011
Posts: 23
Default

Hey Jon,

While going over your pipeline, I noticed something. With this line:

INNER_DISTANCE=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f5 | awk -v var1="$COMBINED_READ_LENGTH" '{printf "%.0f\n", $1-var1}'`

you're calculating the insert size as you use the INNER_DISTANCE value with -r option in Tophat. However, the Picard output, the 5th column is already MEAN_INSERT_SIZE, why are you subtracting COMBINED_READ_LENGTH from the mean insert size again?
ahmetz is offline   Reply With Quote
Old 01-20-2012, 08:15 PM   #75
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Hi Ahmetz,

Tophat -r is looking for the "mate-inner-distance" which is the distance between the ends of read 1 and read 2 not the distance between adaptor1 and adaptor2. We have talked about having a wiki locally for terminology but this is what we often use.

[Adaptor1]>>>>read1>>>---------<<<read2<<<<[Adaptor2]

Fragment Size = distance from beginning of adaptor 1 to the end of adaptor 2
Insert Size = size of DNA between adaptor 1 and adaptor 2
Inner Distance (what tophat wants) = the distance between the end of read1 and the end or read2. In the example above if you consider each - as a nucleotide and the ends of each read defined by > and < then the inner-distance is 9.

Golharam,

We are currently updating the RNAseq pipeline to work with Topat 1.4 to take advantage of the new align to transcriptome step that I've always pushed to have in an RNAseq aligner. It will also look a bit more similar to our exome pipeline so that it can add read groups.
Ultimately, I'd envision there being a exome pipeline, RNAseq pipeline, mate-pair pipeline, and then a summary pipeline that will merge all three results and create the circos plot. Should be fairly easy to pipe the results of the current pipeline into circos but I always plot more than a single data type so I didn't bother to included it. But not a bad idea to included as a launch query so a user could pipe the results directly.
Jon_Keats is offline   Reply With Quote
Old 01-21-2012, 05:00 PM   #76
mitochy
Member
 
Location: one does not simply approximate location

Join Date: Dec 2011
Posts: 10
Default

Why is this post not stickied in bioinformatics forum? This is a really great info on how-to for noobs like me. Would have saved couple days of googling stuff, trial and error
mitochy is offline   Reply With Quote
Old 01-23-2012, 07:25 AM   #77
ahmetz
Member
 
Location: new york

Join Date: Jun 2011
Posts: 23
Default

Hey Jon,

We are in total agreement in the definition of those terms, except for what you get out of picard insert metrics. I agree that tophat wants inner distance as the distance between the reads, and I believe that is what you get from picard as well. I ran a pipeline over the weekend with the Illumina's Body Map skeletal muscle paired end sample. Picard inner distance was 178bp. When Cufflinks was run on the same sample, it calculates the fragment length distribution automatically ( for the -m/--frag-len-mean value) and it was 176. I am inclined to believe -r/--mate-inner-dist for tophat and -m/--frag-len-mean for cufflinks refer to the same distance and therefore INNER_DISTANCE you get from Picard insert metrics should be left alone without subtracting the read lengths. What do you think? I'm not trying to prove you wrong but merely trying to clear out any confusions either of us might have.
ahmetz is offline   Reply With Quote
Old 06-25-2012, 10:39 AM   #78
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

From a BAM file you created using the same reference genome the following command should do it

Code:
samtools view -H YourFile.bam | cut -f2- | sed 's/SN://g' | sed 's/LN://g' | awk 'NR > 1 {OFS="\t" ; print $1, "1", $2}' > YourChromosomeList.bed
Jon_Keats is offline   Reply With Quote
Old 09-02-2012, 08:31 AM   #79
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Hello everyone its been a very long time since I updated this thread. Over the last year I've had people in my lab maintain a detail protocol on how to build up a new machine to do most NGS analysis, well at least those we do in my group. So thanks to David and Teja we have a pretty comprehensive instruction set. Sadly, some are already a bit out of data and I'll try to have a test done on the next machin in our lab and update as needed.

FULL INSTRUCTION LIST: How To Transform Your Mac Into A Sequencing Analysis Machine

Introduction

I’m a newly hired RA from Jonathan Keats’s lab who will be helping with a bunch of new sequencing stuff. I have been working on installing the suite of sequencing programs on our new workstation. Before I started, I knew virtually nothing about Terminal, Unix or manipulating sequencing files when I started. (In my mind, Terminal was where you board trains and Unix was some Talaxian from Star Trek: Voyager.) The learning curve has been steep, obviously, but Jonathan’s previous posts have been invaluable in making the adjustment.

I wanted to update those posts, however, because (a) some of the instructions have changed as newer versions of applications have appeared; (b) posts could be combined into one gigantic “master-post”; (c) some of the instructions are much more advanced/complicated than others; and (d) some helpful instructions for certain applications weren’t included.

To make things easier on the next bright-eyed generation of programming-illiterate biologists, I have included specific code instructions at practically every step of the installation process. After a couple times mentioning a particular command, I will stop including it to save space, so if you’re starting from the middle of the instruction set, refer to previous instructions for more information.

If you find this compilation of instructions frustratingly simplistic, then I suggest you read through the previous posts, if only to read through Jonathan’s wry comments about the entire bioinformatics process. Hopefully, this post will be helpful to extreme sequencing/Unix novices like myself.

Please let me know if you have any questions, good luck, and happy hunting!

David K. Edwards V and Jonathan Keats

Before You Begin: Programs

Unix

Before you begin, you should familiarize yourself with Terminal (Applications>Utilities>Terminal). Or better yet, you should invest some time working though the Unix portion of the "Unix and Perl for Biologists" course (http://groups.google.com/group/unix-...for-biologists), made public by Keith Bradham and Ian Korf at UC Davis. Or preorder their book on Amazon: http://www.amazon.com/UNIX-Perl-Resc...0189572&sr=8-1. Tell your PI it will be the best $50 investment of their career!

It’s really helpful for beginners understanding non-GUI file manipulations and gives you a good list of important Unix commands. (If you’re completely new to programming, it might be too confusing or complicated, but nobody said this was going to be easy.)

Download the entire course package: http://korflab.ucdavis.edu/Unix_and_Perl/index.html.

Here is a general list of helpful Unix commands:
  • To get a manual on any command, type "man command". Type "space" to page down, "b" to back-up, and "q" to quit.
  • To see what folder you are in currently, type "pwd".
  • To see what folders and files exist in the current directory, type "ls".
  • To move into a folder in the current directory, type "cd myfolder". (Note: You can move multiple levels downstream with "cd myfolder/myfolder2".)
  • To go back one directory, type "cd ..". (Note: You can move back multiple levels upstream with "cd ../..".)
  • To copy a file from the current directory to a downstream folder, type "cp myfile myfolder/". (Note: You can copy a file up one directory with "cp myfile ../".)
  • To move a file from the current directory, type "mv" instead of typing “cp”.
  • A folder immediately downstream of the root directory (i.e. absolute top of the tree) is always defined by "command /folder". (This means if you type "cd /something", it looks for the folder "something" downstream of the root directory.)
  • To note the current directory, type ".".
  • To change the permissions of the compiled applications, type "chmod 755 myfile". (This makes the file readable and executable by everyone but only writable by you. To allow everybody to do everything to the file, type “chmod 777 myfile”.)
  • To become a super user for a particular command (and become Superman!), type “sudo”.
  • To decompress a tarball file, type “tar -xvzf file.tar.gz”, where “file.tar.gz” is the decompressed file.


Xcode (http://developer.apple.com/technolog...ols/xcode.html)

NOTE: For some reason Apple has decided to mess with you and recent versions of Xcode (OS Lion and OS Mountain Lion compatible versions) no longer install some essential command line commands like "make" which you will use extensively to build the applications. However, there is an extremely simple solution to install these applications from within Xcode.

To install command line tools see (http://slashusr.wordpress.com/2012/0...nd-line-tools/)
  • Launch Xcode
  • Go to Preferences
  • Go to Downloads
  • Click the "Command Line Tools" radio button
  • Follow Prompts

You need to install Xcode on your computer so you can compile the various applications and if you start writing your own scripts it is a nice text editor in our opinion.

The newest version available on the App store, Xcode 4.3, is only compatible with OSX Mountain Lion (10.8.x). If you have Leopard (10.5.x) or Snow Leopard (10.6.x) or Lion (10.7.x), then you can install the package from your OS installation disks. Insert Mac OS X Install Disc 2, open the “Xcode Tools” folder, and double click “XCodeTools.mpkg”. Otherwise, you need to sign up to be a developer and download it from the website.

MacPorts (http://www.macports.org/)

You need to install some packages to run certain applications. There are two programs to install those packages, Fink and MacPorts. There isn’t much difference between both programs; in general, Fink is more conservative about upgrading packages that MacPorts, but both are perfectly acceptable. I simply chose MacPorts for this protocol.

R and Bioconductor

R: You will need R to perform statistical computations and generate graphs from your data. To install, visit http://www.r-project.org/, then select preferred CRAN mirror and follow the instructions.

Bioconductor: You will probably need Bioconductor to analyze your high-throughput genomic data. To install Bioconductor, you must have the most recent release version of R. The most common packages you will need to install are affy, simpleaffy, and gcmra.

To install these packages, starting first with affy, simply start R and type in the following:

Code:
source("http://bioconductor.org/biocLite.R")
biocLite("affy")
Press enter. R will automatically install the dependencies ‘Biobase’, ‘affyio’, and ‘preprocessCore’ during this installation.

To install simplaffy, replace “affy” with “simplaffy” in the above code and press enter. R will automatically install the dependencies ‘DBI’, ‘RSQLite’, ‘xtable’, ‘IRanges’, ‘AnnotationDbi’, ‘annotate’, ‘Biostrings’, ‘genefilter’, and ‘gcrma’.

There are three other dependencies you should install:

(NOTE: The version of cummeRbund that is installed through the current BioConductor development version is 1.0.0. The latest version, version 1.1.3 will be available as part of the Bioconductor development version 2.10, which will be made available in April 2012. For more information, please visit: http://compbio.mit.edu/cummeRbund/index.html.)

For more installation instructions, visit http://www.bioconductor.org/install/. (For this protocol, the current release version of R is 2.14, and the currently released Bioconductor version is 2.9.)

Before You Begin: Folders

You should establish a series of folders to manage your sequencing data and move around after each step is completed. You don’t necessarily have to follow this system of folders and subfolders, but all of our instructions for installing programs are based on this file hierarchy, so if you want to avoid confusion, and jump on our awesome folder-managing bandwagon, then read carefully!

Here is our system of folders and subfolders:

We have a main working directory called "ngs" in our $HOME directory (Users/YourUserName/). This is our home base for data analysis, and all of our steps and scripts will be called from this folder. Here are our subfolders within “ngs”:
  • ngs/{applications,bwa,run_parameters,run_parameters,scripts,temp,tophat,tophat_fusion}
  • ngs/analyzed_read_files/{chipseq,exomes,genomes,matepair,rnaseq}
  • ngs/finaloutputs/{chipseq,exomes,genomes,matepair,rnaseq}
  • ngs/refgenomes/{bfast_indexed,bowtie_indexed,bwa_indexed,downloads}
  • ngs/refgenomes/downloads/{ncbi36_hg18,grch37_hg19}
  • ngs/refgenomes/downloads/ncbi36_hg18/{annotation_files,reference_sequences}
  • ngs/refgenomes/downloads/grch37_hg19/{annotation_files,reference_sequences}

Each of these subfolders have subfolders, so instead of listing everything here, please visit the script “create_ngs_directorystructure_v4.sh” (http://seqanswers.com/forums/showthr...?t=4589&page=4, post #61) for more information. (To run the script, simply copy and paste the code included in that post when you immediately start Terminal, or when you are in the home directory. The corresponding files and folders will be created.)

Before You Begin: Picking Genome Files

[Maq is no longer included in this protocol because of recent improvements to BWA. If you need to install Maq, please see Jon’s preceding post on how to install it.]

This step is important and can be the source of most issues. You need to pick a source for all information genome sequence files and annotations. We use ensembl over UCSC for many reasons. For human genome reference files, we recommend the 1000 genomes versions. They think about the human genome much more than you do, so give them some credit. Besides, many of the applications you will use are published by those groups, so running them is streamlined and less complicated.

We will be using BWA to align our sequencing data against the reference genome (see BWA installation instructions under “Installing Programs”). You might think to use ensembl (http://www.ensembl.org/info/data/ftp/index.html) to get your reference genome, but the full human genome file (Homo_sapiens.GRCh37.66.dna_rm.toplevel.fa.gz) exceeds the maximum character length allowed by BWA’s index command.

Instead, you should use the 1000 Genomes reference genome (ftp://ftp.sanger.ac.uk/pub/1000genom...ect_reference/). You need to save the reference genome onto your computer:
  • Copy the file human_g1k_v37.fasta.gz” to your “ngs/refgenomes” folder.
  • Decompress the file by double clicking on it.

Installing Applications

Welcome to the meat-and-potatoes of this somewhat bloated post: program installation. This section has been written in chronological order, meaning that I started with the first program and proceeded onward to the last program. Some of the programs require that you have installed other programs, and unfortunately, unless explicitly mentioned, I don’t know which programs have those requirements.

Therefore, I recommend you follow the same installation order for your own computer. This will certainly make things simpler for newbies like myself, especially since I included the commonly used programs (e.g. BWA) before the less commonly used programs (e.g. Cairo).

As mentioned above, if you’re skipping around, I have written next to each application if it requires one of the preceding applications. However, I can’t be sure that this information is correct, so if you encounter a problem during installation, please let us know and we can amend our instructions.

Final note: The version numbers of programs might be out-of-date, so please change the instructions based on those new version numbers. We will try to update this document periodically to avoid this problem, but you should be forewarned!

Setting Your Path Directory

To run many of the applications, you will need to either place the applications in the PATH, define additional PATH locations, or note the location of the application each time you call it. To find the current PATH directories used by Unix, type "$PATH". You should see something similar to the following:

Code:
-bash: /sw/bin:/sw/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/bin
These folders are directly below the root directory and represent the places Unix looks when running an application. If you want to run any of these applications, you must download and compile the application. Before you begin installing any application, do the following (thanks to Nils Homer for the suggestion):

Create a directory in your home directory for the applications:
Code:
mkdir -p $HOME/local/bin
Edit your .profile file so this directory is in your PATH directories (you should see a file called “.profile”.):
Code:
ls –a
Open with nano by typing:
Code:
nano .profile
Add the following lines to your .profile file but DO NOT remove things in the current version: (you don’t need “sudo”):
Code:
export PATH=$HOME/local/bin:$PATH
Save your changes by typing "control O".
Exit nano by typing "control X".

Additionally, when you install applications, place the executable files in this directory so they are in a $PATH directory. You can either copy the application to the directory $HOME/local/bin or install using install script "./configure --prefix=$HOME/local”

BWA (http://sourceforge.net/projects/bio-bwa/files/; change naming in instructions based on BWA version)
NOTE: Reference indexes created in previous versions do not work in version 6 so you need to reindex each reference if you have worked with previous versions or more importantly if you are setting up and someone is providing you pre-index reference files
  • Click on the link above and download the newest version (called "bwa-0.6.1.tar.bz2").
  • Move the "bwa-0.6.1.tar.bz2" file to your "ngs/applications" folder.
  • Decompress the file by double clicking on it.
  • Open Terminal (if previously open, ensure you are in your home directory).
  • Navigate to the decompressed folder by typing:
Code:
cd ngs/applications/bwa-0.6.1
Compile the application by typing:
Code:
make
Lines of code will start appearing under your command. Make sure that no errors are listed!

You can confirm that the installation was successful by typing:
Code:
./bwa
This should bring up a window with the BWA command options. (The first line is “Program: bwa (alignment via Burrows-Wheeler transformation)”.)

Copy "bwa" to your path directory by typing:
Code:
cp bwa $HOME/local/bin
Now Typing "bwa" into terminal at any point in any folder will launch the bwa program

SAMtools (http://sourceforge.net/projects/samtools/; change naming in instructions based on SAMtools version)
  • Click on the link above and download the newest version (called " samtools-0.1.18.tar.bz2").
  • Move the "samtools-0.1.18.tar.bz2" file to your "ngs/applications" folder.
  • Decompress the file by double clicking on it.
  • Open Terminal (if previously open, ensure you are in your home directory).
  • Navigate to the decompressed folder by typing:

Code:
cd ngs/applications/samtools-0.1.18
Compile the application by typing:
Code:
make
Lines of code will start appearing under your command. Make sure that no errors are listed!

You can confirm that the installation was successful by typing:
Code:
./samtools
This should bring up a window with the SAMtools command options. (The first line is “Program: samtools (Tools for alignments in the SAM format)”.)

Copy "samtools" and other valuable applicationsto your path directory by typing:
Code:
cp samtools $HOME/local/bin
cp bcftools $HOME/local/bin
cp vcfutils.pl $HOME/local/bin
(We are assuming you followed our path directory here. If not, then change “$HOME/local/bin” to your location of choice.)

Note: To save space, we have reduced the number of specific instructions, so instead of writing the exact lines of code required for commands, we will simply summarize them. This applies to decompressing the file, navigating to the decompressed folder, compiling the application, and copying to your path directory.

GATK (ftp://ftp.broadinstitute.org/pub/gsa...latest.tar.bz2)

According to the website (http://www.broadinstitute.org/gsa/wi...ading_the_GATK, “Outside the Broad Institute”), before you install GATK, you need to install three applications: JVM (Java Virtual Machine), Apache Ant, and Git. GATK requires that your version of JVM is 1.6 or greater, and your version of Apache Ant is 1.7.1 or greater.

JVM (Java Virtual Machine)

You should have JVM already installed on your computer. To confirm this, open Terminal and type:
Code:
java –version
Three lines of code should appear, starting with java version “1.6.0_29”. To update Java, search “Java” on the Apple website and find the most recent version that corresponds to your operating system.

Ant (http://ant.apache.org/)

You should already have Apache Ant installed on your computer. To confirm this, open Terminal and type:
Code:
ant –version
You should see something like this: “Apache Ant(TM) version 1.8.2 compiled on October 14 2011”. If that doesn’t work, here’s how to install Ant manually:

Click on the link above and download the latest version. (This version will probably be “apache-ant-1.8.2-bin.tar.bz2”.)
Move the “apache-ant-1.8.2-bin.tar.bz2” file to your “ngs/applications” folder.
Decompress the file.
Follow the somewhat complex instructions in the manual. To access the manual, click on the decompressed folder and look under docs/manual/install.html.

Git (http://git-scm.com/download)

Click on the link above and download the latest version. (This version will probably be “git-1.7.9.1-intel-universal-snow-leopard.dmg”.)
Install like any ordinary Mac application. (You thought it would be more complicated, right? You’re welcome!)

Now, onto installing GATK:

Click on the link above.
Move the “GenomeAnalysisTK-latest.tar.bz2” file to your "ngs/applications" folder.
Decompress the file and navigate to it.

To confirm this, Type in “java –jar GenoneAnalysisTK.jar --help”. (Do not copy this text! You will need to handtype it.)

You should see a message like: The Genome Analysis Toolkit (GATK) v1.4-30-gf2ef8d1, Compiled 2012/02/17 20:18:04.


Bowtie (http://sourceforge.net/projects/bowtie-bio/files/bowtie)

Click on the link above and download the latest version. (This version will probably be “bowtie-0.12.7-src.zip”.)
Move the “bowtie-0.12.7-src.zip” file to your “ngs/applications” folder.
Decompress the file and navigate to it.
Compile the application (“make”).
Copy "bowtie", "bowtie-build", and "bowtie-inspect" to your path directory.

To test the installation, navigate to the bowtie folder and type:
Code:
bowtie indexes/e_coli reads/e_coli_1000.fq
You should see a bunch of information stream onto the screen, and at the bottom, you should see:

Code:
# reads processed: 1000
# reads with at least one reported alignment: 699 (69.90%)
# reads that failed to align: 301 (30.10%)
Reported 699 alignments to 1 output stream(s)
Boost (http://www.boost.org/)[Prerequisites: SAMtools, $PATH configuration.]

WARNIHG: Do not download the newest version of Boost (i.e., version 1.48.0)! This version will not natively work with this protocol. Instead, install any earlier version of Boost—we recommend version 1.47.0—and follow the instructions below. (For more information, and instructions on how to modify the latest version of Boost, please visit: http://seqanswers.com/forums/showthread.php?t=16637.)

Click on the link above and download the latest version. (MAKE SURE THIS IS VERSION “boost_1_47_0.tar.bz2” OR EARLIER.)
Move the “boost_1_47_0.tar.bz2” file to your “ngs/applications” folder.
Decompress the file and navigate to it.
Build/bootstrap the package by typing:
Code:
./bootstrap.sh
Type in the following command:
Code:
./bjam --prefix=$HOME/local --toolset=darwin architecture=x86 address-model=32_64 link=static runtime-link=static --layout=versioned stage install
This command will take awhile, so take your coworkers out for cappuccinos or something while you wait. Once it’s finished, the command will create “include” and “lib” subfolders in $HOME/local. You might get some error messages for which targets failed or were skipped, but ignore that because it won’t affect your other applications.
In the new "include" folder, create a subfolder "bam".
Using Terminal, navigate to the SAMtools folder within ngs/applications.
Copy the "libbam.a" file in the SAMtools folder to $HOME/local/lib by typing:
Code:
cp libbam.a $HOME/local/lib
Copy the header files (files ending in .h) in the SAMtools folder to $HOME/local/include/bam by typing:
Code:
cp *.h $HOME/local/include/bam
Tophat (http://tophat.cbcb.umd.edu/) [Prerequisites: Bowtie, SAMtools.]


Click on the link above and download the latest version. (This version will probably be “tophat-1.4.1.tar.gz”. Click on the option that says “Source Code.”)
Move the “tophat-1.4.1.tar.gz” file to your “ngs/applications” folder.
Decompress the file and navigate to it.
Build the package by typing
Code:
./configure --prefix=$HOME/local --with-bam=$HOME/local
[/li]

Compile the application (by typing “make”).
Make the executable available in your $PATH directory by typing:
Code:
make install
To test the Tophat installation, please visit the download website (http://tophat.cbcb.umd.edu/tutorial.html; search under “Testing the installation”) and follow these instructions:

Click on the link above and download the file. (This file will probably be “test_data.tar.gz”.
Decompress the folder and navigate to it.
To process the data, type:
Code:
tophat -r 20 test_ref reads_1.fq reads_2.fq
You should see lines of code after your command, beginning with something like the following:
Code:
[Mon May  4 11:07:23 2009] Beginning TopHat run (v1.1.1)
-----------------------------------------------
Cufflinks (http://cufflinks.cbcb.umd.edu/tutorial.html) [Prerequisites: Boost (SAMtools).]

Click on the link above and download the latest version. (This version will probably be “cufflinks-1.3.0.tar.gz”. Click on the option that says “Source Code.”)
Move the “cufflinks-1.3.0.tar.gz” file to your “ngs/applications” folder.
Decompress the file and navigate to it.
Build the package (with Boost, so different from Tophat instructions!) by typing
Code:
./configure --prefix=$HOME/local --with-boost=$HOME/local --with-bam=$HOME/local
[/li]

Compile the application (by typing “make”).
Make the executable available in your $PATH directory by typing:
Code:
make install
To test the installation, you will need to download the cufflinks test data (http://cufflinks.cbcb.umd.edu/tutorial.html#ref; look under “Testing the installation). You can download the test text file anywhere (e.g. within your username folder) and navigate to that folder.

Process the test data by typing:
Code:
cufflinks test_data.sam
You should see the following at the beginning of your output:
Code:
You are using Cufflinks v1.3.0, which is the most recent release.
[bam_header_read] EOF marker is absent. The input is probably truncated.
VarScan (http://varscan.sourceforge.net/) (Prerequisites: Samtools?)

[ol]
[li]Click on the link above and download the latest version. (This version will probably be “VarScan.v2.2.8.jar”.)[/li]
[li]Move the “VarScan.v2.2.8.jar” file to your “ngs/applications” folder.[/li]
[li]Navigate to your “applications” folder.[/li]
[/ol]

To test the installation, type:

Code:
java -jar VarScan.v2.2.8.jar
You should see the following at the beginning of your output:

Code:
VarScan v2.2

USAGE: java net.sf.varscan.VarScan [COMMAND] [OPTIONS]
Picard (http://picard.sourceforge.net/)

[ol]
[li]Click on the link above and download the latest version. (This version will probably be “picard-tools-1.62.zip”.)[/li]
[li]Move the “picard-tools-1.62.zip” file to your “ngs/applications” folder.[/li]
[li]Decompress the file.[/li]
[li]Copy all .jar applications to your $PATH directory by typing:

Code:
cp .jar $HOME/local/bin
While this step isn’t required, it makes things easier and the pipelines we provide use this concept.[/li]
[/ol]

snpEff (http://snpeff.sourceforge.net/download.html)

To install snpEff, you must install both the program and the corresponding reference genome. These instructions include installing the most recent human genome from Ensembl (which is provided on their website). If you use a different genome, make sure that your genome version matches your snpEff version. (In other words, in this example, the genome version is for “v2_0_5” and the snpEff version is for “v2_0_5d”.)

[ol]
[li]Click on the link above and download the latest version of snpEff. (This version will probably be “snpEff_v2_0_5d_core.zip”.)[/li]
[li]Move the “snpEff_v2_0_5d_core.zip” file to your “ngs/applications” folder.[/li]
[li]Decompress the file.[/li]
[li]In the link above, download the latest version of the reference genome (This version will probably be “snpEff_v2_0_5_GRCh37.65.zip”.)[/li]
[li]Move the “snpEff_v2_0_5_GRCh37.65.zip” file to your “ngs/applications” folder.[/li]
[li]Decompress the file.[/li]
[/ol]


At this point, you’re probably feeling comfortable with these instructions, maybe even patting yourself on the back for understanding them. Well, prepare for more confusion, because we’re entering the wonderful seafaring world of ports!

For the following applications, you will need to install additional ports on your computer. There are two websites you can use to install them: MacPorts (http://www.macports.org/) and fink (http://www.finkproject.org/). The difference between them is that, in general, fink is more conservative about upgrading packages than MacPorts, so while the MacPorts version will be newer, the fink version might be more stable. We selected MacPorts for installing our packages, so our instructions will be tailored toward that program.

MacPorts (http://www.macports.org/install.php) [Prerequisites: XCode.]

To install MacPorts, please visit that website. Choose your operating system under the “Mac OS X Package (.pkg) Installer” section. Install like any ordinary software application.

To test the installation, close Terminal, meaning completely quit the application, and restart to run MacPorts. To begin the program, type in “sudo port”. You should see:

Code:
MacPorts 2.0.3
Entering interactive mode... ("help" for help, "quit" to quit)
To install any port, type:

Code:
install program
where “program” is name of port you’re installing. This is the method for installing any of the ports used by the subsequent applications. As the program indicates, to exit MacPorts, type “quit” and press enter.

FastX (http://hannonlab.cshl.edu/fastx_toolkit/download.html)

MacPorts: Install “pkgconfig”. (The program is called “pkgconfig 0.26”, found on page 171 of the MacPorts website.)

[ol]
[li]Click on the link above and download libgtextutils. (This version will probably be “libgtextutils-0.6.tar.bz2”.)[/li]
[li]Move the “libgtextutils-0.6.tar.bz2” file to your “ngs/applications” folder.[/li]
[li]Decompress the file and navigate to it.[/li]
[li]To install the program, like installing previous programs, type in “./configure” and press enter.
[li]To compile the application completely, type in “make” and press enter, then type in “sudo make install” and press enter.
[li]Make sure the program can identify gtextutils” by typing:

Code:
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH
[/li]
[li]Once that command is processed, type:

Code:
pkg-config --cflags gtextutils-0.1
You should see the following response:

Code:
-I/usr/local/include/gtextutils-0.1/
(If you have any questions about this step, or have any troubleshooting concerns about installing this application, please visit: http://hannonlab.cshl.edu/fastx_tool...nfig_email.txt.)[/li]

[li]Click on the link above and download the latest version of FastX. (This version will probably be “fastx_toolkit-0.0.13.tar.bz2”.)[/li]
[li]Move the “fastx_toolkit-0.0.13.tar.bz2” file to your “ngs/applications” folder.[/li]
[li]Decompress the file and navigate to it.[/li]
[li]To install the program, like installing previous programs, type in “./configure”, then “make”, then “make install”.[/li]
[/ol]
Circos (http://mkweb.bcgsc.ca/circos/software/download/)
Before installing Circos, you will need to update your perl distribution to install all of Circos’s required packages. To install the packages, type the following in Terminal:

Code:
sudo perl -MCPAN -e shell
When it asks if you would like the program to configure things automatically, and choose the best CPAN mirror sites, type “yes”.

To install any package, type:

Code:
install program
where “program” is name of package you’re installing. Before installing these packages, however, you will need to install GD. (I know, it’s like Inception, with a program installation within a program installation within a program….)

GD (http://code.google.com/p/google-desk...tar.gz&can=2&q)

[ol]
[li]Click on the link above and download the latest version. (This version will probably be “gd-2.0.35.tar.gz”.)[/li]
[li]Move the “gd-2.0.35.tar.gz” file to your “ngs/applications” folder.[/li]
[li]Decompress the file and navigate to it.[/li]
[li]To install the program, like installing previous programs, type in “./configure”, then “make”, then “sudo make install”.[/li]
[/ol]

Here is a list of the packages you will need to install (please install them in the following order because some of the packages require other packages):

YAML
Config::General (v2.50 or later)
GD::Polyline (requires YAML)
List::MoreUtils
Math::Bezier
Math::Round
Math::VecStat
Params::Validate
Readonly
Regexp::Common
Set::IntSpan (v1.16 or later)
Clone
Text::Format

Also, if you get the message that says something like this:

Code:
New CPAN.pm version (v1.9800) available.
  [Currently running version is v1.9456]
then type “install CPAN”, then “reload CPAN”, to update to the latest CPAN version. (This process takes a couple minutes.)

All right, here are the instructions for installing Circos:

[ol]
[li]Click on the link above and download the bug fixes version. (This version will be something like“circos-0.56-1.tgz”.)[/li]
[li]Move the “circos-0.56-1.tgz” file to your “ngs/applications” folder.[/li]
[li]Decompress the file.[/li]
[li]Click on the link above and download the latest version. (This version will probably be “circos-0.56.tgz”.)[/li]
[li]Move the “circos-0.56.tgz” file to your “ngs/applications” folder.[/li]
[li]Decompress the file.[/li]
[li]Drag the decompressed file within the bug fixes version into the file of the latest Circos version. When prompted, choose “replace file”.[/li]
[/ol]

To test the Circos installation, please visit this website (http://circos.ca/software/download/tutorials/) and follow these instructions:

[ol]
[li]Click on the link above and download the tutorial file (This version will be something like“circos-tutorials-0.56.tgz”.)[/li]
[li]Move the “circos-tutorials-0.56.tgz” file to your “ngs/applications” folder.[/li]
[li]Decompress the file.[/li]
[li]Drag the decompressed tutorial file into the file of the latest Circos version. When prompted, choose “replace file”.[/li]
[li]Navigate to the “circos-0.56” folder.[/li]
[li]Access the tutorial by typing:

Code:
cd tutorials/2/2
[/li]
[li]Test the tutorial by typing:

Code:
../../../bin/circos -conf ./circos.conf
[/li]
[/ol]

You should see a series of commands flash onto the screen, eventually ending with:

Code:
debuggroup summary,output 4.85s created PNG image ./circos.png (839 kb)
debuggroup summary,output 4.86s created SVG image ./circos.svg (356 kb)
If you navigate to that folder manually (“circos-0.56/tutorials/2/2”) and click on the “circos.png” file, you should see a circular graph of each human chromosome in different colors.

Finally, we copied the binary and library files to your path directory so you can just type "circos" instead of "bin/circos" each time you run the program. If you follow our folder hierarchy, then type the following commands in sequential order:

Code:
cd ngs/applications/circos-0.56/bin
cp circos $HOME/local/bin
cd ../lib
cp circos.pm $HOME/local/lib
Also, within the circos folder, to create a couple directories for your personal use, type the following commands in sequential order:

Code:
cd ngs/applications/circos-0.52
mkdir my_plots
mkdir my_reference_files
mkdir my_config_files
mkdir my_data_files
Once you’ve created those directories, you need to populate your reference files. (For more information, please visit: http://circos.ca/tutorials/.) When you visit that website, you can download the hg19 karyotype, decompress the corresponding file, and drag it into your newly created “my_reference_files” folder.
BEDTools (http://code.google.com/p/bedtools/)

[ol]
[li]Click on the link above and download the latest version. (This version will probably be “BEDTools.v2.15.0.tar.gz”.)[/li]
[li]Move the “BEDTools.v2.15.0.tar.gz” file to your “ngs/applications” folder.[/li]
[li]Decompress the file and navigate to it. (NOTE: The file will be renamed to something like “BEDTools-Version-2.15.0”.)[/li]
[li]To install the program, type in “make clean, then “make all”. You should see a series of commands being processed.[/li]
[li]To list the available binaries and confirm that they installed, type “ls bin”. You should see columns of files beginning with “annotateBed” in the upper lefthand corner and ending with “windowMaker” in the lower righthand corner.[/li]
[li]Copy the binaries to your PATH directory by typing:
Code:
cp bin/* $HOME/local/bin
[/li]
[/ol]
Pairoscope (http://pairoscope.sourceforge.net/) [Prerequisite: SAMTools]
Truthfully, installing this program is difficult, so brace yourselves, folks. Or as Samuel Jackson says in Jurassic Park, “hold onto your butts.”

Before installing pairoscope, you need to install Cairo. To install Cairo, type:

Code:
sudo port install cairo
You should get the following response:

Code:
 --->  Computing dependencies for cairo
--->  Cleaning cairo
Also, before installing pairoscope, you need to install CMake (http://www.cmake.org/cmake/help/install.html). To install the program, click on the link above and download the latest version. (This version will probably be “cmake-2.8.7-Darwin64-universal.dmg”.) Simply install like you would a normal application. (Oh, and when the bouncing colorful triangle appears on your Dock, click to “install command line links”.)

Finally, here are the instructions to install pairoscope:

[ol]
[li]Click on the link above and download the latest version. (This version will probably be “pairoscope-0.2.tgz”.)[/li]
[li]Move the “pairoscope-0.2.tgz” file to your “ngs/applications” folder.[/li]
[li]Decompress the file and navigate to the applications folder.
To install pairoscope, type:

Code:
ccmake pairoscope-0.2
The screen will transform and you will see a series of capitalized instructions on the left and corresponding answers written in white text on the right. To toggle advanced mode, type “t”.

Scroll all the way down with the arrow keys until you reach “Page 2 of 2”. (NOTE: The following series of instructions are based on our folder architecture, and assume that you followed our instructions for installing SAMTools. If your folder architecture is different, please point ccmake to your corresponding SAMTools directories.)

To edit the Samtools include and library locations, follow these instructions:

[ul]
[li]Under “Samtools_INCLUDE_DIR”, type “/-----/local/include/bam”.[/li]
[li]Under “Samtools_LIBRARY”, type “/-----/local/lib/libbam.a”.[/li]
[/ul]

where “-----“ is the exact folder hierarchy of your computer. (To access that exact hierarchy, type in “cd” in the command line and type in “pwd”. The resulting line of code should be pasted into the “-----“ section described above.)

To configure, type “c”. You should see a warning appear that starts with:

Code:
CMake Warning (dev) in CMakeLists.txt:
You can ignore this warning, so type “e”. To generate and exit, type “g”.

Now, pairoscope is ready. To make pairoscope, navigate to the “applications” folder and type:

Code:
cmake pairoscope-0.2
You should see a series of commands ending with:

Code:
 -- Build files have been written to: /-----/ngs/applications
where the “-----“ is the same prefix described above.

A new folder called “CMakeFiles” has been created in the “applications” folder. To make, navigate to the “applications” folder and type “make”. You will see a bunch of purple and green commands beginning with:

Code:
Scanning dependencies of target pairoscope
Copy the newly-created pairoscope program to your $PATH by typing:

Code:
cp pairoscope $HOME/local/bin
To test the installation, type “pairoscope”. You should see a series of commands beginning with the following:

Code:
Usage:   pairoscope [options] <align.bam> <chr> <start> <end> <align2.bam> <chr2> <start2> <end2>
FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/)
[ol]
[li]Click on the link above and download the latest version. (This version will probably be “Source Code for FastQC v0.10.0 (zip file)”. Please download the Source Code version.)[/li]
[li]Move the “fastqc_v0.10.0_source.zip” file to your “ngs/applications” folder.[/li]
[li]Decompress the file and navigate to it. (NOTE: The file will be renamed “FastQC”.)[/li]

And that’s it! (Seriously! According to the installation files: “Once unzipped it's ready to go.”)
HTSeq (http://pypi.python.org/pypi/HTSeq)
[ol]
[li]Click on the link above and download the latest version. (This version will probably be “HTSeq-0.5.3p3.tar.gz”.)[/li]
[li]Move the “HTSeq-0.5.3p3.tar.gz” file to your “ngs/applications” folder.[/li]
[li]Decompress the file and navigate to it.[/li]
[li]Install the program by typing:
Code:
sudo python setup.py install
[/li]
[/ol]

You should see a series of commands being processed and ending with:
Code:
Finished processing dependencies for HTSeq==0.5.3p3
(For more information about program installation, please visit: http://www-huber.embl.de/users/ander.../overview.html.)
chimerascan (http://code.google.com/p/chimerascan/)
[ol]
[li]Click on the link above and download the latest version. (This version will probably be “chimerascan-0.4.5a.tar.gz”.)[/li]
[li]Move the “chimerascan-0.4.5a.tar.gz” file to your “ngs/applications” folder.[/li]
[li]Decompress the file and navigate to it.[/li]
[li]Build the program by typing:
Code:
python setup.py build
[/li]
[li]Install the program by typing:
Code:
sudo python setup.py install
[/li]

To test the installation, you need to access python. To do that, leave the directory (you can type “cd ../” to move into the “applications” folder) and type:

Code:
python
You should see something like:

Code:
Python 2.6.1
Type "help", "copyright", "credits" or "license" for more information.
To test that the chimerascan libraries are in your PYTHONPATH, type “import chimerascan”, then “chimerascan.__version__”. (Just in case that last command is obscured, you should type in “chimerascan” followed by a period, followed by two underscores, then “version”, then two underscores.) You should see the following:

Code:
'0.4.5'
Success! To exit python, type:

Code:
exit()
Congratulations! You now have a working computer that can handle just about any sequencing data you throw into it!
If you have any problems during the installation process, I recommend that you search online for the error message you received. That’s how I managed to resolve many of the difficulties I encountered during this whole process.
Additionally, you should read the README files (you can by typing “less README” when you are in the program’s directory) when you have problems, because they might give you helpful information about what’s going wrong with that program.
Finally, please remember that this document is a work in progress. Right now, we have created a system that can manage the installation of the current application versions, but these versions often change, and with those changes come new program requirements or permissions. If you encounter any problems with future versions, please respond to this thread (preferably with a solution!) and we will make the corresponding updates.
(This document was made with help from Venkata Yellapantula.)
Last updated: March 7, 2012

Last edited by Jon_Keats; 09-02-2012 at 12:01 PM.
Jon_Keats is offline   Reply With Quote
Old 09-13-2012, 07:05 AM   #80
IBseq
Member
 
Location: uk

Join Date: Jul 2012
Posts: 56
Default

HI JOn,
have you ever used cummeRbund?I am new to R, linux and everything that is not windows...I like your posts and was wondering if you could dispense with some usefull commands...

best,
irene
IBseq is offline   Reply With Quote
Reply

Tags
bwa, illumina, newbie, samtools, unix

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:09 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO