Unconfigured Ad

**Jon_Keats** · 09-25-2011, 09:34 PM

Getting Better with Variables

Now that I have the lab (www.keatslab.org) moving forward to some extent I'm back to working on my analysis methods. For a couple of reasons I've been searching for ways to improve the previous methods so that even less user input is required and thus saving me time running things myself or worrying about other users making mistakes. So here is what seems like a good version that relies on very limited user input. It takes advantage of a common parameters file, and a series of basic user queries as the script launches to setup the tophat and cufflinks run. It also uses bwa to find the insert size and standard deviation of the library.

Requirements:
1) Create_Directory_Structure_V4 directory structure, but the pipeline is so flexible this is not really a requirement just simplifies effort.
2) bwa transcriptome index file for genome
3) bowtie index file for genome
4) Annotation files for Picard CollectRnaSeqMetrics.jar to function properly

Options:
1) It accepts compressed gzip or bzip2 files or standard read files with any extension
2) It creates a summary file at the end of each run and you have the option to output a list of genes and associated FPKM values to this summary file. So if you are looking at an siRNA experiment you could drop the gene of interest values for your samples or a list of random genes you would like a quick look at.

General Outline:
1) Query User to define setup, which sets the naming convention for each sample, the tophat run format (two options), the cufflinks run format (six options), the gene list added to the summary file
2) Counts the number of reads in each input file (used to calculate % aligned)
3) Cuts out the first 1 million reads for bwa alignment against a transcriptome reference to define the insert size and standard deviation
4) Runs tophat
5) Runs cufflinks
6) Runs a series of picard metrics programs
7) Summarizes metrics and gene values

create_ngs_directorystructure_v4.sh

Code:

#!/bin/sh

# Create_NGS_DirectoryStructure_V4.sh
# 
#
# Created by Jonathan Keats on 09/25/11
# Translational Genomics Research Institute

#########################################################################
#  CREATES A DIRECTORY STURCTURE TO SUPPORT A VARIETY OF NGS PIPELINES  #
#########################################################################

# Designed for a Mac OS enviroment and requires initiation from your home folder (/User/You/)

# Check to confirm current location is $HOME/ (ie. /User/You/)

echo "Confirming Script Initiation Directory"
var1=$HOME
if [ "`pwd`" != "$var1" ] 
	then 
	echo " The script must be launched from your home directory "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi
echo "1) Launch Location is Correct ($HOME/)"

# Create required directories to support pipelines (RNAseq_MensFormal_)

echo ***Creating Pipeline Directory Structure***
mkdir -p ngs/{applications,bwa,run_parameters,run_parameters,scripts,temp,tophat,tophat_fusion}
mkdir -p ngs/analyzed_read_files/{chipseq,exomes,genomes,matepair,rnaseq}
mkdir -p ngs/finaloutputs/{chipseq,exomes,genomes,matepair,rnaseq}
mkdir -p ngs/refgenomes/{bfast_indexed,bowtie_indexed,bwa_indexed,downloads}
mkdir -p ngs/refgenomes/downloads/{ncbi36_hg18,grch37_hg19}
mkdir -p ngs/refgenomes/downloads/ncbi36_hg18/{annotation_files,reference_sequences}
mkdir -p ngs/refgenomes/downloads/grch37_hg19/{annotation_files,reference_sequences}
mv create_ngs_directorystructure_v4.sh ngs/scripts/
echo ***Pipeline Directory Structure Created***

RNAseq_MensFormal_Mac.sh

Code:

#!/bin/sh

# RNAseq_MensFormal_Mac.sh
# 
#
# Created by Jonathan Keats on 9/21/11.
# Copyright 2011 Translational Genomics Research Institute. All rights reserved.

# Outline - Automated pipeline requiring limited input for multiple RNAseq samples using tophat and cufflinks

# Usage = RNAseq_MensFormal_Mac.sh <rnaseq_common_parameters.ini>

######################################################################################################
##																									##
##	Dependencies - Unix (mv, cd, find, basename, mkdir, echo, wc, head, tail, cut, awk, sed)		##
##				 - BWA (version with -I option)	(Tested with v0.5.9)								##
##				 - SAMTOOLS (Tested with v0.1.12a)													##
##				 - PICARD Tools (Tested with v1.52)													##
##					- CollectInsertSizeMetrics.jar (WATCH VERSION AS COLUMN OUTPUT DIFFERS)			##
##					- CollectRNAseqMetrics.jar														##
##					- MarkDuplicates.jar															##
##					- CollectAlignmentSummaryMetrics.jar											##
##					- MeanQualityByCycle.jar														##
##				 - TOPHAT (Tested with v1.3.2)														##
##				 - CUFFLINKS (Tested with v1.1.0)													##
##																									##
##	Outline - Place uniformly named read file pairs in TOPHAT folder								##
##					- This method is dependent on XXXX folder structure								##
##			- Update "tophat_run_parameters.ini" as appropriate to your samples						##
##			- Launch script which queries the user to define methods used and renaming convention   ##
##			  then does the following steps automatically:											##
##				- Create custom directory for each sample											##
##				- Create custom .ini file for each sample											##
##				- Cut out first 1 million reads from each read file									##
##				- Align to transcriptome reference using BWA										##
##				- Calculate inner-mate-distance and standard deviation								##
##					** inner-mate-distance = (insert size) - (combined read length)					##
##				- Run TOPHAT																		##
##				- Run CUFFLINKS																		##
##				- Calculate metrics 																##
##																									##
##   Notes	- Extensive use of common variables is used to facilitate automation					##
##			- There is some naming convention requirements but they are flexible					##
##					- Paired samples must follow the same convention:								##
##							--> SampleA_1.fastq  and  SampleA_2.fastq								##
##												OR													##
##							--> SampleA_Time1_1.txt and SampleA_Time1_2.txt							##
##					- You can mix different basenames but not common extensions						##
##							--> SampleB_1.fa / SampleB_2.fa											##
##							--> SampleC_Time1_1.fa / SampleC_Time1_2.fa								##
##		** This is fine: Common Extension = .fa ; Read Extension = _1.fa and _2.fa **				##
##							--> SampleB_1.fastq / SampleB_2.fastq									##
##							--> SampleC_1.fa / SampleC_2.fa											##
##							--> SampleD_R1.fa / SampleD_R2.fa										##
##      ** These can not be mixed, no "Common Extension" or common "Read Extension" **				##
##																									##
######################################################################################################

# Read in the tophat_common_parameters file
. $1

#Starting Directory=ngs/tophat/
echo
echo "*** STARTING SCRIPT ***"
date '+%m/%d/%y %H:%M:%S'
echo

##############################################################
##  Check that the script was launched from ngs/tophat dir  ##
##############################################################

echo Checking Launch Location
date '+%m/%d/%y %H:%M:%S'
echo

# Checking if launch location is correct

if [ "`pwd`" != "${LAUNCH_DIR}" ] 
	then 
	echo " The script must be launched from the ${LAUNCH_DIR} directory "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi
echo "1) Launch Location is Correct ($HOME/ngs/tophat)"

# Checking if launch directory contains any .ini files (No such files are expected)

if [ `ls *.ini | wc -l` != 0 ]       
	then 
	echo " The tophat directory contains unexpected .ini files - These must be removed before the script starts "
	echo " ERROR - The script was automatically killed due to a launch error - See Above Error Message"
	exit 2                     
fi
echo "2) No .ini Files Found in Launch Directory as Expected"
echo

echo Directory Check Completed
date '+%m/%d/%y %H:%M:%S'
echo

##############################################################
##  Query User for tophat and cufflinks configurations      ##
##############################################################

echo Starting User Configuration Query Step
date '+%m/%d/%y %H:%M:%S'
echo

# Ask user if the input read files are compressed or not
echo "Are the input read files compressed? (ie. gzip or bzip2) [Yes/No]"
read COMPRESSION_QUERY

# Test to see if the "COMPRESSION QUERY" Query Response was appropriate
## NOTE - In the IF statement use && as AND and || as OR in new shells (old OR is -o)
if [[ ${COMPRESSION_QUERY} = "Yes" || ${COMPRESSION_QUERY} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Ask user for the compression type if they indicated that the read files are compressed
if [ ${COMPRESSION_QUERY} = "Yes" ]
	then
	echo Please indicate the type of compression. [gzip/bzip2] 
	read COMPRESSION_TYPE
	if [[ ${COMPRESSION_TYPE} = "gzip" || ${COMPRESSION_QUERY} = "bzip2" ]]
		then
		echo 
		else
		echo
		echo " Incorrect Response Option, Please use gzip or bzip2 (Case-Sensitive) "
		echo " The script was automatically killed due to a launch error - See Above Error Message" 
		exit 2                              
	fi
fi
######################

echo Would you like to run tophat with a gtf reference file? [Yes/No]
read TOPHAT_QUERY_GTF

# Test to see if the Query Response to the use of a GTF file during tophat alignment was correct or not
## NOTE - In the IF statement use && as AND and || as OR in new shells (old OR is -o)
if [[ ${TOPHAT_QUERY_GTF} = "Yes" || ${TOPHAT_QUERY_GTF} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Ask user for a short GTF Identifier for renaming purposes
if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
	then
	echo "Please provide a short descriptor for the GTF file used during tophat alignment"
	echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
	echo Please Enter Response Now:
	read TOPHAT_GTF_SHORT
	echo                              
fi

echo   Notes regarding the following questions pertain to how cufflinks estimates abundances
echo -----------------------------------------------------------------------------------------
echo
echo A GTF file can be used in two mutually exclusive fashions -G/--GTF or -g/--GTF-guide
echo
echo "1) To limit abundance estimates to just the isoforms in the GTF file use -G/--GTF"
echo "*** In this script this is called LIMIT ABUNDANCE ESTIMATION to GTF reference ***"
echo NOTE: This limits the estimates to a consistent set of known isoforms
echo 
echo "2) To guide BUT NOT limit the abundance estimates to the GTF file use -g/--GTF-guide"
echo "*** In this script this is called GUIDE ABUNDANCE ESTIMATION to GTF reference ***"
echo NOTE: This allows identification of known and novel isoforms
echo -----------------------------------------------------------------------------------------
echo

echo Would you like to LIMIT ABUNDANCE ESTIMATION with cufflinks to a gtf reference file? [Yes/No]
read CUFFLINKS_QUERY_LIMIT_GTF

# Test to see if the Query Response to the use of a LIMIT ABUNDANCE ESTIMATION GTF file during cufflinks isoform assembly was correct or not
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" || ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

echo Would you like to GUIDE ABUNDANCE ESTIMATION with cufflinks to a gtf reference file? [Yes/No]
read CUFFLINKS_QUERY_GUIDE_GTF

# Test to see if the Query Response to the use of a GUIDE ABUNDANCE ESTIMATION GTF file during cufflinks isoform assembly was correct or not
if [[ ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" || ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Test to ensure the user did not say "Yes" to the use both the LIMIT and GUIDE GTF options
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" ]]
	then
	echo " Incorrect Response Combination, You Can Not Say Yes to both the LIMIT and GUIDE ABUNDANCE ESTIMATION Options!! "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Ask user for a short GTF Identifier for renaming purposes
if [ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" ]
	then
	echo "Please provide a short descriptor for the GTF file used to LIMIT ABUNDANCE ESTIMATION"
	echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
	echo Please Enter Response Now:
	read LIMIT_GTF_SHORT
	echo                              
fi

if [ ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" ]
	then
	echo "Please provide a short descriptor for the GTF file used to GUIDE ABUNDANCE ESTIMATION"
	echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
	echo Please Enter Response Now:
	read GUIDE_GTF_SHORT
	echo                              
fi

echo Would you like to run cufflinks with an abundance estimate mask file? [Yes/No]
read CUFFLINKS_QUERY_MASK

# Test to see if the Query Response to the use of a mask file during cufflinks isoform assembly was correct or not
if [[ ${CUFFLINKS_QUERY_MASK} = "Yes" || ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Ask user for a short mask identifier for renaming purposes
if [ ${CUFFLINKS_QUERY_MASK} = "Yes" ]
	then
	echo "Please provide a short descriptor for the mask file used to prevent certain transscripts from being used during abundance estimation"
	echo "Such as MT for mitochorndia, IG for immunoglobulin, this is not standardized but for file sorting purposes you should be consistent between runs"
	echo Please Enter Response Now:
	read MASK_SHORT
	echo                              
fi

# Ask the user if they want to extract a list of Genes and associated FPKM vales from the cufflinks output into the run report
echo Would you like to extract a gene list, defined by GENES_OF_INTEREST in .ini file into the run report? [Yes/No]
read GENESOFINTEREST_QUERY

# Test to see if the Query Response to the use of a mask file during cufflinks isoform assembly was correct or not
if [[ ${GENESOFINTEREST_QUERY} = "Yes" || ${GENESOFINTEREST_QUERY} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

#Final User Query to ensure the variables in the Tophat_Run_Parameters.ini file match the users responses

echo FINAL CHECK: Are you sure the Tophat_Run_Parameters.ini file is updated to match the user responses you provided? [Yes/No]
read FINAL_CHECK

# Test to see if the Query Response was correct or not
if [[ ${FINAL_CHECK} = "Yes" || ${FINAL_CHECK} = "No" ]]
	then
	echo 
	else
	echo
	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
	echo " The script was automatically killed due to a launch error - See Above Error Message" 
	exit 2                              
fi

# Test to see if the user wants to stop the script (ie. they entered "No" as a response to the FINAL CHECK

if [ ${FINAL_CHECK} = "Yes" ]
	then
	echo "Thanks... Your User Configured Run Will Start in a Moment"
	echo
	else
	echo
	echo You Enter No to the query... The Script Will Stop Now
	exit 2                              
fi

echo Finished User Configuration Query Step
date '+%m/%d/%y %H:%M:%S'
echo

##############################################################
##  Define the renaming tags based on the users responses	##
##############################################################

echo Defining Renaming Convention Based on User Configuration Inputs
date '+%m/%d/%y %H:%M:%S'
echo

# TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF
if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
	then
	TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF" 
	else
	TOPHAT_TAG="TophatPE_DeNovo"                              
fi
echo The tophat tag has been set to ${TOPHAT_TAG}

# CUFFLINKS_TAG options are Cufflinks_DeNovo_NoMask, Cufflinks_DeNovo_Masked, Cufflinks_Limited_NoMask, Cufflinks_Limited_Masked, Cufflinks_Guided_NoMask, Cufflinks_Guided_Masked

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_TAG="Cufflinks_DeNovo_NoMask"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_TAG="Cufflinks_DeNovo_${MASK_SHORT}Mask"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_TAG="Cufflinks_${LIMIT_GTF_SHORT}Limited_NoMask"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_TAG="Cufflinks_${LIMIT_GTF_SHORT}Limited_${MASK_SHORT}Mask"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_TAG="Cufflinks_${GUIDE_GTF_SHORT}Guided_NoMask"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_TAG="Cufflinks_${GUIDE_GTF_SHORT}Guided_${MASK_SHORT}Mask"                             
fi
echo The cufflinks tag has been set to ${CUFFLINKS_TAG}
echo

##############################################################
##  Create a custom analysis directory for each sample      ##
##############################################################

## Lots of tricks in this step, find function, basename, learning how \, ", and ' quotes differ allowing you to write special characters and use variables

echo Creating Custom Directories and Splitting Files
date '+%m/%d/%y %H:%M:%S'

for file in `find . -type f -iname "*${READ1_ID}"`
do
filename=`basename $file "${READ1_ID}"`
mkdir $filename
mv $filename${READ1_ID} $filename
mv $filename${READ2_ID} $filename
echo "SAMPLE_NAME=\"$filename\"" >> $filename.ini
done

echo Directory Constuction and File Splitting Complete
date '+%m/%d/%y %H:%M:%S'
echo

##############################################################
##  Analysis Pipeline - BWA, TOPHAT, CUFFLINKS, QC metrics  ##
##############################################################

# Initiate analysis of each identified sample one at a time

echo Starting Individual Sample Analysis
date '+%m/%d/%y %H:%M:%S'
echo

for ini in `ls *.ini`
do
#Read in the .ini file created during the directory construction period
. $ini
mv $ini ${SAMPLE_NAME}
cd ${SAMPLE_NAME}

#Current Directory=ngs/tophat/SampleX/tophat_out
echo Analyzing ${SAMPLE_NAME} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Extract and Count reads in input files, 3 parse version depending on compression and type of compression
if [ ${COMPRESSION_QUERY} = "No" ]
	then
	#Calculate the number of reads present in each file
	echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	COUNT_READ1=`wc -l ${SAMPLE_NAME}${READ1_ID} | awk '{print $1/4}'`
	echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	COUNT_READ2=`wc -l ${SAMPLE_NAME}${READ2_ID} | awk '{print $1/4}'`
	echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	echo Finished  identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

	#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
	echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	head -n 4000000 ${SAMPLE_NAME}${READ1_ID} > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
	echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	head -n 4000000 ${SAMPLE_NAME}${READ2_ID} > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
	echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
	else
	if [ ${COMPRESSION_TYPE} = "gzip" ]
		then
		echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		COUNT_READ1=`gunzip -c ${SAMPLE_NAME}${READ1_ID} | wc -l | awk '{print $1/4}'`
		echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		COUNT_READ2=`gunzip -c ${SAMPLE_NAME}${READ2_ID} | wc -l | awk '{print $1/4}'`
		echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo Finished  identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

		#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
		echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		gunzip -c ${SAMPLE_NAME}${READ1_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
		echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		gunzip -c ${SAMPLE_NAME}${READ2_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
		echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	fi
	if [ ${COMPRESSION_TYPE} = "bzip2" ]
		then
		echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		COUNT_READ1=`bunzip2 -k ${SAMPLE_NAME}${READ1_ID} | wc -l | awk '{print $1/4}'`
		echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		COUNT_READ2=`bunzip2 -k ${SAMPLE_NAME}${READ2_ID} | wc -l | awk '{print $1/4}'`
		echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo Finished  identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

		#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
		echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		bunzip2 -k ${SAMPLE_NAME}${READ1_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
		echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		bunzip2 -k ${SAMPLE_NAME}${READ2_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
		echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
	fi
fi

#Align subset reads to transcriptome reference using BWA aln
echo Starting Alignment of Read1 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bwa aln -t ${CPU_CORES} -I ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ1_BWA_FASTQ} > ${SAMPLE_NAME}${READ1_BWA_SAI}
echo Finished Aligning Read1 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo Starting Alignment of Read2 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bwa aln -t ${CPU_CORES} -I ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ2_BWA_FASTQ} > ${SAMPLE_NAME}${READ2_BWA_SAI}
echo Finished Aligning Read2 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Generate .sam file using BWA sampe
echo Starting BWA SAMPE Assembly of Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bwa sampe ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ1_BWA_SAI} ${SAMPLE_NAME}${READ2_BWA_SAI} ${SAMPLE_NAME}${READ1_BWA_FASTQ} ${SAMPLE_NAME}${READ2_BWA_FASTQ} > ${SAMPLE_NAME}${BWA_SAMPE}.sam
echo Finished BWA SAMPE Assembly Phase | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Convert BWA sam file to bam file using SAMTOOLS
echo Starting SAM to BAM conversion with Samtools | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
samtools view -bS -o ${SAMPLE_NAME}${BWA_SAMPE}.bam ${SAMPLE_NAME}${BWA_SAMPE}.sam
echo Finished SAM to BAM conversion wth Samtools | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Sort BWA bam file using SAMTOOLS
echo Starting BAM file sorting step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
samtools sort ${SAMPLE_NAME}${BWA_SAMPE}.bam ${SAMPLE_NAME}${BWA_SAMPE}${SAMTOOLS_SORT_ID}
echo Finished sorting BAM file | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Determine insert metrics using PICARD
echo Starting Picard CollectInsertSizeMetrics step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/CollectInsertSizeMetrics.jar INPUT=${SAMPLE_NAME}${BWA_SAMPE}${SAMTOOLS_SORT_ID}.bam OUTPUT=${SAMPLE_NAME}_Picard_InsertMetrics.txt HISTOGRAM_FILE=${SAMPLE_NAME}_Picard_InsertMetrics.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished collecting insert metrics step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Calculate "inner mate distance" and standard deviation
echo Starting to calculate insert metrics | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

INNER_DISTANCE=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f5 | awk -v var1="$COMBINED_READ_LENGTH" '{printf "%.0f\n", $1-var1}'`
echo The mean inner-mate-distance was identified as being ${INNER_DISTANCE}bp | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

STDEV=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f6 | awk '{printf "%.0f\n", $1}'`
echo The standard deviation of the inner-mate-distance was identified as being ${STDEV}bp | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

echo Finished calculating insert metrics | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

#Run tophat analysis
echo Starting Tophat Alignment of ${SAMPLE_NAME} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

if [ ${TOPHAT_QUERY_GTF} = "Yes" ]       
	then 
	TOPHAT_COMMAND="tophat --num-threads ${CPU_CORES} -r ${INNER_DISTANCE} --mate-std-dev ${STDEV} --max-insertion-length ${MAX_INSERT_SIZE} --max-deletion-length ${MAX_DELETION_SIZE} ${READ_QUALITY} --library-type ${LIBRARY_FORMAT} -G ${GTF_FILE} ${BOWTIE_INDEX} ${SAMPLE_NAME}${READ1_ID} ${SAMPLE_NAME}${READ2_ID}"
fi

if [ ${TOPHAT_QUERY_GTF} = "No" ]       
	then 
	TOPHAT_COMMAND="tophat --num-threads ${CPU_CORES} -r ${INNER_DISTANCE} --mate-std-dev ${STDEV} --max-insertion-length ${MAX_INSERT_SIZE} --max-deletion-length ${MAX_DELETION_SIZE} ${READ_QUALITY} --library-type ${LIBRARY_FORMAT} ${BOWTIE_INDEX} ${SAMPLE_NAME}${READ1_ID} ${SAMPLE_NAME}${READ2_ID}"
fi

echo ${TOPHAT_COMMAND} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
${TOPHAT_COMMAND}

## This creates a subfolder called "tophat_out/"
## This folder will contain:	accepted_hits.bam (is it sorted?)
##								deletions.bed
##								insertions.bed
##								junctions.bed

cd tophat_out/
#Current Directory=ngs/tophat/SampleX/tophat_out

mv accepted_hits.bam ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam
mv deletions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_deletions.bed
mv insertions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_insertions.bed
mv junctions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_junctions.bed

echo Finished Tophat Alignment | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Run cufflinks analysis
echo Starting Cufflinks Abundance Estimation of ${SAMPLE_NAME} | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF ${GTF_FILE} --compatible-hits-norm --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF ${GTF_FILE} --compatible-hits-norm -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF-guide ${GTF_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
	then
	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF-guide ${GTF_FILE} -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
fi

echo ${CUFFLINKS_COMMAND} | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
${CUFFLINKS_COMMAND}

## This creates a subfolder called "cufflinks_out/" in the sample directory
## This folder will contain:	genes.expr
##								transcripts.expr
##								transcripts.gtf

#In the next step we will rename the expression estimate files

cd ../cufflinks_out/
#Current Directory=ngs/tophat/SampleX/cufflinks_out

mv genes.fpkm_tracking ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_genes.expr
mv isoforms.fpkm_tracking ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_transcripts.expr
mv transcripts.gtf ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_transcripts.gtf

echo Finished Cufflinks Abundance Estimation | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

##############################################################
##  Calculate metrics using Samtools and Picard				##
##############################################################

cd ../tophat_out/
#Current Directory=ngs/tophat/SampleX/tophat_out

#Determine the RNA Seq Metrics using Picard "CollectRNASeqMetrics.jar"
echo Starting Picard CollectRnaSeqMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/CollectRnaSeqMetrics.jar REF_FLAT=${REFFLAT_FILE} RIBOSOMAL_INTERVALS=${RIBOSOME_LIST} STRAND_SPECIFICITY=NONE REFERENCE_SEQUENCE=${REFERENCE_FILE} INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt CHART_OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished collecting RNAseq metrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Mark the duplicate reads using Picard "MarkDuplicates.jar"
echo Starting Picard MarkDuplicates step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
## NOTE: "-Dsnappy.disable=true" is passed to this command to prevent the use of snappy-java by MarkDuplicates as I can not figure out how to add it to the JAVA CLASSPATH 
java -Dsnappy.disable=true -Xmx2g -jar $HOME/local/bin/MarkDuplicates.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam METRICS_FILE=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished MarkDuplicates step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Create an index file for the bam file generated from the MarkDuplicates step using SAMTOOLS INDEX
echo Starting bam file indexing step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
samtools index ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam
echo Finished bam file indexing step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Calculate mapping statistics using SAMTOOLS FlAGSTAT
echo Starting Samtools Flagstat step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
samtools flagstat ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam > ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt
echo Finished Samtools Flagstat step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Calculate mapping statistics using Picard "CollectAlignmentSummaryMetrics.jar"
echo Starting Picard CollectAlignmentSummaryMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/CollectAlignmentSummaryMetrics.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt VALIDATION_STRINGENCY=SILENT REFERENCE_SEQUENCE=${REFERENCE_FILE} ASSUME_SORTED=true IS_BISULFITE_SEQUENCED=false TMP_DIR=${TEMP_DIRECTORY}
echo Finished Picard CollectAlignmentSummaryMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Calculate read quality stats using Picard "MeanQualityByCycle.jar"
echo Starting Picard MeanQualityByCycle step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/MeanQualityByCycle.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.txt CHART_OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished Picard MeanQualityByCycle step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

##############################################################
##  Write out summary file (Stats/Metrics/Indexes?)			##
##############################################################

#Create a summary text file
echo Starting Creation of Summary File | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

echo "*** Analysis Run Report ***" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo Sample ${SAMPLE_NAME} was Analyzed | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

# Test to see if the user wants a to extract FPKM values for a list of genes into the run report
if [ ${GENESOFINTEREST_QUERY} = "Yes" ]
	then
	echo FPKM Values for Common genes of Interests: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
	grep -wf ${GENES_OF_INTEREST} ../cufflinks_out/${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_genes.expr | awk '{print $1, $10}' | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
	echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
fi

echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

echo Sample Summary: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "The two read files processed were ${SAMPLE_NAME}${READ1_ID} and ${SAMPLE_NAME}${READ2_ID}" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${SAMPLE_NAME}${READ1_ID} contained ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${SAMPLE_NAME}${READ2_ID} contained ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo After aligning the first one million reads to a transcriptome reference using BWA | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo the inner-mate-distance was determined to be ${INNER_DISTANCE}bp with a standard deviation of ${STDEV} | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "The alignment metrics for this sample were as follows:" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

#Extract Alignment Metrics
UNIQUE_READ1_READS_ALIGNED=`cut -f3 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 8 | tail -n 1`
UNIQUE_READ2_READS_ALIGNED=`cut -f3 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 9 | tail -n 1`
TOTAL_READ1_ALIGNMENTS=`awk '{print $1}' ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt | head -n 5 | tail -n 1`
TOTAL_READ2_ALIGNMENTS=`awk '{print $1}' ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt | head -n 6 | tail -n 1`
PERCENT_READ1_ALIGNED=`echo ${COUNT_READ1} | awk -v var2="${UNIQUE_READ1_READS_ALIGNED}" '{print var2/$1}'`
PERCENT_READ2_ALIGNED=`echo ${COUNT_READ2} | awk -v var3="${UNIQUE_READ2_READS_ALIGNED}" '{print var3/$1}'`
PERCENT_READ1_Q20_BASES=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | tail -n 1 | awk '{print $11/$8}'`
PERCENT_READ2_Q20_BASES=`head -n 9 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | tail -n 1 | awk '{print $11/$8}'`
PERCENT_PAIRED_READ_ALIGNMENT=`cut -f17 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 8 | tail -n 1 | awk -v var4="${COUNT_READ1}" '{print $1/var4}'`

echo Alignment Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${UNIQUE_READ1_READS_ALIGNED} - Number of Unique Reads from Read1 Aligned | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${UNIQUE_READ2_READS_ALIGNED} - Number of Unique Reads from Read2 Aligned | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${TOTAL_READ1_ALIGNMENTS} - Total Number of Alignment Events from the Unique Reads in Read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${TOTAL_READ2_ALIGNMENTS} - Total Number of Alignment Events from the Unique Reads in Read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ1_ALIGNED} - Percentage of Unique Read1 Reads Aligned to at least one Position | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ2_ALIGNED} - Percentage of Unique Read2 Reads Aligned to at least one Position | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ1_Q20_BASES} - Percentage of Aligned Bases from Read1, which are greater than Q20 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ2_Q20_BASES} - Percentage of Aligned Bases from Read2,  which are greater than Q20 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_PAIRED_READ_ALIGNMENT} - Percentage of Reads Aligned as Pairs | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

#Extract RNAseq Metrics
PERCENT_RIBOSOME_BASES=`cut -f11 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_CODING_BASES=`cut -f12 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_UTR_BASES=`cut -f13 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_INTRONIC_BASES=`cut -f14 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_INTERGENIC_BASES=`cut -f15 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_MRNA_BASES=`cut -f16 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
MEDIAN_5PRIME_BIAS=`cut -f20 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
MEDIAN_3PRIME_BIAS=`cut -f21 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
MEDIAN_5PRIME_3PRIME_BIAS_RATIO=`cut -f22 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`

echo Transcriptome Sequencing Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_RIBOSOME_BASES} - Percent Ribosomal Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_CODING_BASES} - Percent Coding Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_UTR_BASES} - Percent UTR Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_INTRONIC_BASES} - Percent Intronic Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_INTERGENIC_BASES} - Percent Intergenic Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_MRNA_BASES} - Percent mRNA Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${MEDIAN_5PRIME_BIAS} - Median 5 Prime Coverage Bias | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${MEDIAN_3PRIME_BIAS} - Median 3 Prime Coverage Bias | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${MEDIAN_5PRIME_3PRIME_BIAS_RATIO} - Median 5/3 Coverage Ratio | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

#Extract Duplication Metrics
PERCENT_DUPLICATION=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt | tail -n 1 | cut -f8`
LIBRARY_SIZE=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt | tail -n 1 | cut -f9`

echo Duplication Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${LIBRARY_SIZE} - Estimated fragments in Library | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_DUPLICATION} - Percent Duplicate Reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

#Calculate Quality Metrics (first/last cycle and mean per read)
QUALITY_CAPTURE_LINES=`echo ${COMBINED_READ_LENGTH} | awk '{print $1+8}'`
QUALITY_CAPTURE_READS=`echo ${COMBINED_READ_LENGTH} | awk '{print $1/2}'`
head -n ${QUALITY_CAPTURE_LINES} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.txt | tail -n ${COMBINED_READ_LENGTH} > run_qualities.txt
head -n ${QUALITY_CAPTURE_READS} run_qualities.txt > read1_qualities.txt
tail -n ${QUALITY_CAPTURE_READS} run_qualities.txt > read2_qualities.txt
QUALITY_READ1_FIRST_CYCLE1=`cut -f2 read1_qualities.txt | head -n 1`
QUALITY_READ1_LAST_CYCLE1=`cut -f2 read1_qualities.txt | tail -n 1`
QUALITY_READ1_AVERAGE=`awk '{s+=$2} END {print s/NR}' < read1_qualities.txt`
QUALITY_READ2_FIRST_CYCLE1=`cut -f2 read2_qualities.txt | head -n 1`
QUALITY_READ2_LAST_CYCLE1=`cut -f2 read2_qualities.txt | tail -n 1`
QUALITY_READ2_AVERAGE=`awk '{s+=$2} END {print s/NR}' < read2_qualities.txt`

echo Quality Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo This run consists of paired ${QUALITY_CAPTURE_READS}x${QUALITY_CAPTURE_READS}bp reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ1_AVERAGE} - Mean phred score for read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ1_FIRST_CYCLE1} - Mean phred score for the first cycle of read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ1_LAST_CYCLE1} - Mean phred score for the last cycle of read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ2_AVERAGE} - Mean phred score for read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ2_FIRST_CYCLE1} - Mean phred score for the first cycle of read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ2_LAST_CYCLE1} - Mean phred score for the last cycle of read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

#Output complete list of variables used in the run
echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo   THE FOLLOWING VARIABLES WERE GENERATED DURING THE ANALYSIS OF THIS SAMPLE: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

echo "COMPRESSION_QUERY=\"${COMPRESSION_QUERY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMPRESSION_TYPE=\"${COMPRESSION_TYPE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_QUERY_GTF=\"${TOPHAT_QUERY_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_GTF_SHORT=\"${TOPHAT_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_QUERY_LIMIT_GTF=\"${CUFFLINKS_QUERY_LIMIT_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_QUERY_GUIDE_GTF=\"${CUFFLINKS_QUERY_GUIDE_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "LIMIT_GTF_SHORT=\"${LIMIT_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GUIDE_GTF_SHORT=\"${GUIDE_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_QUERY_MASK=\"${CUFFLINKS_QUERY_MASK}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MASK_SHORT=\"${MASK_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GENESOFINTEREST_QUERY=\"${GENESOFINTEREST_QUERY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "FINAL_CHECK=\"${FINAL_CHECK}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_TAG=\"${TOPHAT_TAG}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_TAG=\"${CUFFLINKS_TAG}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "SAMPLE_NAME=\"${SAMPLE_NAME}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COUNT_READ1=\"${COUNT_READ1}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COUNT_READ2=\"${COUNT_READ2}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "INNER_DISTANCE=\"${INNER_DISTANCE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "STDEV=\"${STDEV}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_COMMAND=\"${TOPHAT_COMMAND}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_COMMAND=\"${CUFFLINKS_COMMAND}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo THE FOLLOWING VARIABLES WERE LOADED FROM THE RNAseq_MensFormalWear_Parameters.ini DURING THE ANALYSIS OF THIS SAMPLE: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo They may or may not have been used depending on your query respones
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "LAUNCH_DIR=\"${LAUNCH_DIR}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ_EXT=\"${READ_EXT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ1_ID=\"${READ1_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ2_ID=\"${READ2_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMBINED_READ_LENGTH=\"${COMBINED_READ_LENGTH}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CPU_CORES=\"${CPU_CORES}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GENOME_NAME=\"${GENOME_NAME}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "REFERENCE_FILE=\"${REFERENCE_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "BWA_GENOME_FILE=\"${BWA_GENOME_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ1_BWA_FASTQ=\"${READ1_BWA_FASTQ}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ2_BWA_FASTQ=\"${READ2_BWA_FASTQ}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ1_BWA_SAI=\"${READ1_BWA_SAI}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ2_BWA_SAI=\"${READ2_BWA_SAI}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "BWA_SAMPE=\"${BWA_SAMPE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "SAMTOOLS_SORT_ID=\"${SAMTOOLS_SORT_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "BOWTIE_INDEX=\"${BOWTIE_INDEX}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GTF_FILE=\"${GTF_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ_QUALITY=\"${READ_QUALITY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "LIBRARY_FORMAT=\"${LIBRARY_FORMAT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MAX_INSERT_SIZE=\"${MAX_INSERT_SIZE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MAX_DELETION_SIZE=\"${MAX_DELETION_SIZE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MASK_FILE=\"${MASK_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TEMP_DIRECTORY=\"${TEMP_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "REFFLAT_FILE=\"${REFFLAT_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "RIBOSOME_LIST=\"${RIBOSOME_LIST}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GENES_OF_INTEREST=\"${GENES_OF_INTEREST}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMPLETED_RNASEQ_READ_FILE_DIRECTORY=\"${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMPLETED_RNASEQ_RESULTS_DIRECTORY=\"${COMPLETED_RNASEQ_RESULTS_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt

echo Finished Creating the Summary File | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

##############################################################
##  Clean up analysis directory and Move files to Archive	##
##############################################################

echo Starting to Clean up Analysis Directory | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log

#Remove temp files created during calculations steps
rm run_qualities.txt
rm read1_qualities.txt
rm read2_qualities.txt

#Move tophat outputs and metric files to archive directory
mv ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam.bai ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.bed ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.pdf ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.txt ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}

#Move cufflinks outputs to archive directory
cd ../cufflinks_out
#Current Directory=ngs/tophat/SampleX/cufflinks_out
mv *.expr ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.gtf ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}

#Move fastq, ini, and log file to archive directories
cd ../
#Current Directory=ngs/tophat/SampleX
mv ${SAMPLE_NAME}${READ1_ID} ${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}
mv ${SAMPLE_NAME}${READ2_ID} ${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}

echo Finished Cleaning up Analysis Directory | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log

mv *.log ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}

#Move back to launch directory to start next sample in batch
cd ../
#Current Directory=ngs/tophat/

#Remove Custom analysis directory and none required files
rm -rf ${SAMPLE_NAME}

done

rnaseq_common_parameters.ini

Code:

## RNAseq Common parameters file ##

# Use this variable to indicate the expected launch directory
# (Update as appropriate)
LAUNCH_DIR=/Users/jkeats/ngs/tophat

##################################################################################
## 										##	
##    Common Parameters used at multiple steps or used to name output files	##
##				(Update as appropriate)				##
##										##
##################################################################################

#sequence read file extension
READ_EXT=".fastq.gz"

#forward or read1 identifier
READ1_ID="_1.fastq.gz"

#reverse or read2 identifier
READ2_ID="_2.fastq.gz"

#What is the combined read length of the two read pairs (ie. 50x50=100, 100x100=200)
COMBINED_READ_LENGTH="100"

#How many cores/processors are available for the multi-threaded phases of BWA, TOPHAT, CUFFLINKS
CPU_CORES="1"

#This is used to rename result files to include the genome version
GENOME_NAME=hg19

#Provide the path and full file name to the genome reference file used to create the bowtie index
REFERENCE_FILE=/Users/jkeats/ngs/refgenomes/bowtie_indexed/igenomes_hg19/genome.fa

##################################################################################
##										##
##    Common BWA parameters (Only the BWA_GENOME_FILE needs to be updated)	##
##										##
##################################################################################

BWA_GENOME_FILE="/Users/jkeats/ngs/refgenomes/bwa_indexed/GRCh37_E60cDNA.fa"

READ1_BWA_FASTQ="_1_bwa.fastq"
READ2_BWA_FASTQ="_2_bwa.fastq"

READ1_BWA_SAI="_1_bwa.sai"
READ2_BWA_SAI="_2_bwa.sai"

BWA_SAMPE="_bwa_sampe"

##################################################################################
##										##
##    Common Samtools parameters (Shouldn't need to be modified)		##
##										##	
##################################################################################

SAMTOOLS_SORT_ID="_sorted"

##################################################################################
## 										##
##    Common Tophat Parameters (Update as appropriate)				##
##										##
##################################################################################

#Bowtie index to align data against [ <filename> ]
BOWTIE_INDEX="/Users/jkeats/ngs/refgenomes/bowtie_indexed/igenomes_hg19/genome"

#GTF file to use for LIMITED ABUNDANCE ESTIMATION (-G/--GTF) or GUIDED ABUNDANCE ESTIMATION ( -g/--GTF-guide)
GTF_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_genes.gtf"

#Quality type [ --solexa-quals  OR  --solexa1.3-quals (same as phred64-quals)  OR  --phred64-quals (same as solexa1.3-quals) ]
READ_QUALITY="--solexa1.3-quals"

#Library type [ fr-unstranded  OR  fr-firststrand  OR  fr-secondstrand)
LIBRARY_FORMAT="fr-unstranded"

#Maximum insertion size allow IF allowing indels [ --max-insertion-length     <int>     default: 3 ]
MAX_INSERT_SIZE="3"

#Maximum deletion size allowed IF allowing indels [ --max-deletion-length     <int>     default: 3 ]
MAX_DELETION_SIZE="3"

##################################################################################
##										##
##   Common Cufflinks Parameters (Update as appropriate)			##
##										##
##################################################################################

#This should be used if using an exclusion file for cufflinks abundance estimates
MASK_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/GRCh37_E60_CufflinksExcludedTranscripts.gtf"

##################################################################################
##										##
##   Common Picard Parameters (Update as appropriate)				##
##										##
##################################################################################

#Provide a directory for temporary files (Not always needed but on multi-harddrive systems JAVA may use an unexpected directory)
TEMP_DIRECTORY="/Users/jkeats/ngs/refgenomes/temp"

#This is a refFlat format table outlining the transcript space of the genome analyzed
REFFLAT_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_refFlat.txt"

#This is the ribosome exclusion file, a list of ribosome gene locations to report the percentage of reads related to ribosomal transcripts
RIBOSOME_LIST="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_ribosome_gene_locations.txt"

##################################################################################
##										##
##   Gene List of Interest (Update as appropriate)				##
##										##
##################################################################################

#List of genes to extract
GENES_OF_INTEREST="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/TC_Class_Genes.txt"

##################################################################################
##										##
##   Final Directories (Update as appropriate)					##
##										##
##################################################################################

#Provide the path to the folder were you want to archive the read files
COMPLETED_RNASEQ_READ_FILE_DIRECTORY="/Users/jkeats/ngs/analyzed_read_files/rnaseq/"

#Provide the path to the folder were you want to archive the read files
COMPLETED_RNASEQ_RESULTS_DIRECTORY="/Users/jkeats/ngs/finaloutputs/rnaseq/"

**LOH** · 09-26-2011, 08:23 AM

I learned a lot from your post. And the website for your lab looks great.
keep up with the good work.

**Jon_Keats** · 11-11-2011, 11:44 PM

GATK Trial and Error

I'm finally getting into using the GATK package. My first impression is that I find it much more intimidating that any of the Picard tools and the documentation is a bit harder to follow. Though I think that is largely the difference between a wiki style interface and a singular website page.

Notes:
1) Most analysis methods require indexed bam files. So get into a habit of just indexing every bam file. Typically I only indexed the final bam file for viewing in IGV
2) Related to 1) almost every file seems to require indexes. The reference genome (genome.fasta) file needs an index (genome.fasta.fai), and a dictionary file (genome.dict)
3) Related to 1) and 2) ....see a trend. The input vcf/rod files (file.vcf) need to be indexed (file.vcf.idx)

The frustrating thing is I know how to make the .fai file

Code:

samtools faidx <myFASTQ>

I know how to make the dictionary file using picard

Code:

java -jar CreateSequenceDictionary.jar REFERENCE=genome.fasta OUTPUT=genome.dict

But I have no idea how to create the .idx index file of the vcf file. So right now I'm limited to the file provided on the GATK website, as they provide complete indexes for each.

4) Fixing mates with Picard is no longer required after indel realignment as GATK does it on the fly now

Much more to come I have a feeling

**ahmetz** · 11-18-2011, 08:46 AM

I'm glad to hear that i'm not the only one intimidated by the documentation or there lack of. I feel like a big portion of TopHat/Cufflinks users are using those because of nice, relatively easy to understand documentation.

please keep on posting Jon, as I had given up on GATK, trying to learn python, but now i feel like i should get back into it!

**DineshCyanam** · 11-20-2011, 05:33 PM

Thanks a lot Jon for all the scripts... Really appreciate your work... Makes my life super easy... I have lost count of how many times I have referred back to this thread...

- Dinesh

**DineshCyanam** · 11-26-2011, 09:28 PM

I believe line #316 in the RNAseq_MensFormal_Mac.sh should be as below. Or else no matter what user response you give (Yes/No), the tophat tag will always be set to TophatPE_GTF.

Code:

# TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF
if [[ ${TOPHAT_QUERY_GTF} = "Yes" ]]
    then
    TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF" 
fi

if [[ ${TOPHAT_QUERY_GTF} = "No" ]]
    then
    TOPHAT_TAG="TophatPE_DeNovo"
fi
echo The tophat tag has been set to ${TOPHAT_TAG}

**Jon_Keats** · 11-26-2011, 10:52 PM

Hi Dinesh,

Thanks for the good catch. I'd made this update on my local version but forgot to update this thread and the version on the website. The version above is updated now and the website version should be done tonight.

Updated as follows:

Code:

# TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF
if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
	then
	TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF" 
	else
	TOPHAT_TAG="TophatPE_DeNovo"                              
fi
echo The tophat tag has been set to ${TOPHAT_TAG}

I got your email about the mask file. I've avoided providing one, not because I don't think people should use one but because the contents may be specific to different applications. In my case we exclude immunoglobulin regions, which is a disease specific issue, but things like ribosomal and mitochondrial genes are likely things that 99% of users will exclude so I've placed a copy of the file I use on the website (www.keatslab.org).

To generate a similar file with the elements

Code:

#see what RNA types exist in your gtf file
cut -f 2 ensembl_for_your_genome.gtf | sort | uniq

#create a text file with the RNA types you want to exclude, such as
#to_exclude.txt
Mt_rRNA
Mt_tRNA
Mt_tRNA_pseudogene
rRNA
rRNA_pseudogene
MT

grep -f to_exclude.txt ensembl_for_your_genome.gtf > CufflinksExcludedTranscripts.gtf

**DineshCyanam** · 11-27-2011, 12:24 AM

Thank You, Jon for fixing the bug and for the cufflinks_exclude files.

I was about to post about HiSeq data and Illumina's switch to Sanger quality scores (as I had to update your script to work with HiSeq data) and see that you have posted it on your website. So thanks once again.

- Dinesh

**ETHANol** · 12-05-2011, 05:39 AM

Hi Jon,
I nominate this as the best running thread on SeqAnswers. I'm going to be looking at a lot of RNA-seq data in the next months and I was wondering why you use Cufflinks over DESeq or EdgeR?

Thanks for the scripts. I'll give them a go once I start getting some data back.

I've been embedding too much UNIX in my Perl scripts and your scripts are a nice example on how to write shell scripts. Do you or anyone else know a good reference on writing shell scripts beyond the 'Unix and perl tutorial for biologists'.

**anjulka** · 12-13-2011, 12:57 PM

Originally posted by Jon_Keats View Post

GATK Trial and Error
4) Fixing mates with Picard is no longer required after indel realignment as GATK does it on the fly now

How did you find this out? (I don't mean to doubt you, but I've been led astray before. Plus, I need to learn how to find out change information.)

Thanks.

**Jon_Keats** · 12-14-2011, 02:32 PM

Hi Anjulka,

I completely understand. The source was not obvious, I was following the GATK best practice guide which no longer suggests the picard fix-mate step. Similar to your concern I spent some time searching for the reason and finally found this thread on the GATK Get Satisfaction help page (http://getsatisfaction.com/gsa/topic...el_realignment). See the first reply by Eric Banks (GATK developer)

**Jon_Keats** · 01-17-2012, 12:27 PM

I'm a bit amazed by how many people follow or have read this thread, especially with it recently passing ECO introduction sticky thread. I wonder if I can reference this thread to the grant reviewers that say I need more informatics help on my grants... Oh well enough complaining.

I've been finalizing my exome pipeline for one of the projects that deals with human myeloma cell lines were there is not germline/constitutional DNA sample available. The primary goal is to identify variants in genes identified in our patient sample screens to identify potential model systems and appropriate controls.

I'll post the entire pipeline and examples of all the required input files but the pipeline looks like the following (see below) and generates a Summary Run Report shown (see below) after the following command is initiated in a directory full of fastq files (can be a total mixed bag and either uncompressed, gzip, or bzip2 compressed)

<BWA_exome_pipeline.sh> <run_parameters.ini> <sample_sheet.txt>

Keats Lab

http://www.keatslab.org/resources/ngs-tools

Code:

*** Analysis Run Report ***

Sample KMS11_JPN_p3_TruSeq70MbExome was Analyzed

-----------------------------------------------------------------------------

*** Lane Alignment Run Report ***

Sample KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7 was Analyzed

The two read files processed were KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R1.fastq.gz and KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R2.fastq.gz
KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R1.fastq.gz contained 50457788 reads
KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R2.fastq.gz contained 50457788 reads
-----------------------------------------------------------------------------

*** Lane Alignment Run Report ***

Lane KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8 was Analyzed
The two read files processed were KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R1.fastq.gz and KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R2.fastq.gz
KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R1.fastq.gz contained 50279910 reads
KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R2.fastq.gz contained 50279910 reads
-----------------------------------------------------------------------------

Sample Summary:

After alignment with BWA the mean insert size was determined to be 270.820553bp with a standard deviation of 91.662134

The alignment metrics for this sample are as follows:

Alignment Metrics:
83 - Average Read Length
201475396 - Total Number of Passing Filter Reads Aligned with bwa
199321958 - Number of Reads Aligned
99991035 - Number of Reads from Read1 Aligned
99330923 - Number of Reads from Read2 Aligned
0.989312 - Percentage of Unique Reads Aligned to at least one Position
0.992588 - Percentage of Unique Read1 Reads Aligned to at least one Position
0.986035 - Percentage of Unique Read2 Reads Aligned to at least one Position
0.844223 - Percentage of Aligned Bases, which are greater than Q20
0.99472 - Percentage of Reads Aligned as Pairs
0.000131 - Percentage of Reads which appear to be Adaptor Reads

Duplication Metrics:
500572402 - Estimated fragments in Library
0.095995 - Percent Duplicate Reads

Transcriptome Metrics:
0.56342 - Percent mRNA Bases
0.247456 - Percent Coding Bases
0.315964 - Percent UTR Bases
0.293039 - Percent Intronic Bases
0.143454 - Percent Intergenic Bases
0.661081 - Median 5 Prime Coverage Bias
0.730395 - Median 3 Prime Coverage Bias
0.938235 - Median 5/3 Coverage Ratio

Capture Metrics:
182341432 - Number of unique alignment events [non-duplicates]
0.905031 - Percent unique alignment events [non-duplicates]
162369533 - Number of unique alignment events with MAPQ>0
0.89047 - Percent of reads which are unique alignments with MAPQ>0
13.446 - Gigabases of unique alignments with MAPQ>0
0.696883 - Percent of aligned bases within padded bait region
0.303117 - Percent of aligned bases not associated with a bait region
101.257402 - Mean coverage of the target region
0.376064 - Percentage of bases on target
5.057 - Gigabases of on target base pairs
23.291078 - Fold enrichment
0.003347 - Percent of undetected targets
0.95061 - Percent of targets with greater than 10x coverage
0.923333 - Percent of targets with greater than 20x coverage
0.886545 - Percent of targets with greater than 30x coverage
0.03663 - Percent of low GC targets which are under represented
5.835684 - Percent of high GC targets which are under represented

Quality Metrics:
This run consists of paired 83x83bp reads
32.6069 - Mean phred score for read1
32.216839 - Mean phred score for the first cycle of read1
30.498119 - Mean phred score for the last cycle of read1
31.2961 - Mean phred score for read2
30.722569 - Mean phred score for the first cycle of read2
28.530552 - Mean phred score for the last cycle of read2

Variant Calling Metrics:
40956 - Number of high-confidence variants called by both GATK and Samtools
2.63 - Ti/Tv Ratio for high-confidence variants called by both GATK and Samtools
2059 - Number of high-confidence variants that do not exist in dbsnp
2.19 - Ti/Tv Ratio for high-confidence variants called by both GATK and Samtools that do not exist in dbsnp
3266 - Number of high-confidence variants that do not exist in 1000G project
2.00 - Ti/Tv Ratio for high-confidence variants called by both GATK and Samtools that do not exist in 1000G project
94.97 - Percentage of high-confidence variants that exist in dbsnp
92.60 - Percentage of variants called by GATK and filtered by Samtools that exist in dbsnp
81.86 - Percentage of variants called by Samtools and filtered by GATK that exist in dbsnp
87.19 - Percentage of variants called exclusively by GATK that exist in dbsnp
41.50 - Percentage of variants called exclusively by Samtools that exist in dbsnp
92.95 - Percentage of variants called by but filtered by GATK and Samtools that exist in dbsnp

581 of the 3266 high-confidence variants absent from 1000G are non-synonymous changes

The 581 non-synonymous changes includes the following genes, which are mutated in the MMGI Study:
ZNF804A	-697T	2	185802211	C	CACA	[0, 0, 78, 75]	rs5836928

The 581 non-synonymous changes includes the following positions, which exist in the Cosmic database:
ZNF804A	-697T	2	185802211	C	CACA	[0, 0, 78, 75]	rs5836928
FGFR3	Y373C	4	1806099	A	G	[8, 4, 14, 5]	rs121913485

The 581 non-synonymous changes includes the following nonsense changes:
RP4-788L13.1	E298*	1	99771640	G	T	[28, 35, 7, 11]	.
TAF1A	Y36*	1	222761798	G	T	[53, 28, 12, 11]	.
MAP4K4	Q488*	2	102482984	C	T	[8, 19, 7, 14]	.
PTPRN	Q22*	2	220172517	G	A	[39, 75, 19, 15]	.
PAQR3	S133*	4	79851430	G	C	[29, 26, 25, 27]	.
CYFIP2	W204*	5	156786100	G	A	[7, 13, 9, 11]	.
ANKS1A	E420*	6	34957049	G	T	[26, 23, 12, 9]	.
PLEKHA2	C182*	8	38810216	T	A	[52, 33, 33, 15]	.
SOX6	R596*	11	16010642	G	A	[32, 26, 15, 11]	.
CD4	W53*	12	6909582	G	A	[12, 35, 7, 16]	.
GCOM1	Y122*	15	57913853	T	G	[39, 26, 17, 16]	.
MEF2A	K25*	15	100185784	A	T	[41, 5, 18, 1]	.
NOXO1	W269*	16	2029435	C	T	[19, 27, 24, 33]	.
SRCAP	Q2*	16	30712149	C	T	[17, 29, 8, 9]	.
ZNF613	E134*	19	52447644	G	T	[81, 31, 18, 9]	.
ZIM2	R335*	19	57286637	G	A	[37, 32, 34, 16]	.
FTHL17	E148*	X	31089629	C	A	[0, 0, 30, 8]	rs7057057
MAGEB16	R272*	X	35821127	C	T	[0, 0, 60, 63]	rs4829392
MAGEE2	E120*	X	75004529	C	A	[0, 0, 27, 28]	rs1343879
UBE2NL	L89*	X	142967468	T	G	[0, 0, 49, 62]	rs237520

The 581 non-synonymous changes includes the following frameshift changes:
PROX1	NA	1	214170645	AG	A	[18, 17, 62, 63]	.
DNAH14	NA	1	225328438	G	GAT	[10, 0, 29, 1]	.
FGFR3	NA	4	1809127	CGT	C	[0, 0, 27, 26]	rs34003391
NR3C2	NA	4	149002016	AT	A	[0, 0, 11, 12]	rs10708334
ZNF608	NA	5	123982405	CT	C	[26, 3, 10, 3]	.
CYFIP2	NA	5	156721863	T	TC	[0, 0, 16, 19]	rs76159126
STEAP2	NA	7	89854652	ACAAAT	A	[20, 27, 33, 45]	.
OR13C2	NA	9	107367392	TGTTA	T	[23, 19, 25, 39]	rs71991336
C9orf173	NA	9	140147822	GATGCACCCTTCTGGCTGGCCAACCCTTCT	G	[11, 15, 11, 18]	.
CREB3L1	NA	11	46342259	A	AG	[0, 0, 54, 2]	rs71038900
DIXDC1	NA	11	111853106	G	GC	[0, 0, 33, 51]	rs67635503
TREH	NA	11	118529044	C	CG	[0, 0, 17, 24]	rs11448549
HSP90B1	NA	12	104326109	TAGAG	T	[12, 19, 20, 16]	.
PKD1L2	NA	16	81242148	GTT	G	[0, 0, 61, 26]	rs62734462
LRRC59	NA	17	48474672	TGGTCA	T	[9, 14, 2, 7]	.
HSH2D	NA	19	16268207	TA	T	[0, 0, 7, 18]	rs5827321
LTBP4	NA	19	41123093	A	AG	[0, 0, 18, 43]	rs67216392
RASIP1	NA	19	49230400	GA	G	[22, 49, 5, 18]	.
CD33	NA	19	51729577	ACCCAACAACTGGTATCTTT	A	[37, 19, 24, 14]	.
ADAMTS1	NA	21	28210040	GA	G	[69, 35, 19, 8]	.
C21orf49	NA	21	34169317	C	CT	[0, 0, 48, 55]	rs5843567
COL6A2	NA	21	47545369	A	AC	[0, 1, 29, 69]	.
CLTCL1	NA	22	19189003	A	AC	[0, 0, 1, 22]	rs78649162
EFHC2	NA	X	44202890	G	GC	[0, 0, 2, 35]	rs5902346
SSX9	NA	X	48163733	TC	T	[0, 0, 57, 22]	rs112256157
NUDT11	NA	X	51239295	ATCCTCGAGGCAGCC	A	[0, 0, 4, 12]	rs78182391
TCEAL6	NA	X	101395780	T	TG	[0, 0, 27, 90]	rs11408120
TEX13A	NA	X	104464276	CA	C	[0, 0, 36, 7]	rs66463437

The 581 non-synonymous changes includes the following start codon losses:
TLR8	M1V	X	12924826	A	G	[0, 0, 17, 48]	rs3764880

The 581 non-synonymous changes includes the following stop codon losses:
MPL	*636W	1	43818443	A	G	[31, 17, 45, 40]	.
PCSK2	*604Q	20	17462713	T	C	[26, 13, 18, 14]	.

Attached Files

BWA_Exome_Pipeline.pdf (376.8 KB, 284 views)

**golharam** · 01-17-2012, 01:59 PM

Hi Jon - The BWA Exome Pipeline looks great. I assume the RNA-Seq pipeline is different just by looking at the pdf you posted. I assume the last RNA-Seq pipeline was from 09-26-2011. I also see that the Circos stuff wasn't included. Is there a reason why?

**ahmetz** · 01-20-2012, 09:13 AM

Hey Jon,

While going over your pipeline, I noticed something. With this line:

INNER_DISTANCE=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f5 | awk -v var1="$COMBINED_READ_LENGTH" '{printf "%.0f\n", $1-var1}'`

you're calculating the insert size as you use the INNER_DISTANCE value with -r option in Tophat. However, the Picard output, the 5th column is already MEAN_INSERT_SIZE, why are you subtracting COMBINED_READ_LENGTH from the mean insert size again?

**Jon_Keats** · 01-20-2012, 09:15 PM

Hi Ahmetz,

Tophat -r is looking for the "mate-inner-distance" which is the distance between the ends of read 1 and read 2 not the distance between adaptor1 and adaptor2. We have talked about having a wiki locally for terminology but this is what we often use.

[Adaptor1]>>>>read1>>>---------<<<read2<<<<[Adaptor2]

Fragment Size = distance from beginning of adaptor 1 to the end of adaptor 2
Insert Size = size of DNA between adaptor 1 and adaptor 2
Inner Distance (what tophat wants) = the distance between the end of read1 and the end or read2. In the example above if you consider each - as a nucleotide and the ends of each read defined by > and < then the inner-distance is 9.

Golharam,

We are currently updating the RNAseq pipeline to work with Topat 1.4 to take advantage of the new align to transcriptome step that I've always pushed to have in an RNAseq aligner. It will also look a bit more similar to our exome pipeline so that it can add read groups.
Ultimately, I'd envision there being a exome pipeline, RNAseq pipeline, mate-pair pipeline, and then a summary pipeline that will merge all three results and create the circos plot. Should be fairly easy to pipe the results of the current pipeline into circos but I always plot more than a single data type so I didn't bother to included it. But not a bad idea to included as a launch query so a user could pipe the results directly.

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, Yesterday, 10:09 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 Yesterday, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 26 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News