Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Getting Better with Variables

    Now that I have the lab (www.keatslab.org) moving forward to some extent I'm back to working on my analysis methods. For a couple of reasons I've been searching for ways to improve the previous methods so that even less user input is required and thus saving me time running things myself or worrying about other users making mistakes. So here is what seems like a good version that relies on very limited user input. It takes advantage of a common parameters file, and a series of basic user queries as the script launches to setup the tophat and cufflinks run. It also uses bwa to find the insert size and standard deviation of the library.

    Requirements:
    1) Create_Directory_Structure_V4 directory structure, but the pipeline is so flexible this is not really a requirement just simplifies effort.
    2) bwa transcriptome index file for genome
    3) bowtie index file for genome
    4) Annotation files for Picard CollectRnaSeqMetrics.jar to function properly

    Options:
    1) It accepts compressed gzip or bzip2 files or standard read files with any extension
    2) It creates a summary file at the end of each run and you have the option to output a list of genes and associated FPKM values to this summary file. So if you are looking at an siRNA experiment you could drop the gene of interest values for your samples or a list of random genes you would like a quick look at.

    General Outline:
    1) Query User to define setup, which sets the naming convention for each sample, the tophat run format (two options), the cufflinks run format (six options), the gene list added to the summary file
    2) Counts the number of reads in each input file (used to calculate % aligned)
    3) Cuts out the first 1 million reads for bwa alignment against a transcriptome reference to define the insert size and standard deviation
    4) Runs tophat
    5) Runs cufflinks
    6) Runs a series of picard metrics programs
    7) Summarizes metrics and gene values

    create_ngs_directorystructure_v4.sh

    Code:
    #!/bin/sh
    
    # Create_NGS_DirectoryStructure_V4.sh
    # 
    #
    # Created by Jonathan Keats on 09/25/11
    # Translational Genomics Research Institute
    
    #########################################################################
    #  CREATES A DIRECTORY STURCTURE TO SUPPORT A VARIETY OF NGS PIPELINES  #
    #########################################################################
    
    # Designed for a Mac OS enviroment and requires initiation from your home folder (/User/You/)
    
    # Check to confirm current location is $HOME/ (ie. /User/You/)
    
    echo "Confirming Script Initiation Directory"
    var1=$HOME
    if [ "`pwd`" != "$var1" ] 
    	then 
    	echo " The script must be launched from your home directory "
    	echo " The script was automatically killed due to a launch error - See Above Error Message" 
    	exit 2                              
    fi
    echo "1) Launch Location is Correct ($HOME/)"
    
    # Create required directories to support pipelines (RNAseq_MensFormal_)
    
    echo ***Creating Pipeline Directory Structure***
    mkdir -p ngs/{applications,bwa,run_parameters,run_parameters,scripts,temp,tophat,tophat_fusion}
    mkdir -p ngs/analyzed_read_files/{chipseq,exomes,genomes,matepair,rnaseq}
    mkdir -p ngs/finaloutputs/{chipseq,exomes,genomes,matepair,rnaseq}
    mkdir -p ngs/refgenomes/{bfast_indexed,bowtie_indexed,bwa_indexed,downloads}
    mkdir -p ngs/refgenomes/downloads/{ncbi36_hg18,grch37_hg19}
    mkdir -p ngs/refgenomes/downloads/ncbi36_hg18/{annotation_files,reference_sequences}
    mkdir -p ngs/refgenomes/downloads/grch37_hg19/{annotation_files,reference_sequences}
    mv create_ngs_directorystructure_v4.sh ngs/scripts/
    echo ***Pipeline Directory Structure Created***
    RNAseq_MensFormal_Mac.sh
    Code:
    #!/bin/sh
    
    # RNAseq_MensFormal_Mac.sh
    # 
    #
    # Created by Jonathan Keats on 9/21/11.
    # Copyright 2011 Translational Genomics Research Institute. All rights reserved.
    
    # Outline - Automated pipeline requiring limited input for multiple RNAseq samples using tophat and cufflinks
    
    # Usage = RNAseq_MensFormal_Mac.sh <rnaseq_common_parameters.ini>
    
    ######################################################################################################
    ##																									##
    ##	Dependencies - Unix (mv, cd, find, basename, mkdir, echo, wc, head, tail, cut, awk, sed)		##
    ##				 - BWA (version with -I option)	(Tested with v0.5.9)								##
    ##				 - SAMTOOLS (Tested with v0.1.12a)													##
    ##				 - PICARD Tools (Tested with v1.52)													##
    ##					- CollectInsertSizeMetrics.jar (WATCH VERSION AS COLUMN OUTPUT DIFFERS)			##
    ##					- CollectRNAseqMetrics.jar														##
    ##					- MarkDuplicates.jar															##
    ##					- CollectAlignmentSummaryMetrics.jar											##
    ##					- MeanQualityByCycle.jar														##
    ##				 - TOPHAT (Tested with v1.3.2)														##
    ##				 - CUFFLINKS (Tested with v1.1.0)													##
    ##																									##
    ##	Outline - Place uniformly named read file pairs in TOPHAT folder								##
    ##					- This method is dependent on XXXX folder structure								##
    ##			- Update "tophat_run_parameters.ini" as appropriate to your samples						##
    ##			- Launch script which queries the user to define methods used and renaming convention   ##
    ##			  then does the following steps automatically:											##
    ##				- Create custom directory for each sample											##
    ##				- Create custom .ini file for each sample											##
    ##				- Cut out first 1 million reads from each read file									##
    ##				- Align to transcriptome reference using BWA										##
    ##				- Calculate inner-mate-distance and standard deviation								##
    ##					** inner-mate-distance = (insert size) - (combined read length)					##
    ##				- Run TOPHAT																		##
    ##				- Run CUFFLINKS																		##
    ##				- Calculate metrics 																##
    ##																									##
    ##   Notes	- Extensive use of common variables is used to facilitate automation					##
    ##			- There is some naming convention requirements but they are flexible					##
    ##					- Paired samples must follow the same convention:								##
    ##							--> SampleA_1.fastq  and  SampleA_2.fastq								##
    ##												OR													##
    ##							--> SampleA_Time1_1.txt and SampleA_Time1_2.txt							##
    ##					- You can mix different basenames but not common extensions						##
    ##							--> SampleB_1.fa / SampleB_2.fa											##
    ##							--> SampleC_Time1_1.fa / SampleC_Time1_2.fa								##
    ##		** This is fine: Common Extension = .fa ; Read Extension = _1.fa and _2.fa **				##
    ##							--> SampleB_1.fastq / SampleB_2.fastq									##
    ##							--> SampleC_1.fa / SampleC_2.fa											##
    ##							--> SampleD_R1.fa / SampleD_R2.fa										##
    ##      ** These can not be mixed, no "Common Extension" or common "Read Extension" **				##
    ##																									##
    ######################################################################################################
    
    # Read in the tophat_common_parameters file
    . $1
    
    #Starting Directory=ngs/tophat/
    echo
    echo "*** STARTING SCRIPT ***"
    date '+%m/%d/%y %H:%M:%S'
    echo
    
    ##############################################################
    ##  Check that the script was launched from ngs/tophat dir  ##
    ##############################################################
    
    echo Checking Launch Location
    date '+%m/%d/%y %H:%M:%S'
    echo
    
    # Checking if launch location is correct
    
    if [ "`pwd`" != "${LAUNCH_DIR}" ] 
    	then 
    	echo " The script must be launched from the ${LAUNCH_DIR} directory "
    	echo " The script was automatically killed due to a launch error - See Above Error Message" 
    	exit 2                              
    fi
    echo "1) Launch Location is Correct ($HOME/ngs/tophat)"
    
    # Checking if launch directory contains any .ini files (No such files are expected)
    
    if [ `ls *.ini | wc -l` != 0 ]       
    	then 
    	echo " The tophat directory contains unexpected .ini files - These must be removed before the script starts "
    	echo " ERROR - The script was automatically killed due to a launch error - See Above Error Message"
    	exit 2                     
    fi
    echo "2) No .ini Files Found in Launch Directory as Expected"
    echo
    
    echo Directory Check Completed
    date '+%m/%d/%y %H:%M:%S'
    echo
    
    ##############################################################
    ##  Query User for tophat and cufflinks configurations      ##
    ##############################################################
    
    echo Starting User Configuration Query Step
    date '+%m/%d/%y %H:%M:%S'
    echo
    
    # Ask user if the input read files are compressed or not
    echo "Are the input read files compressed? (ie. gzip or bzip2) [Yes/No]"
    read COMPRESSION_QUERY
    
    # Test to see if the "COMPRESSION QUERY" Query Response was appropriate
    ## NOTE - In the IF statement use && as AND and || as OR in new shells (old OR is -o)
    if [[ ${COMPRESSION_QUERY} = "Yes" || ${COMPRESSION_QUERY} = "No" ]]
    	then
    	echo 
    	else
    	echo
    	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
    	echo " The script was automatically killed due to a launch error - See Above Error Message" 
    	exit 2                              
    fi
    
    # Ask user for the compression type if they indicated that the read files are compressed
    if [ ${COMPRESSION_QUERY} = "Yes" ]
    	then
    	echo Please indicate the type of compression. [gzip/bzip2] 
    	read COMPRESSION_TYPE
    	if [[ ${COMPRESSION_TYPE} = "gzip" || ${COMPRESSION_QUERY} = "bzip2" ]]
    		then
    		echo 
    		else
    		echo
    		echo " Incorrect Response Option, Please use gzip or bzip2 (Case-Sensitive) "
    		echo " The script was automatically killed due to a launch error - See Above Error Message" 
    		exit 2                              
    	fi
    fi
    ######################
    
    echo Would you like to run tophat with a gtf reference file? [Yes/No]
    read TOPHAT_QUERY_GTF
    
    # Test to see if the Query Response to the use of a GTF file during tophat alignment was correct or not
    ## NOTE - In the IF statement use && as AND and || as OR in new shells (old OR is -o)
    if [[ ${TOPHAT_QUERY_GTF} = "Yes" || ${TOPHAT_QUERY_GTF} = "No" ]]
    	then
    	echo 
    	else
    	echo
    	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
    	echo " The script was automatically killed due to a launch error - See Above Error Message" 
    	exit 2                              
    fi
    
    # Ask user for a short GTF Identifier for renaming purposes
    if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
    	then
    	echo "Please provide a short descriptor for the GTF file used during tophat alignment"
    	echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
    	echo Please Enter Response Now:
    	read TOPHAT_GTF_SHORT
    	echo                              
    fi
    
    echo   Notes regarding the following questions pertain to how cufflinks estimates abundances
    echo -----------------------------------------------------------------------------------------
    echo
    echo A GTF file can be used in two mutually exclusive fashions -G/--GTF or -g/--GTF-guide
    echo
    echo "1) To limit abundance estimates to just the isoforms in the GTF file use -G/--GTF"
    echo "*** In this script this is called LIMIT ABUNDANCE ESTIMATION to GTF reference ***"
    echo NOTE: This limits the estimates to a consistent set of known isoforms
    echo 
    echo "2) To guide BUT NOT limit the abundance estimates to the GTF file use -g/--GTF-guide"
    echo "*** In this script this is called GUIDE ABUNDANCE ESTIMATION to GTF reference ***"
    echo NOTE: This allows identification of known and novel isoforms
    echo -----------------------------------------------------------------------------------------
    echo
    
    echo Would you like to LIMIT ABUNDANCE ESTIMATION with cufflinks to a gtf reference file? [Yes/No]
    read CUFFLINKS_QUERY_LIMIT_GTF
    
    # Test to see if the Query Response to the use of a LIMIT ABUNDANCE ESTIMATION GTF file during cufflinks isoform assembly was correct or not
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" || ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" ]]
    	then
    	echo 
    	else
    	echo
    	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
    	echo " The script was automatically killed due to a launch error - See Above Error Message" 
    	exit 2                              
    fi
    
    echo Would you like to GUIDE ABUNDANCE ESTIMATION with cufflinks to a gtf reference file? [Yes/No]
    read CUFFLINKS_QUERY_GUIDE_GTF
    
    # Test to see if the Query Response to the use of a GUIDE ABUNDANCE ESTIMATION GTF file during cufflinks isoform assembly was correct or not
    if [[ ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" || ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" ]]
    	then
    	echo 
    	else
    	echo
    	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
    	echo " The script was automatically killed due to a launch error - See Above Error Message" 
    	exit 2                              
    fi
    
    # Test to ensure the user did not say "Yes" to the use both the LIMIT and GUIDE GTF options
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" ]]
    	then
    	echo " Incorrect Response Combination, You Can Not Say Yes to both the LIMIT and GUIDE ABUNDANCE ESTIMATION Options!! "
    	echo " The script was automatically killed due to a launch error - See Above Error Message" 
    	exit 2                              
    fi
    
    # Ask user for a short GTF Identifier for renaming purposes
    if [ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" ]
    	then
    	echo "Please provide a short descriptor for the GTF file used to LIMIT ABUNDANCE ESTIMATION"
    	echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
    	echo Please Enter Response Now:
    	read LIMIT_GTF_SHORT
    	echo                              
    fi
    
    if [ ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" ]
    	then
    	echo "Please provide a short descriptor for the GTF file used to GUIDE ABUNDANCE ESTIMATION"
    	echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
    	echo Please Enter Response Now:
    	read GUIDE_GTF_SHORT
    	echo                              
    fi
    
    echo Would you like to run cufflinks with an abundance estimate mask file? [Yes/No]
    read CUFFLINKS_QUERY_MASK
    
    # Test to see if the Query Response to the use of a mask file during cufflinks isoform assembly was correct or not
    if [[ ${CUFFLINKS_QUERY_MASK} = "Yes" || ${CUFFLINKS_QUERY_MASK} = "No" ]]
    	then
    	echo 
    	else
    	echo
    	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
    	echo " The script was automatically killed due to a launch error - See Above Error Message" 
    	exit 2                              
    fi
    
    # Ask user for a short mask identifier for renaming purposes
    if [ ${CUFFLINKS_QUERY_MASK} = "Yes" ]
    	then
    	echo "Please provide a short descriptor for the mask file used to prevent certain transscripts from being used during abundance estimation"
    	echo "Such as MT for mitochorndia, IG for immunoglobulin, this is not standardized but for file sorting purposes you should be consistent between runs"
    	echo Please Enter Response Now:
    	read MASK_SHORT
    	echo                              
    fi
    
    # Ask the user if they want to extract a list of Genes and associated FPKM vales from the cufflinks output into the run report
    echo Would you like to extract a gene list, defined by GENES_OF_INTEREST in .ini file into the run report? [Yes/No]
    read GENESOFINTEREST_QUERY
    
    # Test to see if the Query Response to the use of a mask file during cufflinks isoform assembly was correct or not
    if [[ ${GENESOFINTEREST_QUERY} = "Yes" || ${GENESOFINTEREST_QUERY} = "No" ]]
    	then
    	echo 
    	else
    	echo
    	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
    	echo " The script was automatically killed due to a launch error - See Above Error Message" 
    	exit 2                              
    fi
    
    #Final User Query to ensure the variables in the Tophat_Run_Parameters.ini file match the users responses
    
    echo FINAL CHECK: Are you sure the Tophat_Run_Parameters.ini file is updated to match the user responses you provided? [Yes/No]
    read FINAL_CHECK
    
    # Test to see if the Query Response was correct or not
    if [[ ${FINAL_CHECK} = "Yes" || ${FINAL_CHECK} = "No" ]]
    	then
    	echo 
    	else
    	echo
    	echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
    	echo " The script was automatically killed due to a launch error - See Above Error Message" 
    	exit 2                              
    fi
    
    # Test to see if the user wants to stop the script (ie. they entered "No" as a response to the FINAL CHECK
    
    if [ ${FINAL_CHECK} = "Yes" ]
    	then
    	echo "Thanks... Your User Configured Run Will Start in a Moment"
    	echo
    	else
    	echo
    	echo You Enter No to the query... The Script Will Stop Now
    	exit 2                              
    fi
    
    echo Finished User Configuration Query Step
    date '+%m/%d/%y %H:%M:%S'
    echo
    
    ##############################################################
    ##  Define the renaming tags based on the users responses	##
    ##############################################################
    
    echo Defining Renaming Convention Based on User Configuration Inputs
    date '+%m/%d/%y %H:%M:%S'
    echo
    
    # TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF
    if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
    	then
    	TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF" 
    	else
    	TOPHAT_TAG="TophatPE_DeNovo"                              
    fi
    echo The tophat tag has been set to ${TOPHAT_TAG}
    
    # CUFFLINKS_TAG options are Cufflinks_DeNovo_NoMask, Cufflinks_DeNovo_Masked, Cufflinks_Limited_NoMask, Cufflinks_Limited_Masked, Cufflinks_Guided_NoMask, Cufflinks_Guided_Masked
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
    	then
    	CUFFLINKS_TAG="Cufflinks_DeNovo_NoMask"                             
    fi
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
    	then
    	CUFFLINKS_TAG="Cufflinks_DeNovo_${MASK_SHORT}Mask"                             
    fi
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
    	then
    	CUFFLINKS_TAG="Cufflinks_${LIMIT_GTF_SHORT}Limited_NoMask"                             
    fi
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
    	then
    	CUFFLINKS_TAG="Cufflinks_${LIMIT_GTF_SHORT}Limited_${MASK_SHORT}Mask"                             
    fi
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
    	then
    	CUFFLINKS_TAG="Cufflinks_${GUIDE_GTF_SHORT}Guided_NoMask"                             
    fi
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
    	then
    	CUFFLINKS_TAG="Cufflinks_${GUIDE_GTF_SHORT}Guided_${MASK_SHORT}Mask"                             
    fi
    echo The cufflinks tag has been set to ${CUFFLINKS_TAG}
    echo
    
    ##############################################################
    ##  Create a custom analysis directory for each sample      ##
    ##############################################################
    
    ## Lots of tricks in this step, find function, basename, learning how \, ", and ' quotes differ allowing you to write special characters and use variables
    
    echo Creating Custom Directories and Splitting Files
    date '+%m/%d/%y %H:%M:%S'
    
    for file in `find . -type f -iname "*${READ1_ID}"`
    do
    filename=`basename $file "${READ1_ID}"`
    mkdir $filename
    mv $filename${READ1_ID} $filename
    mv $filename${READ2_ID} $filename
    echo "SAMPLE_NAME=\"$filename\"" >> $filename.ini
    done
    
    echo Directory Constuction and File Splitting Complete
    date '+%m/%d/%y %H:%M:%S'
    echo
    
    ##############################################################
    ##  Analysis Pipeline - BWA, TOPHAT, CUFFLINKS, QC metrics  ##
    ##############################################################
    
    # Initiate analysis of each identified sample one at a time
    
    echo Starting Individual Sample Analysis
    date '+%m/%d/%y %H:%M:%S'
    echo
    
    for ini in `ls *.ini`
    do
    #Read in the .ini file created during the directory construction period
    . $ini
    mv $ini ${SAMPLE_NAME}
    cd ${SAMPLE_NAME}
    
    #Current Directory=ngs/tophat/SampleX/tophat_out
    echo Analyzing ${SAMPLE_NAME} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Extract and Count reads in input files, 3 parse version depending on compression and type of compression
    if [ ${COMPRESSION_QUERY} = "No" ]
    	then
    	#Calculate the number of reads present in each file
    	echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
    	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
    	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	COUNT_READ1=`wc -l ${SAMPLE_NAME}${READ1_ID} | awk '{print $1/4}'`
    	echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	COUNT_READ2=`wc -l ${SAMPLE_NAME}${READ2_ID} | awk '{print $1/4}'`
    	echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	echo Finished  identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    	#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
    	echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	head -n 4000000 ${SAMPLE_NAME}${READ1_ID} > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
    	echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	head -n 4000000 ${SAMPLE_NAME}${READ2_ID} > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
    	echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
    	else
    	if [ ${COMPRESSION_TYPE} = "gzip" ]
    		then
    		echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
    		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
    		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		COUNT_READ1=`gunzip -c ${SAMPLE_NAME}${READ1_ID} | wc -l | awk '{print $1/4}'`
    		echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		COUNT_READ2=`gunzip -c ${SAMPLE_NAME}${READ2_ID} | wc -l | awk '{print $1/4}'`
    		echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		echo Finished  identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    		#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
    		echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		gunzip -c ${SAMPLE_NAME}${READ1_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
    		echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		gunzip -c ${SAMPLE_NAME}${READ2_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
    		echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	fi
    	if [ ${COMPRESSION_TYPE} = "bzip2" ]
    		then
    		echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
    		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log 
    		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		COUNT_READ1=`bunzip2 -k ${SAMPLE_NAME}${READ1_ID} | wc -l | awk '{print $1/4}'`
    		echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		COUNT_READ2=`bunzip2 -k ${SAMPLE_NAME}${READ2_ID} | wc -l | awk '{print $1/4}'`
    		echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		echo Finished  identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    		#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
    		echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		bunzip2 -k ${SAMPLE_NAME}${READ1_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
    		echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		bunzip2 -k ${SAMPLE_NAME}${READ2_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
    		echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    		echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    	fi
    fi
    
    #Align subset reads to transcriptome reference using BWA aln
    echo Starting Alignment of Read1 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    bwa aln -t ${CPU_CORES} -I ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ1_BWA_FASTQ} > ${SAMPLE_NAME}${READ1_BWA_SAI}
    echo Finished Aligning Read1 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo Starting Alignment of Read2 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    bwa aln -t ${CPU_CORES} -I ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ2_BWA_FASTQ} > ${SAMPLE_NAME}${READ2_BWA_SAI}
    echo Finished Aligning Read2 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Generate .sam file using BWA sampe
    echo Starting BWA SAMPE Assembly of Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    bwa sampe ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ1_BWA_SAI} ${SAMPLE_NAME}${READ2_BWA_SAI} ${SAMPLE_NAME}${READ1_BWA_FASTQ} ${SAMPLE_NAME}${READ2_BWA_FASTQ} > ${SAMPLE_NAME}${BWA_SAMPE}.sam
    echo Finished BWA SAMPE Assembly Phase | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Convert BWA sam file to bam file using SAMTOOLS
    echo Starting SAM to BAM conversion with Samtools | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    samtools view -bS -o ${SAMPLE_NAME}${BWA_SAMPE}.bam ${SAMPLE_NAME}${BWA_SAMPE}.sam
    echo Finished SAM to BAM conversion wth Samtools | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Sort BWA bam file using SAMTOOLS
    echo Starting BAM file sorting step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    samtools sort ${SAMPLE_NAME}${BWA_SAMPE}.bam ${SAMPLE_NAME}${BWA_SAMPE}${SAMTOOLS_SORT_ID}
    echo Finished sorting BAM file | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Determine insert metrics using PICARD
    echo Starting Picard CollectInsertSizeMetrics step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    java -Xmx2g -jar $HOME/local/bin/CollectInsertSizeMetrics.jar INPUT=${SAMPLE_NAME}${BWA_SAMPE}${SAMTOOLS_SORT_ID}.bam OUTPUT=${SAMPLE_NAME}_Picard_InsertMetrics.txt HISTOGRAM_FILE=${SAMPLE_NAME}_Picard_InsertMetrics.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
    echo Finished collecting insert metrics step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Calculate "inner mate distance" and standard deviation
    echo Starting to calculate insert metrics | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    INNER_DISTANCE=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f5 | awk -v var1="$COMBINED_READ_LENGTH" '{printf "%.0f\n", $1-var1}'`
    echo The mean inner-mate-distance was identified as being ${INNER_DISTANCE}bp | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    STDEV=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f6 | awk '{printf "%.0f\n", $1}'`
    echo The standard deviation of the inner-mate-distance was identified as being ${STDEV}bp | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    echo Finished calculating insert metrics | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Run tophat analysis
    echo Starting Tophat Alignment of ${SAMPLE_NAME} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    if [ ${TOPHAT_QUERY_GTF} = "Yes" ]       
    	then 
    	TOPHAT_COMMAND="tophat --num-threads ${CPU_CORES} -r ${INNER_DISTANCE} --mate-std-dev ${STDEV} --max-insertion-length ${MAX_INSERT_SIZE} --max-deletion-length ${MAX_DELETION_SIZE} ${READ_QUALITY} --library-type ${LIBRARY_FORMAT} -G ${GTF_FILE} ${BOWTIE_INDEX} ${SAMPLE_NAME}${READ1_ID} ${SAMPLE_NAME}${READ2_ID}"
    fi
    
    if [ ${TOPHAT_QUERY_GTF} = "No" ]       
    	then 
    	TOPHAT_COMMAND="tophat --num-threads ${CPU_CORES} -r ${INNER_DISTANCE} --mate-std-dev ${STDEV} --max-insertion-length ${MAX_INSERT_SIZE} --max-deletion-length ${MAX_DELETION_SIZE} ${READ_QUALITY} --library-type ${LIBRARY_FORMAT} ${BOWTIE_INDEX} ${SAMPLE_NAME}${READ1_ID} ${SAMPLE_NAME}${READ2_ID}"
    fi
    
    echo ${TOPHAT_COMMAND} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    ${TOPHAT_COMMAND}
    
    ## This creates a subfolder called "tophat_out/"
    ## This folder will contain:	accepted_hits.bam (is it sorted?)
    ##								deletions.bed
    ##								insertions.bed
    ##								junctions.bed
    
    cd tophat_out/
    #Current Directory=ngs/tophat/SampleX/tophat_out
    
    mv accepted_hits.bam ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam
    mv deletions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_deletions.bed
    mv insertions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_insertions.bed
    mv junctions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_junctions.bed
    
    echo Finished Tophat Alignment | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Run cufflinks analysis
    echo Starting Cufflinks Abundance Estimation of ${SAMPLE_NAME} | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
    	then
    	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
    fi
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
    	then
    	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
    fi
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
    	then
    	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF ${GTF_FILE} --compatible-hits-norm --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
    fi
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
    	then
    	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF ${GTF_FILE} --compatible-hits-norm -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
    fi
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
    	then
    	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF-guide ${GTF_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
    fi
    
    if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
    	then
    	CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF-guide ${GTF_FILE} -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"                             
    fi
    
    echo ${CUFFLINKS_COMMAND} | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    ${CUFFLINKS_COMMAND}
    
    ## This creates a subfolder called "cufflinks_out/" in the sample directory
    ## This folder will contain:	genes.expr
    ##								transcripts.expr
    ##								transcripts.gtf
    
    #In the next step we will rename the expression estimate files
    
    cd ../cufflinks_out/
    #Current Directory=ngs/tophat/SampleX/cufflinks_out
    
    mv genes.fpkm_tracking ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_genes.expr
    mv isoforms.fpkm_tracking ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_transcripts.expr
    mv transcripts.gtf ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_transcripts.gtf
    
    echo Finished Cufflinks Abundance Estimation | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    ##############################################################
    ##  Calculate metrics using Samtools and Picard				##
    ##############################################################
    
    cd ../tophat_out/
    #Current Directory=ngs/tophat/SampleX/tophat_out
    
    #Determine the RNA Seq Metrics using Picard "CollectRNASeqMetrics.jar"
    echo Starting Picard CollectRnaSeqMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    java -Xmx2g -jar $HOME/local/bin/CollectRnaSeqMetrics.jar REF_FLAT=${REFFLAT_FILE} RIBOSOMAL_INTERVALS=${RIBOSOME_LIST} STRAND_SPECIFICITY=NONE REFERENCE_SEQUENCE=${REFERENCE_FILE} INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt CHART_OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
    echo Finished collecting RNAseq metrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Mark the duplicate reads using Picard "MarkDuplicates.jar"
    echo Starting Picard MarkDuplicates step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    ## NOTE: "-Dsnappy.disable=true" is passed to this command to prevent the use of snappy-java by MarkDuplicates as I can not figure out how to add it to the JAVA CLASSPATH 
    java -Dsnappy.disable=true -Xmx2g -jar $HOME/local/bin/MarkDuplicates.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam METRICS_FILE=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
    echo Finished MarkDuplicates step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Create an index file for the bam file generated from the MarkDuplicates step using SAMTOOLS INDEX
    echo Starting bam file indexing step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    samtools index ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam
    echo Finished bam file indexing step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Calculate mapping statistics using SAMTOOLS FlAGSTAT
    echo Starting Samtools Flagstat step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    samtools flagstat ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam > ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt
    echo Finished Samtools Flagstat step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Calculate mapping statistics using Picard "CollectAlignmentSummaryMetrics.jar"
    echo Starting Picard CollectAlignmentSummaryMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    java -Xmx2g -jar $HOME/local/bin/CollectAlignmentSummaryMetrics.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt VALIDATION_STRINGENCY=SILENT REFERENCE_SEQUENCE=${REFERENCE_FILE} ASSUME_SORTED=true IS_BISULFITE_SEQUENCED=false TMP_DIR=${TEMP_DIRECTORY}
    echo Finished Picard CollectAlignmentSummaryMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Calculate read quality stats using Picard "MeanQualityByCycle.jar"
    echo Starting Picard MeanQualityByCycle step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    java -Xmx2g -jar $HOME/local/bin/MeanQualityByCycle.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.txt CHART_OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
    echo Finished Picard MeanQualityByCycle step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    ##############################################################
    ##  Write out summary file (Stats/Metrics/Indexes?)			##
    ##############################################################
    
    #Create a summary text file
    echo Starting Creation of Summary File | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    echo "*** Analysis Run Report ***" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo Sample ${SAMPLE_NAME} was Analyzed | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    
    # Test to see if the user wants a to extract FPKM values for a list of genes into the run report
    if [ ${GENESOFINTEREST_QUERY} = "Yes" ]
    	then
    	echo FPKM Values for Common genes of Interests: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    	grep -wf ${GENES_OF_INTEREST} ../cufflinks_out/${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_genes.expr | awk '{print $1, $10}' | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    	echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    fi
    
    echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    
    echo Sample Summary: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "The two read files processed were ${SAMPLE_NAME}${READ1_ID} and ${SAMPLE_NAME}${READ2_ID}" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${SAMPLE_NAME}${READ1_ID} contained ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${SAMPLE_NAME}${READ2_ID} contained ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo After aligning the first one million reads to a transcriptome reference using BWA | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo the inner-mate-distance was determined to be ${INNER_DISTANCE}bp with a standard deviation of ${STDEV} | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "The alignment metrics for this sample were as follows:" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    
    #Extract Alignment Metrics
    UNIQUE_READ1_READS_ALIGNED=`cut -f3 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 8 | tail -n 1`
    UNIQUE_READ2_READS_ALIGNED=`cut -f3 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 9 | tail -n 1`
    TOTAL_READ1_ALIGNMENTS=`awk '{print $1}' ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt | head -n 5 | tail -n 1`
    TOTAL_READ2_ALIGNMENTS=`awk '{print $1}' ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt | head -n 6 | tail -n 1`
    PERCENT_READ1_ALIGNED=`echo ${COUNT_READ1} | awk -v var2="${UNIQUE_READ1_READS_ALIGNED}" '{print var2/$1}'`
    PERCENT_READ2_ALIGNED=`echo ${COUNT_READ2} | awk -v var3="${UNIQUE_READ2_READS_ALIGNED}" '{print var3/$1}'`
    PERCENT_READ1_Q20_BASES=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | tail -n 1 | awk '{print $11/$8}'`
    PERCENT_READ2_Q20_BASES=`head -n 9 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | tail -n 1 | awk '{print $11/$8}'`
    PERCENT_PAIRED_READ_ALIGNMENT=`cut -f17 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 8 | tail -n 1 | awk -v var4="${COUNT_READ1}" '{print $1/var4}'`
    
    echo Alignment Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${UNIQUE_READ1_READS_ALIGNED} - Number of Unique Reads from Read1 Aligned | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${UNIQUE_READ2_READS_ALIGNED} - Number of Unique Reads from Read2 Aligned | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${TOTAL_READ1_ALIGNMENTS} - Total Number of Alignment Events from the Unique Reads in Read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${TOTAL_READ2_ALIGNMENTS} - Total Number of Alignment Events from the Unique Reads in Read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_READ1_ALIGNED} - Percentage of Unique Read1 Reads Aligned to at least one Position | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_READ2_ALIGNED} - Percentage of Unique Read2 Reads Aligned to at least one Position | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_READ1_Q20_BASES} - Percentage of Aligned Bases from Read1, which are greater than Q20 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_READ2_Q20_BASES} - Percentage of Aligned Bases from Read2,  which are greater than Q20 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_PAIRED_READ_ALIGNMENT} - Percentage of Reads Aligned as Pairs | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    
    #Extract RNAseq Metrics
    PERCENT_RIBOSOME_BASES=`cut -f11 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
    PERCENT_CODING_BASES=`cut -f12 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
    PERCENT_UTR_BASES=`cut -f13 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
    PERCENT_INTRONIC_BASES=`cut -f14 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
    PERCENT_INTERGENIC_BASES=`cut -f15 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
    PERCENT_MRNA_BASES=`cut -f16 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
    MEDIAN_5PRIME_BIAS=`cut -f20 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
    MEDIAN_3PRIME_BIAS=`cut -f21 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
    MEDIAN_5PRIME_3PRIME_BIAS_RATIO=`cut -f22 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
    
    echo Transcriptome Sequencing Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_RIBOSOME_BASES} - Percent Ribosomal Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_CODING_BASES} - Percent Coding Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_UTR_BASES} - Percent UTR Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_INTRONIC_BASES} - Percent Intronic Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_INTERGENIC_BASES} - Percent Intergenic Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_MRNA_BASES} - Percent mRNA Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${MEDIAN_5PRIME_BIAS} - Median 5 Prime Coverage Bias | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${MEDIAN_3PRIME_BIAS} - Median 3 Prime Coverage Bias | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${MEDIAN_5PRIME_3PRIME_BIAS_RATIO} - Median 5/3 Coverage Ratio | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    
    #Extract Duplication Metrics
    PERCENT_DUPLICATION=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt | tail -n 1 | cut -f8`
    LIBRARY_SIZE=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt | tail -n 1 | cut -f9`
    
    echo Duplication Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${LIBRARY_SIZE} - Estimated fragments in Library | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${PERCENT_DUPLICATION} - Percent Duplicate Reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    
    #Calculate Quality Metrics (first/last cycle and mean per read)
    QUALITY_CAPTURE_LINES=`echo ${COMBINED_READ_LENGTH} | awk '{print $1+8}'`
    QUALITY_CAPTURE_READS=`echo ${COMBINED_READ_LENGTH} | awk '{print $1/2}'`
    head -n ${QUALITY_CAPTURE_LINES} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.txt | tail -n ${COMBINED_READ_LENGTH} > run_qualities.txt
    head -n ${QUALITY_CAPTURE_READS} run_qualities.txt > read1_qualities.txt
    tail -n ${QUALITY_CAPTURE_READS} run_qualities.txt > read2_qualities.txt
    QUALITY_READ1_FIRST_CYCLE1=`cut -f2 read1_qualities.txt | head -n 1`
    QUALITY_READ1_LAST_CYCLE1=`cut -f2 read1_qualities.txt | tail -n 1`
    QUALITY_READ1_AVERAGE=`awk '{s+=$2} END {print s/NR}' < read1_qualities.txt`
    QUALITY_READ2_FIRST_CYCLE1=`cut -f2 read2_qualities.txt | head -n 1`
    QUALITY_READ2_LAST_CYCLE1=`cut -f2 read2_qualities.txt | tail -n 1`
    QUALITY_READ2_AVERAGE=`awk '{s+=$2} END {print s/NR}' < read2_qualities.txt`
    
    echo Quality Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo This run consists of paired ${QUALITY_CAPTURE_READS}x${QUALITY_CAPTURE_READS}bp reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${QUALITY_READ1_AVERAGE} - Mean phred score for read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${QUALITY_READ1_FIRST_CYCLE1} - Mean phred score for the first cycle of read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${QUALITY_READ1_LAST_CYCLE1} - Mean phred score for the last cycle of read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${QUALITY_READ2_AVERAGE} - Mean phred score for read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${QUALITY_READ2_FIRST_CYCLE1} - Mean phred score for the first cycle of read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ${QUALITY_READ2_LAST_CYCLE1} - Mean phred score for the last cycle of read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    
    #Output complete list of variables used in the run
    echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo   THE FOLLOWING VARIABLES WERE GENERATED DURING THE ANALYSIS OF THIS SAMPLE: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    
    echo "COMPRESSION_QUERY=\"${COMPRESSION_QUERY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "COMPRESSION_TYPE=\"${COMPRESSION_TYPE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "TOPHAT_QUERY_GTF=\"${TOPHAT_QUERY_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "TOPHAT_GTF_SHORT=\"${TOPHAT_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "CUFFLINKS_QUERY_LIMIT_GTF=\"${CUFFLINKS_QUERY_LIMIT_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "CUFFLINKS_QUERY_GUIDE_GTF=\"${CUFFLINKS_QUERY_GUIDE_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "LIMIT_GTF_SHORT=\"${LIMIT_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "GUIDE_GTF_SHORT=\"${GUIDE_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "CUFFLINKS_QUERY_MASK=\"${CUFFLINKS_QUERY_MASK}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "MASK_SHORT=\"${MASK_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "GENESOFINTEREST_QUERY=\"${GENESOFINTEREST_QUERY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "FINAL_CHECK=\"${FINAL_CHECK}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "TOPHAT_TAG=\"${TOPHAT_TAG}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "CUFFLINKS_TAG=\"${CUFFLINKS_TAG}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "SAMPLE_NAME=\"${SAMPLE_NAME}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "COUNT_READ1=\"${COUNT_READ1}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "COUNT_READ2=\"${COUNT_READ2}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "INNER_DISTANCE=\"${INNER_DISTANCE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "STDEV=\"${STDEV}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "TOPHAT_COMMAND=\"${TOPHAT_COMMAND}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "CUFFLINKS_COMMAND=\"${CUFFLINKS_COMMAND}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo THE FOLLOWING VARIABLES WERE LOADED FROM THE RNAseq_MensFormalWear_Parameters.ini DURING THE ANALYSIS OF THIS SAMPLE: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo They may or may not have been used depending on your query respones
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "LAUNCH_DIR=\"${LAUNCH_DIR}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "READ_EXT=\"${READ_EXT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "READ1_ID=\"${READ1_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "READ2_ID=\"${READ2_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "COMBINED_READ_LENGTH=\"${COMBINED_READ_LENGTH}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "CPU_CORES=\"${CPU_CORES}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "GENOME_NAME=\"${GENOME_NAME}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "REFERENCE_FILE=\"${REFERENCE_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "BWA_GENOME_FILE=\"${BWA_GENOME_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "READ1_BWA_FASTQ=\"${READ1_BWA_FASTQ}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "READ2_BWA_FASTQ=\"${READ2_BWA_FASTQ}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "READ1_BWA_SAI=\"${READ1_BWA_SAI}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "READ2_BWA_SAI=\"${READ2_BWA_SAI}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "BWA_SAMPE=\"${BWA_SAMPE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "SAMTOOLS_SORT_ID=\"${SAMTOOLS_SORT_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "BOWTIE_INDEX=\"${BOWTIE_INDEX}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "GTF_FILE=\"${GTF_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "READ_QUALITY=\"${READ_QUALITY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "LIBRARY_FORMAT=\"${LIBRARY_FORMAT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "MAX_INSERT_SIZE=\"${MAX_INSERT_SIZE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "MAX_DELETION_SIZE=\"${MAX_DELETION_SIZE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "MASK_FILE=\"${MASK_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "TEMP_DIRECTORY=\"${TEMP_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "REFFLAT_FILE=\"${REFFLAT_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "RIBOSOME_LIST=\"${RIBOSOME_LIST}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "GENES_OF_INTEREST=\"${GENES_OF_INTEREST}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "COMPLETED_RNASEQ_READ_FILE_DIRECTORY=\"${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo "COMPLETED_RNASEQ_RESULTS_DIRECTORY=\"${COMPLETED_RNASEQ_RESULTS_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    
    echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
    
    echo Finished Creating the Summary File | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    ##############################################################
    ##  Clean up analysis directory and Move files to Archive	##
    ##############################################################
    
    echo Starting to Clean up Analysis Directory | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
    
    #Remove temp files created during calculations steps
    rm run_qualities.txt
    rm read1_qualities.txt
    rm read2_qualities.txt
    
    #Move tophat outputs and metric files to archive directory
    mv ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
    mv ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam.bai ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
    mv *.bed ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
    mv *.pdf ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
    mv *.txt ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
    
    #Move cufflinks outputs to archive directory
    cd ../cufflinks_out
    #Current Directory=ngs/tophat/SampleX/cufflinks_out
    mv *.expr ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
    mv *.gtf ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
    
    #Move fastq, ini, and log file to archive directories
    cd ../
    #Current Directory=ngs/tophat/SampleX
    mv ${SAMPLE_NAME}${READ1_ID} ${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}
    mv ${SAMPLE_NAME}${READ2_ID} ${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}
    
    echo Finished Cleaning up Analysis Directory | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
    
    mv *.log ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
    
    #Move back to launch directory to start next sample in batch
    cd ../
    #Current Directory=ngs/tophat/
    
    #Remove Custom analysis directory and none required files
    rm -rf ${SAMPLE_NAME}
    
    done
    rnaseq_common_parameters.ini
    Code:
    ## RNAseq Common parameters file ##
    
    # Use this variable to indicate the expected launch directory
    # (Update as appropriate)
    LAUNCH_DIR=/Users/jkeats/ngs/tophat
    
    ##################################################################################
    ## 										##	
    ##    Common Parameters used at multiple steps or used to name output files	##
    ##				(Update as appropriate)				##
    ##										##
    ##################################################################################
    
    #sequence read file extension
    READ_EXT=".fastq.gz"
    
    #forward or read1 identifier
    READ1_ID="_1.fastq.gz"
    
    #reverse or read2 identifier
    READ2_ID="_2.fastq.gz"
    
    #What is the combined read length of the two read pairs (ie. 50x50=100, 100x100=200)
    COMBINED_READ_LENGTH="100"
    
    #How many cores/processors are available for the multi-threaded phases of BWA, TOPHAT, CUFFLINKS
    CPU_CORES="1"
    
    #This is used to rename result files to include the genome version
    GENOME_NAME=hg19
    
    #Provide the path and full file name to the genome reference file used to create the bowtie index
    REFERENCE_FILE=/Users/jkeats/ngs/refgenomes/bowtie_indexed/igenomes_hg19/genome.fa
    
    ##################################################################################
    ##										##
    ##    Common BWA parameters (Only the BWA_GENOME_FILE needs to be updated)	##
    ##										##
    ##################################################################################
    
    BWA_GENOME_FILE="/Users/jkeats/ngs/refgenomes/bwa_indexed/GRCh37_E60cDNA.fa"
    
    READ1_BWA_FASTQ="_1_bwa.fastq"
    READ2_BWA_FASTQ="_2_bwa.fastq"
    
    READ1_BWA_SAI="_1_bwa.sai"
    READ2_BWA_SAI="_2_bwa.sai"
    
    BWA_SAMPE="_bwa_sampe"
    
    ##################################################################################
    ##										##
    ##    Common Samtools parameters (Shouldn't need to be modified)		##
    ##										##	
    ##################################################################################
    
    SAMTOOLS_SORT_ID="_sorted"
    
    ##################################################################################
    ## 										##
    ##    Common Tophat Parameters (Update as appropriate)				##
    ##										##
    ##################################################################################
    
    #Bowtie index to align data against [ <filename> ]
    BOWTIE_INDEX="/Users/jkeats/ngs/refgenomes/bowtie_indexed/igenomes_hg19/genome"
    
    #GTF file to use for LIMITED ABUNDANCE ESTIMATION (-G/--GTF) or GUIDED ABUNDANCE ESTIMATION ( -g/--GTF-guide)
    GTF_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_genes.gtf"
    
    #Quality type [ --solexa-quals  OR  --solexa1.3-quals (same as phred64-quals)  OR  --phred64-quals (same as solexa1.3-quals) ]
    READ_QUALITY="--solexa1.3-quals"
    
    #Library type [ fr-unstranded  OR  fr-firststrand  OR  fr-secondstrand)
    LIBRARY_FORMAT="fr-unstranded"
    
    #Maximum insertion size allow IF allowing indels [ --max-insertion-length     <int>     default: 3 ]
    MAX_INSERT_SIZE="3"
    
    #Maximum deletion size allowed IF allowing indels [ --max-deletion-length     <int>     default: 3 ]
    MAX_DELETION_SIZE="3"
    
    ##################################################################################
    ##										##
    ##   Common Cufflinks Parameters (Update as appropriate)			##
    ##										##
    ##################################################################################
    
    #This should be used if using an exclusion file for cufflinks abundance estimates
    MASK_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/GRCh37_E60_CufflinksExcludedTranscripts.gtf"
    
    ##################################################################################
    ##										##
    ##   Common Picard Parameters (Update as appropriate)				##
    ##										##
    ##################################################################################
    
    #Provide a directory for temporary files (Not always needed but on multi-harddrive systems JAVA may use an unexpected directory)
    TEMP_DIRECTORY="/Users/jkeats/ngs/refgenomes/temp"
    
    #This is a refFlat format table outlining the transcript space of the genome analyzed
    REFFLAT_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_refFlat.txt"
    
    #This is the ribosome exclusion file, a list of ribosome gene locations to report the percentage of reads related to ribosomal transcripts
    RIBOSOME_LIST="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_ribosome_gene_locations.txt"
    
    ##################################################################################
    ##										##
    ##   Gene List of Interest (Update as appropriate)				##
    ##										##
    ##################################################################################
    
    #List of genes to extract
    GENES_OF_INTEREST="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/TC_Class_Genes.txt"
    
    ##################################################################################
    ##										##
    ##   Final Directories (Update as appropriate)					##
    ##										##
    ##################################################################################
    
    #Provide the path to the folder were you want to archive the read files
    COMPLETED_RNASEQ_READ_FILE_DIRECTORY="/Users/jkeats/ngs/analyzed_read_files/rnaseq/"
    
    #Provide the path to the folder were you want to archive the read files
    COMPLETED_RNASEQ_RESULTS_DIRECTORY="/Users/jkeats/ngs/finaloutputs/rnaseq/"
    Last edited by Jon_Keats; 11-26-2011, 10:07 PM. Reason: updated script bug

    Comment


    • #62
      I learned a lot from your post. And the website for your lab looks great.
      keep up with the good work.

      Comment


      • #63
        GATK Trial and Error

        I'm finally getting into using the GATK package. My first impression is that I find it much more intimidating that any of the Picard tools and the documentation is a bit harder to follow. Though I think that is largely the difference between a wiki style interface and a singular website page.

        Notes:
        1) Most analysis methods require indexed bam files. So get into a habit of just indexing every bam file. Typically I only indexed the final bam file for viewing in IGV
        2) Related to 1) almost every file seems to require indexes. The reference genome (genome.fasta) file needs an index (genome.fasta.fai), and a dictionary file (genome.dict)
        3) Related to 1) and 2) ....see a trend. The input vcf/rod files (file.vcf) need to be indexed (file.vcf.idx)

        The frustrating thing is I know how to make the .fai file
        Code:
        samtools faidx <myFASTQ>
        I know how to make the dictionary file using picard
        Code:
        java -jar CreateSequenceDictionary.jar REFERENCE=genome.fasta OUTPUT=genome.dict
        But I have no idea how to create the .idx index file of the vcf file. So right now I'm limited to the file provided on the GATK website, as they provide complete indexes for each.

        4) Fixing mates with Picard is no longer required after indel realignment as GATK does it on the fly now

        Much more to come I have a feeling
        Last edited by Jon_Keats; 11-16-2011, 11:21 AM. Reason: fixed error

        Comment


        • #64
          I'm glad to hear that i'm not the only one intimidated by the documentation or there lack of. I feel like a big portion of TopHat/Cufflinks users are using those because of nice, relatively easy to understand documentation.

          please keep on posting Jon, as I had given up on GATK, trying to learn python, but now i feel like i should get back into it!

          Comment


          • #65
            Thanks a lot Jon for all the scripts... Really appreciate your work... Makes my life super easy... I have lost count of how many times I have referred back to this thread...

            - Dinesh

            Comment


            • #66
              I believe line #316 in the RNAseq_MensFormal_Mac.sh should be as below. Or else no matter what user response you give (Yes/No), the tophat tag will always be set to TophatPE_GTF.
              Code:
              # TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF
              if [[ ${TOPHAT_QUERY_GTF} = "Yes" ]]
                  then
                  TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF" 
              fi
              
              if [[ ${TOPHAT_QUERY_GTF} = "No" ]]
                  then
                  TOPHAT_TAG="TophatPE_DeNovo"
              fi
              echo The tophat tag has been set to ${TOPHAT_TAG}

              Comment


              • #67
                Hi Dinesh,

                Thanks for the good catch. I'd made this update on my local version but forgot to update this thread and the version on the website. The version above is updated now and the website version should be done tonight.

                Updated as follows:
                Code:
                # TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF
                if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
                	then
                	TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF" 
                	else
                	TOPHAT_TAG="TophatPE_DeNovo"                              
                fi
                echo The tophat tag has been set to ${TOPHAT_TAG}
                I got your email about the mask file. I've avoided providing one, not because I don't think people should use one but because the contents may be specific to different applications. In my case we exclude immunoglobulin regions, which is a disease specific issue, but things like ribosomal and mitochondrial genes are likely things that 99% of users will exclude so I've placed a copy of the file I use on the website (www.keatslab.org).

                To generate a similar file with the elements
                Code:
                #see what RNA types exist in your gtf file
                cut -f 2 ensembl_for_your_genome.gtf | sort | uniq
                
                #create a text file with the RNA types you want to exclude, such as
                #to_exclude.txt
                Mt_rRNA
                Mt_tRNA
                Mt_tRNA_pseudogene
                rRNA
                rRNA_pseudogene
                MT
                
                grep -f to_exclude.txt ensembl_for_your_genome.gtf > CufflinksExcludedTranscripts.gtf

                Comment


                • #68
                  Thank You, Jon for fixing the bug and for the cufflinks_exclude files.

                  I was about to post about HiSeq data and Illumina's switch to Sanger quality scores (as I had to update your script to work with HiSeq data) and see that you have posted it on your website. So thanks once again.

                  - Dinesh

                  Comment


                  • #69
                    Hi Jon,
                    I nominate this as the best running thread on SeqAnswers. I'm going to be looking at a lot of RNA-seq data in the next months and I was wondering why you use Cufflinks over DESeq or EdgeR?

                    Thanks for the scripts. I'll give them a go once I start getting some data back.

                    I've been embedding too much UNIX in my Perl scripts and your scripts are a nice example on how to write shell scripts. Do you or anyone else know a good reference on writing shell scripts beyond the 'Unix and perl tutorial for biologists'.
                    --------------
                    Ethan

                    Comment


                    • #70
                      Originally posted by Jon_Keats View Post
                      GATK Trial and Error
                      4) Fixing mates with Picard is no longer required after indel realignment as GATK does it on the fly now
                      How did you find this out? (I don't mean to doubt you, but I've been led astray before. Plus, I need to learn how to find out change information.)

                      Thanks.

                      Comment


                      • #71
                        Hi Anjulka,

                        I completely understand. The source was not obvious, I was following the GATK best practice guide which no longer suggests the picard fix-mate step. Similar to your concern I spent some time searching for the reason and finally found this thread on the GATK Get Satisfaction help page (http://getsatisfaction.com/gsa/topic...el_realignment). See the first reply by Eric Banks (GATK developer)

                        Comment


                        • #72
                          I'm a bit amazed by how many people follow or have read this thread, especially with it recently passing ECO introduction sticky thread. I wonder if I can reference this thread to the grant reviewers that say I need more informatics help on my grants... Oh well enough complaining.

                          I've been finalizing my exome pipeline for one of the projects that deals with human myeloma cell lines were there is not germline/constitutional DNA sample available. The primary goal is to identify variants in genes identified in our patient sample screens to identify potential model systems and appropriate controls.

                          I'll post the entire pipeline and examples of all the required input files but the pipeline looks like the following (see below) and generates a Summary Run Report shown (see below) after the following command is initiated in a directory full of fastq files (can be a total mixed bag and either uncompressed, gzip, or bzip2 compressed)

                          <BWA_exome_pipeline.sh> <run_parameters.ini> <sample_sheet.txt>
                          http://www.keatslab.org/resources/ngs-tools

                          Code:
                          *** Analysis Run Report ***
                          
                          Sample KMS11_JPN_p3_TruSeq70MbExome was Analyzed
                          
                          -----------------------------------------------------------------------------
                          
                          *** Lane Alignment Run Report ***
                          
                          Sample KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7 was Analyzed
                          
                          The two read files processed were KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R1.fastq.gz and KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R2.fastq.gz
                          KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R1.fastq.gz contained 50457788 reads
                          KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L7_R2.fastq.gz contained 50457788 reads
                          -----------------------------------------------------------------------------
                          
                          *** Lane Alignment Run Report ***
                          
                          Lane KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8 was Analyzed
                          The two read files processed were KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R1.fastq.gz and KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R2.fastq.gz
                          KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R1.fastq.gz contained 50279910 reads
                          KMS11_JPN_p3_TruSeq70MbExome_GCCAAT_L8_R2.fastq.gz contained 50279910 reads
                          -----------------------------------------------------------------------------
                          
                          Sample Summary:
                          
                          After alignment with BWA the mean insert size was determined to be 270.820553bp with a standard deviation of 91.662134
                          
                          The alignment metrics for this sample are as follows:
                          
                          Alignment Metrics:
                          83 - Average Read Length
                          201475396 - Total Number of Passing Filter Reads Aligned with bwa
                          199321958 - Number of Reads Aligned
                          99991035 - Number of Reads from Read1 Aligned
                          99330923 - Number of Reads from Read2 Aligned
                          0.989312 - Percentage of Unique Reads Aligned to at least one Position
                          0.992588 - Percentage of Unique Read1 Reads Aligned to at least one Position
                          0.986035 - Percentage of Unique Read2 Reads Aligned to at least one Position
                          0.844223 - Percentage of Aligned Bases, which are greater than Q20
                          0.99472 - Percentage of Reads Aligned as Pairs
                          0.000131 - Percentage of Reads which appear to be Adaptor Reads
                          
                          Duplication Metrics:
                          500572402 - Estimated fragments in Library
                          0.095995 - Percent Duplicate Reads
                          
                          Transcriptome Metrics:
                          0.56342 - Percent mRNA Bases
                          0.247456 - Percent Coding Bases
                          0.315964 - Percent UTR Bases
                          0.293039 - Percent Intronic Bases
                          0.143454 - Percent Intergenic Bases
                          0.661081 - Median 5 Prime Coverage Bias
                          0.730395 - Median 3 Prime Coverage Bias
                          0.938235 - Median 5/3 Coverage Ratio
                          
                          Capture Metrics:
                          182341432 - Number of unique alignment events [non-duplicates]
                          0.905031 - Percent unique alignment events [non-duplicates]
                          162369533 - Number of unique alignment events with MAPQ>0
                          0.89047 - Percent of reads which are unique alignments with MAPQ>0
                          13.446 - Gigabases of unique alignments with MAPQ>0
                          0.696883 - Percent of aligned bases within padded bait region
                          0.303117 - Percent of aligned bases not associated with a bait region
                          101.257402 - Mean coverage of the target region
                          0.376064 - Percentage of bases on target
                          5.057 - Gigabases of on target base pairs
                          23.291078 - Fold enrichment
                          0.003347 - Percent of undetected targets
                          0.95061 - Percent of targets with greater than 10x coverage
                          0.923333 - Percent of targets with greater than 20x coverage
                          0.886545 - Percent of targets with greater than 30x coverage
                          0.03663 - Percent of low GC targets which are under represented
                          5.835684 - Percent of high GC targets which are under represented
                          
                          Quality Metrics:
                          This run consists of paired 83x83bp reads
                          32.6069 - Mean phred score for read1
                          32.216839 - Mean phred score for the first cycle of read1
                          30.498119 - Mean phred score for the last cycle of read1
                          31.2961 - Mean phred score for read2
                          30.722569 - Mean phred score for the first cycle of read2
                          28.530552 - Mean phred score for the last cycle of read2
                          
                          Variant Calling Metrics:
                          40956 - Number of high-confidence variants called by both GATK and Samtools
                          2.63 - Ti/Tv Ratio for high-confidence variants called by both GATK and Samtools
                          2059 - Number of high-confidence variants that do not exist in dbsnp
                          2.19 - Ti/Tv Ratio for high-confidence variants called by both GATK and Samtools that do not exist in dbsnp
                          3266 - Number of high-confidence variants that do not exist in 1000G project
                          2.00 - Ti/Tv Ratio for high-confidence variants called by both GATK and Samtools that do not exist in 1000G project
                          94.97 - Percentage of high-confidence variants that exist in dbsnp
                          92.60 - Percentage of variants called by GATK and filtered by Samtools that exist in dbsnp
                          81.86 - Percentage of variants called by Samtools and filtered by GATK that exist in dbsnp
                          87.19 - Percentage of variants called exclusively by GATK that exist in dbsnp
                          41.50 - Percentage of variants called exclusively by Samtools that exist in dbsnp
                          92.95 - Percentage of variants called by but filtered by GATK and Samtools that exist in dbsnp
                          
                          581 of the 3266 high-confidence variants absent from 1000G are non-synonymous changes
                          
                          The 581 non-synonymous changes includes the following genes, which are mutated in the MMGI Study:
                          ZNF804A	-697T	2	185802211	C	CACA	[0, 0, 78, 75]	rs5836928
                          
                          The 581 non-synonymous changes includes the following positions, which exist in the Cosmic database:
                          ZNF804A	-697T	2	185802211	C	CACA	[0, 0, 78, 75]	rs5836928
                          FGFR3	Y373C	4	1806099	A	G	[8, 4, 14, 5]	rs121913485
                          
                          The 581 non-synonymous changes includes the following nonsense changes:
                          RP4-788L13.1	E298*	1	99771640	G	T	[28, 35, 7, 11]	.
                          TAF1A	Y36*	1	222761798	G	T	[53, 28, 12, 11]	.
                          MAP4K4	Q488*	2	102482984	C	T	[8, 19, 7, 14]	.
                          PTPRN	Q22*	2	220172517	G	A	[39, 75, 19, 15]	.
                          PAQR3	S133*	4	79851430	G	C	[29, 26, 25, 27]	.
                          CYFIP2	W204*	5	156786100	G	A	[7, 13, 9, 11]	.
                          ANKS1A	E420*	6	34957049	G	T	[26, 23, 12, 9]	.
                          PLEKHA2	C182*	8	38810216	T	A	[52, 33, 33, 15]	.
                          SOX6	R596*	11	16010642	G	A	[32, 26, 15, 11]	.
                          CD4	W53*	12	6909582	G	A	[12, 35, 7, 16]	.
                          GCOM1	Y122*	15	57913853	T	G	[39, 26, 17, 16]	.
                          MEF2A	K25*	15	100185784	A	T	[41, 5, 18, 1]	.
                          NOXO1	W269*	16	2029435	C	T	[19, 27, 24, 33]	.
                          SRCAP	Q2*	16	30712149	C	T	[17, 29, 8, 9]	.
                          ZNF613	E134*	19	52447644	G	T	[81, 31, 18, 9]	.
                          ZIM2	R335*	19	57286637	G	A	[37, 32, 34, 16]	.
                          FTHL17	E148*	X	31089629	C	A	[0, 0, 30, 8]	rs7057057
                          MAGEB16	R272*	X	35821127	C	T	[0, 0, 60, 63]	rs4829392
                          MAGEE2	E120*	X	75004529	C	A	[0, 0, 27, 28]	rs1343879
                          UBE2NL	L89*	X	142967468	T	G	[0, 0, 49, 62]	rs237520
                          
                          The 581 non-synonymous changes includes the following frameshift changes:
                          PROX1	NA	1	214170645	AG	A	[18, 17, 62, 63]	.
                          DNAH14	NA	1	225328438	G	GAT	[10, 0, 29, 1]	.
                          FGFR3	NA	4	1809127	CGT	C	[0, 0, 27, 26]	rs34003391
                          NR3C2	NA	4	149002016	AT	A	[0, 0, 11, 12]	rs10708334
                          ZNF608	NA	5	123982405	CT	C	[26, 3, 10, 3]	.
                          CYFIP2	NA	5	156721863	T	TC	[0, 0, 16, 19]	rs76159126
                          STEAP2	NA	7	89854652	ACAAAT	A	[20, 27, 33, 45]	.
                          OR13C2	NA	9	107367392	TGTTA	T	[23, 19, 25, 39]	rs71991336
                          C9orf173	NA	9	140147822	GATGCACCCTTCTGGCTGGCCAACCCTTCT	G	[11, 15, 11, 18]	.
                          CREB3L1	NA	11	46342259	A	AG	[0, 0, 54, 2]	rs71038900
                          DIXDC1	NA	11	111853106	G	GC	[0, 0, 33, 51]	rs67635503
                          TREH	NA	11	118529044	C	CG	[0, 0, 17, 24]	rs11448549
                          HSP90B1	NA	12	104326109	TAGAG	T	[12, 19, 20, 16]	.
                          PKD1L2	NA	16	81242148	GTT	G	[0, 0, 61, 26]	rs62734462
                          LRRC59	NA	17	48474672	TGGTCA	T	[9, 14, 2, 7]	.
                          HSH2D	NA	19	16268207	TA	T	[0, 0, 7, 18]	rs5827321
                          LTBP4	NA	19	41123093	A	AG	[0, 0, 18, 43]	rs67216392
                          RASIP1	NA	19	49230400	GA	G	[22, 49, 5, 18]	.
                          CD33	NA	19	51729577	ACCCAACAACTGGTATCTTT	A	[37, 19, 24, 14]	.
                          ADAMTS1	NA	21	28210040	GA	G	[69, 35, 19, 8]	.
                          C21orf49	NA	21	34169317	C	CT	[0, 0, 48, 55]	rs5843567
                          COL6A2	NA	21	47545369	A	AC	[0, 1, 29, 69]	.
                          CLTCL1	NA	22	19189003	A	AC	[0, 0, 1, 22]	rs78649162
                          EFHC2	NA	X	44202890	G	GC	[0, 0, 2, 35]	rs5902346
                          SSX9	NA	X	48163733	TC	T	[0, 0, 57, 22]	rs112256157
                          NUDT11	NA	X	51239295	ATCCTCGAGGCAGCC	A	[0, 0, 4, 12]	rs78182391
                          TCEAL6	NA	X	101395780	T	TG	[0, 0, 27, 90]	rs11408120
                          TEX13A	NA	X	104464276	CA	C	[0, 0, 36, 7]	rs66463437
                          
                          The 581 non-synonymous changes includes the following start codon losses:
                          TLR8	M1V	X	12924826	A	G	[0, 0, 17, 48]	rs3764880
                          
                          The 581 non-synonymous changes includes the following stop codon losses:
                          MPL	*636W	1	43818443	A	G	[31, 17, 45, 40]	.
                          PCSK2	*604Q	20	17462713	T	C	[26, 13, 18, 14]	.
                          Attached Files

                          Comment


                          • #73
                            Hi Jon - The BWA Exome Pipeline looks great. I assume the RNA-Seq pipeline is different just by looking at the pdf you posted. I assume the last RNA-Seq pipeline was from 09-26-2011. I also see that the Circos stuff wasn't included. Is there a reason why?

                            Comment


                            • #74
                              Hey Jon,

                              While going over your pipeline, I noticed something. With this line:

                              INNER_DISTANCE=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f5 | awk -v var1="$COMBINED_READ_LENGTH" '{printf "%.0f\n", $1-var1}'`

                              you're calculating the insert size as you use the INNER_DISTANCE value with -r option in Tophat. However, the Picard output, the 5th column is already MEAN_INSERT_SIZE, why are you subtracting COMBINED_READ_LENGTH from the mean insert size again?

                              Comment


                              • #75
                                Hi Ahmetz,

                                Tophat -r is looking for the "mate-inner-distance" which is the distance between the ends of read 1 and read 2 not the distance between adaptor1 and adaptor2. We have talked about having a wiki locally for terminology but this is what we often use.

                                [Adaptor1]>>>>read1>>>---------<<<read2<<<<[Adaptor2]

                                Fragment Size = distance from beginning of adaptor 1 to the end of adaptor 2
                                Insert Size = size of DNA between adaptor 1 and adaptor 2
                                Inner Distance (what tophat wants) = the distance between the end of read1 and the end or read2. In the example above if you consider each - as a nucleotide and the ends of each read defined by > and < then the inner-distance is 9.

                                Golharam,

                                We are currently updating the RNAseq pipeline to work with Topat 1.4 to take advantage of the new align to transcriptome step that I've always pushed to have in an RNAseq aligner. It will also look a bit more similar to our exome pipeline so that it can add read groups.
                                Ultimately, I'd envision there being a exome pipeline, RNAseq pipeline, mate-pair pipeline, and then a summary pipeline that will merge all three results and create the circos plot. Should be fairly easy to pipe the results of the current pipeline into circos but I always plot more than a single data type so I didn't bother to included it. But not a bad idea to included as a launch query so a user could pipe the results directly.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                71 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                80 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X