Getting Better with Variables
Now that I have the lab (www.keatslab.org) moving forward to some extent I'm back to working on my analysis methods. For a couple of reasons I've been searching for ways to improve the previous methods so that even less user input is required and thus saving me time running things myself or worrying about other users making mistakes. So here is what seems like a good version that relies on very limited user input. It takes advantage of a common parameters file, and a series of basic user queries as the script launches to setup the tophat and cufflinks run. It also uses bwa to find the insert size and standard deviation of the library.
Requirements:
1) Create_Directory_Structure_V4 directory structure, but the pipeline is so flexible this is not really a requirement just simplifies effort.
2) bwa transcriptome index file for genome
3) bowtie index file for genome
4) Annotation files for Picard CollectRnaSeqMetrics.jar to function properly
Options:
1) It accepts compressed gzip or bzip2 files or standard read files with any extension
2) It creates a summary file at the end of each run and you have the option to output a list of genes and associated FPKM values to this summary file. So if you are looking at an siRNA experiment you could drop the gene of interest values for your samples or a list of random genes you would like a quick look at.
General Outline:
1) Query User to define setup, which sets the naming convention for each sample, the tophat run format (two options), the cufflinks run format (six options), the gene list added to the summary file
2) Counts the number of reads in each input file (used to calculate % aligned)
3) Cuts out the first 1 million reads for bwa alignment against a transcriptome reference to define the insert size and standard deviation
4) Runs tophat
5) Runs cufflinks
6) Runs a series of picard metrics programs
7) Summarizes metrics and gene values
create_ngs_directorystructure_v4.sh
RNAseq_MensFormal_Mac.sh
rnaseq_common_parameters.ini
Now that I have the lab (www.keatslab.org) moving forward to some extent I'm back to working on my analysis methods. For a couple of reasons I've been searching for ways to improve the previous methods so that even less user input is required and thus saving me time running things myself or worrying about other users making mistakes. So here is what seems like a good version that relies on very limited user input. It takes advantage of a common parameters file, and a series of basic user queries as the script launches to setup the tophat and cufflinks run. It also uses bwa to find the insert size and standard deviation of the library.
Requirements:
1) Create_Directory_Structure_V4 directory structure, but the pipeline is so flexible this is not really a requirement just simplifies effort.
2) bwa transcriptome index file for genome
3) bowtie index file for genome
4) Annotation files for Picard CollectRnaSeqMetrics.jar to function properly
Options:
1) It accepts compressed gzip or bzip2 files or standard read files with any extension
2) It creates a summary file at the end of each run and you have the option to output a list of genes and associated FPKM values to this summary file. So if you are looking at an siRNA experiment you could drop the gene of interest values for your samples or a list of random genes you would like a quick look at.
General Outline:
1) Query User to define setup, which sets the naming convention for each sample, the tophat run format (two options), the cufflinks run format (six options), the gene list added to the summary file
2) Counts the number of reads in each input file (used to calculate % aligned)
3) Cuts out the first 1 million reads for bwa alignment against a transcriptome reference to define the insert size and standard deviation
4) Runs tophat
5) Runs cufflinks
6) Runs a series of picard metrics programs
7) Summarizes metrics and gene values
create_ngs_directorystructure_v4.sh
Code:
#!/bin/sh
# Create_NGS_DirectoryStructure_V4.sh
#
#
# Created by Jonathan Keats on 09/25/11
# Translational Genomics Research Institute
#########################################################################
# CREATES A DIRECTORY STURCTURE TO SUPPORT A VARIETY OF NGS PIPELINES #
#########################################################################
# Designed for a Mac OS enviroment and requires initiation from your home folder (/User/You/)
# Check to confirm current location is $HOME/ (ie. /User/You/)
echo "Confirming Script Initiation Directory"
var1=$HOME
if [ "`pwd`" != "$var1" ]
then
echo " The script must be launched from your home directory "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
echo "1) Launch Location is Correct ($HOME/)"
# Create required directories to support pipelines (RNAseq_MensFormal_)
echo ***Creating Pipeline Directory Structure***
mkdir -p ngs/{applications,bwa,run_parameters,run_parameters,scripts,temp,tophat,tophat_fusion}
mkdir -p ngs/analyzed_read_files/{chipseq,exomes,genomes,matepair,rnaseq}
mkdir -p ngs/finaloutputs/{chipseq,exomes,genomes,matepair,rnaseq}
mkdir -p ngs/refgenomes/{bfast_indexed,bowtie_indexed,bwa_indexed,downloads}
mkdir -p ngs/refgenomes/downloads/{ncbi36_hg18,grch37_hg19}
mkdir -p ngs/refgenomes/downloads/ncbi36_hg18/{annotation_files,reference_sequences}
mkdir -p ngs/refgenomes/downloads/grch37_hg19/{annotation_files,reference_sequences}
mv create_ngs_directorystructure_v4.sh ngs/scripts/
echo ***Pipeline Directory Structure Created***
Code:
#!/bin/sh
# RNAseq_MensFormal_Mac.sh
#
#
# Created by Jonathan Keats on 9/21/11.
# Copyright 2011 Translational Genomics Research Institute. All rights reserved.
# Outline - Automated pipeline requiring limited input for multiple RNAseq samples using tophat and cufflinks
# Usage = RNAseq_MensFormal_Mac.sh <rnaseq_common_parameters.ini>
######################################################################################################
## ##
## Dependencies - Unix (mv, cd, find, basename, mkdir, echo, wc, head, tail, cut, awk, sed) ##
## - BWA (version with -I option) (Tested with v0.5.9) ##
## - SAMTOOLS (Tested with v0.1.12a) ##
## - PICARD Tools (Tested with v1.52) ##
## - CollectInsertSizeMetrics.jar (WATCH VERSION AS COLUMN OUTPUT DIFFERS) ##
## - CollectRNAseqMetrics.jar ##
## - MarkDuplicates.jar ##
## - CollectAlignmentSummaryMetrics.jar ##
## - MeanQualityByCycle.jar ##
## - TOPHAT (Tested with v1.3.2) ##
## - CUFFLINKS (Tested with v1.1.0) ##
## ##
## Outline - Place uniformly named read file pairs in TOPHAT folder ##
## - This method is dependent on XXXX folder structure ##
## - Update "tophat_run_parameters.ini" as appropriate to your samples ##
## - Launch script which queries the user to define methods used and renaming convention ##
## then does the following steps automatically: ##
## - Create custom directory for each sample ##
## - Create custom .ini file for each sample ##
## - Cut out first 1 million reads from each read file ##
## - Align to transcriptome reference using BWA ##
## - Calculate inner-mate-distance and standard deviation ##
## ** inner-mate-distance = (insert size) - (combined read length) ##
## - Run TOPHAT ##
## - Run CUFFLINKS ##
## - Calculate metrics ##
## ##
## Notes - Extensive use of common variables is used to facilitate automation ##
## - There is some naming convention requirements but they are flexible ##
## - Paired samples must follow the same convention: ##
## --> SampleA_1.fastq and SampleA_2.fastq ##
## OR ##
## --> SampleA_Time1_1.txt and SampleA_Time1_2.txt ##
## - You can mix different basenames but not common extensions ##
## --> SampleB_1.fa / SampleB_2.fa ##
## --> SampleC_Time1_1.fa / SampleC_Time1_2.fa ##
## ** This is fine: Common Extension = .fa ; Read Extension = _1.fa and _2.fa ** ##
## --> SampleB_1.fastq / SampleB_2.fastq ##
## --> SampleC_1.fa / SampleC_2.fa ##
## --> SampleD_R1.fa / SampleD_R2.fa ##
## ** These can not be mixed, no "Common Extension" or common "Read Extension" ** ##
## ##
######################################################################################################
# Read in the tophat_common_parameters file
. $1
#Starting Directory=ngs/tophat/
echo
echo "*** STARTING SCRIPT ***"
date '+%m/%d/%y %H:%M:%S'
echo
##############################################################
## Check that the script was launched from ngs/tophat dir ##
##############################################################
echo Checking Launch Location
date '+%m/%d/%y %H:%M:%S'
echo
# Checking if launch location is correct
if [ "`pwd`" != "${LAUNCH_DIR}" ]
then
echo " The script must be launched from the ${LAUNCH_DIR} directory "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
echo "1) Launch Location is Correct ($HOME/ngs/tophat)"
# Checking if launch directory contains any .ini files (No such files are expected)
if [ `ls *.ini | wc -l` != 0 ]
then
echo " The tophat directory contains unexpected .ini files - These must be removed before the script starts "
echo " ERROR - The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
echo "2) No .ini Files Found in Launch Directory as Expected"
echo
echo Directory Check Completed
date '+%m/%d/%y %H:%M:%S'
echo
##############################################################
## Query User for tophat and cufflinks configurations ##
##############################################################
echo Starting User Configuration Query Step
date '+%m/%d/%y %H:%M:%S'
echo
# Ask user if the input read files are compressed or not
echo "Are the input read files compressed? (ie. gzip or bzip2) [Yes/No]"
read COMPRESSION_QUERY
# Test to see if the "COMPRESSION QUERY" Query Response was appropriate
## NOTE - In the IF statement use && as AND and || as OR in new shells (old OR is -o)
if [[ ${COMPRESSION_QUERY} = "Yes" || ${COMPRESSION_QUERY} = "No" ]]
then
echo
else
echo
echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
# Ask user for the compression type if they indicated that the read files are compressed
if [ ${COMPRESSION_QUERY} = "Yes" ]
then
echo Please indicate the type of compression. [gzip/bzip2]
read COMPRESSION_TYPE
if [[ ${COMPRESSION_TYPE} = "gzip" || ${COMPRESSION_QUERY} = "bzip2" ]]
then
echo
else
echo
echo " Incorrect Response Option, Please use gzip or bzip2 (Case-Sensitive) "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
fi
######################
echo Would you like to run tophat with a gtf reference file? [Yes/No]
read TOPHAT_QUERY_GTF
# Test to see if the Query Response to the use of a GTF file during tophat alignment was correct or not
## NOTE - In the IF statement use && as AND and || as OR in new shells (old OR is -o)
if [[ ${TOPHAT_QUERY_GTF} = "Yes" || ${TOPHAT_QUERY_GTF} = "No" ]]
then
echo
else
echo
echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
# Ask user for a short GTF Identifier for renaming purposes
if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
then
echo "Please provide a short descriptor for the GTF file used during tophat alignment"
echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
echo Please Enter Response Now:
read TOPHAT_GTF_SHORT
echo
fi
echo Notes regarding the following questions pertain to how cufflinks estimates abundances
echo -----------------------------------------------------------------------------------------
echo
echo A GTF file can be used in two mutually exclusive fashions -G/--GTF or -g/--GTF-guide
echo
echo "1) To limit abundance estimates to just the isoforms in the GTF file use -G/--GTF"
echo "*** In this script this is called LIMIT ABUNDANCE ESTIMATION to GTF reference ***"
echo NOTE: This limits the estimates to a consistent set of known isoforms
echo
echo "2) To guide BUT NOT limit the abundance estimates to the GTF file use -g/--GTF-guide"
echo "*** In this script this is called GUIDE ABUNDANCE ESTIMATION to GTF reference ***"
echo NOTE: This allows identification of known and novel isoforms
echo -----------------------------------------------------------------------------------------
echo
echo Would you like to LIMIT ABUNDANCE ESTIMATION with cufflinks to a gtf reference file? [Yes/No]
read CUFFLINKS_QUERY_LIMIT_GTF
# Test to see if the Query Response to the use of a LIMIT ABUNDANCE ESTIMATION GTF file during cufflinks isoform assembly was correct or not
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" || ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" ]]
then
echo
else
echo
echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
echo Would you like to GUIDE ABUNDANCE ESTIMATION with cufflinks to a gtf reference file? [Yes/No]
read CUFFLINKS_QUERY_GUIDE_GTF
# Test to see if the Query Response to the use of a GUIDE ABUNDANCE ESTIMATION GTF file during cufflinks isoform assembly was correct or not
if [[ ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" || ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" ]]
then
echo
else
echo
echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
# Test to ensure the user did not say "Yes" to the use both the LIMIT and GUIDE GTF options
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" ]]
then
echo " Incorrect Response Combination, You Can Not Say Yes to both the LIMIT and GUIDE ABUNDANCE ESTIMATION Options!! "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
# Ask user for a short GTF Identifier for renaming purposes
if [ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" ]
then
echo "Please provide a short descriptor for the GTF file used to LIMIT ABUNDANCE ESTIMATION"
echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
echo Please Enter Response Now:
read LIMIT_GTF_SHORT
echo
fi
if [ ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" ]
then
echo "Please provide a short descriptor for the GTF file used to GUIDE ABUNDANCE ESTIMATION"
echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs"
echo Please Enter Response Now:
read GUIDE_GTF_SHORT
echo
fi
echo Would you like to run cufflinks with an abundance estimate mask file? [Yes/No]
read CUFFLINKS_QUERY_MASK
# Test to see if the Query Response to the use of a mask file during cufflinks isoform assembly was correct or not
if [[ ${CUFFLINKS_QUERY_MASK} = "Yes" || ${CUFFLINKS_QUERY_MASK} = "No" ]]
then
echo
else
echo
echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
# Ask user for a short mask identifier for renaming purposes
if [ ${CUFFLINKS_QUERY_MASK} = "Yes" ]
then
echo "Please provide a short descriptor for the mask file used to prevent certain transscripts from being used during abundance estimation"
echo "Such as MT for mitochorndia, IG for immunoglobulin, this is not standardized but for file sorting purposes you should be consistent between runs"
echo Please Enter Response Now:
read MASK_SHORT
echo
fi
# Ask the user if they want to extract a list of Genes and associated FPKM vales from the cufflinks output into the run report
echo Would you like to extract a gene list, defined by GENES_OF_INTEREST in .ini file into the run report? [Yes/No]
read GENESOFINTEREST_QUERY
# Test to see if the Query Response to the use of a mask file during cufflinks isoform assembly was correct or not
if [[ ${GENESOFINTEREST_QUERY} = "Yes" || ${GENESOFINTEREST_QUERY} = "No" ]]
then
echo
else
echo
echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
#Final User Query to ensure the variables in the Tophat_Run_Parameters.ini file match the users responses
echo FINAL CHECK: Are you sure the Tophat_Run_Parameters.ini file is updated to match the user responses you provided? [Yes/No]
read FINAL_CHECK
# Test to see if the Query Response was correct or not
if [[ ${FINAL_CHECK} = "Yes" || ${FINAL_CHECK} = "No" ]]
then
echo
else
echo
echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) "
echo " The script was automatically killed due to a launch error - See Above Error Message"
exit 2
fi
# Test to see if the user wants to stop the script (ie. they entered "No" as a response to the FINAL CHECK
if [ ${FINAL_CHECK} = "Yes" ]
then
echo "Thanks... Your User Configured Run Will Start in a Moment"
echo
else
echo
echo You Enter No to the query... The Script Will Stop Now
exit 2
fi
echo Finished User Configuration Query Step
date '+%m/%d/%y %H:%M:%S'
echo
##############################################################
## Define the renaming tags based on the users responses ##
##############################################################
echo Defining Renaming Convention Based on User Configuration Inputs
date '+%m/%d/%y %H:%M:%S'
echo
# TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF
if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
then
TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF"
else
TOPHAT_TAG="TophatPE_DeNovo"
fi
echo The tophat tag has been set to ${TOPHAT_TAG}
# CUFFLINKS_TAG options are Cufflinks_DeNovo_NoMask, Cufflinks_DeNovo_Masked, Cufflinks_Limited_NoMask, Cufflinks_Limited_Masked, Cufflinks_Guided_NoMask, Cufflinks_Guided_Masked
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
then
CUFFLINKS_TAG="Cufflinks_DeNovo_NoMask"
fi
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
then
CUFFLINKS_TAG="Cufflinks_DeNovo_${MASK_SHORT}Mask"
fi
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
then
CUFFLINKS_TAG="Cufflinks_${LIMIT_GTF_SHORT}Limited_NoMask"
fi
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
then
CUFFLINKS_TAG="Cufflinks_${LIMIT_GTF_SHORT}Limited_${MASK_SHORT}Mask"
fi
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
then
CUFFLINKS_TAG="Cufflinks_${GUIDE_GTF_SHORT}Guided_NoMask"
fi
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
then
CUFFLINKS_TAG="Cufflinks_${GUIDE_GTF_SHORT}Guided_${MASK_SHORT}Mask"
fi
echo The cufflinks tag has been set to ${CUFFLINKS_TAG}
echo
##############################################################
## Create a custom analysis directory for each sample ##
##############################################################
## Lots of tricks in this step, find function, basename, learning how \, ", and ' quotes differ allowing you to write special characters and use variables
echo Creating Custom Directories and Splitting Files
date '+%m/%d/%y %H:%M:%S'
for file in `find . -type f -iname "*${READ1_ID}"`
do
filename=`basename $file "${READ1_ID}"`
mkdir $filename
mv $filename${READ1_ID} $filename
mv $filename${READ2_ID} $filename
echo "SAMPLE_NAME=\"$filename\"" >> $filename.ini
done
echo Directory Constuction and File Splitting Complete
date '+%m/%d/%y %H:%M:%S'
echo
##############################################################
## Analysis Pipeline - BWA, TOPHAT, CUFFLINKS, QC metrics ##
##############################################################
# Initiate analysis of each identified sample one at a time
echo Starting Individual Sample Analysis
date '+%m/%d/%y %H:%M:%S'
echo
for ini in `ls *.ini`
do
#Read in the .ini file created during the directory construction period
. $ini
mv $ini ${SAMPLE_NAME}
cd ${SAMPLE_NAME}
#Current Directory=ngs/tophat/SampleX/tophat_out
echo Analyzing ${SAMPLE_NAME} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
#Extract and Count reads in input files, 3 parse version depending on compression and type of compression
if [ ${COMPRESSION_QUERY} = "No" ]
then
#Calculate the number of reads present in each file
echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
COUNT_READ1=`wc -l ${SAMPLE_NAME}${READ1_ID} | awk '{print $1/4}'`
echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
COUNT_READ2=`wc -l ${SAMPLE_NAME}${READ2_ID} | awk '{print $1/4}'`
echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo Finished identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
head -n 4000000 ${SAMPLE_NAME}${READ1_ID} > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
head -n 4000000 ${SAMPLE_NAME}${READ2_ID} > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
else
if [ ${COMPRESSION_TYPE} = "gzip" ]
then
echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
COUNT_READ1=`gunzip -c ${SAMPLE_NAME}${READ1_ID} | wc -l | awk '{print $1/4}'`
echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
COUNT_READ2=`gunzip -c ${SAMPLE_NAME}${READ2_ID} | wc -l | awk '{print $1/4}'`
echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo Finished identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
gunzip -c ${SAMPLE_NAME}${READ1_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
gunzip -c ${SAMPLE_NAME}${READ2_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
fi
if [ ${COMPRESSION_TYPE} = "bzip2" ]
then
echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
COUNT_READ1=`bunzip2 -k ${SAMPLE_NAME}${READ1_ID} | wc -l | awk '{print $1/4}'`
echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
COUNT_READ2=`bunzip2 -k ${SAMPLE_NAME}${READ2_ID} | wc -l | awk '{print $1/4}'`
echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo Finished identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
#Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD
echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bunzip2 -k ${SAMPLE_NAME}${READ1_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ1_BWA_FASTQ}
echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bunzip2 -k ${SAMPLE_NAME}${READ2_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ2_BWA_FASTQ}
echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
fi
fi
#Align subset reads to transcriptome reference using BWA aln
echo Starting Alignment of Read1 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bwa aln -t ${CPU_CORES} -I ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ1_BWA_FASTQ} > ${SAMPLE_NAME}${READ1_BWA_SAI}
echo Finished Aligning Read1 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo Starting Alignment of Read2 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bwa aln -t ${CPU_CORES} -I ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ2_BWA_FASTQ} > ${SAMPLE_NAME}${READ2_BWA_SAI}
echo Finished Aligning Read2 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
#Generate .sam file using BWA sampe
echo Starting BWA SAMPE Assembly of Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
bwa sampe ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ1_BWA_SAI} ${SAMPLE_NAME}${READ2_BWA_SAI} ${SAMPLE_NAME}${READ1_BWA_FASTQ} ${SAMPLE_NAME}${READ2_BWA_FASTQ} > ${SAMPLE_NAME}${BWA_SAMPE}.sam
echo Finished BWA SAMPE Assembly Phase | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
#Convert BWA sam file to bam file using SAMTOOLS
echo Starting SAM to BAM conversion with Samtools | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
samtools view -bS -o ${SAMPLE_NAME}${BWA_SAMPE}.bam ${SAMPLE_NAME}${BWA_SAMPE}.sam
echo Finished SAM to BAM conversion wth Samtools | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
#Sort BWA bam file using SAMTOOLS
echo Starting BAM file sorting step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
samtools sort ${SAMPLE_NAME}${BWA_SAMPE}.bam ${SAMPLE_NAME}${BWA_SAMPE}${SAMTOOLS_SORT_ID}
echo Finished sorting BAM file | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
#Determine insert metrics using PICARD
echo Starting Picard CollectInsertSizeMetrics step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/CollectInsertSizeMetrics.jar INPUT=${SAMPLE_NAME}${BWA_SAMPE}${SAMTOOLS_SORT_ID}.bam OUTPUT=${SAMPLE_NAME}_Picard_InsertMetrics.txt HISTOGRAM_FILE=${SAMPLE_NAME}_Picard_InsertMetrics.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished collecting insert metrics step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
#Calculate "inner mate distance" and standard deviation
echo Starting to calculate insert metrics | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
INNER_DISTANCE=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f5 | awk -v var1="$COMBINED_READ_LENGTH" '{printf "%.0f\n", $1-var1}'`
echo The mean inner-mate-distance was identified as being ${INNER_DISTANCE}bp | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
STDEV=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f6 | awk '{printf "%.0f\n", $1}'`
echo The standard deviation of the inner-mate-distance was identified as being ${STDEV}bp | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo Finished calculating insert metrics | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
#Run tophat analysis
echo Starting Tophat Alignment of ${SAMPLE_NAME} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
if [ ${TOPHAT_QUERY_GTF} = "Yes" ]
then
TOPHAT_COMMAND="tophat --num-threads ${CPU_CORES} -r ${INNER_DISTANCE} --mate-std-dev ${STDEV} --max-insertion-length ${MAX_INSERT_SIZE} --max-deletion-length ${MAX_DELETION_SIZE} ${READ_QUALITY} --library-type ${LIBRARY_FORMAT} -G ${GTF_FILE} ${BOWTIE_INDEX} ${SAMPLE_NAME}${READ1_ID} ${SAMPLE_NAME}${READ2_ID}"
fi
if [ ${TOPHAT_QUERY_GTF} = "No" ]
then
TOPHAT_COMMAND="tophat --num-threads ${CPU_CORES} -r ${INNER_DISTANCE} --mate-std-dev ${STDEV} --max-insertion-length ${MAX_INSERT_SIZE} --max-deletion-length ${MAX_DELETION_SIZE} ${READ_QUALITY} --library-type ${LIBRARY_FORMAT} ${BOWTIE_INDEX} ${SAMPLE_NAME}${READ1_ID} ${SAMPLE_NAME}${READ2_ID}"
fi
echo ${TOPHAT_COMMAND} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
${TOPHAT_COMMAND}
## This creates a subfolder called "tophat_out/"
## This folder will contain: accepted_hits.bam (is it sorted?)
## deletions.bed
## insertions.bed
## junctions.bed
cd tophat_out/
#Current Directory=ngs/tophat/SampleX/tophat_out
mv accepted_hits.bam ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam
mv deletions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_deletions.bed
mv insertions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_insertions.bed
mv junctions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_junctions.bed
echo Finished Tophat Alignment | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
#Run cufflinks analysis
echo Starting Cufflinks Abundance Estimation of ${SAMPLE_NAME} | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
then
CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"
fi
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
then
CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"
fi
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
then
CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF ${GTF_FILE} --compatible-hits-norm --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"
fi
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
then
CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF ${GTF_FILE} --compatible-hits-norm -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"
fi
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "No" ]]
then
CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF-guide ${GTF_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"
fi
if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]]
then
CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF-guide ${GTF_FILE} -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam"
fi
echo ${CUFFLINKS_COMMAND} | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
${CUFFLINKS_COMMAND}
## This creates a subfolder called "cufflinks_out/" in the sample directory
## This folder will contain: genes.expr
## transcripts.expr
## transcripts.gtf
#In the next step we will rename the expression estimate files
cd ../cufflinks_out/
#Current Directory=ngs/tophat/SampleX/cufflinks_out
mv genes.fpkm_tracking ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_genes.expr
mv isoforms.fpkm_tracking ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_transcripts.expr
mv transcripts.gtf ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_transcripts.gtf
echo Finished Cufflinks Abundance Estimation | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
##############################################################
## Calculate metrics using Samtools and Picard ##
##############################################################
cd ../tophat_out/
#Current Directory=ngs/tophat/SampleX/tophat_out
#Determine the RNA Seq Metrics using Picard "CollectRNASeqMetrics.jar"
echo Starting Picard CollectRnaSeqMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/CollectRnaSeqMetrics.jar REF_FLAT=${REFFLAT_FILE} RIBOSOMAL_INTERVALS=${RIBOSOME_LIST} STRAND_SPECIFICITY=NONE REFERENCE_SEQUENCE=${REFERENCE_FILE} INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt CHART_OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished collecting RNAseq metrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
#Mark the duplicate reads using Picard "MarkDuplicates.jar"
echo Starting Picard MarkDuplicates step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
## NOTE: "-Dsnappy.disable=true" is passed to this command to prevent the use of snappy-java by MarkDuplicates as I can not figure out how to add it to the JAVA CLASSPATH
java -Dsnappy.disable=true -Xmx2g -jar $HOME/local/bin/MarkDuplicates.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam METRICS_FILE=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished MarkDuplicates step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
#Create an index file for the bam file generated from the MarkDuplicates step using SAMTOOLS INDEX
echo Starting bam file indexing step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
samtools index ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam
echo Finished bam file indexing step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
#Calculate mapping statistics using SAMTOOLS FlAGSTAT
echo Starting Samtools Flagstat step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
samtools flagstat ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam > ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt
echo Finished Samtools Flagstat step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
#Calculate mapping statistics using Picard "CollectAlignmentSummaryMetrics.jar"
echo Starting Picard CollectAlignmentSummaryMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/CollectAlignmentSummaryMetrics.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt VALIDATION_STRINGENCY=SILENT REFERENCE_SEQUENCE=${REFERENCE_FILE} ASSUME_SORTED=true IS_BISULFITE_SEQUENCED=false TMP_DIR=${TEMP_DIRECTORY}
echo Finished Picard CollectAlignmentSummaryMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
#Calculate read quality stats using Picard "MeanQualityByCycle.jar"
echo Starting Picard MeanQualityByCycle step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
java -Xmx2g -jar $HOME/local/bin/MeanQualityByCycle.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.txt CHART_OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY}
echo Finished Picard MeanQualityByCycle step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
##############################################################
## Write out summary file (Stats/Metrics/Indexes?) ##
##############################################################
#Create a summary text file
echo Starting Creation of Summary File | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo "*** Analysis Run Report ***" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo Sample ${SAMPLE_NAME} was Analyzed | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
# Test to see if the user wants a to extract FPKM values for a list of genes into the run report
if [ ${GENESOFINTEREST_QUERY} = "Yes" ]
then
echo FPKM Values for Common genes of Interests: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
grep -wf ${GENES_OF_INTEREST} ../cufflinks_out/${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_genes.expr | awk '{print $1, $10}' | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
fi
echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo Sample Summary: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "The two read files processed were ${SAMPLE_NAME}${READ1_ID} and ${SAMPLE_NAME}${READ2_ID}" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${SAMPLE_NAME}${READ1_ID} contained ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${SAMPLE_NAME}${READ2_ID} contained ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo After aligning the first one million reads to a transcriptome reference using BWA | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo the inner-mate-distance was determined to be ${INNER_DISTANCE}bp with a standard deviation of ${STDEV} | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "The alignment metrics for this sample were as follows:" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
#Extract Alignment Metrics
UNIQUE_READ1_READS_ALIGNED=`cut -f3 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 8 | tail -n 1`
UNIQUE_READ2_READS_ALIGNED=`cut -f3 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 9 | tail -n 1`
TOTAL_READ1_ALIGNMENTS=`awk '{print $1}' ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt | head -n 5 | tail -n 1`
TOTAL_READ2_ALIGNMENTS=`awk '{print $1}' ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt | head -n 6 | tail -n 1`
PERCENT_READ1_ALIGNED=`echo ${COUNT_READ1} | awk -v var2="${UNIQUE_READ1_READS_ALIGNED}" '{print var2/$1}'`
PERCENT_READ2_ALIGNED=`echo ${COUNT_READ2} | awk -v var3="${UNIQUE_READ2_READS_ALIGNED}" '{print var3/$1}'`
PERCENT_READ1_Q20_BASES=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | tail -n 1 | awk '{print $11/$8}'`
PERCENT_READ2_Q20_BASES=`head -n 9 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | tail -n 1 | awk '{print $11/$8}'`
PERCENT_PAIRED_READ_ALIGNMENT=`cut -f17 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 8 | tail -n 1 | awk -v var4="${COUNT_READ1}" '{print $1/var4}'`
echo Alignment Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${UNIQUE_READ1_READS_ALIGNED} - Number of Unique Reads from Read1 Aligned | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${UNIQUE_READ2_READS_ALIGNED} - Number of Unique Reads from Read2 Aligned | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${TOTAL_READ1_ALIGNMENTS} - Total Number of Alignment Events from the Unique Reads in Read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${TOTAL_READ2_ALIGNMENTS} - Total Number of Alignment Events from the Unique Reads in Read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ1_ALIGNED} - Percentage of Unique Read1 Reads Aligned to at least one Position | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ2_ALIGNED} - Percentage of Unique Read2 Reads Aligned to at least one Position | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ1_Q20_BASES} - Percentage of Aligned Bases from Read1, which are greater than Q20 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_READ2_Q20_BASES} - Percentage of Aligned Bases from Read2, which are greater than Q20 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_PAIRED_READ_ALIGNMENT} - Percentage of Reads Aligned as Pairs | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
#Extract RNAseq Metrics
PERCENT_RIBOSOME_BASES=`cut -f11 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_CODING_BASES=`cut -f12 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_UTR_BASES=`cut -f13 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_INTRONIC_BASES=`cut -f14 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_INTERGENIC_BASES=`cut -f15 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
PERCENT_MRNA_BASES=`cut -f16 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
MEDIAN_5PRIME_BIAS=`cut -f20 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
MEDIAN_3PRIME_BIAS=`cut -f21 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
MEDIAN_5PRIME_3PRIME_BIAS_RATIO=`cut -f22 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1`
echo Transcriptome Sequencing Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_RIBOSOME_BASES} - Percent Ribosomal Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_CODING_BASES} - Percent Coding Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_UTR_BASES} - Percent UTR Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_INTRONIC_BASES} - Percent Intronic Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_INTERGENIC_BASES} - Percent Intergenic Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_MRNA_BASES} - Percent mRNA Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${MEDIAN_5PRIME_BIAS} - Median 5 Prime Coverage Bias | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${MEDIAN_3PRIME_BIAS} - Median 3 Prime Coverage Bias | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${MEDIAN_5PRIME_3PRIME_BIAS_RATIO} - Median 5/3 Coverage Ratio | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
#Extract Duplication Metrics
PERCENT_DUPLICATION=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt | tail -n 1 | cut -f8`
LIBRARY_SIZE=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt | tail -n 1 | cut -f9`
echo Duplication Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${LIBRARY_SIZE} - Estimated fragments in Library | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${PERCENT_DUPLICATION} - Percent Duplicate Reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
#Calculate Quality Metrics (first/last cycle and mean per read)
QUALITY_CAPTURE_LINES=`echo ${COMBINED_READ_LENGTH} | awk '{print $1+8}'`
QUALITY_CAPTURE_READS=`echo ${COMBINED_READ_LENGTH} | awk '{print $1/2}'`
head -n ${QUALITY_CAPTURE_LINES} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.txt | tail -n ${COMBINED_READ_LENGTH} > run_qualities.txt
head -n ${QUALITY_CAPTURE_READS} run_qualities.txt > read1_qualities.txt
tail -n ${QUALITY_CAPTURE_READS} run_qualities.txt > read2_qualities.txt
QUALITY_READ1_FIRST_CYCLE1=`cut -f2 read1_qualities.txt | head -n 1`
QUALITY_READ1_LAST_CYCLE1=`cut -f2 read1_qualities.txt | tail -n 1`
QUALITY_READ1_AVERAGE=`awk '{s+=$2} END {print s/NR}' < read1_qualities.txt`
QUALITY_READ2_FIRST_CYCLE1=`cut -f2 read2_qualities.txt | head -n 1`
QUALITY_READ2_LAST_CYCLE1=`cut -f2 read2_qualities.txt | tail -n 1`
QUALITY_READ2_AVERAGE=`awk '{s+=$2} END {print s/NR}' < read2_qualities.txt`
echo Quality Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo This run consists of paired ${QUALITY_CAPTURE_READS}x${QUALITY_CAPTURE_READS}bp reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ1_AVERAGE} - Mean phred score for read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ1_FIRST_CYCLE1} - Mean phred score for the first cycle of read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ1_LAST_CYCLE1} - Mean phred score for the last cycle of read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ2_AVERAGE} - Mean phred score for read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ2_FIRST_CYCLE1} - Mean phred score for the first cycle of read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ${QUALITY_READ2_LAST_CYCLE1} - Mean phred score for the last cycle of read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
#Output complete list of variables used in the run
echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo THE FOLLOWING VARIABLES WERE GENERATED DURING THE ANALYSIS OF THIS SAMPLE: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMPRESSION_QUERY=\"${COMPRESSION_QUERY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMPRESSION_TYPE=\"${COMPRESSION_TYPE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_QUERY_GTF=\"${TOPHAT_QUERY_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_GTF_SHORT=\"${TOPHAT_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_QUERY_LIMIT_GTF=\"${CUFFLINKS_QUERY_LIMIT_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_QUERY_GUIDE_GTF=\"${CUFFLINKS_QUERY_GUIDE_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "LIMIT_GTF_SHORT=\"${LIMIT_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GUIDE_GTF_SHORT=\"${GUIDE_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_QUERY_MASK=\"${CUFFLINKS_QUERY_MASK}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MASK_SHORT=\"${MASK_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GENESOFINTEREST_QUERY=\"${GENESOFINTEREST_QUERY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "FINAL_CHECK=\"${FINAL_CHECK}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_TAG=\"${TOPHAT_TAG}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_TAG=\"${CUFFLINKS_TAG}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "SAMPLE_NAME=\"${SAMPLE_NAME}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COUNT_READ1=\"${COUNT_READ1}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COUNT_READ2=\"${COUNT_READ2}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "INNER_DISTANCE=\"${INNER_DISTANCE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "STDEV=\"${STDEV}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TOPHAT_COMMAND=\"${TOPHAT_COMMAND}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CUFFLINKS_COMMAND=\"${CUFFLINKS_COMMAND}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo THE FOLLOWING VARIABLES WERE LOADED FROM THE RNAseq_MensFormalWear_Parameters.ini DURING THE ANALYSIS OF THIS SAMPLE: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo They may or may not have been used depending on your query respones
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "LAUNCH_DIR=\"${LAUNCH_DIR}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ_EXT=\"${READ_EXT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ1_ID=\"${READ1_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ2_ID=\"${READ2_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMBINED_READ_LENGTH=\"${COMBINED_READ_LENGTH}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "CPU_CORES=\"${CPU_CORES}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GENOME_NAME=\"${GENOME_NAME}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "REFERENCE_FILE=\"${REFERENCE_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "BWA_GENOME_FILE=\"${BWA_GENOME_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ1_BWA_FASTQ=\"${READ1_BWA_FASTQ}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ2_BWA_FASTQ=\"${READ2_BWA_FASTQ}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ1_BWA_SAI=\"${READ1_BWA_SAI}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ2_BWA_SAI=\"${READ2_BWA_SAI}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "BWA_SAMPE=\"${BWA_SAMPE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "SAMTOOLS_SORT_ID=\"${SAMTOOLS_SORT_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "BOWTIE_INDEX=\"${BOWTIE_INDEX}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GTF_FILE=\"${GTF_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "READ_QUALITY=\"${READ_QUALITY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "LIBRARY_FORMAT=\"${LIBRARY_FORMAT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MAX_INSERT_SIZE=\"${MAX_INSERT_SIZE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MAX_DELETION_SIZE=\"${MAX_DELETION_SIZE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "MASK_FILE=\"${MASK_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "TEMP_DIRECTORY=\"${TEMP_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "REFFLAT_FILE=\"${REFFLAT_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "RIBOSOME_LIST=\"${RIBOSOME_LIST}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "GENES_OF_INTEREST=\"${GENES_OF_INTEREST}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMPLETED_RNASEQ_READ_FILE_DIRECTORY=\"${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo "COMPLETED_RNASEQ_RESULTS_DIRECTORY=\"${COMPLETED_RNASEQ_RESULTS_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt
echo Finished Creating the Summary File | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
##############################################################
## Clean up analysis directory and Move files to Archive ##
##############################################################
echo Starting to Clean up Analysis Directory | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log
#Remove temp files created during calculations steps
rm run_qualities.txt
rm read1_qualities.txt
rm read2_qualities.txt
#Move tophat outputs and metric files to archive directory
mv ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam.bai ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.bed ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.pdf ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.txt ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
#Move cufflinks outputs to archive directory
cd ../cufflinks_out
#Current Directory=ngs/tophat/SampleX/cufflinks_out
mv *.expr ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
mv *.gtf ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
#Move fastq, ini, and log file to archive directories
cd ../
#Current Directory=ngs/tophat/SampleX
mv ${SAMPLE_NAME}${READ1_ID} ${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}
mv ${SAMPLE_NAME}${READ2_ID} ${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}
echo Finished Cleaning up Analysis Directory | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log
mv *.log ${COMPLETED_RNASEQ_RESULTS_DIRECTORY}
#Move back to launch directory to start next sample in batch
cd ../
#Current Directory=ngs/tophat/
#Remove Custom analysis directory and none required files
rm -rf ${SAMPLE_NAME}
done
Code:
## RNAseq Common parameters file ## # Use this variable to indicate the expected launch directory # (Update as appropriate) LAUNCH_DIR=/Users/jkeats/ngs/tophat ################################################################################## ## ## ## Common Parameters used at multiple steps or used to name output files ## ## (Update as appropriate) ## ## ## ################################################################################## #sequence read file extension READ_EXT=".fastq.gz" #forward or read1 identifier READ1_ID="_1.fastq.gz" #reverse or read2 identifier READ2_ID="_2.fastq.gz" #What is the combined read length of the two read pairs (ie. 50x50=100, 100x100=200) COMBINED_READ_LENGTH="100" #How many cores/processors are available for the multi-threaded phases of BWA, TOPHAT, CUFFLINKS CPU_CORES="1" #This is used to rename result files to include the genome version GENOME_NAME=hg19 #Provide the path and full file name to the genome reference file used to create the bowtie index REFERENCE_FILE=/Users/jkeats/ngs/refgenomes/bowtie_indexed/igenomes_hg19/genome.fa ################################################################################## ## ## ## Common BWA parameters (Only the BWA_GENOME_FILE needs to be updated) ## ## ## ################################################################################## BWA_GENOME_FILE="/Users/jkeats/ngs/refgenomes/bwa_indexed/GRCh37_E60cDNA.fa" READ1_BWA_FASTQ="_1_bwa.fastq" READ2_BWA_FASTQ="_2_bwa.fastq" READ1_BWA_SAI="_1_bwa.sai" READ2_BWA_SAI="_2_bwa.sai" BWA_SAMPE="_bwa_sampe" ################################################################################## ## ## ## Common Samtools parameters (Shouldn't need to be modified) ## ## ## ################################################################################## SAMTOOLS_SORT_ID="_sorted" ################################################################################## ## ## ## Common Tophat Parameters (Update as appropriate) ## ## ## ################################################################################## #Bowtie index to align data against [ <filename> ] BOWTIE_INDEX="/Users/jkeats/ngs/refgenomes/bowtie_indexed/igenomes_hg19/genome" #GTF file to use for LIMITED ABUNDANCE ESTIMATION (-G/--GTF) or GUIDED ABUNDANCE ESTIMATION ( -g/--GTF-guide) GTF_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_genes.gtf" #Quality type [ --solexa-quals OR --solexa1.3-quals (same as phred64-quals) OR --phred64-quals (same as solexa1.3-quals) ] READ_QUALITY="--solexa1.3-quals" #Library type [ fr-unstranded OR fr-firststrand OR fr-secondstrand) LIBRARY_FORMAT="fr-unstranded" #Maximum insertion size allow IF allowing indels [ --max-insertion-length <int> default: 3 ] MAX_INSERT_SIZE="3" #Maximum deletion size allowed IF allowing indels [ --max-deletion-length <int> default: 3 ] MAX_DELETION_SIZE="3" ################################################################################## ## ## ## Common Cufflinks Parameters (Update as appropriate) ## ## ## ################################################################################## #This should be used if using an exclusion file for cufflinks abundance estimates MASK_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/GRCh37_E60_CufflinksExcludedTranscripts.gtf" ################################################################################## ## ## ## Common Picard Parameters (Update as appropriate) ## ## ## ################################################################################## #Provide a directory for temporary files (Not always needed but on multi-harddrive systems JAVA may use an unexpected directory) TEMP_DIRECTORY="/Users/jkeats/ngs/refgenomes/temp" #This is a refFlat format table outlining the transcript space of the genome analyzed REFFLAT_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_refFlat.txt" #This is the ribosome exclusion file, a list of ribosome gene locations to report the percentage of reads related to ribosomal transcripts RIBOSOME_LIST="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_ribosome_gene_locations.txt" ################################################################################## ## ## ## Gene List of Interest (Update as appropriate) ## ## ## ################################################################################## #List of genes to extract GENES_OF_INTEREST="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/TC_Class_Genes.txt" ################################################################################## ## ## ## Final Directories (Update as appropriate) ## ## ## ################################################################################## #Provide the path to the folder were you want to archive the read files COMPLETED_RNASEQ_READ_FILE_DIRECTORY="/Users/jkeats/ngs/analyzed_read_files/rnaseq/" #Provide the path to the folder were you want to archive the read files COMPLETED_RNASEQ_RESULTS_DIRECTORY="/Users/jkeats/ngs/finaloutputs/rnaseq/"

Comment