Hi all,
In order to perform some post-run QA analysis I wrote a hack that creates a QA report which gives a quick overview. The report include read stats (number of reads, min, max, and mean lengths), plots of lengths and quality scores with and without clipping, a table of located MID tag stats, and a table of nucleotide frequencies for the first 50 read positions. Here is an example of such a report:
The hack is written using sff_extract which is a 3rd party tool in the MIRA assembler package (http://www.chevreux.org/projects_mira.html) and Biopieces (www.biopieces.org). If anyone finds this useful, help yourself. The hack is below.
Cheers,
Martin
In order to perform some post-run QA analysis I wrote a hack that creates a QA report which gives a quick overview. The report include read stats (number of reads, min, max, and mean lengths), plots of lengths and quality scores with and without clipping, a table of located MID tag stats, and a table of nucleotide frequencies for the first 50 read positions. Here is an example of such a report:
Code:
QA 454 Report ============= Date: Tue Nov 9 16:03:58 CET 2010 File: /home/data/R_2010_11_05_07_48_05_FLX05080367_adminrig_101105Rahsidwaleed/D_2010_11_05_21_11_20_node1_signalProcessing/sff/GQNPAFI01.sff Sequence analysis ----------------- The below table contains some basic info: COUNT is the number of sequences in the file. MIN is the minimum sequence length found. MAX is the maximum sequence length found. MEAN is the mean sequence length found. SUM is the total number of bases in the file. KEY SEQ TYPE alph COUNT 606265 MIN 50 MAX 1200 SUM 314496570.00 MEAN 518.74 Sequence composition -------------------- The below table contains composition analysis of the sequences: GC%_MEAN is the mean GC content. HARD_MASK%_MEAN is the mean of hard masked sequence (i.e. % of N's). SOFT_MASK%_MEAN is the mean of soft masked sequence (i.e. lowercase residues = clipped sequence). #GC%_MEAN HARD_MASK%_MEAN SOFT_MASK%_MEAN 58.85 0.41 26.26 Sequence length distribution ---------------------------- The length distribution of unclipped reads where the lengths are binned in buckets of size 50: Length Distribution - unclipped Count 250000 ++-------------------------*------------------------------------++ | * | | * | 200000 ++ * * ++ | * * | | * * | | * * | 150000 ++ * * ++ | * * | | * * | 100000 ++ * * ++ | * * | | * * * | | * * * | 50000 ++ * * * ++ | * * * * | | * * * * * * | 0 ****************************************************************** + + + + + + + + + + + + + 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 50 nucleotide bins The length distribution of clipped reads where the lengths are binned in buckets of size 50: Length Distribution - clipped Count 180000 ++--------------------------------------------------------------++ | | 160000 ++ * ++ | * | 140000 ++ * ++ | * * | 120000 ++ * * ++ | * * | 100000 ++ * * ++ 80000 ++ * * * ++ | * * * | 60000 ++ * * * * ++ | * * * * | 40000 ++ * * * * * ++ | * * * * * * * | 20000 ++ * * * * * * * * * * ++ | * * * * * * * * * * | 0 ****************************************************************** + + + + + + + + + + + + + + + + + + 0 50 100 150200 250 300 350 400450 500 550 600 650 700750 800 850 50 nucleotide bins Quality score means ------------------- The mean scores of the unclipped sequences: Quality Scores Score 40 ++------------------------------------------------------------------++ ** | |***** | 35 ++ ****** ++ | *** | | **** | 30 ++ *** ++ | ** | | ** | 25 ++ ** ++ | *** | | ** | 20 ++ * ++ | ** | | ** | 15 ++ *** ++ | **** * | ************************************** 10 ++----------+----------+-----------+----------+-----------+----******* 0 200 400 600 800 1000 1200 Sequence length The mean scores of the clipped sequences: Quality Scores Score 40 ++------------------------------------------------------------------++ ** | |*********** | 35 ++ ****** ++ | ****** | | **** | 30 ++ *** ++ | ***** | | ** * * | 25 ++ * *** ++ | ****** *** * *** | | **** * **** * **** | 20 ++ **** * ** * ** *** ++ | ***** * * ** **** | | ***** ***** ** *** | 15 ++ ** ***** ** *** ++ | * **** *** *** | | * ************ | 10 ++------+------+-------+-------+------+-------+-------+------+------++ 0 100 200 300 400 500 600 700 800 900 Sequence length Histogram of bins with mean quality scores: Mean score bins Count 160000 ++--------------------------------------------------------------++ | *** *** | 140000 ++ * * * * ++ | * * * * **** | 120000 ++ * * * * * * ++ | *** * * * * * * | | * * * * * * * * | 100000 ++ * * * * * * * * ++ | * * * * * * * * | 80000 ++ * * * * * * * * ++ | * * * * * * * * | 60000 ++ * * * * * * * * ++ | **** * * * * * * * * | 40000 ++ * * * * * * * * * * ++ | * * * * * * * * * * | | * * * * * * * * * * | 20000 ++ * * * * * * * * * * ++ | * * * * * * * * * * | 0 ++------****----****----****-----***-----***-----***-----****---++ 5 10 15 20 25 30 35 Bins (size 5) Number of sequences with a mean score >= 20: RECORDS_COUNT: 552980 --- MID tag analysis ---------------- The below table contains the identified MID tags and the number of times they were found: MID_NUM is the MID tag identifier. MID_SEQ is the sequence of the MID tag. TOTAL is the number of times this MID tag was found. L250 is the a subset count of TOTAL af sequences longer than 250 bases L250_S20 is a subset count of L250 af sequences with a mean score above 20 #MID_NUM MID_SEQ TOTAL L250 L250_S20 76 ACATGACGAC 10 10 5 154 ACACGACGAC 30629 30222 21538 155 ACACGTAGTA 32239 32052 26666 156 ACACTACTCG 532410 528673 490788 162 ACGTAGATCG 2 2 2 Residue frequency analysis -------------------------- The below table contains the residue frequency (in percent) of the first 50 bases: A C G N T 0 0 100 0 0 100 0 0 0 0 0 100 0 0 0 0 0 0 0 100 100 0 0 0 0 1 99 0 0 0 99 1 0 0 0 0 99 0 0 1 1 0 10 0 89 94 1 0 0 5 5 94 0 0 1 0 1 10 0 89 5 89 1 0 6 5 5 88 0 1 0 1 1 0 98 34 17 42 0 7 23 23 27 0 27 22 29 24 0 25 20 28 30 0 22 22 27 29 0 22 20 30 27 0 22 22 25 31 0 22 20 31 25 0 24 20 29 30 0 21 21 29 30 0 20 21 29 29 0 22 19 30 29 0 21 20 29 30 0 20 21 29 29 0 21 20 30 29 0 20 21 29 30 0 20 21 29 29 0 21 20 30 29 0 20 21 29 31 0 20 21 29 29 0 21 20 30 29 0 20 21 30 30 0 20 21 29 29 0 21 20 30 30 0 20 20 29 30 0 20 21 29 29 0 21 20 30 30 0 20 20 29 30 0 20 21 29 29 0 21 20 30 30 0 20 21 29 30 0 20 21 29 29 0 21 20 30 30 0 20 21 29 30 0 20 21 29 29 0 21 end.
The hack is written using sff_extract which is a 3rd party tool in the MIRA assembler package (http://www.chevreux.org/projects_mira.html) and Biopieces (www.biopieces.org). If anyone finds this useful, help yourself. The hack is below.
Cheers,
Martin
Code:
#!/bin/bash sff_files=$@ if [ ! $1 ]; then echo "Usage: `basename $0` <sff file(s)>" echo exit fi function puts { msg=$1 echo $msg >> $out_file } function pcat { file=$1 cat $file >> $out_file } for sff_file in $sff_files; do base=`basename $sff_file .sff` tmp_dir="$HOME/QA_454_report_$base" if [ ! -d $tmp_dir ]; then mkdir $tmp_dir fi fq_file="$tmp_dir/$base.fq" xml_file="$tmp_dir/$base.xml" out_file="$tmp_dir/QA_454_report_$base.txt" if [ -f $out_file ]; then mv $out_file "$out_file.bak" fi analysis_vals="$tmp_dir/analysis_vals.txt" analysis_seqs="$tmp_dir/analysis_seqs.txt" plot_lendist_unclipped="$tmp_dir/plot_lendist_unclipped.txt" plot_lendist_clipped="$tmp_dir/plot_lendist_clipped.txt" plot_scores_unclipped="$tmp_dir/plot_scores_unclipped.txt" plot_scores_clipped="$tmp_dir/plot_scores_clipped.txt" plot_mean_scores="$tmp_dir/plot_mean_scores.txt" count_score_mean="$tmp_dir/count_score_mean.txt" table_mid="$tmp_dir/table_mid.tab" table_mid_len="$tmp_dir/table_mid_len.tab" table_mid_len_score="$tmp_dir/table_mid_len_score.tab" table_mid_join="$tmp_dir/table_mid_join.tab" table_freq="$tmp_dir/table_freq.tab" # sff_extract is a 3rd party tool from the MIRA package. # http://sourceforge.net/projects/mira-assembler/files/ echo -n "Converting sff file $sff_file to FASTQ format ... " sff_extract --fastq $sff_file --seq_file $fq_file --xml_file $xml_file > /dev/null 2>&1 echo "done." # Using Biopieces -> www.biopieces.org echo "" && echo "Running composition analysis on sequences ... " read_fastq -i $fq_file | progress_meter | analyze_vals -k SEQ -o $analysis_vals | analyze_seq | mean_vals -k 'GC%,HARD_MASK%,SOFT_MASK%' | grab -e 'REC_TYPE eq MEAN' | write_tab -ck 'GC%_MEAN,HARD_MASK%_MEAN,SOFT_MASK%_MEAN' -o $analysis_seqs -x echo "" && echo "Plotting length distributions and scores before and after clipping ..." read_fastq -i $fq_file | progress_meter | bin_vals -k SEQ_LEN -b 50 | plot_lendist -k SEQ_LEN_BIN -T "Length Distribution - unclipped" -X "50 nucleotide bins" -Y "Count" -o $plot_lendist_unclipped | plot_scores -o $plot_scores_unclipped -X "Sequence length" -Y "Score" | clip_seq | bin_vals -k SEQ_LEN -b 50 | plot_lendist -k SEQ_LEN_BIN -T "Length Distribution - clipped" -X "50 nucleotide bins" -Y "Count" -o $plot_lendist_clipped | plot_scores -o $plot_scores_clipped -X "Sequence length" -Y "Score" -x echo "" && echo "Plotting mean score bins and counting mean scores greater than 20 ... " read_fastq -i $fq_file | progress_meter | mean_scores | bin_vals -k SCORES_MEAN -b 5 | plot_histogram -s num -k SCORES_MEAN_BIN -T "Mean score bins" -X "Bins (size 5)" -Y "Count" -o $plot_mean_scores | grab -e 'SCORES_MEAN >= 20' | count_records -o $count_score_mean -x echo "" && echo "Locating and counting MID tags ... " read_fastq -i $fq_file | progress_meter | find_mids | write_tab -o $table_mid -c -k MID_NUM,MID_SEQ,MID_COUNT -x echo "" && echo "Locating and counting MID tags for sequences longer than 250 ... " read_fastq -i $fq_file | progress_meter | grab -e 'SEQ_LEN >= 250' | find_mids | write_tab -o $table_mid_len -c -k MID_NUM,MID_SEQ,MID_COUNT -x echo "" && echo "Locating and counting MID tags for sequences longer than 250 and mean score above 20 ... " read_fastq -i $fq_file | progress_meter | grab -e 'SEQ_LEN >= 250' | mean_scores | grab -e 'SCORES_MEAN >= 20' | find_mids | write_tab -o $table_mid_len_score -c -k MID_NUM,MID_SEQ,MID_COUNT -x echo "" && echo -n "Joining MID tables ... " read_tab -i $table_mid | rename_keys -k MID_NUM,A | rename_keys -k MID_COUNT,TOTAL | read_tab -i $table_mid_len | rename_keys -k MID_NUM,B | rename_keys -k MID_COUNT,L250 | merge_records -k A,B | read_tab -i $table_mid_len_score | rename_keys -k MID_NUM,C | rename_keys -k MID_COUNT,L250_S20 | merge_records -k A,C | rename_keys -k A,MID_NUM | sort_records -k MID_NUMn | write_tab -o $table_mid_join -c -k MID_NUM,MID_SEQ,TOTAL,L250,L250_S20 -x echo "done." echo "" && echo "Creating residue frequency table ... " read_fastq -i $fq_file | progress_meter | extract_seq -l 50 | uppercase_seq | create_weight_matrix -p | flip_tab | write_tab -o $table_freq -x echo "" && echo -n "Generating report ... " puts "" puts "" puts "" puts "QA 454 Report" puts "=============" puts "" puts "" puts "" puts "Date: `date`" puts "" puts "File: `pwd`/$sff_file" puts "" puts "" puts "Sequence analysis" puts "-----------------" puts "" puts "" puts "The below table contains some basic info:" puts "" puts " COUNT is the number of sequences in the file." puts " MIN is the minimum sequence length found." puts " MAX is the maximum sequence length found." puts " MEAN is the mean sequence length found." puts " SUM is the total number of bases in the file." puts "" pcat $analysis_vals puts "" puts "" puts "Sequence composition" puts "--------------------" puts "" puts "" puts "The below table contains composition analysis of the sequences:" puts "" puts " GC%_MEAN is the mean GC content." puts " HARD_MASK%_MEAN is the mean of hard masked sequence (i.e. % of N's)." puts " SOFT_MASK%_MEAN is the mean of soft masked sequence (i.e. lowercase residues = clipped sequence)." puts "" pcat $analysis_seqs puts "" puts "" puts "Sequence length distribution" puts "----------------------------" puts "" puts "" puts "The length distribution of unclipped reads where the lengths are binned in buckets of size 50:" puts "" pcat $plot_lendist_unclipped puts "" puts "The length distribution of clipped reads where the lengths are binned in buckets of size 50:" puts "" pcat $plot_lendist_clipped puts "" puts "" puts "Quality score means" puts "-------------------" puts "" puts "" puts "The mean scores of the unclipped sequences:" puts "" pcat $plot_scores_unclipped puts "" puts "The mean scores of the clipped sequences:" puts "" pcat $plot_scores_clipped puts "" puts "Histogram of bins with mean quality scores:" puts "" pcat $plot_mean_scores puts "" puts "Number of sequences with a mean score >= 20:" puts "" pcat $count_score_mean puts "" puts "" puts "MID tag analysis" puts "----------------" puts "" puts "" puts "The below table contains the identified MID tags and the number of times they were found:" puts "" puts " MID_NUM is the MID tag identifier." puts " MID_SEQ is the sequence of the MID tag." puts " TOTAL is the number of times this MID tag was found." puts " L250 is the a subset count of TOTAL af sequences longer than 250 bases" puts " L250_S20 is a subset count of L250 af sequences with a mean score above 20" puts "" pcat $table_mid_join puts "" puts "" puts "Residue frequency analysis" puts "--------------------------" puts "" puts "" puts "The below table contains the residue frequency (in percent) of the first 50 bases:" puts "" pcat $table_freq puts "" puts "end." rm $fq_file rm $xml_file rm $analysis_vals rm $analysis_seqs rm $plot_lendist_unclipped rm $plot_lendist_clipped rm $plot_scores_unclipped rm $plot_scores_clipped rm $plot_mean_scores rm $count_score_mean rm $table_mid rm $table_mid_len rm $table_mid_len_score rm $table_mid_join rm $table_freq echo "done." echo "" echo "Report located here: $out_file" echo "" done echo "All done."