QA of 454/FLX runs

maasha

Senior Member

Join Date: Apr 2009
Posts: 153

11-15-2010, 02:16 AM

Hi all,

In order to perform some post-run QA analysis I wrote a hack that creates a QA report which gives a quick overview. The report include read stats (number of reads, min, max, and mean lengths), plots of lengths and quality scores with and without clipping, a table of located MID tag stats, and a table of nucleotide frequencies for the first 50 read positions. Here is an example of such a report:

Code:

QA 454 Report
=============



Date: Tue Nov 9 16:03:58 CET 2010

File: /home/data/R_2010_11_05_07_48_05_FLX05080367_adminrig_101105Rahsidwaleed/D_2010_11_05_21_11_20_node1_signalProcessing/sff/GQNPAFI01.sff


Sequence analysis
-----------------


The below table contains some basic info:

COUNT is the number of sequences in the file.
MIN is the minimum sequence length found.
MAX is the maximum sequence length found.
MEAN is the mean sequence length found.
SUM is the total number of bases in the file.

KEY              SEQ
TYPE            alph
COUNT         606265
MIN               50
MAX             1200
SUM     314496570.00
MEAN          518.74


Sequence composition
--------------------


The below table contains composition analysis of the sequences:

GC%_MEAN is the mean GC content.
HARD_MASK%_MEAN is the mean of hard masked sequence (i.e. % of N's).
SOFT_MASK%_MEAN is the mean of soft masked sequence (i.e. lowercase residues = clipped sequence).

#GC%_MEAN       HARD_MASK%_MEAN SOFT_MASK%_MEAN
58.85   0.41    26.26


Sequence length distribution
----------------------------


The length distribution of unclipped reads where the lengths are binned in buckets of size 50:


                            Length Distribution - unclipped
  Count
    250000 ++-------------------------*------------------------------------++
           |                          *                                     |
           |                          *                                     |
    200000 ++                      *  *                                    ++
           |                       *  *                                     |
           |                       *  *                                     |
           |                       *  *                                     |
    150000 ++                      *  *                                    ++
           |                       *  *                                     |
           |                       *  *                                     |
    100000 ++                      *  *                                    ++
           |                       *  *                                     |
           |                       *  *  *                                  |
           |                       *  *  *                                  |
     50000 ++                      *  *  *                                 ++
           |                       *  *  *  *                               |
           |                     * *  *  *  * *                             |
         0 ******************************************************************
           +    +     +    +     +    +     +    +    +     +    +     +    +
           0   100   200  300   400  500   600  700  800   900  1000  1100 1200
                                  50 nucleotide bins


The length distribution of clipped reads where the lengths are binned in buckets of size 50:


                             Length Distribution - clipped
  Count
    180000 ++--------------------------------------------------------------++
           |                                                                |
    160000 ++                                *                             ++
           |                                 *                              |
    140000 ++                                *                             ++
           |                              *  *                              |
    120000 ++                             *  *                             ++
           |                              *  *                              |
    100000 ++                             *  *                             ++
     80000 ++                         *   *  *                             ++
           |                          *   *  *                              |
     60000 ++                     *   *   *  *                             ++
           |                      *   *   *  *                              |
     40000 ++                 *   *   *   *  *                             ++
           |              *   *   *   *   *  *   *                          |
     20000 ++  *   *   *  *   *   *   *   *  *   *                         ++
           |   *   *   *  *   *   *   *   *  *   *                          |
         0 ******************************************************************
           +   +   +   +  +   +   +   +   +  +   +   +   +   +   +  +   +   +
           0   50 100 150200 250 300 350 400450 500 550 600 650 700750 800 850
                                  50 nucleotide bins



Quality score means
-------------------


The mean scores of the unclipped sequences:


                                  Quality Scores
  Score
    40 ++------------------------------------------------------------------++
       **                                                                   |
       |*****                                                               |
    35 ++   ******                                                         ++
       |         ***                                                        |
       |           ****                                                     |
    30 ++             ***                                                  ++
       |                **                                                  |
       |                  **                                                |
    25 ++                  **                                              ++
       |                    ***                                             |
       |                      **                                            |
    20 ++                      *                                           ++
       |                       **                                           |
       |                         **                                         |
    15 ++                         ***                                      ++
       |                            ****                                    *
       |                               **************************************
    10 ++----------+----------+-----------+----------+-----------+----*******
       0          200        400         600        800         1000       1200
                                  Sequence length


The mean scores of the clipped sequences:


                                  Quality Scores
  Score
    40 ++------------------------------------------------------------------++
       **                                                                   |
       |***********                                                         |
    35 ++         ******                                                   ++
       |               ******                                               |
       |                     ****                                           |
    30 ++                       ***                                        ++
       |                          *****                                     |
       |                              **               * *                  |
    25 ++                              *               ***                 ++
       |                               ******          ***   *         ***  |
       |                                    ****     * ****  *        ****  |
    20 ++                                      ****  * ** *  **       ***  ++
       |                                         ***** *  *  **      ****   |
       |                                           *****  *****  ** ***     |
    15 ++                                            **   *****  ** ***    ++
       |                                              *    **** *** ***     |
       |                                              *    ************     |
    10 ++------+------+-------+-------+------+-------+-------+------+------++
       0      100    200     300     400    500     600     700    800     900
                                  Sequence length


Histogram of bins with mean quality scores:


                                    Mean score bins
  Count
    160000 ++--------------------------------------------------------------++
           |                                        ***     ***             |
    140000 ++                                       * *     * *            ++
           |                                        * *     * *     ****    |
    120000 ++                                       * *     * *     *  *   ++
           |                                ***     * *     * *     *  *    |
           |                                * *     * *     * *     *  *    |
    100000 ++                               * *     * *     * *     *  *   ++
           |                                * *     * *     * *     *  *    |
     80000 ++                               * *     * *     * *     *  *   ++
           |                                * *     * *     * *     *  *    |
     60000 ++                               * *     * *     * *     *  *   ++
           |                       ****     * *     * *     * *     *  *    |
     40000 ++                      *  *     * *     * *     * *     *  *   ++
           |                       *  *     * *     * *     * *     *  *    |
           |                       *  *     * *     * *     * *     *  *    |
     20000 ++                      *  *     * *     * *     * *     *  *   ++
           |                       *  *     * *     * *     * *     *  *    |
         0 ++------****----****----****-----***-----***-----***-----****---++
                   5      10      15       20      25      30      35
                                     Bins (size 5)


Number of sequences with a mean score >= 20:

RECORDS_COUNT: 552980
---


MID tag analysis
----------------


The below table contains the identified MID tags and the number of times they were found:

MID_NUM is the MID tag identifier.
MID_SEQ is the sequence of the MID tag.
TOTAL is the number of times this MID tag was found.
L250 is the a subset count of TOTAL af sequences longer than 250 bases
L250_S20 is a subset count of L250 af sequences with a mean score above 20

#MID_NUM        MID_SEQ TOTAL   L250    L250_S20
76      ACATGACGAC      10      10      5
154     ACACGACGAC      30629   30222   21538
155     ACACGTAGTA      32239   32052   26666
156     ACACTACTCG      532410  528673  490788
162     ACGTAGATCG      2       2       2


Residue frequency analysis
--------------------------


The below table contains the residue frequency (in percent) of the first 50 bases:

A       C       G       N       T
0       0       100     0       0
100     0       0       0       0
0       100     0       0       0
0       0       0       0       100
100     0       0       0       0
1       99      0       0       0
99      1       0       0       0
0       99      0       0       1
1       0       10      0       89
94      1       0       0       5
5       94      0       0       1
0       1       10      0       89
5       89      1       0       6
5       5       88      0       1
0       1       1       0       98
34      17      42      0       7
23      23      27      0       27
22      29      24      0       25
20      28      30      0       22
22      27      29      0       22
20      30      27      0       22
22      25      31      0       22
20      31      25      0       24
20      29      30      0       21
21      29      30      0       20
21      29      29      0       22
19      30      29      0       21
20      29      30      0       20
21      29      29      0       21
20      30      29      0       20
21      29      30      0       20
21      29      29      0       21
20      30      29      0       20
21      29      31      0       20
21      29      29      0       21
20      30      29      0       20
21      30      30      0       20
21      29      29      0       21
20      30      30      0       20
20      29      30      0       20
21      29      29      0       21
20      30      30      0       20
20      29      30      0       20
21      29      29      0       21
20      30      30      0       20
21      29      30      0       20
21      29      29      0       21
20      30      30      0       20
21      29      30      0       20
21      29      29      0       21

end.

The hack is written using sff_extract which is a 3rd party tool in the MIRA assembler package (http://www.chevreux.org/projects_mira.html) and Biopieces (www.biopieces.org). If anyone finds this useful, help yourself. The hack is below.

Cheers,

Martin

Code:

#!/bin/bash

sff_files=$@

if [ ! $1 ]; then
    echo "Usage: `basename $0` <sff file(s)>"
    echo
    exit
fi

function puts
{
    msg=$1

    echo $msg >> $out_file
}

function pcat
{
    file=$1

    cat $file >> $out_file
}

for sff_file in $sff_files; do
    base=`basename $sff_file .sff`

    tmp_dir="$HOME/QA_454_report_$base"

    if [ ! -d $tmp_dir ]; then
        mkdir $tmp_dir
    fi

    fq_file="$tmp_dir/$base.fq"
    xml_file="$tmp_dir/$base.xml"
    out_file="$tmp_dir/QA_454_report_$base.txt"

    if [ -f $out_file ]; then
        mv $out_file "$out_file.bak"
    fi

    analysis_vals="$tmp_dir/analysis_vals.txt"
    analysis_seqs="$tmp_dir/analysis_seqs.txt"
    plot_lendist_unclipped="$tmp_dir/plot_lendist_unclipped.txt"
    plot_lendist_clipped="$tmp_dir/plot_lendist_clipped.txt"
    plot_scores_unclipped="$tmp_dir/plot_scores_unclipped.txt"
    plot_scores_clipped="$tmp_dir/plot_scores_clipped.txt"
    plot_mean_scores="$tmp_dir/plot_mean_scores.txt"
    count_score_mean="$tmp_dir/count_score_mean.txt"
    table_mid="$tmp_dir/table_mid.tab"
    table_mid_len="$tmp_dir/table_mid_len.tab"
    table_mid_len_score="$tmp_dir/table_mid_len_score.tab"
    table_mid_join="$tmp_dir/table_mid_join.tab"
    table_freq="$tmp_dir/table_freq.tab"

    # sff_extract is a 3rd party tool from the MIRA package.
    # http://sourceforge.net/projects/mira-assembler/files/
    echo -n "Converting sff file $sff_file to FASTQ format ... "
    sff_extract --fastq $sff_file --seq_file $fq_file --xml_file $xml_file > /dev/null 2>&1
    echo "done."

    # Using Biopieces -> www.biopieces.org

    echo "" && echo "Running composition analysis on sequences ... "
    read_fastq -i $fq_file |
    progress_meter |
    analyze_vals -k SEQ -o $analysis_vals |
    analyze_seq |
    mean_vals -k 'GC%,HARD_MASK%,SOFT_MASK%' |
    grab -e 'REC_TYPE eq MEAN' |
    write_tab -ck 'GC%_MEAN,HARD_MASK%_MEAN,SOFT_MASK%_MEAN' -o $analysis_seqs -x

    echo "" && echo "Plotting length distributions and scores before and after clipping ..."
    read_fastq -i $fq_file |
    progress_meter |
    bin_vals -k SEQ_LEN -b 50 |
    plot_lendist -k SEQ_LEN_BIN -T "Length Distribution - unclipped" -X "50 nucleotide bins" -Y "Count" -o $plot_lendist_unclipped |
    plot_scores -o $plot_scores_unclipped -X "Sequence length" -Y "Score" |
    clip_seq |
    bin_vals -k SEQ_LEN -b 50 |
    plot_lendist -k SEQ_LEN_BIN -T "Length Distribution - clipped" -X "50 nucleotide bins" -Y "Count" -o $plot_lendist_clipped |
    plot_scores -o $plot_scores_clipped -X "Sequence length" -Y "Score" -x

    echo "" && echo "Plotting mean score bins and counting mean scores greater than 20 ... "
    read_fastq -i $fq_file |
    progress_meter |
    mean_scores |
    bin_vals -k SCORES_MEAN -b 5 |
    plot_histogram -s num -k SCORES_MEAN_BIN -T "Mean score bins" -X "Bins (size 5)" -Y "Count" -o $plot_mean_scores |
    grab -e 'SCORES_MEAN >= 20' |
    count_records -o $count_score_mean -x

    echo "" && echo "Locating and counting MID tags ... "
    read_fastq -i $fq_file |
    progress_meter |
    find_mids |
    write_tab -o $table_mid -c -k MID_NUM,MID_SEQ,MID_COUNT -x

    echo "" && echo "Locating and counting MID tags for sequences longer than 250 ... "
    read_fastq -i $fq_file |
    progress_meter |
    grab -e 'SEQ_LEN >= 250' |
    find_mids |
    write_tab -o $table_mid_len -c -k MID_NUM,MID_SEQ,MID_COUNT -x

    echo "" && echo "Locating and counting MID tags for sequences longer than 250 and mean score above 20 ... "
    read_fastq -i $fq_file |
    progress_meter |
    grab -e 'SEQ_LEN >= 250' |
    mean_scores |
    grab -e 'SCORES_MEAN >= 20' |
    find_mids |
    write_tab -o $table_mid_len_score -c -k MID_NUM,MID_SEQ,MID_COUNT -x

    echo "" && echo -n "Joining MID tables ... "
    read_tab -i $table_mid |
    rename_keys -k MID_NUM,A |
    rename_keys -k MID_COUNT,TOTAL |
    read_tab -i $table_mid_len |
    rename_keys -k MID_NUM,B |
    rename_keys -k MID_COUNT,L250 |
    merge_records -k A,B |
    read_tab -i $table_mid_len_score |
    rename_keys -k MID_NUM,C |
    rename_keys -k MID_COUNT,L250_S20 |
    merge_records -k A,C |
    rename_keys -k A,MID_NUM |
    sort_records -k MID_NUMn |
    write_tab -o $table_mid_join -c -k MID_NUM,MID_SEQ,TOTAL,L250,L250_S20 -x
    echo "done."

    echo "" && echo "Creating residue frequency table ... "
    read_fastq -i $fq_file |
    progress_meter |
    extract_seq -l 50 |
    uppercase_seq |
    create_weight_matrix -p |
    flip_tab |
    write_tab -o $table_freq -x

    echo "" && echo -n "Generating report ... "
    puts ""
    puts ""
    puts ""
    puts "QA 454 Report"
    puts "============="
    puts ""
    puts ""
    puts ""
    puts "Date: `date`"
    puts ""
    puts "File: `pwd`/$sff_file"
    puts ""
    puts ""
    puts "Sequence analysis"
    puts "-----------------"
    puts ""
    puts ""
    puts "The below table contains some basic info:"
    puts ""
    puts "  COUNT is the number of sequences in the file."
    puts "  MIN   is the minimum sequence length found."
    puts "  MAX   is the maximum sequence length found."
    puts "  MEAN  is the mean    sequence length found."
    puts "  SUM   is the total number of bases in the file."
    puts ""
    pcat $analysis_vals
    puts ""
    puts ""
    puts "Sequence composition"
    puts "--------------------"
    puts ""
    puts ""
    puts "The below table contains composition analysis of the sequences:"
    puts ""
    puts "  GC%_MEAN is the mean GC content."
    puts "  HARD_MASK%_MEAN is the mean of hard masked sequence (i.e. % of N's)."
    puts "  SOFT_MASK%_MEAN is the mean of soft masked sequence (i.e. lowercase residues = clipped sequence)."
    puts ""
    pcat $analysis_seqs
    puts ""
    puts ""
    puts "Sequence length distribution"
    puts "----------------------------"
    puts ""
    puts ""
    puts "The length distribution of unclipped reads where the lengths are binned in buckets of size 50:"
    puts ""
    pcat $plot_lendist_unclipped
    puts ""
    puts "The length distribution of clipped reads where the lengths are binned in buckets of size 50:"
    puts ""
    pcat $plot_lendist_clipped
    puts ""
    puts ""
    puts "Quality score means"
    puts "-------------------"
    puts ""
    puts ""
    puts "The mean scores of the unclipped sequences:"
    puts ""
    pcat $plot_scores_unclipped
    puts ""
    puts "The mean scores of the clipped sequences:"
    puts ""
    pcat $plot_scores_clipped
    puts ""
    puts "Histogram of bins with mean quality scores:"
    puts ""
    pcat $plot_mean_scores
    puts ""
    puts "Number of sequences with a mean score >= 20:"
    puts ""
    pcat $count_score_mean
    puts ""
    puts ""
    puts "MID tag analysis"
    puts "----------------"
    puts ""
    puts ""
    puts "The below table contains the identified MID tags and the number of times they were found:"
    puts ""
    puts "  MID_NUM is the MID tag identifier."
    puts "  MID_SEQ is the sequence of the MID tag."
    puts "  TOTAL is the number of times this MID tag was found."
    puts "  L250 is the a subset count of TOTAL af sequences longer than 250 bases"
    puts "  L250_S20 is a subset count of L250 af sequences with a mean score above 20"
    puts ""
    pcat $table_mid_join
    puts ""
    puts ""
    puts "Residue frequency analysis"
    puts "--------------------------"
    puts ""
    puts ""
    puts "The below table contains the residue frequency (in percent) of the first 50 bases:"
    puts ""
    pcat $table_freq
    puts ""
    puts "end."

    rm $fq_file
    rm $xml_file
    rm $analysis_vals
    rm $analysis_seqs
    rm $plot_lendist_unclipped
    rm $plot_lendist_clipped
    rm $plot_scores_unclipped
    rm $plot_scores_clipped
    rm $plot_mean_scores
    rm $count_score_mean
    rm $table_mid
    rm $table_mid_len
    rm $table_mid_len_score
    rm $table_mid_join
    rm $table_freq

    echo "done."

    echo ""
    echo "Report located here: $out_file"
    echo ""
done

echo "All done."

Tags: 454, biopieces, flx, quality assurance

Previous template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

QA of 454/FLX runs

Latest Articles

ad_right_rmr

News