Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • QA of 454/FLX runs

    Hi all,

    In order to perform some post-run QA analysis I wrote a hack that creates a QA report which gives a quick overview. The report include read stats (number of reads, min, max, and mean lengths), plots of lengths and quality scores with and without clipping, a table of located MID tag stats, and a table of nucleotide frequencies for the first 50 read positions. Here is an example of such a report:

    Code:
    QA 454 Report
    =============
    
    
    
    Date: Tue Nov 9 16:03:58 CET 2010
    
    File: /home/data/R_2010_11_05_07_48_05_FLX05080367_adminrig_101105Rahsidwaleed/D_2010_11_05_21_11_20_node1_signalProcessing/sff/GQNPAFI01.sff
    
    
    Sequence analysis
    -----------------
    
    
    The below table contains some basic info:
    
    COUNT is the number of sequences in the file.
    MIN is the minimum sequence length found.
    MAX is the maximum sequence length found.
    MEAN is the mean sequence length found.
    SUM is the total number of bases in the file.
    
    KEY              SEQ
    TYPE            alph
    COUNT         606265
    MIN               50
    MAX             1200
    SUM     314496570.00
    MEAN          518.74
    
    
    Sequence composition
    --------------------
    
    
    The below table contains composition analysis of the sequences:
    
    GC%_MEAN is the mean GC content.
    HARD_MASK%_MEAN is the mean of hard masked sequence (i.e. % of N's).
    SOFT_MASK%_MEAN is the mean of soft masked sequence (i.e. lowercase residues = clipped sequence).
    
    #GC%_MEAN       HARD_MASK%_MEAN SOFT_MASK%_MEAN
    58.85   0.41    26.26
    
    
    Sequence length distribution
    ----------------------------
    
    
    The length distribution of unclipped reads where the lengths are binned in buckets of size 50:
    
    
                                Length Distribution - unclipped
      Count
        250000 ++-------------------------*------------------------------------++
               |                          *                                     |
               |                          *                                     |
        200000 ++                      *  *                                    ++
               |                       *  *                                     |
               |                       *  *                                     |
               |                       *  *                                     |
        150000 ++                      *  *                                    ++
               |                       *  *                                     |
               |                       *  *                                     |
        100000 ++                      *  *                                    ++
               |                       *  *                                     |
               |                       *  *  *                                  |
               |                       *  *  *                                  |
         50000 ++                      *  *  *                                 ++
               |                       *  *  *  *                               |
               |                     * *  *  *  * *                             |
             0 ******************************************************************
               +    +     +    +     +    +     +    +    +     +    +     +    +
               0   100   200  300   400  500   600  700  800   900  1000  1100 1200
                                      50 nucleotide bins
    
    
    The length distribution of clipped reads where the lengths are binned in buckets of size 50:
    
    
                                 Length Distribution - clipped
      Count
        180000 ++--------------------------------------------------------------++
               |                                                                |
        160000 ++                                *                             ++
               |                                 *                              |
        140000 ++                                *                             ++
               |                              *  *                              |
        120000 ++                             *  *                             ++
               |                              *  *                              |
        100000 ++                             *  *                             ++
         80000 ++                         *   *  *                             ++
               |                          *   *  *                              |
         60000 ++                     *   *   *  *                             ++
               |                      *   *   *  *                              |
         40000 ++                 *   *   *   *  *                             ++
               |              *   *   *   *   *  *   *                          |
         20000 ++  *   *   *  *   *   *   *   *  *   *                         ++
               |   *   *   *  *   *   *   *   *  *   *                          |
             0 ******************************************************************
               +   +   +   +  +   +   +   +   +  +   +   +   +   +   +  +   +   +
               0   50 100 150200 250 300 350 400450 500 550 600 650 700750 800 850
                                      50 nucleotide bins
    
    
    
    Quality score means
    -------------------
    
    
    The mean scores of the unclipped sequences:
    
    
                                      Quality Scores
      Score
        40 ++------------------------------------------------------------------++
           **                                                                   |
           |*****                                                               |
        35 ++   ******                                                         ++
           |         ***                                                        |
           |           ****                                                     |
        30 ++             ***                                                  ++
           |                **                                                  |
           |                  **                                                |
        25 ++                  **                                              ++
           |                    ***                                             |
           |                      **                                            |
        20 ++                      *                                           ++
           |                       **                                           |
           |                         **                                         |
        15 ++                         ***                                      ++
           |                            ****                                    *
           |                               **************************************
        10 ++----------+----------+-----------+----------+-----------+----*******
           0          200        400         600        800         1000       1200
                                      Sequence length
    
    
    The mean scores of the clipped sequences:
    
    
                                      Quality Scores
      Score
        40 ++------------------------------------------------------------------++
           **                                                                   |
           |***********                                                         |
        35 ++         ******                                                   ++
           |               ******                                               |
           |                     ****                                           |
        30 ++                       ***                                        ++
           |                          *****                                     |
           |                              **               * *                  |
        25 ++                              *               ***                 ++
           |                               ******          ***   *         ***  |
           |                                    ****     * ****  *        ****  |
        20 ++                                      ****  * ** *  **       ***  ++
           |                                         ***** *  *  **      ****   |
           |                                           *****  *****  ** ***     |
        15 ++                                            **   *****  ** ***    ++
           |                                              *    **** *** ***     |
           |                                              *    ************     |
        10 ++------+------+-------+-------+------+-------+-------+------+------++
           0      100    200     300     400    500     600     700    800     900
                                      Sequence length
    
    
    Histogram of bins with mean quality scores:
    
    
                                        Mean score bins
      Count
        160000 ++--------------------------------------------------------------++
               |                                        ***     ***             |
        140000 ++                                       * *     * *            ++
               |                                        * *     * *     ****    |
        120000 ++                                       * *     * *     *  *   ++
               |                                ***     * *     * *     *  *    |
               |                                * *     * *     * *     *  *    |
        100000 ++                               * *     * *     * *     *  *   ++
               |                                * *     * *     * *     *  *    |
         80000 ++                               * *     * *     * *     *  *   ++
               |                                * *     * *     * *     *  *    |
         60000 ++                               * *     * *     * *     *  *   ++
               |                       ****     * *     * *     * *     *  *    |
         40000 ++                      *  *     * *     * *     * *     *  *   ++
               |                       *  *     * *     * *     * *     *  *    |
               |                       *  *     * *     * *     * *     *  *    |
         20000 ++                      *  *     * *     * *     * *     *  *   ++
               |                       *  *     * *     * *     * *     *  *    |
             0 ++------****----****----****-----***-----***-----***-----****---++
                       5      10      15       20      25      30      35
                                         Bins (size 5)
    
    
    Number of sequences with a mean score >= 20:
    
    RECORDS_COUNT: 552980
    ---
    
    
    MID tag analysis
    ----------------
    
    
    The below table contains the identified MID tags and the number of times they were found:
    
    MID_NUM is the MID tag identifier.
    MID_SEQ is the sequence of the MID tag.
    TOTAL is the number of times this MID tag was found.
    L250 is the a subset count of TOTAL af sequences longer than 250 bases
    L250_S20 is a subset count of L250 af sequences with a mean score above 20
    
    #MID_NUM        MID_SEQ TOTAL   L250    L250_S20
    76      ACATGACGAC      10      10      5
    154     ACACGACGAC      30629   30222   21538
    155     ACACGTAGTA      32239   32052   26666
    156     ACACTACTCG      532410  528673  490788
    162     ACGTAGATCG      2       2       2
    
    
    Residue frequency analysis
    --------------------------
    
    
    The below table contains the residue frequency (in percent) of the first 50 bases:
    
    A       C       G       N       T
    0       0       100     0       0
    100     0       0       0       0
    0       100     0       0       0
    0       0       0       0       100
    100     0       0       0       0
    1       99      0       0       0
    99      1       0       0       0
    0       99      0       0       1
    1       0       10      0       89
    94      1       0       0       5
    5       94      0       0       1
    0       1       10      0       89
    5       89      1       0       6
    5       5       88      0       1
    0       1       1       0       98
    34      17      42      0       7
    23      23      27      0       27
    22      29      24      0       25
    20      28      30      0       22
    22      27      29      0       22
    20      30      27      0       22
    22      25      31      0       22
    20      31      25      0       24
    20      29      30      0       21
    21      29      30      0       20
    21      29      29      0       22
    19      30      29      0       21
    20      29      30      0       20
    21      29      29      0       21
    20      30      29      0       20
    21      29      30      0       20
    21      29      29      0       21
    20      30      29      0       20
    21      29      31      0       20
    21      29      29      0       21
    20      30      29      0       20
    21      30      30      0       20
    21      29      29      0       21
    20      30      30      0       20
    20      29      30      0       20
    21      29      29      0       21
    20      30      30      0       20
    20      29      30      0       20
    21      29      29      0       21
    20      30      30      0       20
    21      29      30      0       20
    21      29      29      0       21
    20      30      30      0       20
    21      29      30      0       20
    21      29      29      0       21
    
    end.

    The hack is written using sff_extract which is a 3rd party tool in the MIRA assembler package (http://www.chevreux.org/projects_mira.html) and Biopieces (www.biopieces.org). If anyone finds this useful, help yourself. The hack is below.



    Cheers,



    Martin



    Code:
    #!/bin/bash
    
    sff_files=$@
    
    if [ ! $1 ]; then
        echo "Usage: `basename $0` <sff file(s)>"
        echo
        exit
    fi
    
    function puts
    {
        msg=$1
    
        echo $msg >> $out_file
    }
    
    function pcat
    {
        file=$1
    
        cat $file >> $out_file
    }
    
    for sff_file in $sff_files; do
        base=`basename $sff_file .sff`
    
        tmp_dir="$HOME/QA_454_report_$base"
    
        if [ ! -d $tmp_dir ]; then
            mkdir $tmp_dir
        fi
    
        fq_file="$tmp_dir/$base.fq"
        xml_file="$tmp_dir/$base.xml"
        out_file="$tmp_dir/QA_454_report_$base.txt"
    
        if [ -f $out_file ]; then
            mv $out_file "$out_file.bak"
        fi
    
        analysis_vals="$tmp_dir/analysis_vals.txt"
        analysis_seqs="$tmp_dir/analysis_seqs.txt"
        plot_lendist_unclipped="$tmp_dir/plot_lendist_unclipped.txt"
        plot_lendist_clipped="$tmp_dir/plot_lendist_clipped.txt"
        plot_scores_unclipped="$tmp_dir/plot_scores_unclipped.txt"
        plot_scores_clipped="$tmp_dir/plot_scores_clipped.txt"
        plot_mean_scores="$tmp_dir/plot_mean_scores.txt"
        count_score_mean="$tmp_dir/count_score_mean.txt"
        table_mid="$tmp_dir/table_mid.tab"
        table_mid_len="$tmp_dir/table_mid_len.tab"
        table_mid_len_score="$tmp_dir/table_mid_len_score.tab"
        table_mid_join="$tmp_dir/table_mid_join.tab"
        table_freq="$tmp_dir/table_freq.tab"
    
        # sff_extract is a 3rd party tool from the MIRA package.
        # http://sourceforge.net/projects/mira-assembler/files/
        echo -n "Converting sff file $sff_file to FASTQ format ... "
        sff_extract --fastq $sff_file --seq_file $fq_file --xml_file $xml_file > /dev/null 2>&1
        echo "done."
    
        # Using Biopieces -> www.biopieces.org
    
        echo "" && echo "Running composition analysis on sequences ... "
        read_fastq -i $fq_file |
        progress_meter |
        analyze_vals -k SEQ -o $analysis_vals |
        analyze_seq |
        mean_vals -k 'GC%,HARD_MASK%,SOFT_MASK%' |
        grab -e 'REC_TYPE eq MEAN' |
        write_tab -ck 'GC%_MEAN,HARD_MASK%_MEAN,SOFT_MASK%_MEAN' -o $analysis_seqs -x
    
        echo "" && echo "Plotting length distributions and scores before and after clipping ..."
        read_fastq -i $fq_file |
        progress_meter |
        bin_vals -k SEQ_LEN -b 50 |
        plot_lendist -k SEQ_LEN_BIN -T "Length Distribution - unclipped" -X "50 nucleotide bins" -Y "Count" -o $plot_lendist_unclipped |
        plot_scores -o $plot_scores_unclipped -X "Sequence length" -Y "Score" |
        clip_seq |
        bin_vals -k SEQ_LEN -b 50 |
        plot_lendist -k SEQ_LEN_BIN -T "Length Distribution - clipped" -X "50 nucleotide bins" -Y "Count" -o $plot_lendist_clipped |
        plot_scores -o $plot_scores_clipped -X "Sequence length" -Y "Score" -x
    
        echo "" && echo "Plotting mean score bins and counting mean scores greater than 20 ... "
        read_fastq -i $fq_file |
        progress_meter |
        mean_scores |
        bin_vals -k SCORES_MEAN -b 5 |
        plot_histogram -s num -k SCORES_MEAN_BIN -T "Mean score bins" -X "Bins (size 5)" -Y "Count" -o $plot_mean_scores |
        grab -e 'SCORES_MEAN >= 20' |
        count_records -o $count_score_mean -x
    
        echo "" && echo "Locating and counting MID tags ... "
        read_fastq -i $fq_file |
        progress_meter |
        find_mids |
        write_tab -o $table_mid -c -k MID_NUM,MID_SEQ,MID_COUNT -x
    
        echo "" && echo "Locating and counting MID tags for sequences longer than 250 ... "
        read_fastq -i $fq_file |
        progress_meter |
        grab -e 'SEQ_LEN >= 250' |
        find_mids |
        write_tab -o $table_mid_len -c -k MID_NUM,MID_SEQ,MID_COUNT -x
    
        echo "" && echo "Locating and counting MID tags for sequences longer than 250 and mean score above 20 ... "
        read_fastq -i $fq_file |
        progress_meter |
        grab -e 'SEQ_LEN >= 250' |
        mean_scores |
        grab -e 'SCORES_MEAN >= 20' |
        find_mids |
        write_tab -o $table_mid_len_score -c -k MID_NUM,MID_SEQ,MID_COUNT -x
    
        echo "" && echo -n "Joining MID tables ... "
        read_tab -i $table_mid |
        rename_keys -k MID_NUM,A |
        rename_keys -k MID_COUNT,TOTAL |
        read_tab -i $table_mid_len |
        rename_keys -k MID_NUM,B |
        rename_keys -k MID_COUNT,L250 |
        merge_records -k A,B |
        read_tab -i $table_mid_len_score |
        rename_keys -k MID_NUM,C |
        rename_keys -k MID_COUNT,L250_S20 |
        merge_records -k A,C |
        rename_keys -k A,MID_NUM |
        sort_records -k MID_NUMn |
        write_tab -o $table_mid_join -c -k MID_NUM,MID_SEQ,TOTAL,L250,L250_S20 -x
        echo "done."
    
        echo "" && echo "Creating residue frequency table ... "
        read_fastq -i $fq_file |
        progress_meter |
        extract_seq -l 50 |
        uppercase_seq |
        create_weight_matrix -p |
        flip_tab |
        write_tab -o $table_freq -x
    
        echo "" && echo -n "Generating report ... "
        puts ""
        puts ""
        puts ""
        puts "QA 454 Report"
        puts "============="
        puts ""
        puts ""
        puts ""
        puts "Date: `date`"
        puts ""
        puts "File: `pwd`/$sff_file"
        puts ""
        puts ""
        puts "Sequence analysis"
        puts "-----------------"
        puts ""
        puts ""
        puts "The below table contains some basic info:"
        puts ""
        puts "  COUNT is the number of sequences in the file."
        puts "  MIN   is the minimum sequence length found."
        puts "  MAX   is the maximum sequence length found."
        puts "  MEAN  is the mean    sequence length found."
        puts "  SUM   is the total number of bases in the file."
        puts ""
        pcat $analysis_vals
        puts ""
        puts ""
        puts "Sequence composition"
        puts "--------------------"
        puts ""
        puts ""
        puts "The below table contains composition analysis of the sequences:"
        puts ""
        puts "  GC%_MEAN is the mean GC content."
        puts "  HARD_MASK%_MEAN is the mean of hard masked sequence (i.e. % of N's)."
        puts "  SOFT_MASK%_MEAN is the mean of soft masked sequence (i.e. lowercase residues = clipped sequence)."
        puts ""
        pcat $analysis_seqs
        puts ""
        puts ""
        puts "Sequence length distribution"
        puts "----------------------------"
        puts ""
        puts ""
        puts "The length distribution of unclipped reads where the lengths are binned in buckets of size 50:"
        puts ""
        pcat $plot_lendist_unclipped
        puts ""
        puts "The length distribution of clipped reads where the lengths are binned in buckets of size 50:"
        puts ""
        pcat $plot_lendist_clipped
        puts ""
        puts ""
        puts "Quality score means"
        puts "-------------------"
        puts ""
        puts ""
        puts "The mean scores of the unclipped sequences:"
        puts ""
        pcat $plot_scores_unclipped
        puts ""
        puts "The mean scores of the clipped sequences:"
        puts ""
        pcat $plot_scores_clipped
        puts ""
        puts "Histogram of bins with mean quality scores:"
        puts ""
        pcat $plot_mean_scores
        puts ""
        puts "Number of sequences with a mean score >= 20:"
        puts ""
        pcat $count_score_mean
        puts ""
        puts ""
        puts "MID tag analysis"
        puts "----------------"
        puts ""
        puts ""
        puts "The below table contains the identified MID tags and the number of times they were found:"
        puts ""
        puts "  MID_NUM is the MID tag identifier."
        puts "  MID_SEQ is the sequence of the MID tag."
        puts "  TOTAL is the number of times this MID tag was found."
        puts "  L250 is the a subset count of TOTAL af sequences longer than 250 bases"
        puts "  L250_S20 is a subset count of L250 af sequences with a mean score above 20"
        puts ""
        pcat $table_mid_join
        puts ""
        puts ""
        puts "Residue frequency analysis"
        puts "--------------------------"
        puts ""
        puts ""
        puts "The below table contains the residue frequency (in percent) of the first 50 bases:"
        puts ""
        pcat $table_freq
        puts ""
        puts "end."
    
        rm $fq_file
        rm $xml_file
        rm $analysis_vals
        rm $analysis_seqs
        rm $plot_lendist_unclipped
        rm $plot_lendist_clipped
        rm $plot_scores_unclipped
        rm $plot_scores_clipped
        rm $plot_mean_scores
        rm $count_score_mean
        rm $table_mid
        rm $table_mid_len
        rm $table_mid_len_score
        rm $table_mid_join
        rm $table_freq
    
        echo "done."
    
        echo ""
        echo "Report located here: $out_file"
        echo ""
    done
    
    echo "All done."

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
30 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
32 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Working...
X