Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Novoalign MPI homopolymer filter

    Hi,

    I ran a sample through novoalign (# novoalign (2.06.09MT - Jun 16 2010 @ 12:36:05)) and the mapping stats were as follows;
    # Paired Reads: 15258295
    # Pairs Aligned: 13278014
    # Read Sequences: 30516590
    # Aligned: 30005116
    # Unique Alignment: 27625593
    # Gapped Alignment: 321486
    # Quality Filter: 108560
    # Homopolymer Filter: 64
    # Elapsed Time: 5472,836s

    I then ran the same sample through the MPI version of novoalign (# novoalignMPI (V2.07.11 - Build May 27 2011 @ 15:31:23 on a difference computational cluster) and got the following stats:
    # Paired Reads: 15258295
    # Pairs Aligned: 13138914
    # Read Sequences: 30516590
    # Aligned: 29643668
    # Unique Alignment: 27110885
    # Gapped Alignment: 258400
    # Quality Filter: 222384
    # Homopolymer Filter: 2105
    # Elapsed Time: 881.205 (sec.)
    # CPU Time: 545.9 (min.)


    The number of sequences aligned is lower but in general the values are similar except for the homopolymer filter which is quite different 64 verus 2105.

    Can anyone tell me...
    what is an expected number for the homopolymer filter?
    Should I be worried that the numbers are so different?
    Does it seem right that fewer sequences aligned or should I expect exactly the same numbers?
    Is this likely to be due to different versions of novoalign?
    or the single verus multithreaded MPI version?

    I'd be glad of any input.
    Thanks,
    Jane

  • #2
    Hi Jane,

    There are a few things that might cause slightly different results. First would be the setting of insert size & standard deviation. In Novoalign this is used to set initial limits and as more reads are processed the actual distribution off insert lengths is used. With MPI each process maintains its own fragment length table so there might be small differences and it will take longer for the actual distribution to take affect.
    Also, if you use quality calibration the MPI processes each maintain their own mismatch counts so quality calibration may be slightly different and will take longer to kick in.
    With regard the homopolymer filter and quality filter, reads are first identified as homopolymer and/or having low quality bases. This will stop them being used in the first single end phase of alignment however they will still be used in paired end search if the mate was successfully mapped. If this results in a proper pair then the read won't be counted as homopolymer or low quality.
    I'd like to see your command line and also the insert size reported by novoalign. The differences should be reduced if you set the -i option more accurately.
    There's no need to be concerned about the differences, other than to check that -i was set at least such that mean + 6 std dev is sufficient to cover your fragments.
    The actual alignment code is identical between the different versions of Novoalign, the differences all relate to fragment length distribution and the quality calibration function.
    You can remove quality calibration differences by first running a sample of reads (say 100K) and saving the table using the -K <qcal.csv> and then using this in subsequent runs -k <qcal.csv>.

    Colin

    Comment


    • #3
      Hi,

      Thanks for the reply. I feel more comfortable with the data now.

      My command line is;
      #mpiexec -f hostfile -n $nprocs -launcher rsh -iface ib0 $run_exe \
      mpiexec -f ibhostfile -n $nprocs $run_exe --mmapoff \
      -d /temp/EXOME_DATA/REF_GENOMES/HG18/hg18.nix \
      -f /temp/EXOME_DATA//RESULTS/03/FASTQ/WTCHG_22039_06_1_sequence.txt.gz /temp/EXOME_DATA//RESULTS/03/FASTQ/WTCHG_22039_06_2_sequence.txt.gz \
      -F ILMFQ -i 200 30 -o SAM -o SoftClip -k -a -g 65 -x 7 \
      > SOTON0003a_aligned.sam 2> SOTON0003a_mapping.stats


      I dont know where the insert size is output...

      Jane
      Last edited by jgSoton; 10-04-2011, 06:02 AM. Reason: too much info in file path

      Comment


      • #4
        Hi Jane,

        The insert size will be reported near the end of the log file, SOTON0003a_mapping.stats

        Colin

        Comment


        • #5
          Ahh,

          # Mean 201, Std Dev 53.7

          Jane

          Comment


          • #6
            As you used -i 200 30 the range of fragment length for proper pairs would be 0 to 480. It should be OK as penalties will adjust to the actual distribution. However a few long fragments may not have been flagged as proper pairs.
            The -k option and the -i difference will likely explain the small difference in result between MPI and nonMPI runs.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X