Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jgSoton
    Member
    • Sep 2011
    • 12

    Novoalign MPI homopolymer filter

    Hi,

    I ran a sample through novoalign (# novoalign (2.06.09MT - Jun 16 2010 @ 12:36:05)) and the mapping stats were as follows;
    # Paired Reads: 15258295
    # Pairs Aligned: 13278014
    # Read Sequences: 30516590
    # Aligned: 30005116
    # Unique Alignment: 27625593
    # Gapped Alignment: 321486
    # Quality Filter: 108560
    # Homopolymer Filter: 64
    # Elapsed Time: 5472,836s

    I then ran the same sample through the MPI version of novoalign (# novoalignMPI (V2.07.11 - Build May 27 2011 @ 15:31:23 on a difference computational cluster) and got the following stats:
    # Paired Reads: 15258295
    # Pairs Aligned: 13138914
    # Read Sequences: 30516590
    # Aligned: 29643668
    # Unique Alignment: 27110885
    # Gapped Alignment: 258400
    # Quality Filter: 222384
    # Homopolymer Filter: 2105
    # Elapsed Time: 881.205 (sec.)
    # CPU Time: 545.9 (min.)


    The number of sequences aligned is lower but in general the values are similar except for the homopolymer filter which is quite different 64 verus 2105.

    Can anyone tell me...
    what is an expected number for the homopolymer filter?
    Should I be worried that the numbers are so different?
    Does it seem right that fewer sequences aligned or should I expect exactly the same numbers?
    Is this likely to be due to different versions of novoalign?
    or the single verus multithreaded MPI version?

    I'd be glad of any input.
    Thanks,
    Jane
  • sparks
    Senior Member
    • Mar 2008
    • 126

    #2
    Hi Jane,

    There are a few things that might cause slightly different results. First would be the setting of insert size & standard deviation. In Novoalign this is used to set initial limits and as more reads are processed the actual distribution off insert lengths is used. With MPI each process maintains its own fragment length table so there might be small differences and it will take longer for the actual distribution to take affect.
    Also, if you use quality calibration the MPI processes each maintain their own mismatch counts so quality calibration may be slightly different and will take longer to kick in.
    With regard the homopolymer filter and quality filter, reads are first identified as homopolymer and/or having low quality bases. This will stop them being used in the first single end phase of alignment however they will still be used in paired end search if the mate was successfully mapped. If this results in a proper pair then the read won't be counted as homopolymer or low quality.
    I'd like to see your command line and also the insert size reported by novoalign. The differences should be reduced if you set the -i option more accurately.
    There's no need to be concerned about the differences, other than to check that -i was set at least such that mean + 6 std dev is sufficient to cover your fragments.
    The actual alignment code is identical between the different versions of Novoalign, the differences all relate to fragment length distribution and the quality calibration function.
    You can remove quality calibration differences by first running a sample of reads (say 100K) and saving the table using the -K <qcal.csv> and then using this in subsequent runs -k <qcal.csv>.

    Colin

    Comment

    • jgSoton
      Member
      • Sep 2011
      • 12

      #3
      Hi,

      Thanks for the reply. I feel more comfortable with the data now.

      My command line is;
      #mpiexec -f hostfile -n $nprocs -launcher rsh -iface ib0 $run_exe \
      mpiexec -f ibhostfile -n $nprocs $run_exe --mmapoff \
      -d /temp/EXOME_DATA/REF_GENOMES/HG18/hg18.nix \
      -f /temp/EXOME_DATA//RESULTS/03/FASTQ/WTCHG_22039_06_1_sequence.txt.gz /temp/EXOME_DATA//RESULTS/03/FASTQ/WTCHG_22039_06_2_sequence.txt.gz \
      -F ILMFQ -i 200 30 -o SAM -o SoftClip -k -a -g 65 -x 7 \
      > SOTON0003a_aligned.sam 2> SOTON0003a_mapping.stats


      I dont know where the insert size is output...

      Jane
      Last edited by jgSoton; 10-04-2011, 06:02 AM. Reason: too much info in file path

      Comment

      • sparks
        Senior Member
        • Mar 2008
        • 126

        #4
        Hi Jane,

        The insert size will be reported near the end of the log file, SOTON0003a_mapping.stats

        Colin

        Comment

        • jgSoton
          Member
          • Sep 2011
          • 12

          #5
          Ahh,

          # Mean 201, Std Dev 53.7

          Jane

          Comment

          • sparks
            Senior Member
            • Mar 2008
            • 126

            #6
            As you used -i 200 30 the range of fragment length for proper pairs would be 0 to 480. It should be OK as penalties will adjust to the actual distribution. However a few long fragments may not have been flagged as proper pairs.
            The -k option and the -i difference will likely explain the small difference in result between MPI and nonMPI runs.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM
            • SEQadmin2
              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
              by SEQadmin2

              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
              05-06-2026, 09:04 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Today, 08:59 AM
            0 responses
            10 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            21 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            17 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 05-28-2026, 11:40 AM
            0 responses
            31 views
            0 reactions
            Last Post SEQadmin2  
            Working...