SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
novalign MPI jgSoton Bioinformatics 4 01-02-2012 06:50 AM
DGE - filter or not filter masterpiece Bioinformatics 0 07-11-2011 09:55 PM
ClustalW MPI arkal Bioinformatics 2 07-03-2011 09:55 AM
Pileup - Filter SNP/Indel linked to 454 homopolymer bardou Bioinformatics 0 06-30-2010 08:17 AM
Titanium software (MPI mode) westerman 454 Pyrosequencing 9 06-24-2009 02:04 AM

Reply
 
Thread Tools
Old 09-29-2011, 02:31 AM   #1
jgSoton
Member
 
Location: UK

Join Date: Sep 2011
Posts: 12
Default Novoalign MPI homopolymer filter

Hi,

I ran a sample through novoalign (# novoalign (2.06.09MT - Jun 16 2010 @ 12:36:05)) and the mapping stats were as follows;
# Paired Reads: 15258295
# Pairs Aligned: 13278014
# Read Sequences: 30516590
# Aligned: 30005116
# Unique Alignment: 27625593
# Gapped Alignment: 321486
# Quality Filter: 108560
# Homopolymer Filter: 64
# Elapsed Time: 5472,836s

I then ran the same sample through the MPI version of novoalign (# novoalignMPI (V2.07.11 - Build May 27 2011 @ 15:31:23 on a difference computational cluster) and got the following stats:
# Paired Reads: 15258295
# Pairs Aligned: 13138914
# Read Sequences: 30516590
# Aligned: 29643668
# Unique Alignment: 27110885
# Gapped Alignment: 258400
# Quality Filter: 222384
# Homopolymer Filter: 2105
# Elapsed Time: 881.205 (sec.)
# CPU Time: 545.9 (min.)


The number of sequences aligned is lower but in general the values are similar except for the homopolymer filter which is quite different 64 verus 2105.

Can anyone tell me...
what is an expected number for the homopolymer filter?
Should I be worried that the numbers are so different?
Does it seem right that fewer sequences aligned or should I expect exactly the same numbers?
Is this likely to be due to different versions of novoalign?
or the single verus multithreaded MPI version?

I'd be glad of any input.
Thanks,
Jane
jgSoton is offline   Reply With Quote
Old 10-03-2011, 05:55 PM   #2
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Hi Jane,

There are a few things that might cause slightly different results. First would be the setting of insert size & standard deviation. In Novoalign this is used to set initial limits and as more reads are processed the actual distribution off insert lengths is used. With MPI each process maintains its own fragment length table so there might be small differences and it will take longer for the actual distribution to take affect.
Also, if you use quality calibration the MPI processes each maintain their own mismatch counts so quality calibration may be slightly different and will take longer to kick in.
With regard the homopolymer filter and quality filter, reads are first identified as homopolymer and/or having low quality bases. This will stop them being used in the first single end phase of alignment however they will still be used in paired end search if the mate was successfully mapped. If this results in a proper pair then the read won't be counted as homopolymer or low quality.
I'd like to see your command line and also the insert size reported by novoalign. The differences should be reduced if you set the -i option more accurately.
There's no need to be concerned about the differences, other than to check that -i was set at least such that mean + 6 std dev is sufficient to cover your fragments.
The actual alignment code is identical between the different versions of Novoalign, the differences all relate to fragment length distribution and the quality calibration function.
You can remove quality calibration differences by first running a sample of reads (say 100K) and saving the table using the -K <qcal.csv> and then using this in subsequent runs -k <qcal.csv>.

Colin
sparks is offline   Reply With Quote
Old 10-04-2011, 05:23 AM   #3
jgSoton
Member
 
Location: UK

Join Date: Sep 2011
Posts: 12
Default

Hi,

Thanks for the reply. I feel more comfortable with the data now.

My command line is;
#mpiexec -f hostfile -n $nprocs -launcher rsh -iface ib0 $run_exe \
mpiexec -f ibhostfile -n $nprocs $run_exe --mmapoff \
-d /temp/EXOME_DATA/REF_GENOMES/HG18/hg18.nix \
-f /temp/EXOME_DATA//RESULTS/03/FASTQ/WTCHG_22039_06_1_sequence.txt.gz /temp/EXOME_DATA//RESULTS/03/FASTQ/WTCHG_22039_06_2_sequence.txt.gz \
-F ILMFQ -i 200 30 -o SAM -o SoftClip -k -a -g 65 -x 7 \
> SOTON0003a_aligned.sam 2> SOTON0003a_mapping.stats


I dont know where the insert size is output...

Jane

Last edited by jgSoton; 10-04-2011 at 07:02 AM. Reason: too much info in file path
jgSoton is offline   Reply With Quote
Old 10-04-2011, 06:57 AM   #4
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Hi Jane,

The insert size will be reported near the end of the log file, SOTON0003a_mapping.stats

Colin
sparks is offline   Reply With Quote
Old 10-04-2011, 07:01 AM   #5
jgSoton
Member
 
Location: UK

Join Date: Sep 2011
Posts: 12
Default

Ahh,

# Mean 201, Std Dev 53.7

Jane
jgSoton is offline   Reply With Quote
Old 10-04-2011, 07:18 AM   #6
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

As you used -i 200 30 the range of fragment length for proper pairs would be 0 to 480. It should be OK as penalties will adjust to the actual distribution. However a few long fragments may not have been flagged as proper pairs.
The -k option and the -i difference will likely explain the small difference in result between MPI and nonMPI runs.
sparks is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:33 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO