Seqanswers Leaderboard Ad

**zee** · 09-13-2008, 06:44 AM

I havent tried this, but have you attempted to run indelpe on single-end mapping results from novoalign converted to .map format?
Novoalign's mapping quality's are not recalculated for paired-end and you should see this from the `map mapstat` output.
I think Colin will be able to shed more light on this.

Originally posted by myrna View Post

Oh no! I was just reveling in the fact that novo2maq did set flags as paired in single end data. This glitch allowed me to run indelpe and find some very convincing indels. Not sure how many of them are real, but looking at the coverage a lot are convincing by eye. Without the ability to run indelpe, many of these sites are mistakenly called SNPs. Is there another option to pull the indels from a novoalign output? I understand the rationale that Maq only trusts indels from paired data, but I would like to get Colin's opinion about whether we can trust indels from single end reads (and if so, what mapping quality thresholds?)

Thanks,

Ryan

**myrna** · 09-13-2008, 09:57 AM

indelpe on single end data

Originally posted by zee View Post

I havent tried this, but have you attempted to run indelpe on single-end mapping results from novoalign converted to .map format?
Novoalign's mapping quality's are not recalculated for paired-end and you should see this from the `map mapstat` output.
I think Colin will be able to shed more light on this.

I have done this and it seemed to work well (which was quite satisfying). I just want to be sure I can trust them or if I should pre-filter the alignments at some mapping quality threshold before converting them to .map format. Do you have any sense of the sensitivity and specificity at different coverages?

Thanks.

**zee** · 09-13-2008, 10:02 AM

I have seen some papers use MAQ and filter out anything below mapping quality of 10 and then they do further analysis. With novoalign you should have good quality matches using this sort of filter.
For Assembly and SNP calling it's better to use a high quality threshold, again anything over 10 should suffice, but I'm sure other users could add more insight.
If many of your indels are in this high quality range then it should be reliable. You could always confirm by doing other things like multiple sequence alignment of those regions, pileup, etc.

**myrna** · 09-13-2008, 10:16 AM

Pileup

On a separate yet related note, does anyone know what is done with flag-130 reads (gapped alignments) when a pileup file is made? It looks as if they are being included without being gapped (which makes sense since the pileup format does not have a way of representing gaps, though maybe it should?). However with the much larger number of gapped alignments in the novoalign output, this seems to be giving me problems when trying to identify SNPs from the pileup file. Has anyone else observed this?

Thanks

**sparks** · 09-14-2008, 03:35 PM

Hi myrna,
I think you can trust indel calls on single end reads and here's why...
I think we have allow for them in alignment as they are real. Looking at Craig Venters genome we can see that short indels are fairly frequent. See Table 6. http://biology.plosjournals.org/perl...50254&id=12379

Also from an information content point of view a single base indel isn't much harder to align than a single base mismatch. Consider a 32bp read with one mismatch. The mismatch can be at any of 32 position in the read and take any of the other 3 bases so there are 3*32 = 96 (6.6 bits of information consumed) possible sequences that match with one mismatch. Now consider an insert of one base. It could be any of 32 positions and take any of 4 bases so there are 4*32 = 128 (7 bits of information consumed) possible sequences that match with a one base insert. Not much difference.
With short reads on human size genome you should be able to detect indels and snps at least in high complexity sequence (and easily on smaller genomes) Obviously it won't work in repeats but the alignment quality in maq and novo... should cover that.
The novo2maq conversion will extract gapped alignments (status 130) from single end reads and you can run indelpe against a converted file.
With regard quality, it depends on cover and sample. If cover is fairly high (>10) and sample is from one diploid individual, then I'd only accept reads with quality > 10 and then then I'd also apply a quality filter to SNPs and Indels based on Bayesian posterior probability.

**lh3** · 09-14-2008, 08:00 PM

I also think that when properly handled, it is possible to find reliable indels from single-ended reads. However, you need to careful postprocess the indelpe results. Here are the reasons:

Firstly, my experience is with short reads you will miss about half of indels that are close to short tandem repeat, while with long reads you will have little problem to detect most of them. And so probably we are expecting an 1:10 indel-to-substitution ratio from short read alignment with high depth rather than 1:5, and this is what I have seen on real data with PE reads. Secondly, I know a group who has tried to find indels from single-ended reads with soap, but in the end, they decided to drop all such indels when they did experimental validation. Probably they could improve their method, but this also shows that you should be careful to find indels with single ended reads. Thirdly, even if you simulate reads without any indels, you will find a lot of alignments with indels, especially >3bp indels, while you will find much less from paired end alignment. You need to properly filter results to get accurate results. Fourthly, Phil Green comments in his new cross_match documentation that finding indels longer than 2bp needs particular care. Although this is partly due to the limitation of the new algorithm in cross_match, he would not give such comments unless he thinks this confers some truth.

**sparks** · 09-15-2008, 01:21 AM

I agree with Heng Li that indel calling is prone to problems but I think it can be done with appropriate care.
I have a 1 lane (single end) of data from a a 1Mbp region of human (pooled from multiple individuals). Just using indelpe on movo2map file and then selecting indels with high cover on both strands we get ~100 indels. It remains to be seen if these validate but they look pretty convincing.

here's one example (best viewed at fixed pitch font)
AACTCCTAGAGTGTGCTGTACCCAGAAGAAGACAGAATGGCAGGGTATCC (reference)
AaCTCCTAGAGTGTGCTGTACCCGGAAGA CA
AACtCCTAGAGtGTGCTGTACCAAGAAGA CA
ACTCCTAGaGTGtGCTGTACccaGaaGa cAgaat
TCCTaGAGtGTGCTGTACCcaGaaGA cAGaatggc
...
ccAGaAGa CAGAAtGGCAGGGTATCCTTTGGTCT
AGA CAGAatGGCAGGGTATCCTTTGGT
AGA CAGAATGGCAGGGtATcCTTTggtcTGtaaTt

Quite a few of the indels are in short 3-6bp homopolymers, PCR will tell if they are valid..

**rs705** · 09-19-2008, 05:49 AM

You mention that novoalign is free to non-profits. Do you intend to sell it to commercial companies and if so can you give an estimate of the cost?

**zee** · 09-19-2008, 06:53 AM

Commercial licenses are available for a small fee. We offer single server and site wide licenses and these are quite competitive.
Anybody is free to mail sales - at - novocraft - dot - com for a pricing quote and a list of the extra features available.

**valeu** · 11-03-2008, 09:05 AM

Hi Colin!

I run Novoalign with "-r None", then with "-r Random" option. I got the same alignment in the two cases. Could you please tell me what I did wrong?

Thanks in advance,
Valentina

**sparks** · 11-03-2008, 05:18 PM

Hi Valentina,
The difference is how we treat a read that has multiple alignment locations. In this example with -rNone if a read has multiple laignment location then none of the laignmnet locations are reported. The read is still reported with a astatus of 'R'

@071113_EAS56_0053:2:1:205:775 S GGAATGGAATAGAATGGAATGGAATCGAATGGAAAG IIIIIIIIIIII-AIGI)>8@4'2.,0&-+(3!&%( R 27
@071113_EAS56_0053:2:1:208:823 S GTTGTGTCAATGCTATGTTCTCTTAACTACTATAGG IIIIIIIII0IIII(DI1III@>I)-:G-37&&)'% U 10 90 >gi|89161207|ref|NC_000004.10|NC_000004 115114504 R
@071113_EAS56_0053:2:1:216:778 S GGAGGGGGGAGGGATACCATTAGGAGATATACCTAC IIIIIIIII+III,801.,.109/#-$).5+*'&(" R 20
@071113_EAS56_0053:2:1:220:530 S GGAGGGATGAGTGTGGCCGCCTGAGCCAGGGCCGGG IIIIII,9;AI1C35=$+*!'&(%*#)#&&%%!$!% U 56 0 >gi|89161205|ref|NC_000003.10|NC_000003 113204473 F
@071113_EAS56_0053:2:1:222:845 S GAATTTGCATTTCTCCTAAGTTCCCAGGTGGTGCAC I2IIIIII;IIIIIIII),?3C<48%.,(+1&*&%* U 12 82 >gi|89161210|ref|NC_000006.10|NC_000006 27620264 F
@071113_EAS56_0053:2:1:223:509 S GATGAAATAATCTGTACAACAAACCCCCCTGCCACA I>II@>AIIIIIII:;E+>5*2,,4+50$&&"+'+% R 265

This is the same set of reads with -rR. In this case one of the alignment locations will be chosen at random (based on probability of being the correct one) and reported.

@071113_EAS56_0053:2:1:205:775 S GGAATGGAATAGAATGGAATGGAATCGAATGGAAAG IIIIIIIIIIII-AIGI)>8@4'2.,0&-+(3!&%( R 16 0 >gi|89161220|ref|NC_000024.8|NC_000024 57288157 R
@071113_EAS56_0053:2:1:208:823 S GTTGTGTCAATGCTATGTTCTCTTAACTACTATAGG IIIIIIIII0IIII(DI1III@>I)-:G-37&&)'% U 10 67 >gi|89161207|ref|NC_000004.10|NC_000004 115114504 R
@071113_EAS56_0053:2:1:216:778 S GGAGGGGGGAGGGATACCATTAGGAGATATACCTAC IIIIIIIII+III,801.,.109/#-$).5+*'&(" R 19 0 >gi|89161216|ref|NC_000009.10|NC_000009 88834386 F
@071113_EAS56_0053:2:1:220:530 S GGAGGGATGAGTGTGGCCGCCTGAGCCAGGGCCGGG IIIIII,9;AI1C35=$+*!'&(%*#)#&&%%!$!% U 56 0 >gi|89161205|ref|NC_000003.10|NC_000003 113204473 F
@071113_EAS56_0053:2:1:222:845 S GAATTTGCATTTCTCCTAAGTTCCCAGGTGGTGCAC I2IIIIII;IIIIIIII),?3C<48%.,(+1&*&%* U 12 60 >gi|89161210|ref|NC_000006.10|NC_000006 27620264 F
@071113_EAS56_0053:2:1:223:509 S GATGAAATAATCTGTACAACAAACCCCCCTGCCACA I>II@>AIIIIIII:;E+>5*2,,4+50$&&"+'+% R 17 0 >gi|51511721|ref|NC_000005.8|NC_000005 130493655 F

The difference is that the status 'R' reads have now reported an alignment location.

Hope this helps explain it.

Best Regards, Colin

**valeu** · 11-04-2008, 06:52 AM

Hi Colin! Thank you for you reply!

Have I understood correctly that there is no difference between "-rR" and "-r Random"?

I think I found out why I don't get 'random' reads. This is because I use "-Q 70" flag. And 'random' reads have Q=0.

Cheers,
Valentina

**valeu** · 11-04-2008, 07:06 AM

Hey Colin,

and there are still no news about precompiled version of Novo* on Solaris?

Valentina

**sparks** · 11-04-2008, 07:09 PM

Hi Valentina,

You're right on both counts. For options, in most cases the space between optionletter and value is optional. And for -o & -r options you only need eneter enough letters to uniquely identify the option value.

With regard Solaris, I've installed Open Solaris under Vmware on my workstation but it has a few problems, it's not recognising my network or my USB drive, so I haven't been able to transfer any files to it.
I have no trouble with Vmware and other flavours of Linux.

Colin

**seq_GA** · 07-05-2009, 08:28 PM

Hi Colin,

I am wondering whether novocraft 2.04 version is free to download for reaserch purpose?
All the features available under http://www.novocraft.com/downloads/downloadpage.php are avilable for free version?
Please confirm

Thanks

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News