Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    I havent tried this, but have you attempted to run indelpe on single-end mapping results from novoalign converted to .map format?
    Novoalign's mapping quality's are not recalculated for paired-end and you should see this from the `map mapstat` output.
    I think Colin will be able to shed more light on this.

    Originally posted by myrna View Post
    Oh no! I was just reveling in the fact that novo2maq did set flags as paired in single end data. This glitch allowed me to run indelpe and find some very convincing indels. Not sure how many of them are real, but looking at the coverage a lot are convincing by eye. Without the ability to run indelpe, many of these sites are mistakenly called SNPs. Is there another option to pull the indels from a novoalign output? I understand the rationale that Maq only trusts indels from paired data, but I would like to get Colin's opinion about whether we can trust indels from single end reads (and if so, what mapping quality thresholds?)

    Thanks,

    Ryan

    Comment


    • #32
      indelpe on single end data

      Originally posted by zee View Post
      I havent tried this, but have you attempted to run indelpe on single-end mapping results from novoalign converted to .map format?
      Novoalign's mapping quality's are not recalculated for paired-end and you should see this from the `map mapstat` output.
      I think Colin will be able to shed more light on this.
      I have done this and it seemed to work well (which was quite satisfying). I just want to be sure I can trust them or if I should pre-filter the alignments at some mapping quality threshold before converting them to .map format. Do you have any sense of the sensitivity and specificity at different coverages?

      Thanks.

      Comment


      • #33
        I have seen some papers use MAQ and filter out anything below mapping quality of 10 and then they do further analysis. With novoalign you should have good quality matches using this sort of filter.
        For Assembly and SNP calling it's better to use a high quality threshold, again anything over 10 should suffice, but I'm sure other users could add more insight.
        If many of your indels are in this high quality range then it should be reliable. You could always confirm by doing other things like multiple sequence alignment of those regions, pileup, etc.

        Comment


        • #34
          Pileup

          On a separate yet related note, does anyone know what is done with flag-130 reads (gapped alignments) when a pileup file is made? It looks as if they are being included without being gapped (which makes sense since the pileup format does not have a way of representing gaps, though maybe it should?). However with the much larger number of gapped alignments in the novoalign output, this seems to be giving me problems when trying to identify SNPs from the pileup file. Has anyone else observed this?

          Thanks

          Comment


          • #35
            Hi myrna,
            I think you can trust indel calls on single end reads and here's why...
            I think we have allow for them in alignment as they are real. Looking at Craig Venters genome we can see that short indels are fairly frequent. See Table 6. http://biology.plosjournals.org/perl...50254&id=12379

            Also from an information content point of view a single base indel isn't much harder to align than a single base mismatch. Consider a 32bp read with one mismatch. The mismatch can be at any of 32 position in the read and take any of the other 3 bases so there are 3*32 = 96 (6.6 bits of information consumed) possible sequences that match with one mismatch. Now consider an insert of one base. It could be any of 32 positions and take any of 4 bases so there are 4*32 = 128 (7 bits of information consumed) possible sequences that match with a one base insert. Not much difference.
            With short reads on human size genome you should be able to detect indels and snps at least in high complexity sequence (and easily on smaller genomes) Obviously it won't work in repeats but the alignment quality in maq and novo... should cover that.
            The novo2maq conversion will extract gapped alignments (status 130) from single end reads and you can run indelpe against a converted file.
            With regard quality, it depends on cover and sample. If cover is fairly high (>10) and sample is from one diploid individual, then I'd only accept reads with quality > 10 and then then I'd also apply a quality filter to SNPs and Indels based on Bayesian posterior probability.
            Last edited by sparks; 09-14-2008, 03:41 PM. Reason: Added bit about quality

            Comment


            • #36
              I also think that when properly handled, it is possible to find reliable indels from single-ended reads. However, you need to careful postprocess the indelpe results. Here are the reasons:

              Firstly, my experience is with short reads you will miss about half of indels that are close to short tandem repeat, while with long reads you will have little problem to detect most of them. And so probably we are expecting an 1:10 indel-to-substitution ratio from short read alignment with high depth rather than 1:5, and this is what I have seen on real data with PE reads. Secondly, I know a group who has tried to find indels from single-ended reads with soap, but in the end, they decided to drop all such indels when they did experimental validation. Probably they could improve their method, but this also shows that you should be careful to find indels with single ended reads. Thirdly, even if you simulate reads without any indels, you will find a lot of alignments with indels, especially >3bp indels, while you will find much less from paired end alignment. You need to properly filter results to get accurate results. Fourthly, Phil Green comments in his new cross_match documentation that finding indels longer than 2bp needs particular care. Although this is partly due to the limitation of the new algorithm in cross_match, he would not give such comments unless he thinks this confers some truth.

              Comment


              • #37
                I agree with Heng Li that indel calling is prone to problems but I think it can be done with appropriate care.
                I have a 1 lane (single end) of data from a a 1Mbp region of human (pooled from multiple individuals). Just using indelpe on movo2map file and then selecting indels with high cover on both strands we get ~100 indels. It remains to be seen if these validate but they look pretty convincing.

                here's one example (best viewed at fixed pitch font)
                AACTCCTAGAGTGTGCTGTACCCAGAAGAAGACAGAATGGCAGGGTATCC (reference)
                AaCTCCTAGAGTGTGCTGTACCCGGAAGA CA
                AACtCCTAGAGtGTGCTGTACCAAGAAGA CA
                ACTCCTAGaGTGtGCTGTACccaGaaGa cAgaat
                TCCTaGAGtGTGCTGTACCcaGaaGA cAGaatggc
                ...
                ccAGaAGa CAGAAtGGCAGGGTATCCTTTGGTCT
                AGA CAGAatGGCAGGGTATCCTTTGGT
                AGA CAGAATGGCAGGGtATcCTTTggtcTGtaaTt

                Quite a few of the indels are in short 3-6bp homopolymers, PCR will tell if they are valid..
                Last edited by sparks; 09-15-2008, 02:01 AM. Reason: Added example

                Comment


                • #38
                  You mention that novoalign is free to non-profits. Do you intend to sell it to commercial companies and if so can you give an estimate of the cost?

                  Comment


                  • #39
                    Commercial licenses are available for a small fee. We offer single server and site wide licenses and these are quite competitive.
                    Anybody is free to mail sales - at - novocraft - dot - com for a pricing quote and a list of the extra features available.
                    Last edited by zee; 09-19-2008, 11:01 AM.

                    Comment


                    • #40
                      Hi Colin!

                      I run Novoalign with "-r None", then with "-r Random" option. I got the same alignment in the two cases. Could you please tell me what I did wrong?

                      Thanks in advance,
                      Valentina

                      Comment


                      • #41
                        Hi Valentina,
                        The difference is how we treat a read that has multiple alignment locations. In this example with -rNone if a read has multiple laignment location then none of the laignmnet locations are reported. The read is still reported with a astatus of 'R'

                        @071113_EAS56_0053:2:1:205:775 S GGAATGGAATAGAATGGAATGGAATCGAATGGAAAG IIIIIIIIIIII-AIGI)>8@4'2.,0&-+(3!&%( R 27
                        @071113_EAS56_0053:2:1:208:823 S GTTGTGTCAATGCTATGTTCTCTTAACTACTATAGG IIIIIIIII0IIII(DI1III@>I)-:G-37&&)'% U 10 90 >gi|89161207|ref|NC_000004.10|NC_000004 115114504 R
                        @071113_EAS56_0053:2:1:216:778 S GGAGGGGGGAGGGATACCATTAGGAGATATACCTAC IIIIIIIII+III,801.,.109/#-$).5+*'&(" R 20
                        @071113_EAS56_0053:2:1:220:530 S GGAGGGATGAGTGTGGCCGCCTGAGCCAGGGCCGGG IIIIII,9;AI1C35=$+*!'&(%*#)#&&%%!$!% U 56 0 >gi|89161205|ref|NC_000003.10|NC_000003 113204473 F
                        @071113_EAS56_0053:2:1:222:845 S GAATTTGCATTTCTCCTAAGTTCCCAGGTGGTGCAC I2IIIIII;IIIIIIII),?3C<48%.,(+1&*&%* U 12 82 >gi|89161210|ref|NC_000006.10|NC_000006 27620264 F
                        @071113_EAS56_0053:2:1:223:509 S GATGAAATAATCTGTACAACAAACCCCCCTGCCACA I>II@>AIIIIIII:;E+>5*2,,4+50$&&"+'+% R 265

                        This is the same set of reads with -rR. In this case one of the alignment locations will be chosen at random (based on probability of being the correct one) and reported.

                        @071113_EAS56_0053:2:1:205:775 S GGAATGGAATAGAATGGAATGGAATCGAATGGAAAG IIIIIIIIIIII-AIGI)>8@4'2.,0&-+(3!&%( R 16 0 >gi|89161220|ref|NC_000024.8|NC_000024 57288157 R
                        @071113_EAS56_0053:2:1:208:823 S GTTGTGTCAATGCTATGTTCTCTTAACTACTATAGG IIIIIIIII0IIII(DI1III@>I)-:G-37&&)'% U 10 67 >gi|89161207|ref|NC_000004.10|NC_000004 115114504 R
                        @071113_EAS56_0053:2:1:216:778 S GGAGGGGGGAGGGATACCATTAGGAGATATACCTAC IIIIIIIII+III,801.,.109/#-$).5+*'&(" R 19 0 >gi|89161216|ref|NC_000009.10|NC_000009 88834386 F
                        @071113_EAS56_0053:2:1:220:530 S GGAGGGATGAGTGTGGCCGCCTGAGCCAGGGCCGGG IIIIII,9;AI1C35=$+*!'&(%*#)#&&%%!$!% U 56 0 >gi|89161205|ref|NC_000003.10|NC_000003 113204473 F
                        @071113_EAS56_0053:2:1:222:845 S GAATTTGCATTTCTCCTAAGTTCCCAGGTGGTGCAC I2IIIIII;IIIIIIII),?3C<48%.,(+1&*&%* U 12 60 >gi|89161210|ref|NC_000006.10|NC_000006 27620264 F
                        @071113_EAS56_0053:2:1:223:509 S GATGAAATAATCTGTACAACAAACCCCCCTGCCACA I>II@>AIIIIIII:;E+>5*2,,4+50$&&"+'+% R 17 0 >gi|51511721|ref|NC_000005.8|NC_000005 130493655 F

                        The difference is that the status 'R' reads have now reported an alignment location.

                        Hope this helps explain it.

                        Best Regards, Colin

                        Comment


                        • #42
                          Hi Colin! Thank you for you reply!

                          Have I understood correctly that there is no difference between "-rR" and "-r Random"?

                          I think I found out why I don't get 'random' reads. This is because I use "-Q 70" flag. And 'random' reads have Q=0.

                          Cheers,
                          Valentina

                          Comment


                          • #43
                            Hey Colin,

                            and there are still no news about precompiled version of Novo* on Solaris?

                            Valentina

                            Comment


                            • #44
                              Hi Valentina,

                              You're right on both counts. For options, in most cases the space between optionletter and value is optional. And for -o & -r options you only need eneter enough letters to uniquely identify the option value.

                              With regard Solaris, I've installed Open Solaris under Vmware on my workstation but it has a few problems, it's not recognising my network or my USB drive, so I haven't been able to transfer any files to it.
                              I have no trouble with Vmware and other flavours of Linux.

                              Colin

                              Comment


                              • #45
                                Hi Colin,

                                I am wondering whether novocraft 2.04 version is free to download for reaserch purpose?
                                All the features available under http://www.novocraft.com/downloads/downloadpage.php are avilable for free version?
                                Please confirm

                                Thanks
                                Last edited by seq_GA; 07-05-2009, 08:33 PM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                31 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X