Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SNP/Mutation Detection Using Illumina Paired-end Data

    This website is really useful and I've found answers to many questions I had by just reading the existing threads.

    However, I have not found answer to this question -- what program/pipeline I shall use if I want to use Illumina paired-end data (about 50bp each side, genomic) to map to genome in order to find SNP/mutations?

    From what I've read in the thread "Software packages for next gen sequence analysis", which is excellent btw, it seems there is no package that detects SNP while also taking advantage of the distance information from the paired-end data. If it is the case, shall I use some package that do paired-end mapping (e.g. novocraft) using loose cut-off, and then take the hits and throw them into SNP detection package, e.g. ssahaSNP, to refine the data? Will I miss significant amount of SNP information from this alternative?

    Any comments or suggestions are appreciated! Thanks in advance ~

  • #2
    I think MAQ can deal with Illumina paired-end data and output the SNPs. Maybe I misunderstand your meaning of "detects SNP while also taking advantage of the distance information from the paired-end data". I thought it means mapping reads in paired way, then using the mapping results to extract SNP. Or you want the SNP on the two reads from one pair to have some correlation? I am adding SNP detection part to ZOOM on extracting SNP information from the alignment and assembly results. So could you give me more information? Thanks.

    Comment


    • #3
      Thank you for your reply!

      Your first understanding is right -- mapping reads in paired way, then using the mapping results to extract SNP.

      For MAQ, does it do mapping first and then SNP detection, i.e. two filtering steps to the final SNP result?

      Is there any software and can do the two at the same time, i.e. put it in the same model thus get a single p-value for each SNP detected? Is it possible to do it at all?

      Comment


      • #4
        Hi qqcandy,

        I have written software that uses MAQ paired alignments to do SNP calling - though maq now also has processes to do the same thing.

        I don't know of any process that provides a p-value for SNPs, but you could calculate it yourself with the information available.

        As for wanting to do alignment and SNP calling at one time, I only know of one piece of software that does that (Slider), but I consider that to be a terrible design decision, and I would stay away from any software that tries to do everything in one step. (The opportunity for bugs to be present increases dramatically as the complexity of the software increases.)
        The more you know, the more you know you don't know. —Aristotle

        Comment


        • #5
          Yes. MAQ will take two steps to get final SNP results. I agree with apfejes that it's not good to do alignment and SNP calling at the same time, because you need to map all possible reads to one position before you know whether this position could be a SNP candidate according to the information, say the frequency of different nucleotides on this position, the quality of mapped reads. ZOOM will also adopt the two-steps way.

          Comment


          • #6
            Thanks a lot for the suggetions!

            Now I understand that getting the alignment while assessing SNP is not a good idea. It is better to have the reads aligned first, then take all the reads that align to the region to asess SNPs.

            As far as I know, there are software/algorithms that give a p-value for each SNP reported given the alignment. It was designed for EST analysis but I think it can also be applied to the short reads analysis. (They both come with scores for each nucleotide in the read.)



            There is another related question: if the organism is diploid, the reference has C in a position, and the sample has one copy with G, the other copy with T, how the SNP detection program (e.g. the one in MAQ) deal with it? Can the program report two SNPs at the position for the sample? (Asumming we have enough reads for both G and T)

            Comment


            • #7
              There are likely many software packages out there that do SNP calling - and each one will be different. In fact, you can write your own SNP caller in a few days, or modify one of the available ones out there to do what you want.

              Whether you get a p-value or whether the software can do multiple snps at the same location depends heavily on what application you choose to use.

              The one I've written doesn't give a p-value, but does call multiple snps at one location. However, by calling multiple snps at one location, you start to wander into muddy waters. What if you have a trisomy, or duplication event? It becomes less and less clear what the real answer is.
              The more you know, the more you know you don't know. —Aristotle

              Comment


              • #8
                I think MAQ's paper take into account of the diploid problem. However, I haven't have a try. So I have no idea of what its output looks like.

                You can also try apfejes's software. Hi, apfejes, what is the name of your software? I didn't find it on your blog. Is it included in the packeage of FindPeaks?

                The snp detection part of the next release of ZOOM will output the snp automatically, including the diploid ones. But now if you want to write your own SNP caller based on the alignment results, ZOOM's output may be helpful to you. Because ZOOM will output a frequency file together with the mapping results and assembled consensus. The frequency file record the frequency of different nucleotides on this position. It is like this:

                position A C G T deletion insertion coverage
                33113 0 1 54 2 0 0 0
                4192402 0 43 0 53 0 0 0

                Then you can write your perl script to decide what the snp is according to the comparing of the frequency.
                Last edited by spirit; 10-03-2008, 01:13 PM.

                Comment


                • #9
                  Hi spirit,

                  My software is part of the Vancouver Short Read Analysis package (sourceforge), and is undergoing a lot of development right now.

                  The SNP caller itself is in pretty good shape, but has been tailored to take advantage of WTSS data, using a Maq alignment pipeline developed by another graduate student here. However, It would be relatively easy to make that part optional, which would allow the package to be used for general purpose snp calling. Since SNP callers are relatively easy to write, I wasn't expecting much interest in this - but I'm happy to make the minor changes required, if someone intends to use it.

                  The only caveat is that anyone who'd like to try it would need to download the source and compile it. It's only two commands ("svn checkout <path/trunk>", and "ant buildall"), but I know command line can be scary for some people.

                  I'll do a file release and a manual for it in the near future, but that would obviously be accelerated if there's interest.

                  In the meantime, you can also let me know if you want a feature list, as it's a pretty "full featured" snp caller.

                  Cheers.
                  Last edited by apfejes; 10-03-2008, 01:39 PM. Reason: add in one last sentance.
                  The more you know, the more you know you don't know. —Aristotle

                  Comment


                  • #10
                    Originally posted by spirit View Post
                    The frequency file record the frequency of different nucleotides on this position. It is like this:

                    position A C G T deletion insertion coverage
                    33113 0 1 54 2 0 0 0
                    4192402 0 43 0 53 0 0 0

                    Then you can write your perl script to decide what the snp is according to the comparing of the frequency.
                    For more sophisticated SNP detection (although in most cases not needed), it may be useful to have the quality score for each call. E.g. for position 33133, what is the score for the one C, what are the scores for the 54 Gs. In this case the situation is clear, but if there are 2 Cs and 3 Gs, the score for each of the call can make a big diffrence. It would be nice if we can have that form of output too.

                    Comment


                    • #11
                      Just found this which may be useful to people who are interested in SNP call using short reads data:

                      Conversion of Novoalign to Maq's .map format (calling SNPs and Indels)



                      "At present most people are used to using the maq (http://maq.sourceforge.net) tools for alignment, SNP calling, assembly, etc. We plan to develop our own flavour of these but in the meantime it's possible to get more 'good' alignments than maq using novoalign, and then presumably use the maq tools to call assemble reads/call SNPs from novo* alignments."

                      Comment


                      • #12
                        Yes. That's right. That's the reason why we are developing the new SNP detector of ZOOM. We are adding the SNP detection part considering many elements affects how probably this position is a true SNP, such as quality score of mismatch position and mapping probability of this read...

                        Everybody is rushing to do better !!



                        Originally posted by qqcandy View Post
                        For more sophisticated SNP detection (although in most cases not needed), it may be useful to have the quality score for each call. E.g. for position 33133, what is the score for the one C, what are the scores for the 54 Gs. In this case the situation is clear, but if there are 2 Cs and 3 Gs, the score for each of the call can make a big diffrence. It would be nice if we can have that form of output too.
                        Last edited by spirit; 10-06-2008, 12:47 PM.

                        Comment


                        • #13
                          I've already heard of 4 or more good SNP callers. What we try to do with novoalign is simply maximize the number of good read alignments.
                          IMO the best way to validate a SNP caller is to use experimental validation and not all of us are in a position to do that.
                          I expect people to introduce more flavours of aligners/SNP detectors especially with the advent of longer read lengths and better sequencing protocols.

                          Comment


                          • #14
                            Has anyone used Bowtie for alignment and then did SNP calling with MAQ? Is this possible? To take advantage of the speed of Bowtie and the functions of MAQ?

                            Comment


                            • #15
                              Originally posted by qqcandy View Post

                              There is another related question: if the organism is diploid, the reference has C in a position, and the sample has one copy with G, the other copy with T, how the SNP detection program (e.g. the one in MAQ) deal with it? Can the program report two SNPs at the position for the sample? (Asumming we have enough reads for both G and T)


                              Hello qqcandy,

                              I came across your posting from 10 months ago and I am wondering if you have resolved the issue? If yes, how did you do it?

                              I am using maq to find snps and for most part get only 2 calls (on my diploid organism), but there is a certain percentage of 3-allele calls as well, which for most part is far less than the major and minor allele.

                              Thanks,
                              Anamika

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X