Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • De novo SNP calling in absence of complete reference assembly

    Dear all,

    I would like to call SNPs on a diploid genome in an absence of a reference genome or complete assembly. I know that is possible to do this with cortex but has a lower sensitivity than mapping-based approaches. (And also requires around 30 X coverage).

    The ultimate assembly for our species is not ready yet. We do have access to about 1 million scaffolds at this moment. Additional WGS reads with 20x coverage are available for several individuals. I plan to start the SNP calling now. My idea is to map the reads against the scaffolds and then use FreeBayes for calling. what do you think? Would 1 SAMtools work with 1 million scaffolds?

    I would really appreciate receiving any advice or comments.

    Cheers,
    fcr
    Last edited by fcr; 02-16-2012, 04:49 AM. Reason: incomplete post

  • #2
    I think you should map your reads to the assembly and then do SNP calling. SAMtools should in principle work, but I have not tried.

    Comment


    • #3
      Re Cortex:
      1. You have much more than 30x coverage if you have many samples at 20x
      2. It's not as simple as "you need 30x" for Cortex. But you are absolutely right that an assembly approach will be less sensitive to SNPs.


      Re what to do
      - it depends what you want to achieve. Do you want a conservative small set of SNPs for building a genetic map, or a big sensitive set for some other purpose etc.

      If you have the time, then try both methods (mapping/assembly) and compare. If you are doing population genetic studies, then experience suggests that you will need to be very careful with SNP calls based on an assembly that is not high quality, as it is easy for assembly artefacts to look like interesting scientific finds in your SNPs.

      Comment


      • #4
        Hi Zam,

        Thanks a lot.

        Cortex:
        1. True, the distribution of coverage will include regions above 30x.
        2. What are the Computational needs for 10 individuals with 2.9 Gbp genome? In your server you stated "10 humans on a 256Gb RAM server" How long this takes? Would it be possible to call SNPs with less RAM?

        What to do:
        This is a 60 X coverage genome. I would assume that many of the scaffolds are bona fide and that many of the changes (adding more libraries) are going to affect mainly to the connection among scaffolds rather than disrupting them...but I might be wrong and shouldn't guess. The main interest are; 1. develop genome-wide set of markers and 2. do some population inferences by estimating Fst, Pi and Ne.

        So you think is too risky using scaffolds?


        Cheers,
        Fernando

        Comment


        • #5
          Hi Fernando


          >True, the distribution of coverage will include regions above 30x.
          One of the examples in our paper is of SNP calling in 10 samples each sampled to 6x,
          for example.

          2. Actually, you could call on 10 individuals with much less than 256Gb of RAM. You need 256Gb to hold all of ALL of their genomes at the same time. But lots of the genome is either monomorphic, or doesn't consist of things Cortex can call. So you could do those 10 samples in ~80Gb of RAM (for comparison I've just done 85 humans in 320 Gb of RAM).
          The trick is to call on the joint graph (1 colour, probably needs 80Gb RAM) and then pull out just the variants and make a graph just of the variants. Then "multicolourise" the graph and make a 10-colour graph of the variants only, and genotype everyone in that.
          Uses far less memory.

          How much coverage do your 10 samples have? Is the 60x individual a different sample?

          I'm not saying it is too risky with scaffolds, just that if you find something really exciting, you need to do some work making sure it's not an artefact. I've seen people have to work very hard to avoid problems with the chimp genome.

          best

          Zam

          Comment


          • #6
            With 60X, you should be able to get an assembly decent enough for most analyses. This is true for human. Nonetheless, Zam is right that misassembly may cause artifacts. You have to live with it. If you are careful enough, you can greatly reduce the effect of that. Also beware that there will be reference bias when estimating population statistics (i.e. individuals closer to the reference will be mapped better).

            Comment


            • #7
              Just to clarify one thing (and agree with Heng) - my understanding is that Fernando doesnt want to have to wait until his assembly is finished (I mean done/completed, not finished by manual finishers), and wants to get on with it and start calling now. That's what got me nervous about artefacts.

              Comment


              • #8
                Hi,

                Yes, Zam got it right. I want to start calling SNPs now. The assembly is unfinished and it's going to take time polishing it (~1000,000 scaffolds now). In response to Zam, the assembly is based on an individual, and the estimated coverage is 60X.

                The other 10 individuals have 20 X coverage and i want to use them for SNP calling and perhaps "pilot" genotype calling. I think is worthy advance on this, even if in the future a second calling based on the assembly will help to verify/reject candidate regions of interest.

                lh3: Thanks for your comment about the reference bias when estimating the population statistics...I will keep that in mind.

                Cheers,
                Fernando

                Comment


                • #9
                  De novo SNP calling in absence of complete reference assembly

                  Hai all,

                  What about if the incomplete reference genome like papaya? The available information on papaya are scaffolds and contigs. Is it possible to use papaya scaffolds as a reference to align against my reads? In my case, the objective is to discover the SNPs.

                  Comment


                  • #10
                    Hi Rururara
                    Are you working on the same project as Fernando or a different one? If different, how many samples are you trying to discover SNPs in, and what are their depths od coverage and with what technology. Finally, sorry for ignorance, but what is the ploidy of papaya?
                    regards
                    Zam

                    Comment


                    • #11
                      Hi Zam,

                      Rururara is not working in the same project as me. If papaya is a diploid, he could probably use the papaya scaffolds with the "Coordinates Only" option during the calling with cortex_var (actually a acompanying script called runcalls.pl). Right?

                      Cheers,
                      Fernando

                      Comment


                      • #12
                        Yes, and to explain that in more detail:
                        Rururura:

                        1.If you have one diploid sample you can de novo discover variants using Cortex, and then use your contigs/scaffolds to assign them coordinates. This is what Fernando meant by "CoordinatesOnly", an option for Cortex's new wrapper script.

                        2. If you have several samples, then you can do two things
                        a) You can also use the Cortex "population filter" to classify putative variants as repeat/error/polymorphism - this method is robust to reference assembly errors - it catching missing collapsed repeats in the reference - and this will give you a high quality set of variants
                        b) you could use this method to look into the quality of the reference and annotate regions which you trust and do not trust.

                        Zam

                        Comment


                        • #13
                          Hi Zam & fcr,

                          Yup, we are not in the same team. Hehe. Papaya is diploid. I have 3 samples and one of the sample is parental lines. I'm not sure yet the depth coverage as I am still not getting any sequencing information from the company, but soon I will. Papaya is sequence using HiSeq platform.

                          Comment


                          • #14
                            Hi there- when you say one of the samples is parental, does that mean you have two parents and 2 F1 samples, and you have sequenced one parent and both progeny?
                            Zam

                            Comment


                            • #15
                              Definitely yes. Is there any concern about that? Do u mind to share? Anyway, I would like to try this approach whereby I assemble the parental reads with scaffold and use it as a reference sequence to align against the other two progeny. What do u think?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X