Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assisted de novo genome assembly? Create new consensus mapping reads to reference?

    Greetings.

    My issue:

    I need the SNP-difference(s) between clone 1 and clone 2 of a haploid eukaryote. I do not have an assembled genome of this species for mapping, but there is one that is closely related with 100% synteny (I'll call it X, here).

    I have illumina 50pb paired-end reads. I tried mapping clone 1 and clone 2 to X separately, then extracting the SNPs, removing the intersection of the sets and just using the complement. But there are too many SNPs, 1/100bp, and the sequence quality differs problematically among the samples (the clone 1 sample is gorgeous, clone 2 is a little iffy but higher coverage), thus the SNP list for clone 2 is more than twice as long as clone 1 (but the spurious SNPs have a > 100 coverage and Q scores over 100 in many cases.

    (Pipeline = BWA -> sampe -> SAMtools mpileup -> BCFtools vcfutils.pl -> SNPs)

    I'm thinking of assembling one clone to X, then exporting that as a new sequence, and then mapping clone 2 to that and cutting out the middle man. I am the only one in my immediate area working on this and I am just a mapping monkey, I don't know much about how to assemble a new genome and use it as a reference -- so any advice on how to do this or alternatives to solving the Find the SNP between Clone 1 and Clone 2 Problem is much appreciated.
    Last edited by zmartine; 02-07-2012, 09:00 AM.

  • #2
    This is a bit tricky but I would say your most likely options are:

    1) do a true de novo assembly of clone 1, exclude the repetitive contigs using coverage depth as a guide and call SNPs by mapping clone 2 against those contigs - you could use Velvet, SOAPdenovo etc. for the assembly, and your regular aligner for mapping

    2) do a reference-guided de novo assembly of clone 1 and then map clone 2 against that - you could use CLC Bio or MIRA for that

    3) do a true de novo assembly of clone 1 with the same method as 1), scaffold the contigs against your reference using something like BAMBUS and map SNPs back

    Have a look at http://www.molecularevolution.org/re...owtie_activity for a simple tutorial on Velvet.

    Comment


    • #3
      If you want a non de novo method, you could try aligning the higher quality data to your reference, make a new reference by correcting for the SNPs you found, and then realign to that corrected reference. Hopefully, you should find that the number of new SNPs is drastically lower. You can try iterating that another time, then align sample 2 to your corrected reference.

      Comment


      • #4
        I'm beginning to feel a bit self-conscious about the fact that I seem to do so much self-promotion on seqnaswers recently. Apologies to those of you bored of hearing from me, I'll be brief. Martine - there is a de novo assembly variant caller, called Cortex, designed for this kind of question, which works on haploid and diploid organisms.
        See

        and the paper here:

        It will assemble and look for differences directly, and spit out variants in
        flank-allele1-allele2-flank format, which if you want can then be turned into VCF with respect to your X outgroup/related genome, or VCF with respect to a consensus.

        If, as seems to be the case, you have a tonne of coverage, you can tell it to only pay attention to high quality reads.

        Comment


        • #5
          That sounds cool Zam, I definitely will be giving it a try for my own projects.

          Comment


          • #6
            Zam, this looks like just what I was searching for!

            Comment


            • #7
              error

              Originally posted by nickloman View Post
              That sounds cool Zam, I definitely will be giving it a try for my own projects.
              Hello Zam, I am working with Martine in trying to run cortex. I ran the following line and received the error below:

              command:
              cortex_var_31_c1 --pe_list PyN67C_R1.fastq,PyN67C_R2.fastq --format FASTQ --quality_score_threshold 5 --remove_pcr_duplicates --remove_seq_errors --dump_binary PyN67C_R1-R2.ctx --kmer_size 21 --max_read_len 50

              error:
              Start loading @HWI-ST183:319:c02a2acxx:4:1101:1017:2087 1:Y:0: and @HWI-ST183:319:c02a2acxx:4:1101:1017:2087 2:Y:0:
              cannot open file:@HWI-ST183:319:c02a2acxx:4:1101:1017:2087 1:Y:0:

              Comment


              • #8
                Hi!
                The --pe_list option wants a pair of filelists - ie two lists of FASTQ.
                You have given it a pair of FASTQ. The manual goes through
                some explicit examples. Also worth doing the first two examples
                in the demo directory.
                Cheers
                Zam

                Comment


                • #9
                  You could also look into ICORN http://icorn.sourceforge.net/ as this is an iterative correction of the reference to pull in more data and to find more SNPs. It should work well on closely related species, but I'm not sure how far it can push things.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  68 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X