Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Targeted de novo assembly

    Hello,

    I have an interesting problem I am looking for some advice.

    I have whole genome resequence PE illumina data and I am interested in doing a de novo assembly of three particular genes. Thus far, we have done assemblies with bwa against the genome sequence of a closely related species (4-5% divergence). However, the genes I am interested in are newly inserted in our species so are absent from our current assembly. These genes are also rapidly evolving and I expect a lot of structural rearrangements relative to genes sequences I have already. My plan had been this:

    1. filter my reads for quality using the FASTX tool kit and build a blast database of the reads.
    2. blast reads against a reference sequence of my genes to identify the subset of reads that map to this region (and their mates)
    3. do a de novo assembly of those reads (we have used SOAPdenovo in our lab, other suggestions??)

    However, simply building the blast database of the reads is taking more than 12 hours and I imagine the blast itself will be even slower. Is there a better way to pull down reads that map to my gene of interest? Should I just do a bwa alignment using my three genes as a reference instead of blast?

    Thanks!
    Sarah Kingan

  • #2
    so you have references of your genes you are looking for and what %-identity you expect in the sequence? Blasting all your reads against your reference genes seems not to be the smartest way. ;-) using bwa or vmatch might be a lot faster but of course your results depends on your sequence identity.

    Comment


    • #3
      Hi Thorondor,
      The % identity should be very high, <3% divergence for the orthologous sequences. The problem is that there are many repeat elements in and around the genes so the structure is not conserved. Right now I am pulling the reads that align to the flanking sequence in my bwa alignment and will do a deNovo assembly of those reads using mira. Mira claims to be good at assembling repetitive sequence and difficult to align regions.

      I may then do another iteration. Using bwa, I will map the previously unmapped reads to the contig(s) I built with mira. Then pull the singletons whose mates mapped and do another devovo assembly with mira.

      Sarah

      Comment


      • #4
        Originally posted by skingan View Post
        I may then do another iteration. Using bwa, I will map the previously unmapped reads to the contig(s) I built with mira. Then pull the singletons whose mates mapped and do another devovo assembly with mira
        I've been doing this to manually close gaps that can't be assembled using various short read assemblers and it generally works great if you restrict the new denovo assembly to the regions where you have "problems".

        E.g. using only the reads in vicinity to where you expect your gene to be.

        We normally use CLC as it is extremely fast and memory efficient (and expensive..). However most assemblers should be able to handle the repeats if it is just locally. In my experience the problem is when you have the same repeat regions in multiple area's of the genome and that is solved by doing the local assembly.

        rgds
        Mads
        Last edited by MadsAlbertsen; 05-02-2011, 08:11 AM. Reason: Clarifying..

        Comment


        • #5
          well if your genes of interested are not well covered you might also take a look at LOCAS for your assembly:

          Comment


          • #6
            Originally posted by skingan View Post
            Hi Thorondor,
            The % identity should be very high, <3% divergence for the orthologous sequences. The problem is that there are many repeat elements in and around the genes so the structure is not conserved. Right now I am pulling the reads that align to the flanking sequence in my bwa alignment and will do a deNovo assembly of those reads using mira. Mira claims to be good at assembling repetitive sequence and difficult to align regions.

            I may then do another iteration. Using bwa, I will map the previously unmapped reads to the contig(s) I built with mira. Then pull the singletons whose mates mapped and do another devovo assembly with mira.

            Sarah
            I'm performing a similar analysis. I did find that targeted de novo assembly deals with short tandem repeats very nicely. I'm now wondering if there's a software that integrates the results from targeted de novo assembly with the reference genome so that I can still use samtools, for example, for SNP calling? Thanks in advance for any information.

            Sue
            Last edited by shiva; 06-20-2011, 08:37 PM.

            Comment


            • #7
              Dear all,

              I'd like to mention a tool called mapsembler. It takes some sequence fragments and a set of (illumina) reads. It tries to reconstruct each sequence fragment using the reads (authorizing some substitutions) and for each sequence it reconstructed it extends it left and right by targetted assembly.

              The output may be either a fasta file (contig containing the sequence) or a graph that shows indels, SNPS, or more complex events like gene fusion, exon skipping...

              The tool and documentation are accessible here: http://alcovna.genouest.org/mapsembler/

              Any comment / feedback welcome.

              Pierre

              Comment


              • #8
                mapsembler usage

                Pierre
                mapsembler sounds like it may work for one of my projects.
                have you used it before?
                Can i import the output into a viewer so that I can see how it attempted to assemble the sequences around the 'starter'?

                thanks

                Comment


                • #9
                  Dear salmonella,

                  Sorry for this late answer...

                  I'm one of the authors of Mapsembler.
                  The output of mapsembler, while used using the graph output, can be viewed by any viewer able to deal with xgmml format (I'm using Cytoscape) or .graphml (I'm using gephi).

                  Pierre

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  25 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  27 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  24 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X