Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Failed Genome assembly

    Dears,

    I'm trying to assemble the genome of a Drosophila species, but I'm having serious problem.

    I have 4 illumina libraries: 1 paired-end library: 100bp, insert size 300bp; 3 MP libraries: 100bp, insert sizes 1.5k, 4.5k and 9k.

    I started using two assembler: SOAPdenovo 2 using different kmers and IDBA-UD. In parallel I'm also trying MaSuRCA and Cabog (Celera Assembler). But, at the moment I'm still waiting results. I'm using PE or (PE+1.5kMP) for contigs and all MPs for schaffolding.
    From Soap and IDBA I get assemblies with very small N50 ~160bp. Playing around with kmers and library used for contigs construction do not change results.

    So I went back to the library and I plotted the kmer spectrum for the four libraries. I used Jellifish+KAT to plot the spectrum (pdf attached). The plots look quite bad. The characteristic peak is absent or masked but a very high peak of rare kmers. Also, from fastqc it looks like in the PE libraries ~30% of reads are duplicated, while 20%, 20% and 40% for 1.5k, 4.5k and 9k MPs respectively. I'm guessing there is PCRs bias.

    Any advice?

    Thanks really a lot.
    Attached Files

  • #2
    Could you post the FASTQC reports?
    Have you tried aligning your reads to another drosophila assembly. Brians BBmap will give you some very helpful error profile quickly ( http://seqanswers.com/forums/showpos...25&postcount=1 ).
    I have never seen IDBA applied to eukaryotic genomes before - SOAP should work of course.
    Last edited by luc; 10-19-2014, 09:35 PM.

    Comment


    • #3
      I would suggest you to use one tool at a time. MaSuRCA and SOAPdenovo take a lot of memory. I presume you are not on cluster/server and on default 4 GB machine.

      Make sure you are having correct config files.

      PS:
      I was unable to run both tools to success, due to some or the other errors, I was running out of time and later used Velvet. Velvet is easy to run.
      You will have to play around with it a lot. The default k-mer settings might not work with your sample.
      Last edited by bio_informatics; 10-19-2014, 08:03 AM.
      Bioinformaticscally calm

      Comment


      • #4
        What specific Drosophila species genomes are you trying to assemble? I have assembled a few Drosophila genomes.

        I would suggest using Velvet and using one tool at a time when trying to assemble any genome. Unless you are using online/cloud based resources like Galaxy and iPlant.

        Comment


        • #5
          Hi guys,

          Thanks for your help, really appreciated. So, fastqc profiles are attached (fastqc_data_R1-2.zip).

          My species is D. nigrosparsa. The closely related species for which a genome is available is D. grimshawi. I know this because I already did the phylogeny using the mtDNA genome assembled with these reads (data not yet published). The most recent common ancestor among the two species is quite old, around 20Mya (data not yet published). So use it as a reference, I think would not be very useful, but I give it a try. Also attached there is another file (Dgri.vs.ShortInsert_reads.31.1.pdf), in this file, produced always with KAT, are plotted the kmers in common between my data, short insert reads, with respect to the D. grimshawi. As you can see the fraction in common is very very small.

          About my computational power, I'm running these analysis on a cluster machine, I have enough power to perform two three assemblies in parallel.

          Yes, I'm also not so sure about the MaSuRCA conf file (see attachment: MaSuRCA_Dnig.conf.txt). Maybe Somebody could give it a look.

          What do you think guys?

          Thanks a lot
          Attached Files

          Comment


          • #6
            Could you help me understand how and why Jump keyword is used, when used in config file of MaSurRCA?
            Apologies for taking this a off topic.
            Bioinformaticscally calm

            Comment


            • #7
              Originally posted by bio_informatics View Post
              Could you help me understand how and why Jump keyword is used, when used in config file of MaSurRCA?
              Apologies for taking this a off topic.
              Maybe I did wrong but, I understood that JUMP stands for jumping libraries, so mate pairs. Is it wrong?

              Comment


              • #8
                I do not know, it is used for libraries, that is what manual says.
                Lets wait for experienced and experts to enlighten here.
                Bioinformaticscally calm

                Comment


                • #9
                  I can tell you, they are the same



                  "[...] mate-pair sequencing, which is basically a combination of Next Generation Sequencing with jumping libraries"

                  F

                  Comment


                  • #10
                    Thank you.
                    Originally posted by francicco View Post
                    I can tell you, they are the same



                    "[...] mate-pair sequencing, which is basically a combination of Next Generation Sequencing with jumping libraries"

                    F
                    Bioinformaticscally calm

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 11:49 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-24-2024, 08:47 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    61 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X