Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assembly MiSeq

    I am trying to assembly an oomycete genome ~40 Mb. I have around 20 millons of 250 bp PE-reads. I tried Velvet at the first time but I got a very fragmented assembly. I tried different K-mers (till 125) but still the contigs were very small. I know that MIRA could be an alternative for MiSeq. I tried to input MIRA with joined reads (using FLASH) but only 18% of the reads overlap. My sequencing facility said that the library insert is about 300 bp. This is weird !!!! considering the the small amount of overlapped reads.
    Any comments ?

  • #2
    You can use velvet to get an estimate of the insert length. I think it calculates it from the reads that it does manage to assemble. There is a script in the contrib folder in velvet.

    Comment


    • #3
      Flash joins paired ends. This is used to extend contigs of a targeted amplicons.
      Just because your paired end fragments do not overlap, this does not mean that sequences among paired end reads do not over lap. they most certainly will, and whole genome sequencing usually has a known insert size. This guilds alignment software in contig assembly. You shouldn't use FLASH on whole genome data.

      There are many NGS assembler out there, BOWTIE 2 being one good example.

      Have a play around with the paired end data, specifying your insert size.

      Comment


      • #4
        Bowtie2 isn't an assembler, it's a mapper. You can "sort of" assemble with a mapper by generating consensus sequences based on a reference assembly, but I'm guessing that this oomycete assembly needs to be de-novo.

        FWIW, an insert size of 300bp may mean an average fragment size of 800bp, depending on how your sequencing facility defines their numbers.

        For long reads I'd recommend using a non-kmer approach for assembly (i.e. not velvet), because splitting up into kmers will lose the benefit of 250bp sequencing (or longer if it's paired end with overlap -- a fragment size of less than 500bp).

        Comment


        • #5
          Sorry, My mistake. Indeed bowtie requires a reference to map to. I have performed de novo assemblies with Mira and Newbler with pretty good success.

          Comment


          • #6
            Originally posted by joxcargator73 View Post
            I am trying to assembly an oomycete genome ~40 Mb. I have around 20 millons of 250 bp PE-reads. I tried Velvet at the first time but I got a very fragmented assembly. I tried different K-mers (till 125) but still the contigs were very small. I know that MIRA could be an alternative for MiSeq. I tried to input MIRA with joined reads (using FLASH) but only 18% of the reads overlap.
            If you use MIRA, you don't want to use reads joined by another program -- just give MIRA the raw files and let it take care of the details. You would use a manifest file something like:

            --------------------

            project = oomycete
            parameters = COMMON_SETTINGS \
            -GENERAL:number_of_threads=20 \
            -NW:cmrnl=no\
            SOLEXA_SETTINGS \
            -CLec
            job = genome,denovo,accurate
            readgroup = miseq500cycle
            data = xxx_R1_001.fastq xxx_R2_001.fastq
            technology = solexa
            template_size = 50 1000 autorefine
            segment_placement = ---> <---

            ------------------
            You just run mira the normal way. Put the stuff between the "----" in a file called something like "manifest.conf" and then run mira with:

            mira manifest.conf

            Possibly you would want to background the process and/or capture output to a log file.
            Always best to check your manifest for errors with:

            mira -m manifest.conf

            I am presuming you have a computer with 20 or more cores on it. So you may want to adjust that "number_of_threads"

            Anyway, I think the issue here would be the amount of time it take to get an assembly. Mira is very thorough. So you are probably looking at over a week, even on a powerful computer.

            If that is a problem, you could get a result much more quickly with abyss-pe. Just:

            abyss-pe k=61 name=Oomycete in='xxx_R1_001.fastq xxx_R2_001.fastq'

            You will likely get a result much more quickly.

            --
            Phillip

            Comment


            • #7
              Originally posted by pmiguel View Post
              If you use MIRA, you don't want to use reads joined by another program -- just give MIRA the raw files and let it take care of the details. You would use a manifest file something like:

              --------------------

              project = oomycete
              parameters = COMMON_SETTINGS \
              -GENERAL:number_of_threads=20 \
              -NW:cmrnl=no\
              SOLEXA_SETTINGS \
              -CLec
              job = genome,denovo,accurate
              readgroup = miseq500cycle
              data = xxx_R1_001.fastq xxx_R2_001.fastq
              technology = solexa
              template_size = 50 1000 autorefine
              segment_placement = ---> <---

              ------------------
              You just run mira the normal way. Put the stuff between the "----" in a file called something like "manifest.conf" and then run mira with:

              mira manifest.conf

              Possibly you would want to background the process and/or capture output to a log file.
              Always best to check your manifest for errors with:

              mira -m manifest.conf

              I am presuming you have a computer with 20 or more cores on it. So you may want to adjust that "number_of_threads"

              Anyway, I think the issue here would be the amount of time it take to get an assembly. Mira is very thorough. So you are probably looking at over a week, even on a powerful computer.

              If that is a problem, you could get a result much more quickly with abyss-pe. Just:

              abyss-pe k=61 name=Oomycete in='xxx_R1_001.fastq xxx_R2_001.fastq'

              You will likely get a result much more quickly.

              --
              Phillip
              Thanks I run a small set of reads ~ 8 millions PE with MIRA. I got a very fragmented assembly again ..N50 = 1,8 Kb. It took two days. Then I run all the reads 20 millons (PE). After 5 days I got better results N50 = 3.5 Kb. However It is still fragmented. I learned that High genome heterozygosity might create problems in the assembly. In fact we believe that the size of the genome is around 40 Mb, but the total consensus that MIRA outputs is around 92 Mb. I am doing some research for diploid genomes with high level of heterozygosity. I found a couple of tools that could help. One of these is hapsembler. Suggestions are welcome. Thanks again
              Last edited by joxcargator73; 02-28-2014, 02:56 PM.

              Comment


              • #8
                It is frequently the case that error correction can improve assembly. Have a go at using BayesHammer (i.e. just the 'error correction' part of SPAdes), or Quake to correct reads before assembly.

                Comment


                • #9
                  Originally posted by gringer View Post
                  It is frequently the case that error correction can improve assembly. Have a go at using BayesHammer (i.e. just the 'error correction' part of SPAdes), or Quake to correct reads before assembly.
                  Thanks, I was entertaining that idea too. I will try that too.

                  Comment


                  • #10
                    Originally posted by joxcargator73 View Post
                    Thanks, I was entertaining that idea too. I will try that too.
                    In my testing with Velvet and SoapDenovo, error correction has a minor positive effect and normalization has a greater positive effect (particularly on haploid genomes; less effective on diploids). If you're willing to try, I have a couple of tools that might help, available here. One is BBNorm, which does error correction and depth normalization. Velvet tends to produce the best assemblies at roughly 40x read coverage. To normalize to a target 40x kmer depth (at K=31), you would:

                    bbnorm.sh in=read1.fq in2=read2.fq out=norm1.fq out2=norm2.fq target=40

                    You can turn on error-correction with the flag "ecc=t"

                    Or you can run error-correction without normalization with the script "ecc.sh". They both call the same program. It's very fast.

                    I've had mixed results with read overlap merging. It is very useful in quickly plotting a library's insert size distribution or trimming adapters, but seems to reduce the quality of Soap's assemblies (not sure about Velvet). On the other hand, Ray seems to give better assemblies with merged reads. But if you do perform merging, my merger is much faster and substantially more accurate (in the sense of few false positives) than Flash, with a similar merging rate.

                    bbmerge.sh in=read1.fq in2=read2.fq out=merged.fq outsingle=outu=unmerged.fq outu2=unmerged2.fq hist=hist.txt


                    This command will also produce an insert-size histogram in hist.txt.

                    You can also try quality-trimming, if your 250bp library has a strong dropoff, and adapter removal if many reads are shorter than 250bp. I have a tool for that, too -

                    bbduk.sh in=read1.fq in2=read2.fq out=trimmed1.fq out2=trimmed2.fq k=25 ktrim=r hdist=1 qtrim=rl trimq=10 ref=adapters.fa.gz minlen=100

                    This will trim adapter sequence from the right end of the read (where it occurs if the insert size drops below read length) by matching kmers from the attached "adapters.fa.gz" (which contains TruSeq universal adapter sequences). It requires a 25-mer match allowing 1 mismatch (hdist=1). You can set the number of mismatches higher to account for errors, but it requires exponentially more memory. Then, the read will be quality-trimmed to Q10 on the right and left sides, and if either read is shorter than 100bp, the pair will be discarded. You can of course enable and disable all of these things. Note that read merging generally works best if you first remove adapters but don't quality-trim, then merge, then quality-trim the unmerged pairs, as bbmerge will internally account for base quality.

                    Good luck!

                    P.S. You can get help information on usage by running the shellscripts without any arguments.
                    Attached Files
                    Last edited by Brian Bushnell; 03-05-2014, 07:47 PM.

                    Comment


                    • #11
                      You can definitely give SPAdes a try. The instructions to assemble 2x250 reads can be found in the manual (http://spades.bioinf.spbau.ru/releas...al.html#sec3.4), though everything described there is just a default behavior as of SPAdes 3.0.

                      Feel free to contact SPAdes support in case of any problems.

                      Comment


                      • #12
                        Thanks for your help. I am trying now different options. I will post the results soon.

                        Comment


                        • #13
                          This is an update of the results using Spades.
                          I first usde spades with the following parameters
                          spades.py -k 21,33,55,77,99,127 --careful -1 forward -2 reverse -o spades_output

                          The output improved a lot in comparison to I had with other assemblers (Velvet and MIRA)

                          N50 11.9 kb
                          number of contigs: 18,063
                          Total base 49 Mb

                          The I tried DiSpades with two simple paratemers

                          dipspades.py -k 21,33,55,77,99,127 --careful -1 reads_left.fastq -2 reads_right.fastq --expect-rearrangements -o output_dir

                          N50 12.7 Kb
                          number of contigs 5,186
                          Total bases 33 Mb

                          I will continue trying other parameters, but I have a question with regard the difference in the total bases with SPdes and Dispades. I now that the second was designed for diploid polymorphic organisms. this is the case of my bug. I wonder if spades is producing extra contigs and these are artifacts. or Dispades is missing some contigs during the assembly. We now that the total genome of our bug is between 30 and 50 Mb. Both assemblies look promising but we have to confirm the real size of the gemome by PFGE.

                          On the other hand bbmap helped a lot to double check my reads. It is a very useful too. Thanks

                          Comment


                          • #14
                            I would suggest reading DipSPAdes manual (http://spades.bioinf.spbau.ru/releas...es_manual.html) in order to understand the differences between SPAdes and DipSPAdes output.

                            In short: since you have highly diploid genome, then the conventional assembler (like SPAdes) would produce you the so-called set of 'double' contigs, that is - each contig will be from the one of the haplomes. Surely, the total length of the assembly will be pretty close to the double of the expected genome size due to ploidy (if the k-mer is large enough and the ploidy level is more than 0.5-1%).

                            DipSPAdes uses haplocontigs to produce consensus contigs, which contain the consensus from both haplomes. Therefore, the expected genome size there will be close to the real genome size. Even more, DipSPAdes will also use the differences in the haplocontigs to resolve the additional repeats.

                            Hope this helps.

                            Comment


                            • #15
                              Thanks for your answer.
                              Yes we are aware of the differences between Spades and dispades. I just wanted to double check and share the results.
                              I am thinking to do do a couple of PacBio runs (long reads), to improve the assembly. I think that Dispades can take PacBio reads, even if there are not corrected.
                              I believe that your comments helped a lot.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X