Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Velvet and long Illumina reads

    I have previously used Velvet for paired end de Novo assembly of bacterial genomes using 36 base pair reads. We have now recieved some reads that are 72 base per long. Using Velvet on these data set result in a very large number of contigs (several thousands). When the sequences are cut down to 36 base the number of contigs become as expected (a few hundreds with a correct genome size). I have also tried to split the 72 bp in two 36 sets with good result (i.e it is not due to low quality at the ends). Anyone else had these problems with "long" Illumina reads and how did you deal with it?
    I have used both Velvet 0.7.20 and 0.7.31.

    Best regards,

    Peter
    Last edited by Peter Bjarke Olsen; 04-15-2009, 03:48 AM.

  • #2
    I think that velvet works best with some sort of depth of coverage. Did you try using fewer of the 72bp reads to assemble contigs. If using half the 72bp reads gives you the expected number of contigs, then the issue is not long reads, but too much sequence causing very high depth of coverage
    --
    bioinfosm

    Comment


    • #3
      In mchaisson's paper, mate-pairs significantly improve assembly but read length doesn't until it reaches a certain read length barrier (35nt for E.coli and 60nt for S.cerevisiae)

      Low coverage might be a problem. Longer reads result in lower coverage in certain regions. My guess is that velvet will try to give the most N50 contigs and will not map these reads to those poorly covered regions. Therefore, you get fragmented assembly.

      Correct me if I'm wrong
      Melissa

      Comment


      • #4
        Thanks for the answers. I think you are on to something bioinfosm. I have reduced the raw data with 36% with the fastx-toolkit and get a more reasoable amount of contigs with the expected genome size. I still find it strange that I can get the same (good) result by splitting the reads in to 36 bp fragments.

        Comment


        • #5
          I find that pretty strange too, Peter. Forgive me, but are you absolutely sure that what you have aren't paired 36bp reads? That would explain why leaving them together mucks up your assembly. Is it possible that your source did paired end reads and didn't pass that on to you?

          Another way to look at that would be to plot the average base quality versus cycle. If they're actually paired, then the most likely way you've received them would be that each "read" is actually cycles 1-72, but 1-36 are the forward read (5' to 3') and 37-72 are the reverse read (5' to 3'). So if you plot the qualities, or base content (fraction A, T, C, or G), you should see a certain pattern going from 1-36, then repeated from 37-72 ... e.g. mean quality decreasing until 36, then jumping back up before decreasing with the same curve from 37 to 72.

          Sorry if I've run with an impossible theory here, but if I didn't know any better, it's something I'd suspect. (Except for the fact that ~1/3 of your "paired" reads actually result in a decent assembly ... that would seem to shoot my theory down) ...

          Comment


          • #6
            I understand your point. I also thought that it could be because the reads were two 36 bp reads instead of 72 base pair reads so it was one of the first things i checked. The quality drops gradually over the 72 bases but nothing drastic.

            Originally posted by jnfass View Post
            I find that pretty strange too, Peter. Forgive me, but are you absolutely sure that what you have aren't paired 36bp reads? That would explain why leaving them together mucks up your assembly. Is it possible that your source did paired end reads and didn't pass that on to you?

            Another way to look at that would be to plot the average base quality versus cycle. If they're actually paired, then the most likely way you've received them would be that each "read" is actually cycles 1-72, but 1-36 are the forward read (5' to 3') and 37-72 are the reverse read (5' to 3'). So if you plot the qualities, or base content (fraction A, T, C, or G), you should see a certain pattern going from 1-36, then repeated from 37-72 ... e.g. mean quality decreasing until 36, then jumping back up before decreasing with the same curve from 37 to 72.

            Sorry if I've run with an impossible theory here, but if I didn't know any better, it's something I'd suspect. (Except for the fact that ~1/3 of your "paired" reads actually result in a decent assembly ... that would seem to shoot my theory down) ...

            Comment


            • #7
              I've noticed that there are enough extra errors at the ends of some of the 'longer' reads to muck up assemblies. Some of the coverage statistics used to remove erroneous edges that worked for shorter reads do not work as well with the longer ones. This is the case with euler, and likely with velvet as well.

              Comment


              • #8
                That coverage thingy is an aspect of velvet which baffles me quite a bit...:
                Testrun for PhiX, 76bp PE, 3,9mio Readpairs: 1000+ contigs
                20000 reads out of 3,9 mio: ~40 contigs, with aggressive parameter tuning: 30, with n50>300
                10000 reads out of 3,9 mio: 1 contig, of perfect size with 3 SNPs as Blast discovers... (and without any parameter tuning)

                I'm still wondering why velvet is soooo coverage-'sensitive'
                -Jonathan

                Comment


                • #9
                  I have reads data for pollen beetle from illumina machine of 75bp for paired end allignment. i want to know that velvet will perform well for these to assemble by using de novo. how many memory will be required for that because i have 8GB ram and my two reads files size is 1.92GB eahc,so velvet give malloc segmentation error.
                  secondly i want to know that how we can convert the velvet output .afg file to .ace file.
                  Last edited by shahid.manzoor; 06-26-2009, 05:38 AM.

                  Comment


                  • #10
                    With 8 GB of ram, you might be able to run the assembly for ~ 2mio reads - that is single-ended, paired-end would be ~1mio.
                    All using the highest k-/hash-value of 31 for the initial hashing step.

                    You might want to try to ramp it up if it actually works, but I can tell you 48GB ram is not enough for 9.8 mio reads (K=31) - that is 4.9mio in PE
                    (My guess is: 55-70GB for that amount of data).

                    BUT: this is just empirical, and only for my dataset - the internal graph-structure of your assemlby might be far better structured (by chance, mind you!) and consume less space - or more.

                    Edit:
                    Additionally: depending on the size of your organism, you might actually REALLY WANT to split the data, as velvet tends to be ... itchy when it comes to 'deeper' coverage (I have not yet evaluated this soft-border, a guess would be ~50x to 100x or more?)

                    Best
                    -Jonathan
                    Last edited by Jonathan; 06-26-2009, 05:52 AM.

                    Comment


                    • #11
                      I'm working with shahid.manzoor on the same set of ~11 million 76bp read pairs. We just recently got Velvet working (just needed to recompile for the correct (64 bit) environment ... *facepalm*), and I have completed a few runs with k-mer size 55-63. I'm testing with large k-mer sizes mainly so that I can see the results within 24 hours, until I get an idea of how to get useful data out of Velvet.
                      Our major problem right now is contig size and coverage. So far the largest contig I've received is 204bp, and most contigs consist of only two overlapping reads.
                      So far I have avoided setting any parameters other than -ins_length 187, which was provided with the Illumina output. Which parameters would you recommend changing, in order to get longer contigs?

                      F1rst greetings,
                      Ingemar

                      Comment


                      • #12
                        cov_cutoff is one parameter that turns out to be important..
                        what kind of coverage depth do you expect with all these reads? Having it close to 40 helps with velvet (from my experience)
                        --
                        bioinfosm

                        Comment


                        • #13
                          Thanks for the tip, I'll try changing the coverage cutoff today.
                          Our coverage depth may turn out to be a problem. Assuming a genome size of about 200 MB (like the model Tribolium castaneum) we have only about 5x coverage. =/ For some reason, the lab that did the sequencing did not provide any estimate of the genome size or coverage, so my assumptions are all I have to go on.

                          EDIT:
                          Sorry everyone, false alarm. Turns out that the sequencing lab bungled the run by gathering data from the wrong lanes ... We have been trying to build a 200 MB genome using reads from the phiX control kit.
                          Last edited by ohlsson; 07-17-2009, 12:37 AM.

                          Comment


                          • #14
                            Originally posted by ohlsson View Post
                            Thanks for the tip, I'll try changing the coverage cutoff today.
                            Our coverage depth may turn out to be a problem. Assuming a genome size of about 200 MB (like the model Tribolium castaneum) we have only about 5x coverage. =/ For some reason, the lab that did the sequencing did not provide any estimate of the genome size or coverage, so my assumptions are all I have to go on.
                            5x coverage?

                            Velvet is not going to like that. Velvet wants more like 50x to get nice big contigs.

                            11 Mreads on a 5 Mb genome would probably velvet nicely. I think 20 Mb would be pushing it. And if you aren't getting any velvet contigs...that's probably why.

                            Comment


                            • #15
                              This thread is making me wonder about my data. I have 9x10^6 reads of 75 bp, and I want the plastid genome. These genomes are quite small (only ~135kb), and DNA extractions usually contain LOTS of plastid DNA. I would expect to recover the plastid genome in relatively large pieces, but I'm not! The largest contigs that I am recovering are only around 3kb.

                              Could it be that the coverage is too high and Velvet is having problems with this?
                              What do you all suggest?
                              Should I use the fastx toolkit to make shorter reads?
                              Should I use fewer reads and then combine contigs post-Velvet?
                              Other suggestions?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              57 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              56 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X