Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • First time with Velvet

    Hi,

    Im new to seqanswers and bioinformatics and need a little help with Velvet. I keep getting large blocks of Ns in my PE assembly. The Ns go away when I do an assembly with just the single reads.

    workfow;

    ~30 million PE 100bp reads
    lowest average sanger quality score is 34

    filter using fastx-tool kit - remove reads < 20bp in length, <30 quality over 90% of the read

    also had to remove reads that failed the illumina chastity filter

    remove non-PE reads after filtering and join into one file
    now have ~20 million reads

    velveth 31 -shortPaired
    found average ins_length using velvetg: 240, sdev = 50

    experimenting with parameters, I found the two genomes of interest to have coverages of 500 and 75.

    I have been able to find 4-5 contigs per genome that cover nearly the entire length. The problem is that the contigs contain large blocks of Ns. For example

    one assembly has 3120 contigs > 500bp in length, but there are 11634 blocks of Ns with average length of 41

    in the 4 contigs that cover one of the genomes, there are 86 blocks of Ns that average 56bp in length.

    How to I get rid of these Ns?

    Thanks,

    JT

  • #2
    This is a result of Velvet scaffolding contigs using paired-end information. Set "-scaffolding" to "no" if you don't want it to do that.

    Comment


    • #3
      okay, thanks.

      when scaffolding is off, I dont get large contigs. I guess I'll have to try changing parameters again.

      Comment


      • #4
        I would also look at changing your kmer value. For 100bp reads, I use a kmer value of 57..31 seems more suited to 50bp reads.

        Comment


        • #5
          You can tune the exp_cov e cov_cutoff parameter. You can plot the stats file in velvet output on R and check the best values for these parameters.

          Comment


          • #6
            thanks for the helpful suggestions, i really appreciate it.

            I tried kmer size 57

            then tried keeping everything set to auto

            $velvetg output_57/ -scaffolding off -ins_length 250 -exp_cov auto -cov_cutoff auto

            Final graph has 469034 nodes and n50 of 124, max 2556, total 20289293, using 7937596/38767538 reads

            when I plot the stats in R, I just get a decreasing curve that doesn't appear to have any peaks. before I found the expected coverage by blasting the contigs to my reference genomes and calculating an average coverage of the contigs that hit.

            jt

            Comment


            • #7
              You could also try Velvetoptimiser to help tune settings to your dataset:

              Comment


              • #8
                i tried optimizer, but it seems to like low coverage. I have not tried it with specifying my expected coverage.

                Comment


                • #9
                  one of the genomes is at 400X the other is near 150X I think. When I assemble with exp_cov 400 and cut_off 50, it increases the contig size to ~8000bp.

                  Why wont velvet put these contigs together better? I tried cutting the amount of data by 0.25, but it decreased the max contig size and N50.

                  thanks again.

                  I know what i'll be doing this weekend.

                  jt

                  Comment


                  • #10
                    Why wont velvet put these contigs together better?
                    I did not understand this statment.

                    Comment


                    • #11
                      sorry, I meant to ask why the contigs are so small. 8000bp is not even close to the size of the genome. 400X coverage implies that there is sufficient data.

                      jt

                      Comment


                      • #12
                        Hi,

                        So, high coverage is good but dont implies in a unique and giant contig. Several biological factors have important impact in genome reconstruction. One and very important is the repetitive elements and the size of these elements in the genome sequenced.

                        With mate-pair or paired-end libraries you can solve repetitions and assembly more contigous sequences (scaffolds) but if your repetition is bigger than the insert size of your library the reconstruction will be fragmented.

                        Another factor is: high levels of coverage introduces more errors in your reads and the assembly process can decrease the N50 levels. And all the assemblers (ABySS, SOAPdenovo, Velvet) has a different peak of n50 and coverage levels. When this peak is reached the contig n50 does not get bigger it stagnates in a plateau. So coverage is a important factor until reached a specific number (30 - 50x more or less) more than that its useless.

                        Here: two paper who talks about this:

                        http://www.plosone.org/article/info:...l.pone.0008407



                        http://bioinformatics.oxfordjournals...tr319.abstract

                        []s,

                        André

                        Comment


                        • #13
                          okay, thanks. I'll read the papers carefully.

                          I understand that optimal assembly occurs around 50X.

                          What I'm not entirely clear on yet is the effects of having scaffolding on or off. With scaffolding off, is Velvet basically assembling single reads instead of taking advantage of the PEs? If so, I need it on to get large contigs. But when I get large contigs with scaffold on, then I have lots of Ns in the sequence.

                          I think that I may attempt to assemble all the reads as singles, then map the resultant contigs to the large contigs with N gaps. Maybe the small contigs will cover the N gaps.

                          jt

                          Comment


                          • #14
                            I would try being very aggressive in quality filtering/trimming. Most accounts that I have seen don't see improvement beyond 50X coverage, so you should be able to be selective with your input data.

                            Comment


                            • #15
                              So with 400X coverage, 20M 100bp reads, that's a genome size of about 50Mb. Is that correct?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              15 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              70 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X