Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Very large contig files using Velvet

    Hi,

    I have just begun assembling some genomes using velvet using a range of kmer lenghts (a gap of 10 between kmer length of 25-179). For all of these files I am generating very large contigs.fa files - Mostly between 0.5-50 gigs.

    Some general characteristics of the assemblies are an N50 of between 1000-5000 and a sum sequence of 30-50e6.

    Here is an example of the command I am using to perform the assembly.

    "velveth assembled155_qual 155 -fmtAuto -create_binary -shortPaired -separate FS10.raw.1.fastq FS10.raw.2.fastq
    velvetg assembled155_qual -cov_cutoff auto -exp_cov auto -ins_length auto -read_trkg yes -amos_file yes"

    Where each of the input files are approximately 1gig.



    Whilst these the sum is quite large... I am not sure it should really equate to a 50gig .fa file...

    Sorry if I have left out any info. Any help on this would be great!

    Cheers
    Last edited by Dagga; 02-13-2014, 09:37 PM.

  • #2
    Are you sure it's the contigs.fa file that is so large?

    Usually I find it's the .afg file that is very large.

    What is the size of the genome you're trying to assemble?

    You could reduce the size of the contigs.fa file by setting the -min_contig_lgth parameter so as to remove very short contigs.

    Are you using a very new version of velvet? I am not familiar with the 'auto' setting for -ins_length.

    Comment


    • #3
      Thanks!

      Yep! I am sure they are in the gig size but they range from 20-50 gigs depending on the kmer setting of velvet. As far as I know, it is the latest version of velvet (v 1.2.10).

      All of the afg files are around 2.5gigs.

      There is only one genome of this species that has been sequenced and it is approx 12Mbp and I am expecting a genome length of about 8-12Mbp. However, I am sure there was some contamination in the sample so I am expecting a larger assembly.

      Thanks for the min length comment! i'll give that a try now and see if that helps.

      Cheers

      Comment


      • #4
        Is it repetitive? Polyploid? Have you done a dotplot of your organism against the other related reference?

        Comment


        • #5
          I dont think it is repetitive and it is a bacterial genome so it should be relatively simple. I havent dont a dotplot but I think there would be a few differences to the other genome so i dont think it would be close enough to use as a reference.

          Comment


          • #6
            I think I have found the reason why the contigs.fa files are so large - but I am still not sure how to fix it.

            I managed to open one of the smaller contigs.fa files (0.5gig) and have found there are several contigs with very large (30-40Mbp) spans of N's. This has happened for several different contigs and therefore I think this is why the files are so large.

            My question now is - does anyone know why this is happening and how I can fix it?

            I know about the -scaffolding no command which will completely eliminate N's but I think this is a bit drastic as few N's combining contigs is ok.

            Cheers!

            Comment


            • #7
              Can you find out what fraction of the characters are N's in your files and what fraction are valid bases?

              Valid bases:

              Code:
              $ tr -dc '[ACGT]/i' < test.fa | wc -c
              Following should tell you how many "N's" are there:

              Code:
              $ tr -dc 'N' < test.fa | wc -c
              If the N's are outnumbering valid bases then perhaps the assembly is not right.

              Comment


              • #8
                Mauve is an excellent tool to try to visualize genomes against each other. Pick the closest species available and try your assembly against it. I have a feeling that if you have too many N's this would not work.

                Comment


                • #9
                  I have a feeling the insert length could be an issue. We have used an Nextera prep and according to the sequencing centre - this has a variable insert length. I am attempting to reassemble with a set velvetg insert length and see if this helps things.

                  I will get back to you about the other questions asap

                  Comment


                  • #10
                    and its > 90% N's for about 15 contigs. These are the very large contigs with a length of 15-50 Mbp...

                    The other contigs seem normal.

                    Comment


                    • #11
                      Put the "normal" contigs in a file and give mauve a try using a closely related species. That will give you some idea about the quality of the assembly.

                      Those large contigs with N's will hopefully will be resolved with newer velvet runs.

                      Comment


                      • #12
                        Great!

                        thanks GenoMax i'll give that a try.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin


                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                          Yesterday, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        55 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        45 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        55 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X