Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assembling bacterial genome de novo

    Hi
    I am assembling a bacterial genome de novo. I have about 40million Illumina reads, 75bp. These were paired end but for reasons I wont go into the insert sizes were ~150bp so they overlap.
    I have tried Velvet and MIRA to some extent. My best velvet run (I used velvetoptimiser) gave me 10500 contigs with an N50 of 1455, longest contig 14500, total bases in contigs 4.7M, 1500 contigs greater than 1kb which span ~3M of the genome. The expected genome size is ~4.7M.
    Any recommendations on the next step to get to a build with fewer contigs of larger sizes?

    Thanks
    Noa

  • #2
    Those statistics are pretty poor for a typical bacterial genome so you should be looking to improve on that N50 dramatically.

    Use only a subset of those reads and aim for 30x - 100x coverage (you could filter for the highest quality reads).

    Too many reads will slow down your experiments, and won't improve the assembly. In fact too many reads can cause the assembly to fragment due to a build-up of erroneous k-mers.

    As your pairs are overlapping, try joining them together with a tool like FLASH
    http://bioinformatics.oxfordjournals...tr507.abstract and then assemble these longer reads (try both an OLC assembler like MIRA and a de Bruijn assembler like Velvet - you can increase the k-mer size).

    Comment


    • #3
      Thanks for your help- I know the statistics are poor -happy for any advice as I am new to all this!

      I have been working on a subset of the reads. I am using velvetOptimiser so I scanned all possible kmers from 19 to 31 for each of either 250,000 reads, 1mil, or 2mil reads, and I tried both single and paired end options. The best so far was 2million paired end (those are the stat's I quoted). Should I give even more reads? I have been trimming the reads from base #5 to 60, which gives good looking boxplots with everything with at least a quality score of >20. Happy for suggestions on other ways to filter.

      I had also originally tried to work with data that was paried already using SHERA but didnt seem to get significantly improved results. I may readdress that now that I am slightly more proficient with velvet.

      I also tried MIRA yesterday for the first time and got with 2 million trimmed reads (55bp): 22000 contigs covering consensus of 4.8M largest conti is 3200bp, N50 is 287 - which is also not so great for a bacterial genome. Happy for advice there too!

      Comment


      • #4
        I have to say your results are curiously poor. 2m reads should give you 30x coverage which I would expect to give a much better result than you are getting.

        What happens if you just take the first 2m reads from the (hopefully correctly shuffled, should be pair /1 followed by /2) file, don't filter at all and pass to Velveth with -shortPaired and k-mer of 31 and Velvetg with -exp_cov auto -cov_cutoff auto -ins_length 150?

        Also are you absolutely sure you have got the correct insert length?

        Comment


        • #5
          I agree they are much worse results than expected.
          I just rechecked the shuffling and it is /1 and then /2 as expected (I used the shuffleSequences perl script that comes with vlevet for that)

          There are 4mil reads in total (2 mil of each end) just to make sure that is clear.

          I just tried with the unfiltered sequences (which are longer- but the quality gets really bad after about 90bp.) running as you suggested but the process got killed - i will try to find a computer with more memory... Is it worth trying with half the sequences?

          Insert size is definitely ~150- i just double checked with the guy who did the actual experiment, and also that is what velvetoptimiser spits out at the end of the run when I dont give it that info.

          Comment


          • #6
            I thought your reads were 75bp long? Not sure what you mean by quality dropping off after 90bp.

            Quality may well be the problem. Can you post a FastQC chart?

            Comment


            • #7
              And yes, try with 1/2 the sequences.

              Comment


              • #8
                Try k-mers more than 31. We got a good results from 37 to 47. -shortPaired. -exp_cov auto Try maximum number of reads your computer can treat. We got best results on 15+15 mln and 20+20 mln, less was worse.
                Last edited by vtosha; 02-29-2012, 03:56 AM.

                Comment


                • #9
                  Sorry - here is a boxplot. The original reads were 150bp (144bp after barcode removal), I chopped them up to 56bp (start at bp#5, end at bp#60)
                  Attached Files

                  Comment


                  • #10
                    Ive tried larger kmers and I wasnt getting much better results. The velvetOptimiser is such a useful tool and right now it is only working up to 31mers. (Not sure if i can reinstall it to do higher kmers? I have to ask the guy who installed it...)

                    Comment


                    • #11
                      We try with Perl-script of our bioinformatician. As I understand, for IT it is not too difficult to write such script for autoscreening Velvet's parametres.

                      Comment


                      • #12
                        Thanks- we dont have IT here to help unfortunately.I am a wet biologist learning the computation stuff on my own! (actually enjoying it but it is slow going!)
                        Just to make sure- you were doing 20million x2 reads on what size genome?

                        Comment


                        • #13
                          Nick-
                          I still cant run with hald the seq's. Will try to get hold of a stronger computer but that has to wait til our sys admin is back tomorrow. Any insights or things I can try in the meanwhile? Thanks so much for your help!

                          Comment


                          • #14
                            I just tried running as a 41mer with the PE 2millionx2 reads and I ma getting N50 of 243, max 5556, total of 3.6M genome. So that doesnt look too promising. (Maybe not surprising since the entire read I am using is 56 bp (the trimmed read) so maybe that is too large of a kmer.

                            Comment


                            • #15
                              Some bacterial, near 5,5 M. 72nt, paired-end. N50 from 100 000 to 170 000. Not the best at all, but best our. But for some chloroplast genome we couldn't get any long contig with an excellent sequencing results, so problem may be in library preparation or DNA.
                              Yes, there may be optimal k-mer for your own results, and k-mers with more length not better.
                              Last edited by vtosha; 02-29-2012, 04:19 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X