Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome assembly -hiseq illumina data

    Hi,

    I'm new to genome assembly, and am having problems.

    I will be grateful for tips on how to assemble a bacterial genome.

    I have 100bp hiseq illumina data. Genome size is ~4MB.

    I have been using Dynamic Trim, then LengthSort, then velvet or soapdenovo.

    The problem is that after running velvet, I get an N50 of 1Mb , but the total size of the genome come to 30MB, over 6 times more than it should.

    A friend suggested using a subsection of the data I have (cause I have 350x coverage for one of my genomes and 3048x coverage for another). I've even tried that but the total size of the genome is still more than it should be.

    Any help would be greatly appreciated!!

  • #2
    Getting a good assembly can be surprisingly challenging, even for bacteria. Downsampling your data is probably a good thing to keep compute requirements manageable but may cause you to lose coverage in regions of extreme GC content, resulting in a more fragmented assembly.

    Our group has had good luck with IDBA, and we have cobbled together an assembly pipeline called A5 that you could try. The A5 pipeline is supposed to be able to do bacterial-size assemblies directly from sequencer output by automating the adapter trimming, quality filtering, error correction, contigging, scaffolding, and some assembly QC. See here for more info: http://code.google.com/p/ngopt/

    It's still very much beta, and we could really use some feedback on whether it's working outside our lab. In our hands it's generating pretty good assemblies, as measured by comparison to a reference genome. On your data you would almost certainly have to downsample to get it to work with the A5 pipeline, unless you have some big iron to run it on.

    Comment


    • #3
      It seems to me that your depth is too high. Since errors accumulate, the more depth means more chances of introducing errors, hence breaking the assembly. I would suggest try lowing the depth and see if there is any improvements.

      Comment


      • #4
        Thank you Koadman and vamosia. Yes, I'll reduce the dept and try again.

        @Koadman: I've downloaded your A5 pipe line. Thanks. I'll keep you'll updated with the result.

        Bgansw

        Comment


        • #5
          Hi Koadman,

          I've installed the A5 pipeline in my bin folder n have tried to install it.

          The a5_pipeline.pl is blinking, and its highlighted in red n not working.

          How do I rectify this?

          Comment


          • #6
            blinking and highlighted in red? I don't know where to begin.

            Are you using a mac? linux? is it blinking in your terminal window when you run `ls`? Or in some GUI file manager?

            To start, it's not necessary to move any of the files in the ngopt pipeline, it can and should be run in place. If you move everything in a5's bin to a bin directory somewhere else (but still in your $PATH) it probably will run but will not be able to find the adapters fasta file used for adapter trimming. We should probably change this to look for adapters in $prefix/share/a5 to be more unixy.

            To be specific, you should only have to do the following to get an assembly (instructions for linux):

            Code:
            wget http://ngopt.googlecode.com/files/ngopt_a5pipeline_linux-x64_20111128.tar.gz
            tar xvzf ngopt_a5pipeline_linux-x64_20111128.tar.gz
            ngopt_a5pipeline_linux-x64/bin/a5_pipeline.pl <read 1 fastq> <read 2 fastq> <my_assembly>
            where <read 1 fastq> and <read 2 fastq> are the filesystem paths to your paired read fastq files (one file per read in a pair, do not include the <>! it is just a convention to denote a required argument) and my_assembly is the base name or path you want to give your assembly files.

            Instructions for mac (which doesn't ship with wget's holy awesomeness):

            Code:
            curl http://ngopt.googlecode.com/files/ngopt_a5pipeline_macOS-x64_20111128.tar.gz > ngopt_a5pipeline_macOS-x64_20111128.tar.gz
            tar xvzf ngopt_a5pipeline_macOS-x64_20111128.tar.gz
            ngopt_a5pipeline_macOS-x64/bin/a5_pipeline.pl <read 1 fastq> <read 2 fastq> my_assembly

            Comment


            • #7
              Hi Koadman,

              The A5 pipeline is running on our cluster right now. So far all good!

              I'll keep you updated with the results. Hopefully, it'll all run smoothly..

              Thanx a ton.

              Comment


              • #8
                Hi Koadman,

                Am getting this error message, with the pipeline. I suspect that its because the new Hiseq data has been processed by casava 1.8, and generates reads in sanger format (also known as illumina 1.9 format). Most of our scripts were designed for reads in illumina 1.3 format (cause we got the 3 genomes sequenced using the GA2 and not the hiseq).

                I had to convert the files to illumina 1.3 to get the pipeline to work on them. Have you used the A5_pipeline on Hiseq data before? If yes, do you know what format your hiseq data is in?

                This is the error I'm getting. I've only pasted the last bit of the file. I've attached the file, "nohup.out" as well, so you can see what ran and where it failed. I would really really appreciate your help.


                nohup.out
                [bgansw@bioinfo ngopt_a5pipeline_linux-x64]$ head nohup.out
                [a5] Found the following libraries:
                raw1:
                id=raw1
                p1=8RS_CAGATC_L005_R1_illuminaformat_sed.fastq
                p2=8RS_CAGATC_L005_R2_illumina_sed.fastq
                [a5] Cleaning reads with SGA
                148311540
                [a5] Found 1 libraries
                [a5] Starting pipeline at step 1
                [a5] Cleaning reads with SGA
                [bgansw@bioinfo ngopt_a5pipeline_linux-x64]$ tail nohup.out
                at java.security.SecureClassLoader.defineClass(libgcj.so.7rh)
                at java.net.URLClassLoader.findClass(libgcj.so.7rh)
                at java.lang.ClassLoader.loadClass(libgcj.so.7rh)
                at java.lang.ClassLoader.loadClass(libgcj.so.7rh)
                at org.halophiles.assembly.InsertSizeExporter.main(InsertSizeExporter.java:36)
                Use of uninitialized value in numeric gt (>) at ./bin/a5_pipeline.pl line 1205, <IN> line 55454.
                Use of uninitialized value in concatenation (.) or string at ./bin/a5_pipeline.pl line 1211, <IN> line 55454.
                [a5] Discarding estimate. Not enough data points:
                Use of uninitialized value in multiplication (*) at ./bin/a5_pipeline.pl line 1212, <IN> line 55454.
                [a5_preproc] Insert estimate for raw1 not reliable.[a5_preproc] No insert estimate for raw1. Please provide one in a library file. Exiting
                [bgansw@bioinfo ngopt_a5pipeline_linux-x64]$
                Attached Files

                Comment


                • #9
                  Thanks for sending the nohup.out, really helpful!
                  Reading through it I can see that your input files seem to be lacking the canonical /1 and /2 paired read identifiers in their names. a5 tries to add them since bwa insists on their presence when generating a paired-end sam file. a bit further down we see:

                  Warning, FASTQ quality string is not the same length as the sequence string for read @HWI-ST705:113:C026NACXX:5:1208:5169:188022/2

                  My guess is that a5's perl code to add the /1 and /2 identifiers is broken, and added a /2 to that read's quality line. The problem will probably go away if you are able to add the /1 and /2 identifiers to the reads yourself and rerun. In the meantime will go fix this part of a5.

                  Comment


                  • #10
                    Originally posted by koadman View Post
                    Thanks for sending the nohup.out, really helpful!
                    Reading through it I can see that your input files seem to be lacking the canonical /1 and /2 paired read identifiers in their names. a5 tries to add them since bwa insists on their presence when generating a paired-end sam file.
                    Umm, that's not true at all. I run bwa all the time, and my reads don't have read1 or read2 identifiers on them at all. In fact, some programs, like Picard's MarkDupliactes won't realize that the two reads are paired together unless their names are identical.

                    bwa just assumes that the first read of fastq1 is paried with the first read of fastq2.

                    Comment


                    • #11
                      Indeed you are correct! I must have been distracted or written that too hastily. It's not BWA that has that requirement, but rather SGA which shares a little bit of algorithm with BWA. My deepest apologies for the confusion. SGA's preprocess step, at least the version of the code embedded in the a5 pipeline, when run in paired-end mode, insists on having identifiers of the form /1 /2 if any are used. And lo, I just discovered it will happily run without any identifiers. So the little code fragment which mangled your file by trying to add an identifier does appear to be unnecessary. Deleting code is a win in my book, thank you.

                      I've posted a revised a5_pipeline.pl to the source code repository. You should be able to drop this in place of the existing a5_pipeline.pl and hopefully there won't be any more hiccups on your data.

                      Comment


                      • #12
                        How do you calculate the 30MB? In every assembly you could have excess contigs that you don't want to use. If the genome sequence is similar to an already published genome, you can use nucmer or promer to pick out big contigs that map to your genome. If they are much different, then I use a combination of binning my GC content, coverage, and tblastx hits.

                        jt

                        Comment


                        • #13
                          How would you deal with gene transfer in these instances? Horizontally acquired genes would have a different GC content, which would be eliminated using GC and blast binning.

                          Comment


                          • #14
                            I do inclusive searches. First take all contigs with similar coverage, then add all contigs with close GC, then add contigs that hit by blast, ect.

                            Although I am also new at this and still learning, my goal is to not exclude contigs. However, I work with samples that have multiple bacterial genomes and a large amount of non-bacterial sequences.

                            Does this sound reasonable? As I mentioned, I am still learning.

                            Comment


                            • #15
                              well, the largest currently known bacterial genomes are around 10Mbp, and most seem to be under 5Mbp so there's a good chance that 30Mbp is way too large. One simple method to eliminate material not part of your isolate is a coverage-based filter on contigs. For contigs that are sufficiently large, the variance in coverage due to library PCR bias and bridge PCR bias can be averaged out, and sequences that are part of the isolate will show much higher average coverage than contaminant or noisy contigs. This is the approach employed by IDBA and many other tools (the a5 pipeline includes this). I have seen this approach fail to produce the desired isolate genome in the past, and every time it turned out the lab had accidentally cocultured one or more other organisms with the target isolate.

                              I would be extremely wary of separating on the basis of GC (or higher order k-mer) content since as the previous poster pointed out, lateral transfer can introduce regions with unusual GC and you don't want to lose these!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X