Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • scaffolding GAII paired-end library with Hiseq mate-pairs

    Hello,
    we are a Belgian research team studying the +/- 4Mbp genome of a bacterial plant pathogen (and newbies in NGS data analysis). We are getting some unexpected results during de novo assembly of our target genome using a combined paired-end and mate-pair library. No good reference genome is available, so de novo assembly is our only option. We would like to share some of our results for your consideration. Maybe some of you can tell us if this is a normal result, or if we are doing something wrong here…
    First off, the data-sets:
    1. One Illumina GA, paired-end short read set (50bp reads, 350Mb, 375bp insert), which gives us a theoretical 70x coverage.
    2. One Illumina Hiseq, Mate-pair short read set (100bp reads, 500Mb, 5kb insert), which gives us a combined 160x coverage.
    When we used the PE-set alone for de novo assembly in CLC-Bio, we get 478 contigs with an N50 of +/-20kbp. When looking at the contigs, we saw repetitive fragments (IS-sequences) were the major cause for the contig break-up. Based on the literature, we thought most of these gaps could be closed if we combined the PE-set with an extra MP-dataset.
    However, if we combine both sets in a de novo assembly in CLC-Bio’s Beta-assembler (plugin in v4.8), we get 493 contigs and an N50 of 22kb.
    When we try to scaffold the 478 contigs of the PE-only assembly with the MP-set in SSPACE, we can reduce them to 63 scaffolds, but the program has to introduce some 300.000 N’s in the sequence (total 4.2 Mb) to accomplish it. DNAStar also have problems with the Illumina 1.9 format from Hiseq2000...does anybody has experience using Hiseq data on this software?
    Does anybody here have a clue what we are doing wrong and how we could improve this, or is there a logical explanation why the MP-set is not giving us a better gap closure?
    Thank you for any remarks/suggestions!

  • #2
    Hi Steve,

    Just wanted to say that it is important to set the insert size for SSPACE as good as possible. There are tools to determine the median/mean insert size and its devation. One of them is within the SSPACE premium package. Since you are a BaseClear customer, you can get the SSPACE premium version for free if you do not have this already.

    Furthermore, the assembly will not improve with the matepairs, they will sometimes even be worse. You already have enough coverage with your paired-end sequences. Main reason is that with matepair sequencing there is a bias in coverage at some regions along the genome, some regions are covered more than others. I would suggest not to include the matepair for the initial assembly. Only use the matepairs for scaffolding, as well as the paired-end reads used at the initial assembly.

    What might improve the assembly is trimming of low-quality nucleotides and removing reads of low quality using the CLCBio's trimmer.

    Once obtained the scaffolds, you can fill the gaps (N's) with tools like SOAP's GapClosure from BGI, or IMAGE. We are currently also working on a tool do this.

    Regards,
    Marten Boetzer
    BaseClear

    Originally posted by stevebaeyen View Post
    Hello,
    we are a Belgian research team studying the +/- 4Mbp genome of a bacterial plant pathogen (and newbies in NGS data analysis). We are getting some unexpected results during de novo assembly of our target genome using a combined paired-end and mate-pair library. No good reference genome is available, so de novo assembly is our only option. We would like to share some of our results for your consideration. Maybe some of you can tell us if this is a normal result, or if we are doing something wrong here…
    First off, the data-sets:
    1. One Illumina GA, paired-end short read set (50bp reads, 350Mb, 375bp insert), which gives us a theoretical 70x coverage.
    2. One Illumina Hiseq, Mate-pair short read set (100bp reads, 500Mb, 5kb insert), which gives us a combined 160x coverage.
    When we used the PE-set alone for de novo assembly in CLC-Bio, we get 478 contigs with an N50 of +/-20kbp. When looking at the contigs, we saw repetitive fragments (IS-sequences) were the major cause for the contig break-up. Based on the literature, we thought most of these gaps could be closed if we combined the PE-set with an extra MP-dataset.
    However, if we combine both sets in a de novo assembly in CLC-Bio’s Beta-assembler (plugin in v4.8), we get 493 contigs and an N50 of 22kb.
    When we try to scaffold the 478 contigs of the PE-only assembly with the MP-set in SSPACE, we can reduce them to 63 scaffolds, but the program has to introduce some 300.000 N’s in the sequence (total 4.2 Mb) to accomplish it. DNAStar also have problems with the Illumina 1.9 format from Hiseq2000...does anybody has experience using Hiseq data on this software?
    Does anybody here have a clue what we are doing wrong and how we could improve this, or is there a logical explanation why the MP-set is not giving us a better gap closure?
    Thank you for any remarks/suggestions!

    Comment


    • #3
      You are using mate-pair data to scaffold (join) contigs together rather than actually closing gaps, so what you are seeing is not unusual. The Ns represent repeat sequences of known length.

      If you want to attempt to close gaps within those scaffolds, one option is to use a local assembly approach like GapCloser (part of SOAPdenovo) which I have had good results with but be aware that you won't be able to close all (or maybe even many) of the gaps this way.

      But in fact I'd try your assembly with Velvet or SOAPdenovo rather than CLC-Bio first off and see if it does a better job.

      Comment


      • #4
        So, the process of replacing those "N's" between scaffolds with actual sequence using PE and ME data is apparently called "gap closing". Since you are using a commercial software package to do your assembly, etc, you might want to ask them if they include a "gap closing" module.

        Otherwise, there are liberated programs available for doing gap closing. (At least one is mentioned elsewhere in this thread.)
        --
        Phillip

        Comment


        • #5
          Originally posted by pmiguel View Post
          So, the process of replacing those "N's" between scaffolds with actual sequence using PE and ME data is apparently called "gap closing".
          That's what I call it anyway. One point about scaffolding (that perhaps is not well recognised) is that you don't usually end up with fewer gaps, just that the gaps become better characterised, e.g. you now know that contig A joins to contig B with a gap of N bases.

          Comment


          • #6
            IMAGE2 gap closing

            Originally posted by boetsie View Post
            Hi Steve,
            Once obtained the scaffolds, you can fill the gaps (N's) with tools like SOAP's GapClosure from BGI, or IMAGE. We are currently also working on a tool do this.
            BaseClear
            Hi Boetsie,
            we obtained very nice scaffolds using your SSPACE Premium v2 software (up to 937kb and N50=275kb). I tried using IMAGE2 but there is no 'readme' or 'install' file and I can't find any information that helps me to run the software on the example provided with the program (program runs but does not close the gaps). I tried to contact Jason Tsai but no reply so far.
            This is what i did:
            I downloaded the Dec., 2 version (v2.3) from Sourceforge, copied the precompiled binaries to /usr/local/bin and made them executable on a Linux Ubuntu 11.10 64-bit distro. I looked at the scripts run.sh and saw some variables that have to be declared (such as paths to velvet, ssaha, etc.) but i still do not get the gaps closed in iteration 10 (see output in attachment imagetest.txt).

            # software path
            # this is the path where the IMAGE path is
            # Please change it accordingly
            VELPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
            SSAHADIR=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
            WALKPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
            Then i did:
            cd /home/sbaeyen/Bio/IMAGE/IMAGE_version2/example
            sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ home/sbaeyen/Bio/IMAGE/IMAGE_version2/image.pl -prefix 76bp -iteration 1 -all_iteration 10 -dir_prefix iteration > imagetest.txt
            When I run the 'image_run_summary.pl' script , I get:
            sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ perl /home/sbaeyen/Bio/IMAGE/IMAGE_version2/image_run_summary.pl iteration
            The prefix is : iteration
            iteration Starting_gaps Gap_closed Gap_extend_oneside Gap_extend_bothside
            1 5 0 0 0
            2 5 0 0 0
            3 5 0 0 0
            4 5 0 0 0
            5 5 0 0 0
            6 5 0 0 0
            7 5 0 0 0
            8 5 0 0 0
            9 5 0 0 0
            10 5 0 0 0
            Do you have any clue what i need to adapt to get this program running/what i did wrong?
            Best regards and thanks (again) for any advice!
            Steve
            ps if you want i can send you the program output imagetest.txt

            Comment


            • #7
              Hi Steve,

              I've tried to run IMAGE too, but did not succeed. The input is very complex and I even had to change the code to get it running, though it did not close any gap. I've asked one of the authors but did not get any reply. I would go for GapClosure from SOAP, which is very good but does not include the remaining gaps and seems to join repeated areas. We have finished our tool, but are working on a publication, after that it will be released.

              Regards,
              Boetsie

              Originally posted by stevebaeyen View Post
              Hi Boetsie,
              we obtained very nice scaffolds using your SSPACE Premium v2 software (up to 937kb and N50=275kb). I tried using IMAGE2 but there is no 'readme' or 'install' file and I can't find any information that helps me to run the software on the example provided with the program (program runs but does not close the gaps). I tried to contact Jason Tsai but no reply so far.
              This is what i did:
              I downloaded the Dec., 2 version (v2.3) from Sourceforge, copied the precompiled binaries to /usr/local/bin and made them executable on a Linux Ubuntu 11.10 64-bit distro. I looked at the scripts run.sh and saw some variables that have to be declared (such as paths to velvet, ssaha, etc.) but i still do not get the gaps closed in iteration 10 (see output in attachment imagetest.txt).

              # software path
              # this is the path where the IMAGE path is
              # Please change it accordingly
              VELPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
              SSAHADIR=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
              WALKPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
              Then i did:
              cd /home/sbaeyen/Bio/IMAGE/IMAGE_version2/example
              sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ home/sbaeyen/Bio/IMAGE/IMAGE_version2/image.pl -prefix 76bp -iteration 1 -all_iteration 10 -dir_prefix iteration > imagetest.txt
              When I run the 'image_run_summary.pl' script , I get:
              sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ perl /home/sbaeyen/Bio/IMAGE/IMAGE_version2/image_run_summary.pl iteration
              The prefix is : iteration
              iteration Starting_gaps Gap_closed Gap_extend_oneside Gap_extend_bothside
              1 5 0 0 0
              2 5 0 0 0
              3 5 0 0 0
              4 5 0 0 0
              5 5 0 0 0
              6 5 0 0 0
              7 5 0 0 0
              8 5 0 0 0
              9 5 0 0 0
              10 5 0 0 0
              Do you have any clue what i need to adapt to get this program running/what i did wrong?
              Best regards and thanks (again) for any advice!
              Steve
              ps if you want i can send you the program output imagetest.txt

              Comment


              • #8
                Hi Boetsie,
                thanks for the advice of using SOAP's GapCloser ! Using the PE reads, i was able to close 161 of 400 gaps (of N's) in the scaffolds. Do you think the performance of Gapfiller would be even better?
                Regards,
                Steve

                Comment


                • #9
                  stevebaeyen: I recently started to use IMAGE2. It seems to work well with me. At least it finished the example well and closed gaps. But, when it comes to my own genome, it did extend the ends but did not close very many gaps. May be a problem with data not IMAGE2...

                  Did you make sure, velveth,velvetg,smalt etc are in path?

                  Comment


                  • #10
                    Has anyone ever succeed to run IMAGE on his own data?

                    I want to run it with my scaffolds, but i'm having trouble to make the input files required by IMAGE. Does anyone have a script to automatically generate these files based on the original scaffolds?

                    Regards,
                    Boetsie

                    Comment


                    • #11
                      Hi,
                      I had similar NGS data on a 3 Mbp bacterial pathogen, PE and MP Illumina data, and at least with an older version of the CLC GW assembler I also got much worse results with combined assembly even though I thought adding the MP would significantly reduce the number of contigs. I have not tried combining these with the new beta assembler CLC has, although it works better on my PE Illumina data alone.Have you tried that?
                      The solution for us was to use Velvet on both datasets and that brought our number of contigs down from approx. 70 to something like 15. These were verified by optical mapping, and we only saw one major error in these Velvet contigs... perhaps it is worth a try?

                      Comment


                      • #12
                        Originally posted by Stegger View Post
                        Hi,
                        I had similar NGS data on a 3 Mbp bacterial pathogen, PE and MP Illumina data, and at least with an older version of the CLC GW assembler I also got much worse results with combined assembly even though I thought adding the MP would significantly reduce the number of contigs. I have not tried combining these with the new beta assembler CLC has, although it works better on my PE Illumina data alone.Have you tried that?
                        The solution for us was to use Velvet on both datasets and that brought our number of contigs down from approx. 70 to something like 15. These were verified by optical mapping, and we only saw one major error in these Velvet contigs... perhaps it is worth a try?
                        Hi , I tried to denovo assemble PE+MP datasets on the new CLC scaffolder but didn't get a huge improvement compared to the PE dataset alone. A succesfull scaffolding with a +/-70% reduction was performed with SSPACE Premium v2 and and gaps closed with SOAP Gapcloser. Thanks for the Velvet tip, I'll give it a try! Do you have a good reference concerning optical mapping?

                        Comment


                        • #13
                          My pleasure!
                          and yes I had a very good reference..

                          Comment


                          • #14
                            Originally posted by Stegger View Post
                            My pleasure!
                            and yes I had a very good reference..
                            and can you give me with a link to a review article about optical mapping ?

                            Comment


                            • #15
                              How to close the CLC-bio contigs according to the reference genome sequence?

                              Hi, I used the CLC-bio de novo assembly to analyze the Miseq 150bp PE data, and I have the 150 contigs. I also have the 170kb reference genome seq; I tried to use the IMAGE to close the gap, and it did not work for me. Can anyone suggest me how to close the gap? Thank you very much.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X