Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is SSPACE good for Abyss assemblies?

    Has anyone used SSPACE to scaffold Abyss data? Abyss already produces a .adj and a .dot file which might be as good as the scaffold is going to get.

    Opinions?

    --
    Phillip

  • #2
    Originally posted by pmiguel View Post
    Has anyone used SSPACE to scaffold Abyss data? Abyss already produces a .adj and a .dot file which might be as good as the scaffold is going to get.

    Opinions?

    --
    Phillip

    Ray (since v1.4.0) now includes a scaffolder (it is pretty good).

    See http://denovoassembler.sourceforge.net/ (open source and well-documented !)

    p.s.: I am the author of Ray (I am a PhD student).

    Comment


    • #3
      Originally posted by pmiguel View Post
      Has anyone used SSPACE to scaffold Abyss data? Abyss already produces a .adj and a .dot file which might be as good as the scaffold is going to get.

      Opinions?

      --
      Phillip
      I've only tested ABYSS contigs myself for the E.coli dataset, and here it gave some very good results. I do recommend filtering small contigs (e.g. larger than 100 or 200bp), since smaller contigs are likely to be repeats or misassembled contigs.

      For E.coli, scaffolding of contigs with a minimal of 100bp reduced 595 contigs to 127 scaffolds. In addition, the N50 went from 18k to 94k. I've tested these scaffolds with MUMmer and all were valid.

      I must say, i am the developer of SSPACE, so i'm a bit biased

      Some other post i found about ABYSS and SSPACE;



      Kind regards,
      Boetsie

      Comment


      • #4
        Originally posted by seb567 View Post
        Ray (since v1.4.0) now includes a scaffolder (it is pretty good).

        See http://denovoassembler.sourceforge.net/ (open source and well-documented !)

        p.s.: I am the author of Ray (I am a PhD student).

        Hi Seb567,
        We did try Ray. Maybe we did not configure the Ray assembly correctly, but our Abyss results looked much better. For instance the following command:
        /programs/Ray-1.4.0/code/Ray \
        -k \
        43 \
        -i \
        ../FastQ/000617_TL3360_both.fastq \
        -o \
        000617_TL3360

        produced ~3400 contigs ranging from 130 bp to 8.6 kb. Whereas Abyss produced 137 contigs ranging from 41- 450165 bp using a similar kmer size (41).

        These were 2x100 bp reads from ~350bp fragment PEs -- about 200x coverage. The DNA was from the bacterium Salmonella.

        --
        Phillip

        Comment


        • #5
          Originally posted by boetsie View Post
          I've only tested ABYSS contigs myself for the E.coli dataset, and here it gave some very good results. I do recommend filtering small contigs (e.g. larger than 100 or 200bp), since smaller contigs are likely to be repeats or misassembled contigs.

          For E.coli, scaffolding of contigs with a minimal of 100bp reduced 595 contigs to 127 scaffolds. In addition, the N50 went from 18k to 94k. I've tested these scaffolds with MUMmer and all were valid.

          I must say, i am the developer of SSPACE, so i'm a bit biased
          [...]

          Kind regards,
          Boetsie
          Hi Boetsie,

          Yes, I should try it.

          After Abyss alone, our N50 for contigs >200 bases is already 17.5kb. (77 contigs, range 214-389830 bases, mean 58691 bases.) This was with setting the kmer higher (63) than the example I gave in the post above.

          I will post here the results after SSPACE.

          --
          Phillip

          Comment


          • #6
            Hi Boetsie,
            Okay I ran SSPACE. Only one mysterious glitch in getting it to run (described below). I filtered my contigs by removing any shorter than 200 bases prior to running. Here are the initial and final results:

            Inserted contig file;
            Total number of contigs = 77
            Sum (bp) = 5456937
            Max contig size = 389830
            Min contig size = 214
            Average contig size = 70869
            N50 = 225952

            After extension;
            Total number of contigs = 77
            Sum (bp) = 5456953
            Max contig size = 389830
            Min contig size = 222
            Average contig size = 70869
            N50 = 225952

            After scaffolding lib1:
            Total number of scaffolds = 69
            Sum (bp) = 5457073
            Max scaffold size = 389830
            Min scaffold size = 680
            Average scaffold size = 79088
            N50 = 226679

            Overall and increase of >10% in the scaffold lengths over the initial contigs. Not bad! Actually I think I am likely coming up against a hard limit imposed by our library insert size.

            Also it ran fast -- just a minute or two with -x 1 set.

            I did have one problem getting it to run. It took me about 30 minutes with the perl debugger to track down the issue. So I'll describe it and the simple solution for anyone googling the warning SSPACE gave. The warning was:

            Bowtie-build error; -1 at /bin/SSPACE/SSPACE-1.1_linux-x86_64/bin/mapWithBowtie.pl line 37.
            WARNING: No scaffolding, because no reads found on contigs


            Turns out to be because mapWithBowtie.pl was getting a permissions error when it attempted to run bowtie-build via a sys call. So

            chmod +x /bin/SSPACE/SSPACE-1.1_linux-x86_64/bowtie/bow*

            fixed the issue. That is, the programs in the bowtie subdirectory needed to be given execute permission.

            --
            Phillip

            Comment


            • #7
              Hi Phillip,

              your results look OK, <70 contigs with only one paired-end library of 200bp is very good. I think there is not much to gain from this library. Remaining contigs are probably repeats (especially the small contigs) or contigs/scaffolds that could not be combined with each other since the library insert size is too small.

              For example with E.coli we went from 127 to 89 scaffolds with a paired-end 500, and then to 9 scaffolds with a mate pair 2kb.

              I'm aware of this problem, and i thought i had fixed it, but it did not. The next release will hopefully not contain this error. Thanks for mentioning it!

              regards,
              Boetsie

              Comment


              • #8
                Hi Boetsie,
                Actually the new TruSeq DNA library protocol recommends fragmenting DNA to a mean length of 300-400 bases for genomic DNA. Since our resulting sequence was at or above specifications for the instrument, I think the larger insert sizes are the way to go by default.
                Thanks for the info about the effect of mate end (ME) reads. I did not have any for this bacterium. We do have some for a fungal genome we assembled. But they are 454 MEs. We are giving those a shot.

                --
                Phillip

                Comment


                • #9
                  Originally posted by pmiguel View Post
                  Hi Seb567,
                  We did try Ray. Maybe we did not configure the Ray assembly correctly, but our Abyss results looked much better. For instance the following command:
                  /programs/Ray-1.4.0/code/Ray \
                  -k \
                  43 \
                  -i \
                  ../FastQ/000617_TL3360_both.fastq \
                  -o \
                  000617_TL3360

                  produced ~3400 contigs ranging from 130 bp to 8.6 kb. Whereas Abyss produced 137 contigs ranging from 41- 450165 bp using a similar kmer size (41).

                  These were 2x100 bp reads from ~350bp fragment PEs -- about 200x coverage. The DNA was from the bacterium Salmonella.

                  --
                  Phillip
                  What is the content of these files:

                  000617_TL3360.CoverageDistributionAnalysis.txt
                  000617_TL3360.LibraryStatistics.txt

                  Thank you.

                  Comment


                  • #10
                    Originally posted by seb567 View Post
                    What is the content of these files:
                    000617_TL3360.CoverageDistributionAnalysis.txt
                    MinimumCoverage: 46
                    PeakCoverage: 159
                    RepeatCoverage: 160
                    Percentage of vertices with coverage 1: 87.6321%
                    DistributionFile: 000617_TL3360.CoverageDistribution.txt


                    Originally posted by seb567 View Post
                    000617_TL3360.LibraryStatistics.txt
                    File: ../FastQ/000617_TL3360_both.fastq
                    NumberOfSequences: 13001302

                    Total: 13001302

                    NumberOfPairedLibraries: 1

                    LibraryNumber: 0
                    InputFormat: Interleaved,Paired
                    DetectionType: Automatic
                    File: ../FastQ/000617_TL3360_both.fastq
                    NumberOfSequences: 13001302
                    AverageOuterDistance: 385
                    StandardDeviation: 628
                    DetectionFailure: Yes

                    --
                    Phillip

                    Comment


                    • #11
                      Originally posted by pmiguel View Post
                      MinimumCoverage: 46
                      PeakCoverage: 159
                      RepeatCoverage: 160
                      Percentage of vertices with coverage 1: 87.6321%
                      DistributionFile: 000617_TL3360.CoverageDistribution.txt




                      File: ../FastQ/000617_TL3360_both.fastq
                      NumberOfSequences: 13001302

                      Total: 13001302

                      NumberOfPairedLibraries: 1

                      LibraryNumber: 0
                      InputFormat: Interleaved,Paired
                      DetectionType: Automatic
                      File: ../FastQ/000617_TL3360_both.fastq
                      NumberOfSequences: 13001302
                      AverageOuterDistance: 385
                      StandardDeviation: 628
                      DetectionFailure: Yes

                      --
                      Phillip

                      The CoverageDistributionAnalysis.txt file points to a bad detection of the repeat coverage, so nothing will work correctly for sure after that.

                      MinimumCoverage: 46
                      PeakCoverage: 159
                      RepeatCoverage: 160 <----

                      Can you put the content of 000617_TL3360.CoverageDistribution.txt on http://pastebin.com/ and link it here ?

                      Comment


                      • #12
                        Originally posted by seb567 View Post
                        The CoverageDistributionAnalysis.txt file points to a bad detection of the repeat coverage, so nothing will work correctly for sure after that.

                        MinimumCoverage: 46
                        PeakCoverage: 159
                        RepeatCoverage: 160 <----

                        Can you put the content of 000617_TL3360.CoverageDistribution.txt on http://pastebin.com/ and link it here ?
                        Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.


                        Thanks
                        --
                        Phillip

                        Comment


                        • #13
                          Originally posted by pmiguel View Post
                          OK, problem solved.

                          This is your coverage distribution:


                          However, it confuses Ray because it is going up and down near the inflection point:

                          142 1002
                          143 2012
                          144 432
                          145 1032
                          146 1098
                          147 1088
                          148 1166
                          149 1454
                          150 778
                          151 1122
                          152 1146
                          153 -720
                          154 424
                          155 192
                          156 -64
                          157 552
                          158 418
                          159 -406 Peak Coverage
                          160 164
                          161 -826
                          162 -434
                          163 -190
                          164 -124
                          165 26
                          166 1014
                          167 -1100
                          168 -562
                          169 -1376
                          170 -1288
                          171 -336
                          172 -984
                          173 -500
                          174 -1064

                          I added data smoothing and it fixes the problem.

                          File= /home/boiseb01/coverage-pmiguel
                          MinCoverage= 45
                          PeakCoverage= 158
                          RepeatCoverage= 290


                          distribution. Thanks to pmiguel on SEQanswers for providing raw data points. http://seqanswers.com/forums/showthread.php?p=43979#post43979





                          seb

                          Comment


                          • #14
                            Hi,

                            I have used SSPACE with abyss output after assembly with 180 and 550 PE libraries. I filtered for contigs > 200 and below is the output from SSPACE. I have a quick question about the output relating to repeats. After scaffoldijng with the final library I get the following;
                            Number of repeats = 14553
                            Total size of repeats = 1494450560
                            What do these figures relate to? Its funny because If I add the total size of repeats to the total size of the scaffolded assembly after the final library is added I get, 1494450560 + 1149222136 = 2643672696, which is the estimated size of my genome!


                            Inserted contig file;
                            Total number of contigs = 440783
                            Sum (bp) = 657546051
                            Max contig size = 39800
                            Min contig size = 200
                            Average contig size = 1491
                            N50 = 3535

                            After scaffolding lib1: 3kb
                            Total number of scaffolds = 326357
                            Sum (bp) = 844894494
                            Max scaffold size = 102863
                            Min scaffold size = 200
                            Average scaffold size = 2588
                            N50 = 10046

                            After scaffolding lib2: 5kb
                            Total number of scaffolds = 266348
                            Sum (bp) = 993616335
                            Max scaffold size = 164536
                            Min scaffold size = 200
                            Average scaffold size = 3730
                            N50 = 17281

                            After scaffolding lib3: 10kb
                            Total number of scaffolds = 232199
                            Sum (bp) = 1149222136
                            Max scaffold size = 303516
                            Min scaffold size = 200
                            Average scaffold size = 4949
                            N50 = 29100

                            Comment


                            • #15
                              It's a complicated calculation, but basically it counts the number of contigs that are linked left, and the number of contigs that are linked right from the contig.

                              Say that contigA has three contigs that are linked left and two contigs linked right. The repeat is the highest number of links, thus here 3. This contig is thus said to be repeated 3 times in the assembly.

                              Have a look at the *.repeat file in the intermediate_results folder. Here, all repeats are listed.

                              Remember though, that one of the repeated elements is also included in the final assembly, so the repeats should be subtracted from the final scaffolds. So if contigA is repeated 4 times with a size of 1300bp. The 1300bp should be subtracted from the final assembly, since the contig is already present within the scaffolds.

                              To improve your assembly, try to include the PE libraries in SSPACE too. Scaffolding a combination of Paired-End and Mate pair libraries is very powerfull.

                              Boetsie

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X