Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by boetsie View Post
    [...]
    Remember though, that one of the repeated elements is also included in the final assembly, so the repeats should be subtracted from the final scaffolds. So if contigA is repeated 4 times with a size of 1300bp. The 1300bp should be subtracted from the final assembly, since the contig is already present within the scaffolds.
    [...]
    Boetsie
    Hi Boetsie,
    If there were a repetitive element present in 10 copies in a genome that assembled into a single contig, would SSPACE only place a single copy of that element in the final assembly? Or am I misreading you?
    --
    Phillip

    Comment


    • #17
      Originally posted by pmiguel View Post
      Hi Boetsie,
      If there were a repetitive element present in 10 copies in a genome that assembled into a single contig, would SSPACE only place a single copy of that element in the final assembly? Or am I misreading you?
      --
      Phillip
      In short, yes

      Comment


      • #18
        Originally posted by seb567 View Post
        Hi seb,
        We still get the "DetectionFailure: Yes" line in the ".LibraryStatistics.txt" file.
        Also the ".RayVersion.txt" file gives "Ray version: 1.6.0". So maybe the link above is to an older version?

        --
        Phillip

        Comment


        • #19
          Originally posted by boetsie View Post
          In short, yes
          Just wondering what the current state of the art is in full genome assembly...

          If there were 10 identical copies of an 1000 bp element scattered across an otherwise single copy genome would an assembler be able to reconstruct the genome without gaps? Say none of the elements were near each other and sufficient mate end coverage existed. That is, 30X coverage with 2 kb ME reads.

          In principle seems it should be possible, but I don't know if modern assemblers would do so.

          If not, would SSPACE reconstruct a gapless genome, or would it still produce a set of 10 scaffolds with one copy of the repetitive element?

          --
          Phillip

          Comment


          • #20
            If the library is larger than the repeated element, SSPACE will probably generate a single scaffold, though with gaps. The repeated contig will be present only once though.
            If the library is smaller than the repeated element, SSPACE will generate 10 scaffolds, in one scaffold the repeated contig is present.

            I'm not sure how other assemblers/scaffolders are doing this, if they include all repeats or not.

            Through gap closing the remaining gaps can be filled. Currently, i'm working on a script to do this.

            Boetsie

            Comment


            • #21
              How to span very large repetitive blocks.

              Hi Boetsie,
              So would "gap closing" take the form of pulling out all the mates of the reads within the library length at the ends of contigs and attempting to assemble them into a contig that can span the gap? Or would it involve looking for individual reads that span the actual junction between the repetitive region and the single copy one that flanks it?

              I would like to point out that even extremely large repetitive blocks might contain (small) segments that are, effectively, single copy. This is because large repetitive blocks are often formed by nested insertions of (high copy number) transposable elements (TEs).

              Just to be clear, imagine one TE inserting between two single copy genes in a genome. Then in a later generation, imagine another TE inserting into the first one. This process can, and does, continue until you might have a > 100 kb block of highly repetitive DNA separating the two genes.

              Because this block may comprise many TEs, many of which have copy numbers in the hundreds or thousands, it might seem hopeless to close the sequencing gap this represents. But it may not be hopeless.

              Even though the TEs have high copy numbers in the genome, their junction with the DNA into which they inserted will likely still be unique. The effect is that even if your two low copy contigs are separated by an "ocean" of repetitive sequence, there likely will be small unique insertion-site-junction sequence "stepping stones" that could allow this ocean to be traversed.

              TEs are rarely longer than 20 kb. So repetitive blocks formed by clusters of them may be traversable in this manner.

              --
              Phillip

              Comment


              • #22
                Dear Boetsi

                I tried SSPACE on SOAPdenovo contig file which had a size of 6.2 GB. SSPACE crashed giving error of that the characters exceeded 2^32-1 characters! Does SSPACE not work for huge contig files ?

                Also , in the library file that we specify if I specify the zipped fasta files such as .fa.gz I get a different result (N50 = 1440) than when I provide unzipped files such as .fa (N50=1990) . So I believe, SSPACE does not prefers taking compressed files as input such as .gz files.

                Aby

                Comment


                • #23
                  Hi,

                  I've found BGI's gapcloser works fairly well in conjunction with SSPACE... but looking forward to the release of GAPCLOSURE...

                  SSPACE was useful to me for finishing off abyss/velvet assemblies using illumina mate-pairs... the mate-paired data I obtained appeared to have quite high levels of "shadow library" contamination, so SSPACE's requirements for correct read orientation and expected separation distance appears to be a good way of reducing mistakes due to this contamination.

                  Cheers,
                  James Hane

                  Comment


                  • #24
                    Hi James,
                    What do you mean by "shadow library"?

                    --
                    Phillip

                    Comment


                    • #25
                      Hi Phillip,

                      my service provider uses the term "shadow library" and the name stuck with me... i'd appreciate if you could enlighten me to its more common pseudonyms.

                      During the construction of Illumina mate pair libraries (as I understand it) the termini of very large fragments are circularised together... these are then fragmented and shorter fragments containing the joined termini (which are eventually sequenced in the <-- --> orientation relative to the original genomic sequence) are purified. However this process is inefficient and can be contaminated to various degrees by (non-circularised) contaminating short fragments (still in the original --> <-- orientation and not separated by a large distance).

                      If you were to reverse complement your mate-reads back to the --> <-- orientation (how i do it anyway) and align these back to a reference genome... the end result is some reads aligning large distances apart in the FR orientation and some contaminating reads aligning a short distance apart in the RF orientation. (i've noticed that as the mate pairs get bigger the shadow library contamination is bigger too - would appreciate if anyone else noticing this would share their experiences)

                      This is pretty bad for scaffolding a de novo assembly... and some assemblers i.e. velvet can allow for some level of contamination. SSPACE takes read pair alignments and expected separation distances of pairs into account when it joins scaffold ends together, minimising the "shadow library" problem to some extent.

                      Cheers,
                      James

                      Comment


                      • #26
                        James,
                        I don't have a term for this phenomenon. So "shadow library" is fine with me. BTW, near as I could tell ABySS-PE also handles shadow library contamination without problems.

                        --
                        Phillip

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        24 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        25 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        21 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X