SSPACE libraries

AdrianP

Senior Member

Join Date: Apr 2011
Posts: 130

07-02-2013, 12:16 PM

Hello,

I am currently trying to scaffold an assembly with SSPACE, and can already see some progress.

Code:

SUMMARY:
------------------------------------------------------------
        Inserted contig file;
                Total number of contigs = 5308
                Sum (bp) = 42486218
                        Total number of N's = 158184
                        Sum (bp) no N's = 42328034
                Max contig size = 134487
                Min contig size = 1000
                Average contig size = 8004
                N50 = 17056

        After scaffolding lib1:
                Total number of scaffolds = 3392
                Sum (bp) = 42503701
                        Total number of N's = 178406
                        Sum (bp) no N's = 42325295
                Max scaffold size = 190774
                Min scaffold size = 1000
                Average scaffold size = 12530
                N50 = 27904

        After scaffolding lib2:
                Total number of scaffolds = 1820
                Sum (bp) = 42986473
                        Total number of N's = 661945
                        Sum (bp) no N's = 42324528
                Max scaffold size = 365239
                Min scaffold size = 1000
                Average scaffold size = 23618
                N50 = 50457

------------------------------------------------------------

The first library is just paired end, and the second library is mate pairs.

This is my library file:

Code:

lib1 GDR-16_65bp_R1.fastq GDR-16_65bp_R2.fastq 280 0.8 FR
lib2 MPNC_65bp_R1.fastq MPNC_65bp_R2.fastq 2411 0.5 FR

I am a bit confused about the 5th columns which is the "deviation of the mean distance". From the example in the tutorial, a deviation of 0.75 in a 200 bp insert, accepts distances of 150 to 250. I am assuming because 0.75*200 = 150? Anyways, I am asking because in both of my libraries I got a large % of reads that "calculated distances out-of-bounds".

I know in mate pairs you have a lot of paired end contamination, but the first library was paired end, and I had almost 50% of reads that did not satisfy the distance. I am pasting library stats:

Lib1, paired end:

Code:

LIBRARY lib1 STATS:
################################################################################

MAPPING READS TO CONTIGS:
------------------------------------------------------------
        Number of single reads found on contigs = 9149900
        Number of pairs used for pairing contigs / total pairs = 3517598 / 3646662
------------------------------------------------------------

READ PAIRS STATS:
        Assembled pairs: 3517598 (7035196 sequences)
                Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 280 +/-224): 1668422
                Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 5097
                Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 7824
                ---
                Satisfied in distance/logic within a given contig pair (pre-scaffold): 240094
                Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 1596161
                ---
        Total satisfied: 1908516        unsatisfied: 1609082


        Estimated insert size statistics (based on 1673519 pairs):
                Mean insert size = 240
                Median insert size = 230
REPEATS:
        Number of repeated edges = 1665
------------------------------------------------------------

################################################################################

Lib2, mate pair:

Code:

LIBRARY lib2 STATS:
################################################################################

MAPPING READS TO CONTIGS:
------------------------------------------------------------
        Number of single reads found on contigs = 5924956
        Number of pairs used for pairing contigs / total pairs = 1560927 / 1708281
------------------------------------------------------------

READ PAIRS STATS:
        Assembled pairs: 1560927 (3121854 sequences)
                Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 2411 +/-1205.5): 129649
                Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 40359
                Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 427578
                ---
                Satisfied in distance/logic within a given contig pair (pre-scaffold): 267259
                Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 696082
                ---
        Total satisfied: 396908 unsatisfied: 1164019


        Estimated insert size statistics (based on 170008 pairs):
                Mean insert size = 1951
                Median insert size = 2237
REPEATS:
        Number of repeated edges = 1569
------------------------------------------------------------

################################################################################

Looks like for lib2 i should adjust the mean, but for lib1, I am wondering what is causing 50% of reads to not satisfy the distance, which is basically 280 +/-224.

Any thoughts on how I could improve my scaffolding?

Tags: None

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
Yesterday, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 58 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

SSPACE libraries

Latest Articles

ad_right_rmr

News