Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SSPACE libraries

    Hello,

    I am currently trying to scaffold an assembly with SSPACE, and can already see some progress.

    Code:
    SUMMARY:
    ------------------------------------------------------------
            Inserted contig file;
                    Total number of contigs = 5308
                    Sum (bp) = 42486218
                            Total number of N's = 158184
                            Sum (bp) no N's = 42328034
                    Max contig size = 134487
                    Min contig size = 1000
                    Average contig size = 8004
                    N50 = 17056
    
            After scaffolding lib1:
                    Total number of scaffolds = 3392
                    Sum (bp) = 42503701
                            Total number of N's = 178406
                            Sum (bp) no N's = 42325295
                    Max scaffold size = 190774
                    Min scaffold size = 1000
                    Average scaffold size = 12530
                    N50 = 27904
    
            After scaffolding lib2:
                    Total number of scaffolds = 1820
                    Sum (bp) = 42986473
                            Total number of N's = 661945
                            Sum (bp) no N's = 42324528
                    Max scaffold size = 365239
                    Min scaffold size = 1000
                    Average scaffold size = 23618
                    N50 = 50457
    
    ------------------------------------------------------------
    The first library is just paired end, and the second library is mate pairs.

    This is my library file:
    Code:
    lib1 GDR-16_65bp_R1.fastq GDR-16_65bp_R2.fastq 280 0.8 FR
    lib2 MPNC_65bp_R1.fastq MPNC_65bp_R2.fastq 2411 0.5 FR
    I am a bit confused about the 5th columns which is the "deviation of the mean distance". From the example in the tutorial, a deviation of 0.75 in a 200 bp insert, accepts distances of 150 to 250. I am assuming because 0.75*200 = 150? Anyways, I am asking because in both of my libraries I got a large % of reads that "calculated distances out-of-bounds".

    I know in mate pairs you have a lot of paired end contamination, but the first library was paired end, and I had almost 50% of reads that did not satisfy the distance. I am pasting library stats:

    Lib1, paired end:
    Code:
    LIBRARY lib1 STATS:
    ################################################################################
    
    MAPPING READS TO CONTIGS:
    ------------------------------------------------------------
            Number of single reads found on contigs = 9149900
            Number of pairs used for pairing contigs / total pairs = 3517598 / 3646662
    ------------------------------------------------------------
    
    READ PAIRS STATS:
            Assembled pairs: 3517598 (7035196 sequences)
                    Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 280 +/-224): 1668422
                    Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 5097
                    Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 7824
                    ---
                    Satisfied in distance/logic within a given contig pair (pre-scaffold): 240094
                    Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 1596161
                    ---
            Total satisfied: 1908516        unsatisfied: 1609082
    
    
            Estimated insert size statistics (based on 1673519 pairs):
                    Mean insert size = 240
                    Median insert size = 230
    REPEATS:
            Number of repeated edges = 1665
    ------------------------------------------------------------
    
    ################################################################################
    Lib2, mate pair:
    Code:
    LIBRARY lib2 STATS:
    ################################################################################
    
    MAPPING READS TO CONTIGS:
    ------------------------------------------------------------
            Number of single reads found on contigs = 5924956
            Number of pairs used for pairing contigs / total pairs = 1560927 / 1708281
    ------------------------------------------------------------
    
    READ PAIRS STATS:
            Assembled pairs: 1560927 (3121854 sequences)
                    Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 2411 +/-1205.5): 129649
                    Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 40359
                    Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 427578
                    ---
                    Satisfied in distance/logic within a given contig pair (pre-scaffold): 267259
                    Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 696082
                    ---
            Total satisfied: 396908 unsatisfied: 1164019
    
    
            Estimated insert size statistics (based on 170008 pairs):
                    Mean insert size = 1951
                    Median insert size = 2237
    REPEATS:
            Number of repeated edges = 1569
    ------------------------------------------------------------
    
    ################################################################################
    Looks like for lib2 i should adjust the mean, but for lib1, I am wondering what is causing 50% of reads to not satisfy the distance, which is basically 280 +/-224.

    Any thoughts on how I could improve my scaffolding?

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    Yesterday, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
58 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
53 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
45 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
55 views
0 likes
Last Post seqadmin  
Working...
X