Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PBJelly errors in setup, extraction, support stages

    I've been trying to construct a de novo assembly of a mammalian genome for some time now. Currently I have an incomplete genome constructed from Illumina data on AllpathsLG, and I would like to use PBJelly to fill in the gaps using PacBio reads.

    I ran the test data successfully, but the pipeline doesn't seem to work on my real data. I'm seeing essentially no improvement in my assembly quality after running PBJelly on my Pacbio reads. I'm getting a lot of errors in the assembly, especially at the setup and mapping stages. About twenty percent of my scaffold references are giving me this error in setup:

    Code:
    2015-03-19 09:48:26,814 [DEBUG] Scaffold scaffold_40566|ref0053720 is empty
    I'm not seeing any other errors in setup, though. In extraction, I get these kind of outputs:

    Code:
    2015-03-24 11:25:12,545 [INFO] Parsing /scratch/02985/emg2497/mouse_genome_project/pbjelly_nojoblimit/pacbioreads/Pacbio_A05_1.1.mod.fastq
    2015-03-24 11:25:18,887 [INFO] Loaded 53626 Reads
    2015-03-24 11:25:21,197 [INFO] Parsed 12357 Reads
    2015-03-24 11:25:21,197 [INFO] Parsing /scratch/02985/emg2497/mouse_genome_project/pbjelly_nojoblimit/pacbioreads/Pacbio_A05_1.2.mod.fastq
    2015-03-24 11:25:24,073 [INFO] Loaded 48605 Reads
    2015-03-24 11:25:28,346 [INFO] Parsed 11056 Reads
    And so forth for the rest of my data. Again, it appears to be throwing out another 20% of the data. Support is where I start to see even more issues, with both of these flags coming up in large numbers:

    Code:
    2015-03-20 14:02:14,425 [DEBUG] Hit for m140207_170145_42153_c100619042550000001
    823119607181456_s1_p0/2576/2848_6155 has mapq 0 - below threshold 200
    2015-03-20 14:02:14,429 [DEBUG] Hit for m140207_170145_42153_c100619042550000001
    823119607181456_s1_p0/2782/17335_18304 has mapq 0 - below threshold 200
    
    2015-03-20 14:02:30,989 [DEBUG] gapSup
    2015-03-20 14:02:30,989 [DEBUG] - Strand on m140207_170145_42153_c100619042550000001823119607181456_s1_p0/16349/3190_8591
    2015-03-20 14:02:30,989 [DEBUG] RightDist 202 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG] LeftDist -4938 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG]
    2015-03-20 14:02:30,990 [DEBUG] gapSup
    2015-03-20 14:02:30,990 [DEBUG] - Strand on m140207_170145_42153_c100619042550000001823119607181456_s1_p0/16349/3190_8591
    2015-03-20 14:02:30,990 [DEBUG] RightDist -3599 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG] LeftDist -1217 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG] span support
    2015-03-20 14:02:30,990 [DEBUG]
    I've checked the reads using metrics like Fastqc and they don't seem to be noticeably lower quality than I would expect, so I'm finding this very confusing. I'm running PBJelly with all the defaults--is there anything that might be confounding my analysis to display these results? I'd be happy to display more log data if it would be helpful.

    Does anyone have any advice? Any insight at all would be very welcome.

  • #2
    Can you share your starting assembly statistics, and PacBio coverage level?

    Comment


    • #3
      Sure! My starting assembly was done in AllpathsLG using two Illumina libraries--a fragment library with 62x coverage and a matepair library with 43x coverage. All of my PacBio libraries together come to about 1x coverage.

      The starting assembly had scaffold N50s of 96,649 bp (with gaps) and 73,555 bp (without gaps). The contig N50 is 6,131 bp.

      Any other metrics that might be useful?

      Comment


      • #4
        You cannot close gaps using 1x of data. I would recommend 5x at an absolute minimum, more like 10x and it really helps if the data is size selected.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Advancing Precision Medicine for Rare Diseases in Children
          by seqadmin




          Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
          12-16-2024, 07:57 AM
        • seqadmin
          Recent Advances in Sequencing Technologies
          by seqadmin



          Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

          Long-Read Sequencing
          Long-read sequencing has seen remarkable advancements,...
          12-02-2024, 01:49 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 12-17-2024, 10:28 AM
        0 responses
        33 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-13-2024, 08:24 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-12-2024, 07:41 AM
        0 responses
        34 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-11-2024, 07:45 AM
        0 responses
        46 views
        0 likes
        Last Post seqadmin  
        Working...
        X