Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • pacbio sequence error correction

    Hi all,

    I have some pacbio long read data, about 10x coverage of a 120M genome. I already have the reference genome. However it is not complete and there are many gaps in it. What I am trying to do is to error correct my pacbio sequence and assemble the genome. Later on I will add more illumina data trying to close the gaps.

    My question about he error correction is: Can I use the incomplete reference genome to error correct my pacbio data? My plan is to convert the genome fasta into pacBioToCA required frg format. And then feed my pacbio data and the genome frg data to the correction pipeline to output error corrected data. My concern is : will pacBioToCA accept relatively long genome scalfold data as high identity sequence to correct my pacbio data?

    Suggestions and help is greatly appreciatedl

    Stuart

  • #2
    I am not able to figure out how I can use the incomplete reference genome for error correction. It looks like FastaToCA converts fastq file to frg file so that it can be used as high identity sequence for error correction. However, the incomplete genome assembly in fasta file. there is no quality score files can be found. How can I get around this?

    many thanks!

    Stuart

    Comment


    • #3
      Perhaps use the pbjelly pipeline to fill gaps? Also, with an appropriate pipeline (quiver: https://github.com/PacificBiosciences/GenomicConsensus) you may not need error correction to call accurate consensus.

      cheers,
      -mark

      Comment


      • #4
        Thanks for the tips! Mark. It looks like it will take me a while to figure this out. However, It sounds like interesting to me when you say I might not need to do error correction for pacbiodate since it it has 15% error rate.

        STuart

        Comment


        • #5
          Some more tips: if you want to use pacBioToCA, the approach would be to use the raw Illumina data as input to the correction step, not the draft assembly. The advantage of going back to the raw data is you may be able to correct assembly errors. The disadvantage is it takes longer to run.

          If you want to keep the assembly as is, you can install SMRT Analysis and use AHA (a hybrid assembler) to scaffold it, provided your the genome is less than about 200 MB. For larger genomes, or to really focus on the gap-filling, you can use pbjelly.

          Finally, the "no error correction" suggestion refers to the new algorithm HGAp: http://www.pacbiodevnet.com/hgap. You'll need more PacBio coverage to go that route. The benefit is you may be able to close more gaps and get a final result that's potentially as accurate as Sanger finishing.

          Comment


          • #6
            Thanks for your tips! jbingham. I am in the process of generating short illumina data for the error correction. I think I don't have enough coverage to try the new algorithm since my pacbio data only gives 3-4 times coverage when look into those data more carefully. The most majority of them are less than 500bp and 1000bp. Longest read is 13kb. I will post my process later.

            Thanks again to Winsettz and jbingham for helping out here!

            Stuart

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin


              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            43 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            43 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            38 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X