Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I don't know about cosmid data, but the tutorial provides you with sample data. It's useful if you're just trying to get familiar with the bas.h5 file type.

    Code:
    wget http://files.pacb.com/software/hgap/HGAp_BAS_H5_DATA/HGAp_BAS_H5_DATA/BAC/m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5
    Cheers
    Nick
    Last edited by OstanNick; 07-23-2014, 08:20 AM.

    Comment


    • #17
      Yes, I will use this sample data. BAC and Cosmids are finally quite similar, and the method would be the same.

      Thanks,
      seb.

      Comment


      • #18
        I just wanted to add, that is very old data (from an original RS machine, not an RS II), it may be better going with something that isn't a BAC or Cosmid, but is current. The data is also missing a metadeta.xml file, so will be almost impossible to import into SMRT Portal.

        Comment


        • #19
          Hi, it's been a while since a post to this thread, but I have a question/comment that could help future people looking to filter PacBio libraries of a "metagenomic" nature - especially as the links above redirect to protocols using older versions of the programs. I have used the latest version of the SMRTanalysis (2.3.0) to process (and assemble using RS_HGAP_Assembly.2) my data, and in addition to the .h5 reads, SMRTPortal has produced raw subreads in fasta format. Is this a new output feature? Or would these not work for doing metagenomic filtering with a known "non-target" organism? Thanks!

          Comment


          • #20
            Hi!

            As "acgt", I have also to process/assemble metagenomics data with PacBio reads, I follow the next publications (may be useful to some):





            The problem is that blasting the preassembled reads to ncbi NT database I find many reads with a partial (in length) best hit.

            Consequently, I would like to ask:
            1) How could I improve the sensitivity of overlapping step (e.g. blasr parameters) in order to improve the consensus generation in terms of production less chimeric preassembled reads?
            2) Also should I have to keep the same values for blasr running in the polishing step?

            Thanks in advance

            Comment


            • #21
              Hi,
              Last question first.
              2. The blasr parameters for the preassmebly and polishing are completely independent.

              I'm not sure I full understand question 1, given the default parameters (HGAP.3) the chance of correcting a randomly created biological chimera are very low. I can probably recommend some blasr parameters, but the recommendation will be dependent on what exactly is happening. If you could give a little more detail.

              The sample is a non amplified (not WGA, MDA) shotgun sample?
              Are you using HGAP.3 for the assembly?
              Are you sure that the preassembly is generating chimeric corrected reads? i.e. do you have a perfect reference for what is being sequenced?
              What does the final assembled reads look like, are the chimeric reads used in the assembly?

              Comment


              • #22
                Dear rhall,

                Thank you very much for your reply and willingness to help us out.

                Let me specify more: we are sequencing nematode DNA (roughly 31% GC content) extracted from the soil of infected plants (tomato; similar GC content); no amplification step.

                We use HGAP.3. We have checked a part of our assembled reads with BLAST vs nrDB(nt) to get a picture of the key taxonomic groups: we find a lot of proteobacteria (GC content varying from 40-70%), down to arthropods and several other organisms. For a good portion of the assembled reads used as BLAST input, we only get low partial coverage in the BLAST result, potentially indicating that the parameters originally selected as basis for the pre-assembly step could be improved in order to avoid having any chimera.

                Unfortunately, we don’t have a perfect reference. Based on the final assembly, in some cases we see that only a low fraction of a long read hits e.g. to tomato genome (whose genomes has been sequenced), so that’s why we want to make sure we are strict in the pre-assembly step. In fact, we’d ultimately like to use the combination of GCcontent, coverage and taxonomy info to limit the reads that we would select to then attempt a partial nematode genome seq. assembly. Limiting this in the pre-assembly, whose results we screen by BLAST would be very helpful.

                Possibly you would have a suggestion about the parameter selection to avoid potential chimera creation in this metagenome sequencing situation.

                Thank you very much!

                Comment


                • #23
                  Sorry a few more questions, HGAP.3 parameters are set such that artificial chimera generation is extremely unlikely. I have used HGAP.3 (default parameters) with multiple metagenomic samples and have never had this problem, so it's probably best to diagnose where the chimeras are coming from before brute force trying different parameters.
                  Are you blasting the preassembled reads 'corrected.fasta' or the assembled reads 'polished_assembly.fasta'? If it is the preassembled reads, then it is possible to increase the coverage requirement for preassembly ('Minimum Coverage For Correction'), the default is 6, increasing will reduce the preassembled yield, but the likelihood of having 6 reads that support a chimeric read that does not have some biological root cause is unlikely. Chimeric generation in the OLC step, would be effected by the overlap error rate in the CA spec file, but it is already extremely conservative.

                  One option is to take the set of reads that contain possible chimeras and resequence all the data against these reads to see if the chimeras have raw read support.

                  Comment


                  • #24
                    Hi,
                    First contact with PacBio here. So I have the filtered_subreads in fastq and fasta format and the subreads without spike in control in fasta (later I'll get all raw data). Apparently, PacBio pipeline does not remove the control reads from fastq files (they told me that). So what's the best way to do it and have subreads in fastq without control reads? Thanks

                    *I don't work with SMRT Portal.

                    Comment


                    • #25
                      Who provided the filtered_subreads.fasta/q, it is possible to filter out the spike in control when these are generated.
                      Do I understand correctly that you have subreads in fasta format without the control? could you use this as a reference to extract the fastq reads from the filtered_subreads.fastq file.
                      What are the files being used as input to? If it's assembly, simply leave them in, the control will assemble out.

                      Comment


                      • #26
                        Originally posted by rhall View Post
                        Who provided the filtered_subreads.fasta/q
                        The sequencing core facility where our sample was sent.

                        Originally posted by rhall View Post
                        Do I understand correctly that you have subreads in fasta format without the control? could you use this as a reference to extract the fastq reads from the filtered_subreads.fastq file.
                        That's a good point to start with.

                        Originally posted by rhall View Post
                        What are the files being used as input to? If it's assembly, simply leave them in, the control will assemble out.
                        Will it? I haven't done nothing yet; the idea is to wait for the illumina reads and made an hybrid assembly. In the mean time, try to play with the subreads and make a draft assembly..... SMRT pipes/protocols are killing me!

                        Thanks for your reply!

                        Comment


                        • #27
                          It is highly unlikely that any of your real sequence is shared with the control, so I wouldn't worry about leaving it in, just remember to filter it out of the assembled contigs before you submit to ncbi (I can point to at least one submission that didn't). How big is the assembly? an AMI is certainly an option for using SMRT Pipe with small data sets https://github.com/PacificBioscience...MRT-Portal-AMI
                          Otherwise try http://wgs-assembler.sourceforge.net...index.php/PBcR or https://github.com/PacificBioscience...lcon_manual.md to assemble the PacBio data.

                          Comment


                          • #28
                            Originally posted by rhall View Post
                            It is highly unlikely that any of your real sequence is shared with the control, so I wouldn't worry about leaving it in, just remember to filter it out of the assembled contigs before you submit to ncbi (I can point to at least one submission that didn't). How big is the assembly? an AMI is certainly an option for using SMRT Pipe with small data sets https://github.com/PacificBioscience...MRT-Portal-AMI
                            Otherwise try http://wgs-assembler.sourceforge.net...index.php/PBcR or https://github.com/PacificBioscience...lcon_manual.md to assemble the PacBio data.
                            The genome size is ca. 250Mb (eukaryote) , so not sure if SMRT-Portal will support it. I have launched AMIs before, but I rather prefer to work in our local computer cluster facility (just linux). I'll give a try to PBcR and FALCON.
                            Thanks

                            Comment


                            • #29
                              If your local cluster uses SGE (and your local admins are willing) then install SMRTportal locally. Certain steps of PacBio analysis are best done via portal (and you are going to get raw data). FYI: SMRTportal no longer supports hybrid assemblies so that part you will have to do outside of the portal as @rhall has already posted.

                              Comment


                              • #30
                                Originally posted by GenoMax View Post
                                If your local cluster uses SGE (and your local admins are willing) then install SMRTportal locally. Certain steps of PacBio analysis are best done via portal (and you are going to get raw data). FYI: SMRTportal no longer supports hybrid assemblies so that part you will have to do outside of the portal as @rhall has already posted.
                                I thought it was going to be more complicated; thanks to the SMRT analysis wiki and good tips from my system administrator I was able to run SMRTportal locally. But...is there a way to import your filtered_subreads in fastq file for further analysis?

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin


                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                  Yesterday, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                39 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                41 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                35 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                55 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X