Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • disentangling target genome and endosymbiont at read level

    Hi!

    Recently I got data at my hands were a single lane of GAIIx was sequenced from genomic DNA. >6GB output, all high qual. However, GC plot of reads shows two distinct peaks (larger at 37% -> target genome, smaller at around 50%). Seeing this and knowing the source of the DNA the second peak seems to come from an endosymbiont (or bacterial contamination). When I assemble with velvet (already tested cc=50 and large kmers) or Ray I get a genome of around 2MB (far to small) with bad cegma and also none of the stuff that should be in there, although blast hits for the right organism. Questions is: how to separate the endosymbiont from the target, possibly at read level?

    Any help highly appreciated.

  • #2
    I'd try splitting the reads by GC, and assembling the high GC and low GC pools separately.

    Comment


    • #3
      my first though, but

      Hi!

      That was my first thought as well, but then I loose all the reads of high GC from the target genome, i.e. they will be included in the other one - it is a curve of GC content and thus the target genome has to have regions with ~ 50% GC as well, whereas the possible endosymbiont has have reads of lower GC as well. I could take all the reads from a certain level of GC upwards and hope to assemble a bacterium out of those, then extract all the reads that went into the bacterial genome from the complete set of reads and hope to end up with more or less pure target genome, but is this sensible and feasible?!?

      Comment


      • #4
        I am also having the same problem.

        What programs should I be using to separate reads based on GC content? A search of these forums only revealed replies such as "there are many programs that do this" but with no examples.

        Any help would be appreciated, cheers!

        Comment


        • #5
          What is your favourite scripting language? e.g. BioPerl/Biopython/etc would all make it easy to write a quick script to filter FASTQ on GC content. Also, do you have paired end data - and if so presumably you might want to filter at the pair level? That makes things a little more complicated...

          Comment


          • #6
            The other way to split is to align all the reads to your target organism stringently, and the velvet only the unmapped reads. Then, either figure out what your mystery contaminant is from the velvet, or include the velvet contigs in your genome alongside your desired organism, so that the reads will align to that, instead of beign forced somewhere in your target organism genome.

            Comment


            • #7
              Thanks maubp and swbarnes2. Indeed, I did end up editing the DynamicTrim perl script to include and option for GC trimming, this deals with paired end data fine.

              swbarnes2, thanks for this tip. However, I 'm unable to do this as I am assembling de novo. Makes things a bit tougher.

              Comment


              • #8
                You could try running an assembly pipeline designed explicitly to deal with mixes of organisms. metAMOS seems to be one such option:



                It uses metagenome taxonomy analysis to figure out which organism each scaffold group comes from and creates a separate assembly fasta file for each organism. Looks like it's under very active development at the moment.

                Comment


                • #9
                  There are a number of tools out there that attempt to cluster or classify reads or contigs by sequence-intrinsic properties (i.e. k-mers, protein domains). Check out TETRA, WebCarma, TACOA or PhyloPythia.

                  Comment


                  • #10
                    The authors of PhyloPythia have an interesting comparison of the nucleotide composition-based methods to a sequence identity/homology-based method (MEGAN) in the 50 pages of supplemental material for this 1.5 page paper:


                    I didn't notice whether they ran a nucleotide or amino acid blast search for MEGAN, but either way, it seems that using homology information gives pretty darn good results compared to the composition methods (among which PhyloPythiaS seems to be superior).

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X