Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Acquiring reads not incorporated into a Trinity assembly?

    Hi all,

    I am conducting mRNASeq (Illumina HiSeq 2500 with PE data) using Trinity. I am finding it a bit difficult to see if there is an option to provide you with the reads which were not used to create the final .fasta assembly (the 'orphan' reads). Is there a way to do this?

    I have considered looking at reads that that did not align back to the assembly, but I'm aware this may not be a true representation of reads not being used in the assembly.

    Many thanks

  • #2
    Mapping is the approach normally used for this purpose. Kmer-based assemblers operate in kmer-space and do not (usually) associate kmers with the reads they came from. So, the information about which contig came from which read is lost.

    BBMap has a special flag designed specifically for this purpose, "kfilter", which bans alignments that do not have at least k consecutive matching bases. For example, let's say you assemble some data using K=61. Then:

    bbmap.sh in=reads.fq ref=assembly.fa outm=mapped.sam outu=unmapped.fq kfilter=61 maxindel=100


    That will capture unmapped reads as unmapped.fq. And "kfilter=61" ensures that all of the mapped reads share a 61-mer with the contig they mapped to; in other words, that read WAS used to build that contig.

    In practice, I don't think kfilter usually makes a lot of difference except in very specific scenarios, but RNA-seq assembly may be one of them.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      Mapping is the approach normally used for this purpose. Kmer-based assemblers operate in kmer-space and do not (usually) associate kmers with the reads they came from. So, the information about which contig came from which read is lost.

      BBMap has a special flag designed specifically for this purpose, "kfilter", which bans alignments that do not have at least k consecutive matching bases. For example, let's say you assemble some data using K=61. Then:

      bbmap.sh in=reads.fq ref=assembly.fa outm=mapped.sam outu=unmapped.fq kfilter=61 maxindel=100


      That will capture unmapped reads as unmapped.fq. And "kfilter=61" ensures that all of the mapped reads share a 61-mer with the contig they mapped to; in other words, that read WAS used to build that contig.

      In practice, I don't think kfilter usually makes a lot of difference except in very specific scenarios, but RNA-seq assembly may be one of them.
      Hi Brian,

      Thank you very much for your reply, I will look into doing this.

      One further complication that I have is that I have generated a final assembly through a multiple k-mer approach. I've created assemblies with every (odd) iteration of k from 19-31 and merged them all followed by removing redundant contigs. To find out which reads have not been used in this final assembly is there an option in BBMap to use multiple kfilters simultaneously?

      If not, would you recommend conducting BBMap on each iterated-k assembly individually and clustering all unmapped reads together - anything which forms a cluster containing unmapped reads from all 7 assemblies will be reads not incorporated into the final assembly.

      Thanks again,
      Ali

      Comment


      • #4
        The point of multiple kmers is to compensate for locally low coverage. If a contig shares a 19-mer with a read, then that read was used in the 19-mer phase. If a read shares a 31-mer with a read, then it shares a 19-mer as well. So, just set kfilter=19.

        19 is really short for assembly, though. I don't have any direct experience with Trinity, but normally I find that's below the bottom of the range of useful kmer sizes. In Spades, for example, which also supports multiple kmer lengths, I normally sweep from K=25 to K=127.

        Comment


        • #5
          Trinity's normal kmer is 25. Sweeping in Trinity doesn't seem to make much difference. At least I do not recall anyone on the mailing list recommending a sweep.

          That said, I would just take your final contigs and map the reads to them using your minimum kfilter. That would keep any reads that map to your final assembly on a very conservative basis from being output thus allowing you to concentrate on reads that absolutely do not map. As Brian hinted, kfilter probably won't make much of a difference so personally I would skip it.

          Comment


          • #6
            Brian and Westermen,

            Thanks for both of your comments, I'll use the bbmap as recommended.

            With regards to the multiple-k approach I've been using it as a way to optimise my assembly - my initial assemblies have been quite fragmented with an unrealistic amount of loci. This is likely partly due to Trinity being unable to reliably distinguish allelic variation from isoform information (as indicated by a few gene examples I looked at in my dataset). I've certainly found on tests of subsets of the data that a multiple-k assembly gives a much better assembly than any single k-mer alone. Conducting differential expression on loci that can only be well annotated should then resolve any misassembly issues. My next step is a scaffolding through translation mapping to my closest genome/exome (tblastx)or proteome. Here is the paper I am using the approach from (http://www.ncbi.nlm.nih.gov/pmc/arti...pid397977title) if you are interested to see or have any further comments.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 11:49 AM
            0 responses
            13 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-24-2024, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X