Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Alirez
    Junior Member
    • Sep 2014
    • 6

    Acquiring reads not incorporated into a Trinity assembly?

    Hi all,

    I am conducting mRNASeq (Illumina HiSeq 2500 with PE data) using Trinity. I am finding it a bit difficult to see if there is an option to provide you with the reads which were not used to create the final .fasta assembly (the 'orphan' reads). Is there a way to do this?

    I have considered looking at reads that that did not align back to the assembly, but I'm aware this may not be a true representation of reads not being used in the assembly.

    Many thanks
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    Mapping is the approach normally used for this purpose. Kmer-based assemblers operate in kmer-space and do not (usually) associate kmers with the reads they came from. So, the information about which contig came from which read is lost.

    BBMap has a special flag designed specifically for this purpose, "kfilter", which bans alignments that do not have at least k consecutive matching bases. For example, let's say you assemble some data using K=61. Then:

    bbmap.sh in=reads.fq ref=assembly.fa outm=mapped.sam outu=unmapped.fq kfilter=61 maxindel=100


    That will capture unmapped reads as unmapped.fq. And "kfilter=61" ensures that all of the mapped reads share a 61-mer with the contig they mapped to; in other words, that read WAS used to build that contig.

    In practice, I don't think kfilter usually makes a lot of difference except in very specific scenarios, but RNA-seq assembly may be one of them.

    Comment

    • Alirez
      Junior Member
      • Sep 2014
      • 6

      #3
      Originally posted by Brian Bushnell View Post
      Mapping is the approach normally used for this purpose. Kmer-based assemblers operate in kmer-space and do not (usually) associate kmers with the reads they came from. So, the information about which contig came from which read is lost.

      BBMap has a special flag designed specifically for this purpose, "kfilter", which bans alignments that do not have at least k consecutive matching bases. For example, let's say you assemble some data using K=61. Then:

      bbmap.sh in=reads.fq ref=assembly.fa outm=mapped.sam outu=unmapped.fq kfilter=61 maxindel=100


      That will capture unmapped reads as unmapped.fq. And "kfilter=61" ensures that all of the mapped reads share a 61-mer with the contig they mapped to; in other words, that read WAS used to build that contig.

      In practice, I don't think kfilter usually makes a lot of difference except in very specific scenarios, but RNA-seq assembly may be one of them.
      Hi Brian,

      Thank you very much for your reply, I will look into doing this.

      One further complication that I have is that I have generated a final assembly through a multiple k-mer approach. I've created assemblies with every (odd) iteration of k from 19-31 and merged them all followed by removing redundant contigs. To find out which reads have not been used in this final assembly is there an option in BBMap to use multiple kfilters simultaneously?

      If not, would you recommend conducting BBMap on each iterated-k assembly individually and clustering all unmapped reads together - anything which forms a cluster containing unmapped reads from all 7 assemblies will be reads not incorporated into the final assembly.

      Thanks again,
      Ali

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        The point of multiple kmers is to compensate for locally low coverage. If a contig shares a 19-mer with a read, then that read was used in the 19-mer phase. If a read shares a 31-mer with a read, then it shares a 19-mer as well. So, just set kfilter=19.

        19 is really short for assembly, though. I don't have any direct experience with Trinity, but normally I find that's below the bottom of the range of useful kmer sizes. In Spades, for example, which also supports multiple kmer lengths, I normally sweep from K=25 to K=127.

        Comment

        • westerman
          Rick Westerman
          • Jun 2008
          • 1104

          #5
          Trinity's normal kmer is 25. Sweeping in Trinity doesn't seem to make much difference. At least I do not recall anyone on the mailing list recommending a sweep.

          That said, I would just take your final contigs and map the reads to them using your minimum kfilter. That would keep any reads that map to your final assembly on a very conservative basis from being output thus allowing you to concentrate on reads that absolutely do not map. As Brian hinted, kfilter probably won't make much of a difference so personally I would skip it.

          Comment

          • Alirez
            Junior Member
            • Sep 2014
            • 6

            #6
            Brian and Westermen,

            Thanks for both of your comments, I'll use the bbmap as recommended.

            With regards to the multiple-k approach I've been using it as a way to optimise my assembly - my initial assemblies have been quite fragmented with an unrealistic amount of loci. This is likely partly due to Trinity being unable to reliably distinguish allelic variation from isoform information (as indicated by a few gene examples I looked at in my dataset). I've certainly found on tests of subsets of the data that a multiple-k assembly gives a much better assembly than any single k-mer alone. Conducting differential expression on loci that can only be well annotated should then resolve any misassembly issues. My next step is a scaffolding through translation mapping to my closest genome/exome (tblastx)or proteome. Here is the paper I am using the approach from (http://www.ncbi.nlm.nih.gov/pmc/arti...pid397977title) if you are interested to see or have any further comments.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


              Here are nine questions we think about, in roughly the order they matter, before...
              Yesterday, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            20 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            38 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            44 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            49 views
            0 reactions
            Last Post SEQadmin2  
            Working...