Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Run Tadpole (BBtools) only on sub-sets of reads from input files

    Hey there,

    I have been using tadpole for error correction (and BBtools in general) and I am extremely happy with its results and performance. Much appreciated!

    I am looking for support/advice on what seems a relatively simple thing to do, but I can't seem to solve in a simple manner: is it possible to run the tadpole.sh family of commands on sub-sets of a given input file?

    Contrary to the problem one has when assembling genomes (millions of small partitions of a single large object), the fields of single-cell genomics and single-molecule sequencing present a different paradigm: one has thousands of smaller objects (cells or molecules) scattered around the reads, usually determined by the inclusion of unique molecular identifiers (UMI).

    It would be fantastic to be able to run the tadpole.sh suite of algorithms in this different type of problem. Two applications for which I have successfully used tadpole with this set of mind are 1) on the assembly randomly fragmented mRNA libraries using UMIs as a handle; basically making virtual long reads out of Illumina experiments and 2) the correction of clouds of reads from amplicons (again - held together by a shared UMI) in order to detect SNPs or/and indels.

    The challenge I face now is that of scalability. Even though the generation of thousand of small sets of reads into individual fastq for passing to tadpole works well in principle, in practice is a big challenge for even mid-sizes data sets. The IO overhead for this use paradigm results into week-long runtimes. Where as running tadpole on the whole fastq takes less than a minute (using 100 cores) ...

    From a naive point of view, it feels as if having the option to sub-set input files (regex or list of items to name a couple) would be a solution to this problem of scalability since iterating through a single file thousands of times sound more efficient (at logistically simple) than generating thousands of little mini jobs. I wonder if you have any suggestion on how to tackle this situation. Another idea is to cat fastq | grep UMI | tadpole ... but I am not sure if one can pass pipes to tadpole.sh (and my guess is a no based on the documentation)

    Thanks again for the great work here!

  • #2
    You should be able to pass fastq to all BBtools. Use in=stdin.fq when you are doing that.

    Comment


    • #3
      Great this is exactly what I needed. I will test and update since now the bottle neck will be to efficiently sub-set a large collection of reads. Some tabix should help. Let's see ...

      Thanks again and sorry if there were duplicated posts. I am not sure how the forum works. All previous attempts to post (and even this thread) never returned any sort of notification of their status.

      I will update a final solution for the record.

      Comment


      • #4
        I think I might be hitting into some sort of bug.

        When I input a fastq using its path as an argument, the whole thing runs fine but when `cat` this same file and pipe it to `tadpole.sh` using the in=stdin.fq (or stdin.fastq) things seem to run fine up to one point but then the output is empty.

        Here an example of a working case

        Here an example of the non-working case

        Perhaps I need to go about this in a different way that I am missing?

        This problem is present in a couple of setups I have access to:
        - For a guix supported version I am using BBMap version 38.90 (examples are from this one)
        - From a conda instance BBMap version 37.62

        Thanks for the help.

        Comment


        • #5
          I may have bad news. All bbmap tools are supposed to be able to accept input from STDIN but it appears that "tadpole.sh" may be an exception. This is something Brian (author of BBMap may know the answer to). Brian no longer participates in forums so you could try emailing him directly and see if he responds.

          Something like
          Code:
          zcat file.fq.gz | reformat.sh -Xmx4g in=stdin.fq out=stdout.fa
          does work.

          Comment


          • #6
            Oops, let's see what I can do. Thanks for the information, though. Any hint on how can I find his email address? So far none of the obvious places worked and I haven't heard back from him on Twitter.

            My hope is that there should be an easy fix since it doesn't look like it doesn't work at all, the reads are actually loaded and processed but somehow downstream the analysis the are missed.

            Also, I just found out that when actually building contigs, inputs from stdin work perfectly fine! The problem only manifests when the option mode=correct is set.

            I will try to dig into the java code but it's really far from my comfort zone.

            Thanks again.

            Comment


            • #7
              Brian's email address is in the inline help for bbmap programs. Just run `bbmap.sh` and look through the help.
              The problem only manifests when the option mode=correct is set.
              It may be by design then. Error correction requires keeping large amount of sequence in memory. You could try assigning a large amount of RAM for -Xmx option and see if that works.

              Other option is named pipes/FIFO etc but depends on how much effort you are willing to invest.
              Last edited by GenoMax; 09-18-2021, 06:17 AM.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              68 views
              0 likes
              Last Post seqadmin  
              Working...
              X