SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Two sets of .fast5 files from Nanopore igwill Oxford Nanopore 1 04-07-2019 04:07 PM
Tophat2: prepare unmapped.bam file for input into a tophat run on alternative genome danielsbrewer RNA Sequencing 10 05-20-2016 01:57 AM
makeblastdb is making multiple sets of files, how can I fix it? rbruenn Bioinformatics 1 10-15-2015 12:14 PM
ERROR MESSAGE: Input files reads and reference have incompatible contigs aituka Bioinformatics 0 06-06-2012 05:21 AM

Reply
 
Thread Tools
Old 09-16-2021, 09:34 AM   #1
Pedro Olivares
Junior Member
 
Location: Germany

Join Date: Sep 2021
Posts: 5
Question Run Tadpole (BBtools) only on sub-sets of reads from input files

Hey there,

I have been using tadpole for error correction (and BBtools in general) and I am extremely happy with its results and performance. Much appreciated!

I am looking for support/advice on what seems a relatively simple thing to do, but I can't seem to solve in a simple manner: is it possible to run the tadpole.sh family of commands on sub-sets of a given input file?

Contrary to the problem one has when assembling genomes (millions of small partitions of a single large object), the fields of single-cell genomics and single-molecule sequencing present a different paradigm: one has thousands of smaller objects (cells or molecules) scattered around the reads, usually determined by the inclusion of unique molecular identifiers (UMI).

It would be fantastic to be able to run the tadpole.sh suite of algorithms in this different type of problem. Two applications for which I have successfully used tadpole with this set of mind are 1) on the assembly randomly fragmented mRNA libraries using UMIs as a handle; basically making virtual long reads out of Illumina experiments and 2) the correction of clouds of reads from amplicons (again - held together by a shared UMI) in order to detect SNPs or/and indels.

The challenge I face now is that of scalability. Even though the generation of thousand of small sets of reads into individual fastq for passing to tadpole works well in principle, in practice is a big challenge for even mid-sizes data sets. The IO overhead for this use paradigm results into week-long runtimes. Where as running tadpole on the whole fastq takes less than a minute (using 100 cores) ...

From a naive point of view, it feels as if having the option to sub-set input files (regex or list of items to name a couple) would be a solution to this problem of scalability since iterating through a single file thousands of times sound more efficient (at logistically simple) than generating thousands of little mini jobs. I wonder if you have any suggestion on how to tackle this situation. Another idea is to cat fastq | grep UMI | tadpole ... but I am not sure if one can pass pipes to tadpole.sh (and my guess is a no based on the documentation)

Thanks again for the great work here!
Pedro Olivares is offline   Reply With Quote
Old 09-16-2021, 01:40 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,127
Default

You should be able to pass fastq to all BBtools. Use in=stdin.fq when you are doing that.
GenoMax is offline   Reply With Quote
Old 09-17-2021, 01:28 AM   #3
Pedro Olivares
Junior Member
 
Location: Germany

Join Date: Sep 2021
Posts: 5
Default

Great this is exactly what I needed. I will test and update since now the bottle neck will be to efficiently sub-set a large collection of reads. Some tabix should help. Let's see ...

Thanks again and sorry if there were duplicated posts. I am not sure how the forum works. All previous attempts to post (and even this thread) never returned any sort of notification of their status.

I will update a final solution for the record.
Pedro Olivares is offline   Reply With Quote
Old 09-17-2021, 04:08 AM   #4
Pedro Olivares
Junior Member
 
Location: Germany

Join Date: Sep 2021
Posts: 5
Default

I think I might be hitting into some sort of bug.

When I input a fastq using its path as an argument, the whole thing runs fine but when `cat` this same file and pipe it to `tadpole.sh` using the in=stdin.fq (or stdin.fastq) things seem to run fine up to one point but then the output is empty.

Here an example of a working case

Here an example of the non-working case

Perhaps I need to go about this in a different way that I am missing?

This problem is present in a couple of setups I have access to:
- For a guix supported version I am using BBMap version 38.90 (examples are from this one)
- From a conda instance BBMap version 37.62

Thanks for the help.
Pedro Olivares is offline   Reply With Quote
Old 09-17-2021, 11:58 AM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,127
Default

I may have bad news. All bbmap tools are supposed to be able to accept input from STDIN but it appears that "tadpole.sh" may be an exception. This is something Brian (author of BBMap may know the answer to). Brian no longer participates in forums so you could try emailing him directly and see if he responds.

Something like
Code:
zcat file.fq.gz | reformat.sh -Xmx4g in=stdin.fq out=stdout.fa
does work.
GenoMax is offline   Reply With Quote
Old 09-17-2021, 11:28 PM   #6
Pedro Olivares
Junior Member
 
Location: Germany

Join Date: Sep 2021
Posts: 5
Default

Oops, let's see what I can do. Thanks for the information, though. Any hint on how can I find his email address? So far none of the obvious places worked and I haven't heard back from him on Twitter.

My hope is that there should be an easy fix since it doesn't look like it doesn't work at all, the reads are actually loaded and processed but somehow downstream the analysis the are missed.

Also, I just found out that when actually building contigs, inputs from stdin work perfectly fine! The problem only manifests when the option mode=correct is set.

I will try to dig into the java code but it's really far from my comfort zone.

Thanks again.
Pedro Olivares is offline   Reply With Quote
Old 09-18-2021, 07:00 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,127
Default

Brian's email address is in the inline help for bbmap programs. Just run `bbmap.sh` and look through the help.
Quote:
The problem only manifests when the option mode=correct is set.
It may be by design then. Error correction requires keeping large amount of sequence in memory. You could try assigning a large amount of RAM for -Xmx option and see if that works.

Other option is named pipes/FIFO etc but depends on how much effort you are willing to invest.

Last edited by GenoMax; 09-18-2021 at 07:17 AM.
GenoMax is offline   Reply With Quote
Reply

Tags
bbtools, error correction, kmer, tadpole

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:11 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO