SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
short reads aligner for short reference database capricy Bioinformatics 5 11-04-2015 06:05 AM
How to align reads to other reads (not to reference genome) username111 Bioinformatics 2 08-18-2014 08:49 AM
align short query against 300M Illumina reads hohllp Bioinformatics 5 11-21-2011 01:40 PM
Looking for assembler to align very short reads andylai Bioinformatics 0 06-02-2011 09:58 PM
PubMed: PASS: a Program to Align Short Sequences. Newsbot! Bioinformatics 0 02-17-2009 06:00 AM

Reply
 
Thread Tools
Old 02-16-2017, 06:36 PM   #1
bloosnail
Member
 
Location: Pittsburgh

Join Date: Jul 2015
Posts: 17
Default Good way to align very, very short reads?

We have single-cell transcriptomic reads, where each read has a barcode associated with it which designates which cell it came from. From the default pipeline (a company called 10X Genomics), about 1/3 of reads' barcodes did not pass quality control and we are seeing if we can stretch this and save some of the data. So essentially, each read has a barcode of 14bp attached to it, and there is a whitelist of allowed barcodes (all also of 14bp). We are trying to use Bowtie2 to align the barcodes of reads that did not pass quality control (generally they all have one or more bases different from any whitelisted barcodes) back to the reference whitelist to see which barcodes in the whitelist are most similar to each read, and we may decide to keep more of the data if reads have whitelisted barcodes similar enough to their own.

Essentially, we are making mock .fastq and .fasta files from the reads' barcodes and whitelisted barcodes respectively, and running alignment with the whitelisted barcodes as the "reference genome". We are wondering if there is a better way/tool to do this, since I believe Bowtie2 was designed for longer length (50+ bp) reads (all barcodes are 14bp long). It looks like the first version of Bowtie is more geared toward shorter read length, but the manual still says it was intended for reads of length 25-50bp, and we are wondering if there is a tool designed for even shorter read lengths, or if someone has insights on whether Bowtie2/Bowtie is giving accurate results. I first tried writing a Perl script to do the check but it took a minute for only 30 reads (one sample has 30 million...). Thanks.

*Edit
So basically, we are doing normal alignment but with several differences:
-the read lengths and reference sequences are all very short (14bp)
-all data is exactly the same length, so there is no need for gapped alignment
-all reads are single-ended

Last edited by bloosnail; 02-16-2017 at 06:45 PM.
bloosnail is offline   Reply With Quote
Old 02-16-2017, 07:01 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The best way to do this is typically with BBDuk or Seal, which operate on kmer-matching but allow a hamming distance. They're ideal for 14bp sequences, and are very fast. For example:

Quote:
bbduk.sh in=reads.fq ref=whitelist.fa outm=match.fq out=unmatched.fq k=14 mm=f
This will put all the reads in match or unmatched, requiring exact matches. If you want to allow mismatches, use "hdist"; for example, this will allow 3 mismatches:

Quote:
bbduk.sh in=reads.fq ref=whitelist.fa outm=match.fq out=unmatched.fq k=14 mm=f hdist=3
Alternatively, Seal will split the input file into one output file per reference sequence:

Quote:
seal.sh in=reads.fq ref=whitelist.fa pattern=match_%.fq out=unmatched.fq k=14 mm=f hdist=3
You can add the "ordered" flag if you want the reads to remain in input order. Note that it is also possible to do this via alignment (e.g. with bowtie1 or with BBMap), which would produce a sam file, but it may require tweaking default parameters. With BBMap for example it would be something like:

Quote:
bbmap.sh ref=whitelist.fa in=reads.fq out=mapped.sam k=7 vslow strictmaxindel=0
...but, I think Seal or BBDuk would be more convenient.

Last edited by Brian Bushnell; 02-16-2017 at 07:05 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-17-2017, 12:34 PM   #3
bloosnail
Member
 
Location: Pittsburgh

Join Date: Jul 2015
Posts: 17
Default

Hi Brian,

Thank you for your response, this seems to be useful for our purposes. I am wondering if BBDuk has some functionality to output the actual sequence in the .fa that each read aligned to, not just whether a read has valid alignments or not. Our specific end goal is to find the subset of whitelisted reads that is most similar to each barcode.

Additionally, we are only trying to find the subset of whitelisted barcodes with the minimum Hamming distance for each barcode, not which barcodes have a reference within a certain Hamming distance away. So I think (if it is able to output the sequence of the reference .fa that a read is aligned it) we could run BBDuk iteratively over increasing Hamming distance from 1-14, then starting from Hamming distance of 1 up to 14, check if a read has passed yet or not, and exclude it from later iterations once it has passed.

So specifically, my main question is whether BBDuk can output the the actual reference sequence that a read aligned to, or if you have some recommendation on a tool that can do that along with the other criteria listed previously? Thanks.
bloosnail is offline   Reply With Quote
Old 02-17-2017, 05:59 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Actually, BBDuk will not go over a hamming distance of around 7 very easily. The exact limit depends on the kmer length, number of sequences, and amount of memory, but the time and memory increase exponentially with (14*3)^(hamming distance). But then, finding a 14-mer match with a hamming distance of 14 is not very interesting anyway Anyway, iteratively increasing hamming distance from 0 to some limit is probably the best way to go, unless there's a specific program that simply brute-force calculate all the hamming distances... I do have one like that for bar codes (where there's no computational cost for mismatches, unlike BBDuk or alignment algorithms) but it's not quite designed for this use. But if this is a common need for processing 10X data I'll plan to modify it to suit this purpose exactly (we don't use 10X data currently, though I am investigating it for the purpose of scaffolding).

Anyway - if you use Seal, you will get one file for each sequence in the reference fasta, containing all the reads matching that sequence, which sounds possibly like what you want. Alternatively, you could do this:

Code:
bbduk.sh in=reads.fq ref=whitelist.fa outm=matched0.fq out=unmatched0.fq k=14 mm=f hdist=0 rename fbm

bbduk.sh in=unmatched0.fq ref=whitelist.fa outm=matched1.fq out=unmatched1.fq k=14 mm=f hdist=1 rename fbm

bbduk.sh in=unmatched1.fq ref=whitelist.fa outm=matched2.fq out=unmatched2.fq k=14 mm=f hdist=2 rename fbm
...etc. That will rename the reads to indicate which sequence they matched (it will add it as a suffix so the original name is also retained). By default BBDuk will not let you go over hdist=3 unless you add the flag "-da". But, there are 2 ways to increase hamming distance in BBDuk, with "hdist" and "qhdist"; the first makes mutant copies of reference kmers, and the second makes mutant copies of query kmers. So to achieve hamming distance 6, it's most efficient to use "hdist=3 qhdist=3", since each is independently exponential.

Last edited by Brian Bushnell; 02-17-2017 at 06:01 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-17-2017, 10:22 PM   #5
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,230
Default

Cell Ranger corrects one base mismatches if it confidently can be assigned to one of whitelist barcodes. 10x software might have a setting to relax match criteria.
nucacidhunter is offline   Reply With Quote
Old 02-19-2017, 01:09 PM   #6
bloosnail
Member
 
Location: Pittsburgh

Join Date: Jul 2015
Posts: 17
Default

@Brian, I am currently running BBDuk and it seems to be working well. Thank you for your suggestion. I am wondering if there is reasonable increase in time if more cores are used -- from the documentation, "This uses a fixed number of threads, currently 7, which cannot be changed (except by editing the “WAYS=7” line in BBDuk’s source code and recompiling)". Also, I think there is a slight typo in the manual in the installation section (http://jgi.doe.gov/data-and-tools/bb...llation-guide/), I think the command "$ tar -xvfz BBMap_(version)" should be "$ tar -xvzf BBMap_(version).tar.gz". Somewhat trivial, but could be confusing for someone without experience on the command line.

@nucacidhunter, I tried e-mailing 10X and they said they do not currently have built-in functionality to relax the settings, but there is a line in the source code where one can relax the cut-off for filtering. We also may try this, thank you for the suggestion.
bloosnail is offline   Reply With Quote
Old 02-21-2017, 04:17 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by bloosnail View Post
@Brian, I am currently running BBDuk and it seems to be working well. Thank you for your suggestion. I am wondering if there is reasonable increase in time if more cores are used -- from the documentation, "This uses a fixed number of threads, currently 7, which cannot be changed (except by editing the “WAYS=7” line in BBDuk’s source code and recompiling)".
To clarify, BBDuk scales fine to an arbitrary number of threads for processing the reads (using the "threads" flag, but by default it uses all available threads). It uses 7 threads only when processing the reference. In most cases processing the reference takes a trivial amount of time, but it can take a while with a high hamming distance.

Increasing WAYS will increase speed up to the number of cores you have (the number needs to be odd though). If WAYS is above your core count it will decrease speed instead. But it doesn't increase speed all that much (all of the arithmetic is replicated; only the memory random accesses are parallel) so 7 is usually fine. Once the reference is processed, WAYS has no effect on runtime for processing the reads. So it's only useful to adjust if the majority of your time is spent processing the reference.

Note that "hdist" increases the time needed for loading the reference, while "qhdist" increases the time needed to process the reads.

Quote:
Also, I think there is a slight typo in the manual in the installation section (http://jgi.doe.gov/data-and-tools/bb...llation-guide/), I think the command "$ tar -xvfz BBMap_(version)" should be "$ tar -xvzf BBMap_(version).tar.gz". Somewhat trivial, but could be confusing for someone without experience on the command line.
Thanks; I'll fix that!
Brian Bushnell is offline   Reply With Quote
Old 02-23-2017, 06:00 PM   #8
bloosnail
Member
 
Location: Pittsburgh

Join Date: Jul 2015
Posts: 17
Default

Hi Brian,

I am looking more closely at the output from BBDuk and I think there may be a problem. It appears that whenever a read has multiple matches, the name of the barcode that is aligned to is always the same. For example, here is a grep'd read with multiple matches on the output .fq and this is what is returned:

@NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
@NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
@NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
@NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1

So what would be expected is numerous hits for the read ID, but ideally the barcode names would be different. It appears it is like this for every read with multiple matches. Would you know what the problem is? Here is the command I have been using (or some slight variation of it):

$ ../BBDuk/bbmap/bbduk.sh in=../no_CB_no_head.fq ref=
../bt2_index/barcodes.fa outm=matched1.fq out=unmatched1.fq k=14 mm=f -da hdist=1 rcomp=f rename fbm
bloosnail is offline   Reply With Quote
Old 02-23-2017, 10:39 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBDuk and Seal are very similar, but different in an important way. BBDuk associates each kmer with a single reference sequence - specifically, the first one it encounters in the reference fasta. Seal associates each kmer with every reference sequence containing that kmer. The syntax is very similar, and both support read renaming. For contaminant removal or adapter-trimming, BBDuk is ideal. But in a situation where it is important to distinguish between similar things - such as isoform expression quantification, contaminant detection, and barcode matching... it's important to use Seal.

Sorry, I realize now that I should have been more clear on this initially. If I was routinely using 10X data I'm sure I'd have a protocol in place, but this is new to me. With a low hdist setting and a high hamming distance between barcodes (which was my initial expectation) the results are identical, but once you increase hdist/qhdist sufficiently, Seal should be used instead of BBDuk, when it's important to know which reference sequence a query hit.

Note that Seal still does not distinguish between 2 mismatches for reference sequence A versus 3 mismatches for reference sequence B. If you set hdist=2, the query will get tagged as matching A only. If you set hdist=3, the query will get tagged as matching both A and B with no indication that it matches A better; that information is not stored. I designed it for longer sequences where ties could be broken by the number of shared kmers; but when the sequence length is equal to the kmer length, obviously, that won't work as a tie-breaker since it's either 1 or 0.

Anyway - I think Seal is a good tool for this purpose, but tools are always best when you know their limitations!

Last edited by Brian Bushnell; 02-23-2017 at 10:41 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-24-2017, 11:54 AM   #10
bloosnail
Member
 
Location: Pittsburgh

Join Date: Jul 2015
Posts: 17
Default

Thank you for your thorough and quick response Brian. I have tried Seal out and it seems to be giving the correct output -- same number of reads but the multi-mapped reads have different IDs on them. Is it possible to still split up the hdist and qhdist to sum up to the desired Hamming distance to decrease exponential runtime? I tried running something like this:

../BBDuk/bbmap/seal.sh in
=unmatched3.fq ref=../bt2_index/barcodes.fa out=matched4.fq outu=unmatched4.fq k
=14 mm=f -da hdist=2 qhdist=2 rcomp=f rename fbm overwrite

for matches on Hamming distance of 4 and it gives a lot of NullPointerExceptions and crashes.
bloosnail is offline   Reply With Quote
Old 02-24-2017, 02:38 PM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by bloosnail View Post
Thank you for your thorough and quick response Brian. I have tried Seal out and it seems to be giving the correct output -- same number of reads but the multi-mapped reads have different IDs on them. Is it possible to still split up the hdist and qhdist to sum up to the desired Hamming distance to decrease exponential runtime? I tried running something like this:

../BBDuk/bbmap/seal.sh in
=unmatched3.fq ref=../bt2_index/barcodes.fa out=matched4.fq outu=unmatched4.fq k
=14 mm=f -da hdist=2 qhdist=2 rcomp=f rename fbm overwrite

for matches on Hamming distance of 4 and it gives a lot of NullPointerExceptions and crashes.
It should not crash, so please send me all of the screen output (it all gets written to stderr). Of course it would be great if you could send me the input data as well, if it's not confidential.
Brian Bushnell is offline   Reply With Quote
Old 02-24-2017, 04:11 PM   #12
bloosnail
Member
 
Location: Pittsburgh

Join Date: Jul 2015
Posts: 17
Default

Here is a link to the screen output from the command that I posted earlier: https://drive.google.com/open?id=0B_...0ZCQ3dQei1aUFU

Also, the number of input reads at the bottom should be 4,568,965 instead of 51. I'm not sure my supervisor would want me to post the data and it's also pretty big, but if the problem is not clear I can PM you a shortened version of it. Thanks for taking a look!
bloosnail is offline   Reply With Quote
Old 02-25-2017, 07:48 AM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I found and fixed the problem, and will upload the fixed version on Monday.
Brian Bushnell is offline   Reply With Quote
Old 02-28-2017, 05:53 PM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

This issue is now resolved as of BBMap v37.00!
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:35 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO