Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bloosnail
    Member
    • Jul 2015
    • 17

    Good way to align very, very short reads?

    We have single-cell transcriptomic reads, where each read has a barcode associated with it which designates which cell it came from. From the default pipeline (a company called 10X Genomics), about 1/3 of reads' barcodes did not pass quality control and we are seeing if we can stretch this and save some of the data. So essentially, each read has a barcode of 14bp attached to it, and there is a whitelist of allowed barcodes (all also of 14bp). We are trying to use Bowtie2 to align the barcodes of reads that did not pass quality control (generally they all have one or more bases different from any whitelisted barcodes) back to the reference whitelist to see which barcodes in the whitelist are most similar to each read, and we may decide to keep more of the data if reads have whitelisted barcodes similar enough to their own.

    Essentially, we are making mock .fastq and .fasta files from the reads' barcodes and whitelisted barcodes respectively, and running alignment with the whitelisted barcodes as the "reference genome". We are wondering if there is a better way/tool to do this, since I believe Bowtie2 was designed for longer length (50+ bp) reads (all barcodes are 14bp long). It looks like the first version of Bowtie is more geared toward shorter read length, but the manual still says it was intended for reads of length 25-50bp, and we are wondering if there is a tool designed for even shorter read lengths, or if someone has insights on whether Bowtie2/Bowtie is giving accurate results. I first tried writing a Perl script to do the check but it took a minute for only 30 reads (one sample has 30 million...). Thanks.

    *Edit
    So basically, we are doing normal alignment but with several differences:
    -the read lengths and reference sequences are all very short (14bp)
    -all data is exactly the same length, so there is no need for gapped alignment
    -all reads are single-ended
    Last edited by bloosnail; 02-16-2017, 06:45 PM.
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    The best way to do this is typically with BBDuk or Seal, which operate on kmer-matching but allow a hamming distance. They're ideal for 14bp sequences, and are very fast. For example:

    bbduk.sh in=reads.fq ref=whitelist.fa outm=match.fq out=unmatched.fq k=14 mm=f
    This will put all the reads in match or unmatched, requiring exact matches. If you want to allow mismatches, use "hdist"; for example, this will allow 3 mismatches:

    bbduk.sh in=reads.fq ref=whitelist.fa outm=match.fq out=unmatched.fq k=14 mm=f hdist=3
    Alternatively, Seal will split the input file into one output file per reference sequence:

    seal.sh in=reads.fq ref=whitelist.fa pattern=match_%.fq out=unmatched.fq k=14 mm=f hdist=3
    You can add the "ordered" flag if you want the reads to remain in input order. Note that it is also possible to do this via alignment (e.g. with bowtie1 or with BBMap), which would produce a sam file, but it may require tweaking default parameters. With BBMap for example it would be something like:

    bbmap.sh ref=whitelist.fa in=reads.fq out=mapped.sam k=7 vslow strictmaxindel=0
    ...but, I think Seal or BBDuk would be more convenient.
    Last edited by Brian Bushnell; 02-16-2017, 07:05 PM.

    Comment

    • bloosnail
      Member
      • Jul 2015
      • 17

      #3
      Hi Brian,

      Thank you for your response, this seems to be useful for our purposes. I am wondering if BBDuk has some functionality to output the actual sequence in the .fa that each read aligned to, not just whether a read has valid alignments or not. Our specific end goal is to find the subset of whitelisted reads that is most similar to each barcode.

      Additionally, we are only trying to find the subset of whitelisted barcodes with the minimum Hamming distance for each barcode, not which barcodes have a reference within a certain Hamming distance away. So I think (if it is able to output the sequence of the reference .fa that a read is aligned it) we could run BBDuk iteratively over increasing Hamming distance from 1-14, then starting from Hamming distance of 1 up to 14, check if a read has passed yet or not, and exclude it from later iterations once it has passed.

      So specifically, my main question is whether BBDuk can output the the actual reference sequence that a read aligned to, or if you have some recommendation on a tool that can do that along with the other criteria listed previously? Thanks.

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Actually, BBDuk will not go over a hamming distance of around 7 very easily. The exact limit depends on the kmer length, number of sequences, and amount of memory, but the time and memory increase exponentially with (14*3)^(hamming distance). But then, finding a 14-mer match with a hamming distance of 14 is not very interesting anyway Anyway, iteratively increasing hamming distance from 0 to some limit is probably the best way to go, unless there's a specific program that simply brute-force calculate all the hamming distances... I do have one like that for bar codes (where there's no computational cost for mismatches, unlike BBDuk or alignment algorithms) but it's not quite designed for this use. But if this is a common need for processing 10X data I'll plan to modify it to suit this purpose exactly (we don't use 10X data currently, though I am investigating it for the purpose of scaffolding).

        Anyway - if you use Seal, you will get one file for each sequence in the reference fasta, containing all the reads matching that sequence, which sounds possibly like what you want. Alternatively, you could do this:

        Code:
        bbduk.sh in=reads.fq ref=whitelist.fa outm=matched0.fq out=unmatched0.fq k=14 mm=f hdist=0 rename fbm
        
        bbduk.sh in=unmatched0.fq ref=whitelist.fa outm=matched1.fq out=unmatched1.fq k=14 mm=f hdist=1 rename fbm
        
        bbduk.sh in=unmatched1.fq ref=whitelist.fa outm=matched2.fq out=unmatched2.fq k=14 mm=f hdist=2 rename fbm
        ...etc. That will rename the reads to indicate which sequence they matched (it will add it as a suffix so the original name is also retained). By default BBDuk will not let you go over hdist=3 unless you add the flag "-da". But, there are 2 ways to increase hamming distance in BBDuk, with "hdist" and "qhdist"; the first makes mutant copies of reference kmers, and the second makes mutant copies of query kmers. So to achieve hamming distance 6, it's most efficient to use "hdist=3 qhdist=3", since each is independently exponential.
        Last edited by Brian Bushnell; 02-17-2017, 06:01 PM.

        Comment

        • nucacidhunter
          Jafar Jabbari
          • Jan 2013
          • 1250

          #5
          Cell Ranger corrects one base mismatches if it confidently can be assigned to one of whitelist barcodes. 10x software might have a setting to relax match criteria.

          Comment

          • bloosnail
            Member
            • Jul 2015
            • 17

            #6
            @Brian, I am currently running BBDuk and it seems to be working well. Thank you for your suggestion. I am wondering if there is reasonable increase in time if more cores are used -- from the documentation, "This uses a fixed number of threads, currently 7, which cannot be changed (except by editing the “WAYS=7” line in BBDuk’s source code and recompiling)". Also, I think there is a slight typo in the manual in the installation section (http://jgi.doe.gov/data-and-tools/bb...llation-guide/), I think the command "$ tar -xvfz BBMap_(version)" should be "$ tar -xvzf BBMap_(version).tar.gz". Somewhat trivial, but could be confusing for someone without experience on the command line.

            @nucacidhunter, I tried e-mailing 10X and they said they do not currently have built-in functionality to relax the settings, but there is a line in the source code where one can relax the cut-off for filtering. We also may try this, thank you for the suggestion.

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              Originally posted by bloosnail View Post
              @Brian, I am currently running BBDuk and it seems to be working well. Thank you for your suggestion. I am wondering if there is reasonable increase in time if more cores are used -- from the documentation, "This uses a fixed number of threads, currently 7, which cannot be changed (except by editing the “WAYS=7” line in BBDuk’s source code and recompiling)".
              To clarify, BBDuk scales fine to an arbitrary number of threads for processing the reads (using the "threads" flag, but by default it uses all available threads). It uses 7 threads only when processing the reference. In most cases processing the reference takes a trivial amount of time, but it can take a while with a high hamming distance.

              Increasing WAYS will increase speed up to the number of cores you have (the number needs to be odd though). If WAYS is above your core count it will decrease speed instead. But it doesn't increase speed all that much (all of the arithmetic is replicated; only the memory random accesses are parallel) so 7 is usually fine. Once the reference is processed, WAYS has no effect on runtime for processing the reads. So it's only useful to adjust if the majority of your time is spent processing the reference.

              Note that "hdist" increases the time needed for loading the reference, while "qhdist" increases the time needed to process the reads.

              Also, I think there is a slight typo in the manual in the installation section (http://jgi.doe.gov/data-and-tools/bb...llation-guide/), I think the command "$ tar -xvfz BBMap_(version)" should be "$ tar -xvzf BBMap_(version).tar.gz". Somewhat trivial, but could be confusing for someone without experience on the command line.
              Thanks; I'll fix that!

              Comment

              • bloosnail
                Member
                • Jul 2015
                • 17

                #8
                Hi Brian,

                I am looking more closely at the output from BBDuk and I think there may be a problem. It appears that whenever a read has multiple matches, the name of the barcode that is aligned to is always the same. For example, here is a grep'd read with multiple matches on the output .fq and this is what is returned:

                @NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
                @NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
                @NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
                @NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1

                So what would be expected is numerous hits for the read ID, but ideally the barcode names would be different. It appears it is like this for every read with multiple matches. Would you know what the problem is? Here is the command I have been using (or some slight variation of it):

                $ ../BBDuk/bbmap/bbduk.sh in=../no_CB_no_head.fq ref=
                ../bt2_index/barcodes.fa outm=matched1.fq out=unmatched1.fq k=14 mm=f -da hdist=1 rcomp=f rename fbm

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  #9
                  BBDuk and Seal are very similar, but different in an important way. BBDuk associates each kmer with a single reference sequence - specifically, the first one it encounters in the reference fasta. Seal associates each kmer with every reference sequence containing that kmer. The syntax is very similar, and both support read renaming. For contaminant removal or adapter-trimming, BBDuk is ideal. But in a situation where it is important to distinguish between similar things - such as isoform expression quantification, contaminant detection, and barcode matching... it's important to use Seal.

                  Sorry, I realize now that I should have been more clear on this initially. If I was routinely using 10X data I'm sure I'd have a protocol in place, but this is new to me. With a low hdist setting and a high hamming distance between barcodes (which was my initial expectation) the results are identical, but once you increase hdist/qhdist sufficiently, Seal should be used instead of BBDuk, when it's important to know which reference sequence a query hit.

                  Note that Seal still does not distinguish between 2 mismatches for reference sequence A versus 3 mismatches for reference sequence B. If you set hdist=2, the query will get tagged as matching A only. If you set hdist=3, the query will get tagged as matching both A and B with no indication that it matches A better; that information is not stored. I designed it for longer sequences where ties could be broken by the number of shared kmers; but when the sequence length is equal to the kmer length, obviously, that won't work as a tie-breaker since it's either 1 or 0.

                  Anyway - I think Seal is a good tool for this purpose, but tools are always best when you know their limitations!
                  Last edited by Brian Bushnell; 02-23-2017, 10:41 PM.

                  Comment

                  • bloosnail
                    Member
                    • Jul 2015
                    • 17

                    #10
                    Thank you for your thorough and quick response Brian. I have tried Seal out and it seems to be giving the correct output -- same number of reads but the multi-mapped reads have different IDs on them. Is it possible to still split up the hdist and qhdist to sum up to the desired Hamming distance to decrease exponential runtime? I tried running something like this:

                    ../BBDuk/bbmap/seal.sh in
                    =unmatched3.fq ref=../bt2_index/barcodes.fa out=matched4.fq outu=unmatched4.fq k
                    =14 mm=f -da hdist=2 qhdist=2 rcomp=f rename fbm overwrite

                    for matches on Hamming distance of 4 and it gives a lot of NullPointerExceptions and crashes.

                    Comment

                    • Brian Bushnell
                      Super Moderator
                      • Jan 2014
                      • 2709

                      #11
                      Originally posted by bloosnail View Post
                      Thank you for your thorough and quick response Brian. I have tried Seal out and it seems to be giving the correct output -- same number of reads but the multi-mapped reads have different IDs on them. Is it possible to still split up the hdist and qhdist to sum up to the desired Hamming distance to decrease exponential runtime? I tried running something like this:

                      ../BBDuk/bbmap/seal.sh in
                      =unmatched3.fq ref=../bt2_index/barcodes.fa out=matched4.fq outu=unmatched4.fq k
                      =14 mm=f -da hdist=2 qhdist=2 rcomp=f rename fbm overwrite

                      for matches on Hamming distance of 4 and it gives a lot of NullPointerExceptions and crashes.
                      It should not crash, so please send me all of the screen output (it all gets written to stderr). Of course it would be great if you could send me the input data as well, if it's not confidential.

                      Comment

                      • bloosnail
                        Member
                        • Jul 2015
                        • 17

                        #12
                        Here is a link to the screen output from the command that I posted earlier: https://drive.google.com/open?id=0B_...0ZCQ3dQei1aUFU

                        Also, the number of input reads at the bottom should be 4,568,965 instead of 51. I'm not sure my supervisor would want me to post the data and it's also pretty big, but if the problem is not clear I can PM you a shortened version of it. Thanks for taking a look!

                        Comment

                        • Brian Bushnell
                          Super Moderator
                          • Jan 2014
                          • 2709

                          #13
                          I found and fixed the problem, and will upload the fixed version on Monday.

                          Comment

                          • Brian Bushnell
                            Super Moderator
                            • Jan 2014
                            • 2709

                            #14
                            This issue is now resolved as of BBMap v37.00!

                            Comment

                            Latest Articles

                            Collapse

                            • SEQadmin2
                              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                              by SEQadmin2


                              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                              Here are nine questions we think about, in roughly the order they matter, before...
                              06-18-2026, 07:11 AM
                            • SEQadmin2
                              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                              by SEQadmin2


                              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                              ...
                              06-02-2026, 10:05 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, 06-17-2026, 06:09 AM
                            0 responses
                            26 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-09-2026, 11:58 AM
                            0 responses
                            44 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-05-2026, 10:09 AM
                            0 responses
                            48 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-04-2026, 08:59 AM
                            0 responses
                            49 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...