Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Good way to align very, very short reads?

    We have single-cell transcriptomic reads, where each read has a barcode associated with it which designates which cell it came from. From the default pipeline (a company called 10X Genomics), about 1/3 of reads' barcodes did not pass quality control and we are seeing if we can stretch this and save some of the data. So essentially, each read has a barcode of 14bp attached to it, and there is a whitelist of allowed barcodes (all also of 14bp). We are trying to use Bowtie2 to align the barcodes of reads that did not pass quality control (generally they all have one or more bases different from any whitelisted barcodes) back to the reference whitelist to see which barcodes in the whitelist are most similar to each read, and we may decide to keep more of the data if reads have whitelisted barcodes similar enough to their own.

    Essentially, we are making mock .fastq and .fasta files from the reads' barcodes and whitelisted barcodes respectively, and running alignment with the whitelisted barcodes as the "reference genome". We are wondering if there is a better way/tool to do this, since I believe Bowtie2 was designed for longer length (50+ bp) reads (all barcodes are 14bp long). It looks like the first version of Bowtie is more geared toward shorter read length, but the manual still says it was intended for reads of length 25-50bp, and we are wondering if there is a tool designed for even shorter read lengths, or if someone has insights on whether Bowtie2/Bowtie is giving accurate results. I first tried writing a Perl script to do the check but it took a minute for only 30 reads (one sample has 30 million...). Thanks.

    *Edit
    So basically, we are doing normal alignment but with several differences:
    -the read lengths and reference sequences are all very short (14bp)
    -all data is exactly the same length, so there is no need for gapped alignment
    -all reads are single-ended
    Last edited by bloosnail; 02-16-2017, 06:45 PM.

  • #2
    The best way to do this is typically with BBDuk or Seal, which operate on kmer-matching but allow a hamming distance. They're ideal for 14bp sequences, and are very fast. For example:

    bbduk.sh in=reads.fq ref=whitelist.fa outm=match.fq out=unmatched.fq k=14 mm=f
    This will put all the reads in match or unmatched, requiring exact matches. If you want to allow mismatches, use "hdist"; for example, this will allow 3 mismatches:

    bbduk.sh in=reads.fq ref=whitelist.fa outm=match.fq out=unmatched.fq k=14 mm=f hdist=3
    Alternatively, Seal will split the input file into one output file per reference sequence:

    seal.sh in=reads.fq ref=whitelist.fa pattern=match_%.fq out=unmatched.fq k=14 mm=f hdist=3
    You can add the "ordered" flag if you want the reads to remain in input order. Note that it is also possible to do this via alignment (e.g. with bowtie1 or with BBMap), which would produce a sam file, but it may require tweaking default parameters. With BBMap for example it would be something like:

    bbmap.sh ref=whitelist.fa in=reads.fq out=mapped.sam k=7 vslow strictmaxindel=0
    ...but, I think Seal or BBDuk would be more convenient.
    Last edited by Brian Bushnell; 02-16-2017, 07:05 PM.

    Comment


    • #3
      Hi Brian,

      Thank you for your response, this seems to be useful for our purposes. I am wondering if BBDuk has some functionality to output the actual sequence in the .fa that each read aligned to, not just whether a read has valid alignments or not. Our specific end goal is to find the subset of whitelisted reads that is most similar to each barcode.

      Additionally, we are only trying to find the subset of whitelisted barcodes with the minimum Hamming distance for each barcode, not which barcodes have a reference within a certain Hamming distance away. So I think (if it is able to output the sequence of the reference .fa that a read is aligned it) we could run BBDuk iteratively over increasing Hamming distance from 1-14, then starting from Hamming distance of 1 up to 14, check if a read has passed yet or not, and exclude it from later iterations once it has passed.

      So specifically, my main question is whether BBDuk can output the the actual reference sequence that a read aligned to, or if you have some recommendation on a tool that can do that along with the other criteria listed previously? Thanks.

      Comment


      • #4
        Actually, BBDuk will not go over a hamming distance of around 7 very easily. The exact limit depends on the kmer length, number of sequences, and amount of memory, but the time and memory increase exponentially with (14*3)^(hamming distance). But then, finding a 14-mer match with a hamming distance of 14 is not very interesting anyway Anyway, iteratively increasing hamming distance from 0 to some limit is probably the best way to go, unless there's a specific program that simply brute-force calculate all the hamming distances... I do have one like that for bar codes (where there's no computational cost for mismatches, unlike BBDuk or alignment algorithms) but it's not quite designed for this use. But if this is a common need for processing 10X data I'll plan to modify it to suit this purpose exactly (we don't use 10X data currently, though I am investigating it for the purpose of scaffolding).

        Anyway - if you use Seal, you will get one file for each sequence in the reference fasta, containing all the reads matching that sequence, which sounds possibly like what you want. Alternatively, you could do this:

        Code:
        bbduk.sh in=reads.fq ref=whitelist.fa outm=matched0.fq out=unmatched0.fq k=14 mm=f hdist=0 rename fbm
        
        bbduk.sh in=unmatched0.fq ref=whitelist.fa outm=matched1.fq out=unmatched1.fq k=14 mm=f hdist=1 rename fbm
        
        bbduk.sh in=unmatched1.fq ref=whitelist.fa outm=matched2.fq out=unmatched2.fq k=14 mm=f hdist=2 rename fbm
        ...etc. That will rename the reads to indicate which sequence they matched (it will add it as a suffix so the original name is also retained). By default BBDuk will not let you go over hdist=3 unless you add the flag "-da". But, there are 2 ways to increase hamming distance in BBDuk, with "hdist" and "qhdist"; the first makes mutant copies of reference kmers, and the second makes mutant copies of query kmers. So to achieve hamming distance 6, it's most efficient to use "hdist=3 qhdist=3", since each is independently exponential.
        Last edited by Brian Bushnell; 02-17-2017, 06:01 PM.

        Comment


        • #5
          Cell Ranger corrects one base mismatches if it confidently can be assigned to one of whitelist barcodes. 10x software might have a setting to relax match criteria.

          Comment


          • #6
            @Brian, I am currently running BBDuk and it seems to be working well. Thank you for your suggestion. I am wondering if there is reasonable increase in time if more cores are used -- from the documentation, "This uses a fixed number of threads, currently 7, which cannot be changed (except by editing the “WAYS=7” line in BBDuk’s source code and recompiling)". Also, I think there is a slight typo in the manual in the installation section (http://jgi.doe.gov/data-and-tools/bb...llation-guide/), I think the command "$ tar -xvfz BBMap_(version)" should be "$ tar -xvzf BBMap_(version).tar.gz". Somewhat trivial, but could be confusing for someone without experience on the command line.

            @nucacidhunter, I tried e-mailing 10X and they said they do not currently have built-in functionality to relax the settings, but there is a line in the source code where one can relax the cut-off for filtering. We also may try this, thank you for the suggestion.

            Comment


            • #7
              Originally posted by bloosnail View Post
              @Brian, I am currently running BBDuk and it seems to be working well. Thank you for your suggestion. I am wondering if there is reasonable increase in time if more cores are used -- from the documentation, "This uses a fixed number of threads, currently 7, which cannot be changed (except by editing the “WAYS=7” line in BBDuk’s source code and recompiling)".
              To clarify, BBDuk scales fine to an arbitrary number of threads for processing the reads (using the "threads" flag, but by default it uses all available threads). It uses 7 threads only when processing the reference. In most cases processing the reference takes a trivial amount of time, but it can take a while with a high hamming distance.

              Increasing WAYS will increase speed up to the number of cores you have (the number needs to be odd though). If WAYS is above your core count it will decrease speed instead. But it doesn't increase speed all that much (all of the arithmetic is replicated; only the memory random accesses are parallel) so 7 is usually fine. Once the reference is processed, WAYS has no effect on runtime for processing the reads. So it's only useful to adjust if the majority of your time is spent processing the reference.

              Note that "hdist" increases the time needed for loading the reference, while "qhdist" increases the time needed to process the reads.

              Also, I think there is a slight typo in the manual in the installation section (http://jgi.doe.gov/data-and-tools/bb...llation-guide/), I think the command "$ tar -xvfz BBMap_(version)" should be "$ tar -xvzf BBMap_(version).tar.gz". Somewhat trivial, but could be confusing for someone without experience on the command line.
              Thanks; I'll fix that!

              Comment


              • #8
                Hi Brian,

                I am looking more closely at the output from BBDuk and I think there may be a problem. It appears that whenever a read has multiple matches, the name of the barcode that is aligned to is always the same. For example, here is a grep'd read with multiple matches on the output .fq and this is what is returned:

                @NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
                @NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
                @NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
                @NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1

                So what would be expected is numerous hits for the read ID, but ideally the barcode names would be different. It appears it is like this for every read with multiple matches. Would you know what the problem is? Here is the command I have been using (or some slight variation of it):

                $ ../BBDuk/bbmap/bbduk.sh in=../no_CB_no_head.fq ref=
                ../bt2_index/barcodes.fa outm=matched1.fq out=unmatched1.fq k=14 mm=f -da hdist=1 rcomp=f rename fbm

                Comment


                • #9
                  BBDuk and Seal are very similar, but different in an important way. BBDuk associates each kmer with a single reference sequence - specifically, the first one it encounters in the reference fasta. Seal associates each kmer with every reference sequence containing that kmer. The syntax is very similar, and both support read renaming. For contaminant removal or adapter-trimming, BBDuk is ideal. But in a situation where it is important to distinguish between similar things - such as isoform expression quantification, contaminant detection, and barcode matching... it's important to use Seal.

                  Sorry, I realize now that I should have been more clear on this initially. If I was routinely using 10X data I'm sure I'd have a protocol in place, but this is new to me. With a low hdist setting and a high hamming distance between barcodes (which was my initial expectation) the results are identical, but once you increase hdist/qhdist sufficiently, Seal should be used instead of BBDuk, when it's important to know which reference sequence a query hit.

                  Note that Seal still does not distinguish between 2 mismatches for reference sequence A versus 3 mismatches for reference sequence B. If you set hdist=2, the query will get tagged as matching A only. If you set hdist=3, the query will get tagged as matching both A and B with no indication that it matches A better; that information is not stored. I designed it for longer sequences where ties could be broken by the number of shared kmers; but when the sequence length is equal to the kmer length, obviously, that won't work as a tie-breaker since it's either 1 or 0.

                  Anyway - I think Seal is a good tool for this purpose, but tools are always best when you know their limitations!
                  Last edited by Brian Bushnell; 02-23-2017, 10:41 PM.

                  Comment


                  • #10
                    Thank you for your thorough and quick response Brian. I have tried Seal out and it seems to be giving the correct output -- same number of reads but the multi-mapped reads have different IDs on them. Is it possible to still split up the hdist and qhdist to sum up to the desired Hamming distance to decrease exponential runtime? I tried running something like this:

                    ../BBDuk/bbmap/seal.sh in
                    =unmatched3.fq ref=../bt2_index/barcodes.fa out=matched4.fq outu=unmatched4.fq k
                    =14 mm=f -da hdist=2 qhdist=2 rcomp=f rename fbm overwrite

                    for matches on Hamming distance of 4 and it gives a lot of NullPointerExceptions and crashes.

                    Comment


                    • #11
                      Originally posted by bloosnail View Post
                      Thank you for your thorough and quick response Brian. I have tried Seal out and it seems to be giving the correct output -- same number of reads but the multi-mapped reads have different IDs on them. Is it possible to still split up the hdist and qhdist to sum up to the desired Hamming distance to decrease exponential runtime? I tried running something like this:

                      ../BBDuk/bbmap/seal.sh in
                      =unmatched3.fq ref=../bt2_index/barcodes.fa out=matched4.fq outu=unmatched4.fq k
                      =14 mm=f -da hdist=2 qhdist=2 rcomp=f rename fbm overwrite

                      for matches on Hamming distance of 4 and it gives a lot of NullPointerExceptions and crashes.
                      It should not crash, so please send me all of the screen output (it all gets written to stderr). Of course it would be great if you could send me the input data as well, if it's not confidential.

                      Comment


                      • #12
                        Here is a link to the screen output from the command that I posted earlier: https://drive.google.com/open?id=0B_...0ZCQ3dQei1aUFU

                        Also, the number of input reads at the bottom should be 4,568,965 instead of 51. I'm not sure my supervisor would want me to post the data and it's also pretty big, but if the problem is not clear I can PM you a shortened version of it. Thanks for taking a look!

                        Comment


                        • #13
                          I found and fixed the problem, and will upload the fixed version on Monday.

                          Comment


                          • #14
                            This issue is now resolved as of BBMap v37.00!

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            31 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            32 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            53 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X