Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • generate fastq files for entire sample pool, remove human reads, then demultiplex

    I have 125xPE Hiseq data with dual Nextera indexes (each 8 bp). My data is from a single lane of the flowcell, but I have my lane's .bcl and .cpos files. Using these files I then perform my de-multiplexing using Picard.

    However, I would like to remove any human reads from the whole data set before I ever de-multiplex the samples. This way human reads would never be identified on a per-sample basis, only on a group wide basis and filtered out as early as possible from my downstream analysis pipeline.

    Can anyone suggest a way to do this? Is it feasible?

    It seems that if I could get Step 3 to work, this would do it.

    Step 1: Completely skip all barcode bases in the reads and generate my PE fastq files
    Step 2: Tag human reads from these de-identified fastq files using some tool
    Step 3: Somehow use Step 2's tags to remove human data from the raw .bcl and .cpos files
    Step 4: De-multiplex again using the barcode


    Also, any favorite tool for human read filtering from Hiseq data (and justification)?

    Any comparison between Picard for demux and Illumina's bcl2fastq or other demultiplexers? It seems to me that a demultiplexer that takes into account single indels would be nice.
    Last edited by sk8bro; 10-29-2015, 09:30 AM.

  • #2
    With the demultiplexed (or non-demux) read files use BBSplit to separate the human data using human genome reference.

    bcl2fastq will allow errors in barcodes (Picard tool doesn't? have not looked carefully). You would need access to entire run folder in order to run bcl2fastq.

    Comment


    • #3
      BBsplit looks like a good tool for the human read filtering from the non-demux fastq files. This would give me clean_1.fq and clean_2.fq let's say.... But, I am still not sure how I would then demultiplex the clean fastq files. As far as I know, the barcode data wouldn't be stored in the clean fastq files.

      Is BBsplit (well really BBmap) much better than say bowtie2 for human read filtering? I have used bowtie2, so would understand comparison. It is nice that BBsplit will directly output fastq files and if I wanted to align to other genomes besides human, then BBsplit seems great. I will download and test it out, but any anectodal justification/comparison would be helpful!

      Yes, Picard has 4 options: max mismatches, min mismatch delta, max no calls, and read quality that I can set to my desired stringency. I haven't used bcl2fastq, but there must be similar parameters. The limitation of Picard is that it seems to really only look for substitution errors, and not indels. But, it is doing a fine job, just curious if something more sophisticated is out there. TagGD seemed cool to try, but I don't see how to use it on dual-indexed data.

      Thx!

      Comment


      • #4
        Also, because I am running my samples through a core, and the rest of the lanes aren't my data, I don't have access to the whole run folder. Just my lane. I didn't know whether bcl2fastq would like that, but your comment suggests that it won't.

        Here is a link to TagGD paper:

        Multiplexing is of vital importance for utilizing the full potential of next generation sequencing technologies. We here report TagGD (DNA-based Tag Generator and Demultiplexor), a fully-customisable, fast and accurate software package that can generate thousands of barcodes satisfying user-defined constraints and can guarantee full demultiplexing accuracy. The barcodes are designed to minimise their interference with the experiment. Insertion, deletion and substitution events are considered when designing and demultiplexing barcodes. 20,000 barcodes of length 18 were designed in 5 minutes and 2 million barcoded Illumina HiSeq-like reads generated with an error rate of 2% were demultiplexed with full accuracy in 5 minutes. We believe that our software meets a central demand in the current high-throughput biology and can be utilised in any field with ample sample abundance. The software is available on GitHub (https://github.com/pelinakan/UBD.git).

        Comment


        • #5
          Yes, bcl2fastq conversion requires the entire run folder.

          The rationale for your work flow is unclear. It's seems like your desired output is demultiplexed FASTQs from which the human reads are removed. The only way to identify the human data is by alignment, which requires FASTQs as input. And there's no way to remove the human data from the BCLs, anyway (or to back-convert FASTQs to BCLs). So why don't you just perform standard bcl2fastq conversion/demultiplexing and follow GenoMax's suggestion to use BBSplit, which is designed to do precisely what you seem to want?

          The BBMap site (here) contains ROCs to compare BBMap to a number of aligners, including Bowtie 2.

          Comment


          • #6
            Originally posted by sk8bro View Post
            Also, because I am running my samples through a core, and the rest of the lanes aren't my data, I don't have access to the whole run folder. Just my lane. I didn't know whether bcl2fastq would like that, but your comment suggests that it won't.
            Ask the core to do the fastq conversion. It would be trivial for them. bcl2fastq would allow mismatches in the "tag" itself. This potentially can be useful in recovering additional data. If your use case requires only perfect matches for the tag read then that is not important. I don't think picard will allow mismatches for the tag reads during demultiplexing (but then I have never used picard for this purpose).

            There is no installation/compilation with BBMap. You will have to make the indexes once but that is also painless and not very time consuming even for human genome.

            Comment


            • #7
              Originally posted by sk8bro View Post
              Is BBsplit (well really BBmap) much better than say bowtie2 for human read filtering?
              BBSplit is incomparably better than anything else that I know of for partitioning short reads between multiple references (bear in mind, I am the author), because it maps to all of them simultaneously and picks the best per read or pair. However, it is designed specifically for that case - when you have all of the references of interest, or for iterative assemblies where you are simultaneously building multiple organisms and want to split the reads between them and reassemble.

              If you are sequencing an unknown organism, have no assembly, and simply want to remove human reads, I recommend this procedure. This will ensure the removal of the ~98.6% of reads that can with very high confidence be assigned to human, as opposed to non-animals (tested using 2x150bp reads). The remaining 1.4% is mostly very highly conserved, or very low complexity (e.g. ACACACACAC....) and cannot be confidently assigned to a specific organism.

              Of course, you can use an unmasked version of the human genome and higher sensitivity mapping (for example, BBMap's defaults, which are very sensitive), if you want to remove absolutely all human contamination. But I don't recommend this unless you are dealing with only bacteria and archaea, and nothing multicellular or from other kingdoms. If you are interested in eukaryotes... it's generally best to remove human reads as per the above protocol, then assemble, then remove human contigs, which are hopefully longer, more accurate, and more complex (by virtue of being longer) than reads, and thus will match the human genome even thought their constituent reads did not.
              Last edited by Brian Bushnell; 10-29-2015, 08:52 PM.

              Comment


              • #8
                GenoMax: Picard CAN allow for mismatches and allows me to look for unexpected barcodes and see the effects of parameters, etc. Default bcl2fastq is performed by my core facility already, of course, it is just not what I need right now, but I do like comparing against it. I have a hard time believing that bcl2fastq can't allow for higher stringency settings, but I do understand why it would be a pain for a core to use special settings for just my data.

                HESmith: If I can't back-convert Fastqs to bcls, then I agree that this is impossible. Why can't they be back-converted though? Do you mean there just isn't a tool to do this? Or that a tool couldn't be made? Is there no tool because there is no desire for this or some other technical reason? The alternative of removing human reads from the de-multiplexed fastq files is fine, but I do think there is a user-base in the clinical world who would be all about never identifying human reads on a per-sample basis.

                Brian: Thanks for all of your suggestions. I will give your tools a try! They do seem like a great suite to get familiar with, thank you. I love the masked human genome option. One question... I am using Trimmomatic currently to pre-process my reads (I trim adapters, trim the first few and last bases that are often lower quality, then do a sliding window trim using 4 bp and Q18 average). Would you recommend human removal before or after pre-processing? I see you have BBTrim also, which I will look into because I don't entirely understand the numbers I get out of Trimmomatic. For instance, using the head/tail trim of a few bases gives me more data retained than if I skip this step and use only the sliding window quality trim. Basically, it is seeming like I should replace all of my pipeline with BB tools .

                Comment


                • #9
                  Why can't they be back-converted though? Do you mean there just isn't a tool to do this? Or that a tool couldn't be made? Is there no tool because there is no desire for this or some other technical reason? The alternative of removing human reads from the de-multiplexed fastq files is fine, but I do think there is a user-base in the clinical world who would be all about never identifying human reads on a per-sample basis.
                  I've been following the conversation, and I still don't understand the rationale.
                  No one has written such a tool, because no one sees the purpose of writing such a tool. Just make your life simple, and remove the human reads from the FASTQ files. It's very simple to write a script that will cycle through all the FASTQ files.

                  I've used Xenome in the past, to distinguish reads from xenografts of human tumors in mice from the mice reads.
                  I wouldn't particularly recommend the software, but it does have one interesting feature. It classifies the reads in three separate categories, mice, human, and ambiguous.

                  You don't mention which species you are sequencing, but this is a point to take into consideration if the species are closely related.

                  Comment


                  • #10
                    Originally posted by sk8bro View Post
                    Brian: Thanks for all of your suggestions. I will give your tools a try! They do seem like a great suite to get familiar with, thank you. I love the masked human genome option. One question... I am using Trimmomatic currently to pre-process my reads (I trim adapters, trim the first few and last bases that are often lower quality, then do a sliding window trim using 4 bp and Q18 average). Would you recommend human removal before or after pre-processing?
                    Human removal should go after adapter-trimming and quality-trimming or quality-filtering, as adapter-sequence (for example) will cause mismatches and thus make the reads look less similar to human.

                    I see you have BBTrim also, which I will look into because I don't entirely understand the numbers I get out of Trimmomatic. For instance, using the head/tail trim of a few bases gives me more data retained than if I skip this step and use only the sliding window quality trim.
                    BBDuk actually does support window trimming, but I don't recommend it as it is not optimal. Rather, running BBDuk with "qtrim=rl trimq=10" will, for example, quality-trim to Q10 and leave the longest possible sequence of average quality of at least 10 that cannot be extended without adding regions with average quality below 10. Depending on your experiment, be wary of setting too high a quality-trim threshold; it will bias your analyses and can lead to inferior results. In many scenarios, such as mapping, no quality-trimming at all is needed unless you have low quality data. And forcibly trimming a fixed number of bases from each end is generally detrimental unless there is a specific problem you are trying to address.

                    Basically, it is seeming like I should replace all of my pipeline with BB tools .
                    I won't disagree...

                    Comment


                    • #11
                      Hi Brian or any bbmap user, I am wondering if you can explain the read accounting that bbmap is doing?

                      Basically, why doesn't 613590 - 273 -374 - 844 = #unmapped ? (612099 vs 611706)
                      Code:
                      Reads Used:             613590  (75291382 bases)
                      
                      mated pairs:              0.0890%             273         0.0865%              65100
                      unmapped:                99.6930%          611706        99.6948%           75061564
                      
                      Read 1 data:            pct reads       num reads       pct bases          num bases
                      mapped:                   0.1219%             374         0.1190%              44827
                      
                      Read 2 data:            pct reads       num reads       pct bases          num bases
                      mapped:                   0.2751%             844         0.2760%             103810
                      Thanks, liking BB suite so far
                      Last edited by GenoMax; 11-16-2015, 09:11 AM. Reason: added CODE tags to improve readability

                      Comment


                      • #12
                        Hmmm... that's a good question. But it should be "unmapped+mapped1+mapped2=used"; the mated pairs are a subset of the mapped reads already.

                        "Unmapped" is currently tracking whether the reads go to "outm" or "outu". If one read in a pair is mapped and the other is unmapped, both go to "outm" because pairs are always kept together. Whereas "mapped" is tracking the status of individual reads - whether they mapped or not. That's confusing; if some of the reads map unpaired (as in this case), then the counts don't add up. But, I added it at the request of people who were doing human filtering, because they wanted to know exactly how many reads survived that phase, which is the number you get from the "unmapped" line when you use the "printunmappedcount" flag.

                        Comment


                        • #13
                          Thanks, Brian

                          yes, I verified that wc -l of clean_1.fq (divided by 2) is equal to the unmapped read count.

                          So following what you said, which is clear, then 666 reads or 333 pairs are missing in action? I didn't specify an outm file. Could that change behavior? Would you be worried if you were me? Thanks,

                          Comment


                          • #14
                            Originally posted by sk8bro View Post
                            Thanks, Brian

                            yes, I verified that wc -l of clean_1.fq (divided by 2) is equal to the unmapped read count.

                            So following what you said, which is clear, then 666 reads or 333 pairs are missing in action? I didn't specify an outm file. Could that change behavior? Would you be worried if you were me? Thanks,
                            It doesn't matter whether or not you specify "outm" or "outu". "unmapped" just counts the number of reads that end up in "outu", or that would end up in outu if you did specify it.

                            There's no reason to be worried. Sorry that the counts don't quite add up when you're using paired reads, but that's because they track different things.

                            Comment


                            • #15
                              You wrote: "But it should be "unmapped+mapped1+mapped2=used"

                              I just wanted to confirm that it is okay that the numbers also don't fit this formula.

                              611706+374+844 = 612924 != 613590

                              Then, if I wanted to report the mapping rate, is this the correct formula?

                              mapped = used-unmapped/used

                              Is there a way to export to outm only if both pairs map? Default I think is to export if either pair maps?

                              Thank you for clarifiying.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X