Seqanswers Leaderboard Ad

**GenoMax** · 10-29-2015, 09:42 AM

With the demultiplexed (or non-demux) read files use BBSplit to separate the human data using human genome reference.

bcl2fastq will allow errors in barcodes (Picard tool doesn't? have not looked carefully). You would need access to entire run folder in order to run bcl2fastq.

**sk8bro** · 10-29-2015, 11:39 AM

BBsplit looks like a good tool for the human read filtering from the non-demux fastq files. This would give me clean_1.fq and clean_2.fq let's say.... But, I am still not sure how I would then demultiplex the clean fastq files. As far as I know, the barcode data wouldn't be stored in the clean fastq files.

Is BBsplit (well really BBmap) much better than say bowtie2 for human read filtering? I have used bowtie2, so would understand comparison. It is nice that BBsplit will directly output fastq files and if I wanted to align to other genomes besides human, then BBsplit seems great. I will download and test it out, but any anectodal justification/comparison would be helpful!

Yes, Picard has 4 options: max mismatches, min mismatch delta, max no calls, and read quality that I can set to my desired stringency. I haven't used bcl2fastq, but there must be similar parameters. The limitation of Picard is that it seems to really only look for substitution errors, and not indels. But, it is doing a fine job, just curious if something more sophisticated is out there. TagGD seemed cool to try, but I don't see how to use it on dual-indexed data.

Thx!

**sk8bro** · 10-29-2015, 11:45 AM

Also, because I am running my samples through a core, and the rest of the lanes aren't my data, I don't have access to the whole run folder. Just my lane. I didn't know whether bcl2fastq would like that, but your comment suggests that it won't.

Here is a link to TagGD paper:

TagGD: Fast and Accurate Software for DNA Tag Generation and Demultiplexing

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0057521

Multiplexing is of vital importance for utilizing the full potential of next generation sequencing technologies. We here report TagGD (DNA-based Tag Generator and Demultiplexor), a fully-customisable, fast and accurate software package that can generate thousands of barcodes satisfying user-defined constraints and can guarantee full demultiplexing accuracy. The barcodes are designed to minimise their interference with the experiment. Insertion, deletion and substitution events are considered when designing and demultiplexing barcodes. 20,000 barcodes of length 18 were designed in 5 minutes and 2 million barcoded Illumina HiSeq-like reads generated with an error rate of 2% were demultiplexed with full accuracy in 5 minutes. We believe that our software meets a central demand in the current high-throughput biology and can be utilised in any field with ample sample abundance. The software is available on GitHub (https://github.com/pelinakan/UBD.git).

**HESmith** · 10-29-2015, 02:12 PM

Yes, bcl2fastq conversion requires the entire run folder.

The rationale for your work flow is unclear. It's seems like your desired output is demultiplexed FASTQs from which the human reads are removed. The only way to identify the human data is by alignment, which requires FASTQs as input. And there's no way to remove the human data from the BCLs, anyway (or to back-convert FASTQs to BCLs). So why don't you just perform standard bcl2fastq conversion/demultiplexing and follow GenoMax's suggestion to use BBSplit, which is designed to do precisely what you seem to want?

The BBMap site (here) contains ROCs to compare BBMap to a number of aligners, including Bowtie 2.

**GenoMax** · 10-29-2015, 04:32 PM

Originally posted by sk8bro View Post

Also, because I am running my samples through a core, and the rest of the lanes aren't my data, I don't have access to the whole run folder. Just my lane. I didn't know whether bcl2fastq would like that, but your comment suggests that it won't.

Ask the core to do the fastq conversion. It would be trivial for them. bcl2fastq would allow mismatches in the "tag" itself. This potentially can be useful in recovering additional data. If your use case requires only perfect matches for the tag read then that is not important. I don't think picard will allow mismatches for the tag reads during demultiplexing (but then I have never used picard for this purpose).

There is no installation/compilation with BBMap. You will have to make the indexes once but that is also painless and not very time consuming even for human genome.

**Brian Bushnell** · 10-29-2015, 08:46 PM

Originally posted by sk8bro View Post

Is BBsplit (well really BBmap) much better than say bowtie2 for human read filtering?

BBSplit is incomparably better than anything else that I know of for partitioning short reads between multiple references (bear in mind, I am the author), because it maps to all of them simultaneously and picks the best per read or pair. However, it is designed specifically for that case - when you have all of the references of interest, or for iterative assemblies where you are simultaneously building multiple organisms and want to split the reads between them and reassemble.

If you are sequencing an unknown organism, have no assembly, and simply want to remove human reads, I recommend this procedure. This will ensure the removal of the ~98.6% of reads that can with very high confidence be assigned to human, as opposed to non-animals (tested using 2x150bp reads). The remaining 1.4% is mostly very highly conserved, or very low complexity (e.g. ACACACACAC....) and cannot be confidently assigned to a specific organism.

Of course, you can use an unmasked version of the human genome and higher sensitivity mapping (for example, BBMap's defaults, which are very sensitive), if you want to remove absolutely all human contamination. But I don't recommend this unless you are dealing with only bacteria and archaea, and nothing multicellular or from other kingdoms. If you are interested in eukaryotes... it's generally best to remove human reads as per the above protocol, then assemble, then remove human contigs, which are hopefully longer, more accurate, and more complex (by virtue of being longer) than reads, and thus will match the human genome even thought their constituent reads did not.

**sk8bro** · 10-30-2015, 03:29 AM

GenoMax: Picard CAN allow for mismatches and allows me to look for unexpected barcodes and see the effects of parameters, etc. Default bcl2fastq is performed by my core facility already, of course, it is just not what I need right now, but I do like comparing against it. I have a hard time believing that bcl2fastq can't allow for higher stringency settings, but I do understand why it would be a pain for a core to use special settings for just my data.

HESmith: If I can't back-convert Fastqs to bcls, then I agree that this is impossible. Why can't they be back-converted though? Do you mean there just isn't a tool to do this? Or that a tool couldn't be made? Is there no tool because there is no desire for this or some other technical reason? The alternative of removing human reads from the de-multiplexed fastq files is fine, but I do think there is a user-base in the clinical world who would be all about never identifying human reads on a per-sample basis.

Brian: Thanks for all of your suggestions. I will give your tools a try! They do seem like a great suite to get familiar with, thank you. I love the masked human genome option. One question... I am using Trimmomatic currently to pre-process my reads (I trim adapters, trim the first few and last bases that are often lower quality, then do a sliding window trim using 4 bp and Q18 average). Would you recommend human removal before or after pre-processing? I see you have BBTrim also, which I will look into because I don't entirely understand the numbers I get out of Trimmomatic. For instance, using the head/tail trim of a few bases gives me more data retained than if I skip this step and use only the sliding window quality trim. Basically, it is seeming like I should replace all of my pipeline with BB tools

.

**blancha** · 10-30-2015, 04:11 AM

Why can't they be back-converted though? Do you mean there just isn't a tool to do this? Or that a tool couldn't be made? Is there no tool because there is no desire for this or some other technical reason? The alternative of removing human reads from the de-multiplexed fastq files is fine, but I do think there is a user-base in the clinical world who would be all about never identifying human reads on a per-sample basis.

I've been following the conversation, and I still don't understand the rationale.
No one has written such a tool, because no one sees the purpose of writing such a tool. Just make your life simple, and remove the human reads from the FASTQ files. It's very simple to write a script that will cycle through all the FASTQ files.

I've used Xenome in the past, to distinguish reads from xenografts of human tumors in mice from the mice reads.
I wouldn't particularly recommend the software, but it does have one interesting feature. It classifies the reads in three separate categories, mice, human, and ambiguous.

You don't mention which species you are sequencing, but this is a point to take into consideration if the species are closely related.

**Brian Bushnell** · 10-30-2015, 10:06 AM

Originally posted by sk8bro View Post

Brian: Thanks for all of your suggestions. I will give your tools a try! They do seem like a great suite to get familiar with, thank you. I love the masked human genome option. One question... I am using Trimmomatic currently to pre-process my reads (I trim adapters, trim the first few and last bases that are often lower quality, then do a sliding window trim using 4 bp and Q18 average). Would you recommend human removal before or after pre-processing?

Human removal should go after adapter-trimming and quality-trimming or quality-filtering, as adapter-sequence (for example) will cause mismatches and thus make the reads look less similar to human.

I see you have BBTrim also, which I will look into because I don't entirely understand the numbers I get out of Trimmomatic. For instance, using the head/tail trim of a few bases gives me more data retained than if I skip this step and use only the sliding window quality trim.

BBDuk actually does support window trimming, but I don't recommend it as it is not optimal. Rather, running BBDuk with "qtrim=rl trimq=10" will, for example, quality-trim to Q10 and leave the longest possible sequence of average quality of at least 10 that cannot be extended without adding regions with average quality below 10. Depending on your experiment, be wary of setting too high a quality-trim threshold; it will bias your analyses and can lead to inferior results. In many scenarios, such as mapping, no quality-trimming at all is needed unless you have low quality data. And forcibly trimming a fixed number of bases from each end is generally detrimental unless there is a specific problem you are trying to address.

Basically, it is seeming like I should replace all of my pipeline with BB tools

.

I won't disagree...

**sk8bro** · 11-16-2015, 09:08 AM

Hi Brian or any bbmap user, I am wondering if you can explain the read accounting that bbmap is doing?

Basically, why doesn't 613590 - 273 -374 - 844 = #unmapped ? (612099 vs 611706)

Code:

Reads Used:             613590  (75291382 bases)

mated pairs:              0.0890%             273         0.0865%              65100
unmapped:                99.6930%          611706        99.6948%           75061564

Read 1 data:            pct reads       num reads       pct bases          num bases
mapped:                   0.1219%             374         0.1190%              44827

Read 2 data:            pct reads       num reads       pct bases          num bases
mapped:                   0.2751%             844         0.2760%             103810

Thanks, liking BB suite so far

**Brian Bushnell** · 11-16-2015, 09:46 AM

Hmmm... that's a good question. But it should be "unmapped+mapped1+mapped2=used"; the mated pairs are a subset of the mapped reads already.

"Unmapped" is currently tracking whether the reads go to "outm" or "outu". If one read in a pair is mapped and the other is unmapped, both go to "outm" because pairs are always kept together. Whereas "mapped" is tracking the status of individual reads - whether they mapped or not. That's confusing; if some of the reads map unpaired (as in this case), then the counts don't add up. But, I added it at the request of people who were doing human filtering, because they wanted to know exactly how many reads survived that phase, which is the number you get from the "unmapped" line when you use the "printunmappedcount" flag.

**sk8bro** · 11-16-2015, 10:21 AM

Thanks, Brian

yes, I verified that wc -l of clean_1.fq (divided by 2) is equal to the unmapped read count.

So following what you said, which is clear, then 666 reads or 333 pairs are missing in action? I didn't specify an outm file. Could that change behavior? Would you be worried if you were me? Thanks,

**Brian Bushnell** · 11-16-2015, 10:25 AM

Originally posted by sk8bro View Post

Thanks, Brian

yes, I verified that wc -l of clean_1.fq (divided by 2) is equal to the unmapped read count.

So following what you said, which is clear, then 666 reads or 333 pairs are missing in action? I didn't specify an outm file. Could that change behavior? Would you be worried if you were me? Thanks,

It doesn't matter whether or not you specify "outm" or "outu". "unmapped" just counts the number of reads that end up in "outu", or that would end up in outu if you did specify it.

There's no reason to be worried. Sorry that the counts don't quite add up when you're using paired reads, but that's because they track different things.

**sk8bro** · 11-16-2015, 10:37 AM

You wrote: "But it should be "unmapped+mapped1+mapped2=used"

I just wanted to confirm that it is okay that the numbers also don't fit this formula.

611706+374+844 = 612924 != 613590

Then, if I wanted to report the mapping rate, is this the correct formula?

mapped = used-unmapped/used

Is there a way to export to outm only if both pairs map? Default I think is to export if either pair maps?

Thank you for clarifiying.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

generate fastq files for entire sample pool, remove human reads, then demultiplex

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News