SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Fastq sample files from Illumina and IonTorrent gmaster Bioinformatics 1 08-11-2015 01:23 AM
How to fix fastq files from a bacteria sample? ymc Bioinformatics 2 05-30-2015 08:47 AM
How to randomly remove portions of the raw reads from the FASTQ file choijae3 Bioinformatics 5 01-08-2014 07:27 AM
Split Large FASTQ file in small FASTQ files with user defined number of reads Windows deepbiomed Bioinformatics 3 04-04-2013 07:14 AM
How to filter a possible DNA contamination in fastq sample reads Jluis Bioinformatics 2 07-10-2012 02:14 AM

Reply
 
Thread Tools
Old 10-29-2015, 09:27 AM   #1
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default generate fastq files for entire sample pool, remove human reads, then demultiplex

I have 125xPE Hiseq data with dual Nextera indexes (each 8 bp). My data is from a single lane of the flowcell, but I have my lane's .bcl and .cpos files. Using these files I then perform my de-multiplexing using Picard.

However, I would like to remove any human reads from the whole data set before I ever de-multiplex the samples. This way human reads would never be identified on a per-sample basis, only on a group wide basis and filtered out as early as possible from my downstream analysis pipeline.

Can anyone suggest a way to do this? Is it feasible?

It seems that if I could get Step 3 to work, this would do it.

Step 1: Completely skip all barcode bases in the reads and generate my PE fastq files
Step 2: Tag human reads from these de-identified fastq files using some tool
Step 3: Somehow use Step 2's tags to remove human data from the raw .bcl and .cpos files
Step 4: De-multiplex again using the barcode


Also, any favorite tool for human read filtering from Hiseq data (and justification)?

Any comparison between Picard for demux and Illumina's bcl2fastq or other demultiplexers? It seems to me that a demultiplexer that takes into account single indels would be nice.

Last edited by sk8bro; 10-29-2015 at 09:30 AM.
sk8bro is offline   Reply With Quote
Old 10-29-2015, 09:42 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,975
Default

With the demultiplexed (or non-demux) read files use BBSplit to separate the human data using human genome reference.

bcl2fastq will allow errors in barcodes (Picard tool doesn't? have not looked carefully). You would need access to entire run folder in order to run bcl2fastq.
GenoMax is offline   Reply With Quote
Old 10-29-2015, 11:39 AM   #3
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

BBsplit looks like a good tool for the human read filtering from the non-demux fastq files. This would give me clean_1.fq and clean_2.fq let's say.... But, I am still not sure how I would then demultiplex the clean fastq files. As far as I know, the barcode data wouldn't be stored in the clean fastq files.

Is BBsplit (well really BBmap) much better than say bowtie2 for human read filtering? I have used bowtie2, so would understand comparison. It is nice that BBsplit will directly output fastq files and if I wanted to align to other genomes besides human, then BBsplit seems great. I will download and test it out, but any anectodal justification/comparison would be helpful!

Yes, Picard has 4 options: max mismatches, min mismatch delta, max no calls, and read quality that I can set to my desired stringency. I haven't used bcl2fastq, but there must be similar parameters. The limitation of Picard is that it seems to really only look for substitution errors, and not indels. But, it is doing a fine job, just curious if something more sophisticated is out there. TagGD seemed cool to try, but I don't see how to use it on dual-indexed data.

Thx!
sk8bro is offline   Reply With Quote
Old 10-29-2015, 11:45 AM   #4
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

Also, because I am running my samples through a core, and the rest of the lanes aren't my data, I don't have access to the whole run folder. Just my lane. I didn't know whether bcl2fastq would like that, but your comment suggests that it won't.

Here is a link to TagGD paper:

http://journals.plos.org/plosone/art...l.pone.0057521
sk8bro is offline   Reply With Quote
Old 10-29-2015, 02:12 PM   #5
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

Yes, bcl2fastq conversion requires the entire run folder.

The rationale for your work flow is unclear. It's seems like your desired output is demultiplexed FASTQs from which the human reads are removed. The only way to identify the human data is by alignment, which requires FASTQs as input. And there's no way to remove the human data from the BCLs, anyway (or to back-convert FASTQs to BCLs). So why don't you just perform standard bcl2fastq conversion/demultiplexing and follow GenoMax's suggestion to use BBSplit, which is designed to do precisely what you seem to want?

The BBMap site (here) contains ROCs to compare BBMap to a number of aligners, including Bowtie 2.
HESmith is offline   Reply With Quote
Old 10-29-2015, 04:32 PM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,975
Default

Quote:
Originally Posted by sk8bro View Post
Also, because I am running my samples through a core, and the rest of the lanes aren't my data, I don't have access to the whole run folder. Just my lane. I didn't know whether bcl2fastq would like that, but your comment suggests that it won't.
Ask the core to do the fastq conversion. It would be trivial for them. bcl2fastq would allow mismatches in the "tag" itself. This potentially can be useful in recovering additional data. If your use case requires only perfect matches for the tag read then that is not important. I don't think picard will allow mismatches for the tag reads during demultiplexing (but then I have never used picard for this purpose).

There is no installation/compilation with BBMap. You will have to make the indexes once but that is also painless and not very time consuming even for human genome.
GenoMax is offline   Reply With Quote
Old 10-29-2015, 08:46 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by sk8bro View Post
Is BBsplit (well really BBmap) much better than say bowtie2 for human read filtering?
BBSplit is incomparably better than anything else that I know of for partitioning short reads between multiple references (bear in mind, I am the author), because it maps to all of them simultaneously and picks the best per read or pair. However, it is designed specifically for that case - when you have all of the references of interest, or for iterative assemblies where you are simultaneously building multiple organisms and want to split the reads between them and reassemble.

If you are sequencing an unknown organism, have no assembly, and simply want to remove human reads, I recommend this procedure. This will ensure the removal of the ~98.6% of reads that can with very high confidence be assigned to human, as opposed to non-animals (tested using 2x150bp reads). The remaining 1.4% is mostly very highly conserved, or very low complexity (e.g. ACACACACAC....) and cannot be confidently assigned to a specific organism.

Of course, you can use an unmasked version of the human genome and higher sensitivity mapping (for example, BBMap's defaults, which are very sensitive), if you want to remove absolutely all human contamination. But I don't recommend this unless you are dealing with only bacteria and archaea, and nothing multicellular or from other kingdoms. If you are interested in eukaryotes... it's generally best to remove human reads as per the above protocol, then assemble, then remove human contigs, which are hopefully longer, more accurate, and more complex (by virtue of being longer) than reads, and thus will match the human genome even thought their constituent reads did not.

Last edited by Brian Bushnell; 10-29-2015 at 08:52 PM.
Brian Bushnell is offline   Reply With Quote
Old 10-30-2015, 03:29 AM   #8
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

GenoMax: Picard CAN allow for mismatches and allows me to look for unexpected barcodes and see the effects of parameters, etc. Default bcl2fastq is performed by my core facility already, of course, it is just not what I need right now, but I do like comparing against it. I have a hard time believing that bcl2fastq can't allow for higher stringency settings, but I do understand why it would be a pain for a core to use special settings for just my data.

HESmith: If I can't back-convert Fastqs to bcls, then I agree that this is impossible. Why can't they be back-converted though? Do you mean there just isn't a tool to do this? Or that a tool couldn't be made? Is there no tool because there is no desire for this or some other technical reason? The alternative of removing human reads from the de-multiplexed fastq files is fine, but I do think there is a user-base in the clinical world who would be all about never identifying human reads on a per-sample basis.

Brian: Thanks for all of your suggestions. I will give your tools a try! They do seem like a great suite to get familiar with, thank you. I love the masked human genome option. One question... I am using Trimmomatic currently to pre-process my reads (I trim adapters, trim the first few and last bases that are often lower quality, then do a sliding window trim using 4 bp and Q18 average). Would you recommend human removal before or after pre-processing? I see you have BBTrim also, which I will look into because I don't entirely understand the numbers I get out of Trimmomatic. For instance, using the head/tail trim of a few bases gives me more data retained than if I skip this step and use only the sliding window quality trim. Basically, it is seeming like I should replace all of my pipeline with BB tools .
sk8bro is offline   Reply With Quote
Old 10-30-2015, 04:11 AM   #9
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Quote:
Why can't they be back-converted though? Do you mean there just isn't a tool to do this? Or that a tool couldn't be made? Is there no tool because there is no desire for this or some other technical reason? The alternative of removing human reads from the de-multiplexed fastq files is fine, but I do think there is a user-base in the clinical world who would be all about never identifying human reads on a per-sample basis.
I've been following the conversation, and I still don't understand the rationale.
No one has written such a tool, because no one sees the purpose of writing such a tool. Just make your life simple, and remove the human reads from the FASTQ files. It's very simple to write a script that will cycle through all the FASTQ files.

I've used Xenome in the past, to distinguish reads from xenografts of human tumors in mice from the mice reads.
I wouldn't particularly recommend the software, but it does have one interesting feature. It classifies the reads in three separate categories, mice, human, and ambiguous.

You don't mention which species you are sequencing, but this is a point to take into consideration if the species are closely related.
blancha is offline   Reply With Quote
Old 10-30-2015, 10:06 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by sk8bro View Post
Brian: Thanks for all of your suggestions. I will give your tools a try! They do seem like a great suite to get familiar with, thank you. I love the masked human genome option. One question... I am using Trimmomatic currently to pre-process my reads (I trim adapters, trim the first few and last bases that are often lower quality, then do a sliding window trim using 4 bp and Q18 average). Would you recommend human removal before or after pre-processing?
Human removal should go after adapter-trimming and quality-trimming or quality-filtering, as adapter-sequence (for example) will cause mismatches and thus make the reads look less similar to human.

Quote:
I see you have BBTrim also, which I will look into because I don't entirely understand the numbers I get out of Trimmomatic. For instance, using the head/tail trim of a few bases gives me more data retained than if I skip this step and use only the sliding window quality trim.
BBDuk actually does support window trimming, but I don't recommend it as it is not optimal. Rather, running BBDuk with "qtrim=rl trimq=10" will, for example, quality-trim to Q10 and leave the longest possible sequence of average quality of at least 10 that cannot be extended without adding regions with average quality below 10. Depending on your experiment, be wary of setting too high a quality-trim threshold; it will bias your analyses and can lead to inferior results. In many scenarios, such as mapping, no quality-trimming at all is needed unless you have low quality data. And forcibly trimming a fixed number of bases from each end is generally detrimental unless there is a specific problem you are trying to address.

Quote:
Basically, it is seeming like I should replace all of my pipeline with BB tools .
I won't disagree...
Brian Bushnell is offline   Reply With Quote
Old 11-16-2015, 08:08 AM   #11
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

Hi Brian or any bbmap user, I am wondering if you can explain the read accounting that bbmap is doing?

Basically, why doesn't 613590 - 273 -374 - 844 = #unmapped ? (612099 vs 611706)
Code:
Reads Used:             613590  (75291382 bases)

mated pairs:              0.0890%             273         0.0865%              65100
unmapped:                99.6930%          611706        99.6948%           75061564

Read 1 data:            pct reads       num reads       pct bases          num bases
mapped:                   0.1219%             374         0.1190%              44827

Read 2 data:            pct reads       num reads       pct bases          num bases
mapped:                   0.2751%             844         0.2760%             103810
Thanks, liking BB suite so far

Last edited by GenoMax; 11-16-2015 at 08:11 AM. Reason: added CODE tags to improve readability
sk8bro is offline   Reply With Quote
Old 11-16-2015, 08:46 AM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hmmm... that's a good question. But it should be "unmapped+mapped1+mapped2=used"; the mated pairs are a subset of the mapped reads already.

"Unmapped" is currently tracking whether the reads go to "outm" or "outu". If one read in a pair is mapped and the other is unmapped, both go to "outm" because pairs are always kept together. Whereas "mapped" is tracking the status of individual reads - whether they mapped or not. That's confusing; if some of the reads map unpaired (as in this case), then the counts don't add up. But, I added it at the request of people who were doing human filtering, because they wanted to know exactly how many reads survived that phase, which is the number you get from the "unmapped" line when you use the "printunmappedcount" flag.
Brian Bushnell is offline   Reply With Quote
Old 11-16-2015, 09:21 AM   #13
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

Thanks, Brian

yes, I verified that wc -l of clean_1.fq (divided by 2) is equal to the unmapped read count.

So following what you said, which is clear, then 666 reads or 333 pairs are missing in action? I didn't specify an outm file. Could that change behavior? Would you be worried if you were me? Thanks,
sk8bro is offline   Reply With Quote
Old 11-16-2015, 09:25 AM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by sk8bro View Post
Thanks, Brian

yes, I verified that wc -l of clean_1.fq (divided by 2) is equal to the unmapped read count.

So following what you said, which is clear, then 666 reads or 333 pairs are missing in action? I didn't specify an outm file. Could that change behavior? Would you be worried if you were me? Thanks,
It doesn't matter whether or not you specify "outm" or "outu". "unmapped" just counts the number of reads that end up in "outu", or that would end up in outu if you did specify it.

There's no reason to be worried. Sorry that the counts don't quite add up when you're using paired reads, but that's because they track different things.
Brian Bushnell is offline   Reply With Quote
Old 11-16-2015, 09:37 AM   #15
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

You wrote: "But it should be "unmapped+mapped1+mapped2=used"

I just wanted to confirm that it is okay that the numbers also don't fit this formula.

611706+374+844 = 612924 != 613590

Then, if I wanted to report the mapping rate, is this the correct formula?

mapped = used-unmapped/used

Is there a way to export to outm only if both pairs map? Default I think is to export if either pair maps?

Thank you for clarifiying.
sk8bro is offline   Reply With Quote
Old 11-16-2015, 10:12 AM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by sk8bro View Post
You wrote: "But it should be "unmapped+mapped1+mapped2=used"

I just wanted to confirm that it is okay that the numbers also don't fit this formula.

611706+374+844 = 612924 != 613590
Yes, I know It only works for single-ended reads.

Quote:
Then, if I wanted to report the mapping rate, is this the correct formula?

mapped = used-unmapped/used
Depends on what you want to report, but yes, the number of reads that get sent to outm are "used-unmapped". The number of reads that actually mapped (meaning they match the reference) is mapped1+mapped2, which is slightly different (for paired reads).

Quote:
Is there a way to export to outm only if both pairs map? Default I think is to export if either pair maps?
Yes, you can use the "pairedonly" flag, which will a pair mapped only if both reads map, and in the correct orientation (same contig, opposite strands, within the maximum insert size).
Brian Bushnell is offline   Reply With Quote
Old 11-16-2015, 10:43 AM   #17
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

That all sounds great

Have you ever compared to SNAP in terms of human read removal?
sk8bro is offline   Reply With Quote
Old 11-16-2015, 10:59 AM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

SNAP is faster than BBMap, but much less sensitive. I would never use it for most mapping tasks, but for contamination removal, it might be acceptable because you want fairly low sensitivity anyway, to avoid false positives. A more fair comparison in that case might be BBDuk vs SNAP, as they both use long kmers, and would probably have similar speed and sensitivity.

But no, I have not specifically looked at SNAP for contamination removal.
Brian Bushnell is offline   Reply With Quote
Old 11-17-2015, 09:13 AM   #19
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

Thank you for the explanation and your extensive suite of tools.
sk8bro is offline   Reply With Quote
Old 11-19-2015, 07:46 AM   #20
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

Hi Brian, I was wondering about this... If bam/bai is what I want at the end of the day, do you still recommend fastq output and then conversion?

"BBSplit is recommended for fastq and fasta output, not for sam/bam output"

It seems that sam output with this option would give me what I'm after without any extra work:

bs=<file> Write a shell script to 'file' that will turn the sam output into a sorted, indexed bam file.
sk8bro is offline   Reply With Quote
Reply

Tags
demultiplexing, human reads

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:15 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO