SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
App 16S metagenomics basespace illumina- rarefaction curves Vale16S Metagenomics 1 09-15-2014 12:12 PM
rarefaction curves and simulation sphil Bioinformatics 0 07-15-2014 02:30 AM

Reply
 
Thread Tools
Old 12-10-2014, 01:36 PM   #1
Fernas
Member
 
Location: Middle East

Join Date: Apr 2013
Posts: 74
Default How to plot the Saturation curves to assess the sequencing depth?

Dear all,

I want to know if there is a tool or ready-script that I can use to plot the Saturation (Rarefaction) plot in order to assess the sequencing depth of the sequenced libraries that I have.
Fernas is offline   Reply With Quote
Old 12-10-2014, 03:14 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I have a program for plotting library uniqueness as you go through the reads. The graphs look like this:


It works by pulling kmers from each input read, and testing whether it has been seen before, then storing it in a table.

The bottom line, "first", tracks whether the first kmer of the read has been seen before (independent of whether it is read 1 or read 2).

The top line, "pair", indicates whether a combined kmer from both read 1 and read 2 has been seen before. The other lines are generally safe to ignore but they track other things, like read1- or read2-specific data, and random kmers versus the first kmer.

It plots a point every X reads (configurable, default 25000).

In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.

In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.

Sample command line:

bbcountunique.sh in=reads.fq out=histogram.txt

Note that the lines are not perfectly smooth; the little peaks are caused by high-error tiles. But it's still useful in that it allows assessment of a library that lacks a reference.
Attached Images
File Type: png uniqueness.png (31.6 KB, 300 views)

Last edited by Brian Bushnell; 12-10-2014 at 06:02 PM.
Brian Bushnell is offline   Reply With Quote
Old 12-10-2014, 05:55 PM   #3
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 308
Default

Thanks Brian ! Once more!
luc is online now   Reply With Quote
Old 12-10-2014, 08:10 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Smile

Quote:
Originally Posted by luc View Post
Thanks Brian ! Once more!
You're welcome
Brian Bushnell is offline   Reply With Quote
Old 12-12-2014, 07:40 AM   #5
Fernas
Member
 
Location: Middle East

Join Date: Apr 2013
Posts: 74
Default

Thanks very much Brian.

I downloaded BBMap and when I tried to run (bbcountunique.ssh) I got the following error "Exception in thread main java.lang.unsupportedclassversionerror: jgi/Calcuniqueness"

I downloaded a new version of jre but I am wondering how to use it with the code? do I need to modify the code or the path variable? do I need to download another file?
Fernas is offline   Reply With Quote
Old 12-12-2014, 10:47 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by Fernas View Post
Thanks very much Brian.

I downloaded BBMap and when I tried to run (bbcountunique.ssh) I got the following error "Exception in thread main java.lang.unsupportedclassversionerror: jgi/Calcuniqueness"

I downloaded a new version of jre but I am wondering how to use it with the code? do I need to modify the code or the path variable? do I need to download another file?
Assuming you downloaded a Java 7 JRE, and installed it, the program should just work. If you are still getting the same error, then the JRE was not installed correctly; you can resolve that by editing your PATH variable to remove the path to the old java executable (on our system, it is "/usr/common/usg/languages/java/jdk/oracle/1.7.0_51_x86_64/bin/") and put in the path to the new java executable. Alternately, you can edit the shellscript "bbcountunique.sh". Line 80, 5 lines from the bottom, is:

local CMD="java $EA $z -cp $CP jgi.CalcUniqueness [email protected]"

If your new java executable is at, say, "/usr/jdk/oracle/1.7.0_51_x86_64/bin/java", then you would just change that line to:

local CMD="/usr/jdk/oracle/1.7.0_51_x86_64/bin/java $EA $z -cp $CP jgi.CalcUniqueness [email protected]"

If you can't figure out how to install the new jre or don't have the proper permissions, then run "java -version" and copy and paste the output into this thread.
Brian Bushnell is offline   Reply With Quote
Old 12-13-2014, 01:23 AM   #7
Fernas
Member
 
Location: Middle East

Join Date: Apr 2013
Posts: 74
Default

It works now.
Thank you very much indeed Brian!.
Fernas is offline   Reply With Quote
Old 02-11-2015, 09:36 PM   #8
arash82
Member
 
Location: Sweden

Join Date: Oct 2014
Posts: 11
Default

Quote:
Originally Posted by Brian Bushnell View Post
In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.

In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.
First, thanks Brian for this tool... I am trying to use it on pilot data to determine how to multiplex my samples.

To my question, I am not entire sure I understand how to interpret the results.

In default mode I get a curve that plateaus around 35% between 30-50M reads. It doesn't seem to move towards zero. I'd like to interpret this that there is no point in sequencing more than 30M reads, but that wouldn't be correct to your statement. It would appear that I keep getting 30% new sequences like forever!?

Could you clarify, or am I doing something wrong? And how should you interpret the cumulative mode?

Thanks,
Arash

PS. I have three columns. Could you also clarify what the rand column means?

Last edited by arash82; 02-11-2015 at 09:38 PM. Reason: PS
arash82 is offline   Reply With Quote
Old 02-12-2015, 06:44 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Arash,

For each read, the first kmer is created and a kmer from a random location is created. Each of these kmers is looked up in a table to determine if it has been seen before. There is a separate table for first kmers and for random kmers; if you are using paired reads, there are also separate tables for read 1 and read 2. If the kmer has not been seen before, that read is considered "unique" for that metric and the kmer is stored. Otherwise the read is considered "non-unique". Every 25000 reads (by default) a row is printed showing the unique rate. In cumulative mode (which I personally never use!) the numbers in a row apply to all reads (so you can never reach zero!); in noncumulative mode, the number applies to only the last 25000 reads (so you will reach 0% uniqueness as soon as you get a batch of 25000 consecutive reads that have all been seen before).

"First" column is the percent of reads in which the first kmer has never been seen.
"Rand" column is the percent of reads in which a specific randomly-selected kmer has never been seen.
"Pair" column uses a hash of a specific kmer in read 1 and read 2 that has a fixed position, chosen to have a minimal error rate. Meaning that it reflects the number of unique pairs that have been seen before.

I wrote this tool, and I like it, but I designed it largely to other people's specifications so some of the defaults are a bit odd in my opinion, like the "rand" columns - I typically ignore those!

If you run in noncumulative mode, which I recommend, then you will gain no benefit from additional sequencing once the "pair" column approaches zero (for paired reads) or once the "first" column approaches zero (for single-ended reads). With paired reads, "first" will approach zero way before "pair", and once that happens, you are no longer generating unique reads, just reads that you have seen before but with new insert sizes. In general, there is no reason to sequence further once "first" approaches zero in non-cumulative mode!

However, this tool relies on high data quality. If you have low quality data with substitution errors, or very short inserts such that adapters occur in the middle of the reads, the tool will overestimate uniqueness and never reach zero. For example - if 30% of your reads have an error in the first K bases (K is by default 25), then rather than asymptotically approaching 0% uniqueness, it will approach 30% uniqueness, because kmers with errors in them will usually never have been seen before with that specific error. Mapping-based approaches do not have this problem. So, in practice, this program is ideal for high quality data, but mapping is better for low-quality data. All the little spikes in the picture I posted above are due to a bunch of reads that, for whatever reason (like a bubble in the flow cell), had low quality; if the reads were all error-free, the line would be perfectly smooth.

In summary:

1) Don't use cumulative mode for determining how much to sequence; it's only for calculating the total number of unique reads in a dataset.
2) Ignore the rand column.
3) This tool only provides useful information from decent-quality data; for very low quality data (either a high error rate [under Q15], or very short insert sizes) you need to use mapping.
4) You don't need to sequence more once the "first" column approaches zero. How close it approaches depends on your budget and needs; at 50% uniqueness, with even coverage and 100bp reads, you would have a around 100x average coverage.

In some situations, like RNA-seq, single-cell, or metagenomes, in which the sequences have an exponential coverage distribution, you will NEVER reach zero.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 02-13-2015, 08:26 AM   #10
arash82
Member
 
Location: Sweden

Join Date: Oct 2014
Posts: 11
Default

Dear Brian,

Thanks for the extensive response and clarification on how the program works. Very much appreceated.

I kind of forgot to mention that I am using it on RNA-seq data from a HiSeq 2500. I currently don't have access to the mapped file, but I'll try it on them as soon as I can.

The thing is I am using the program (right now at least) just to determine if I am sequencing deep enough or if I can multiplex furthere. I don't need a perfect curve, just an estimate. Was thinking maybe to trim and then run, but shouldn't gain much from that...

Thanks,
Arash

PS. Was also thinking that the spikes are nice in a way as a quility indiciation. I have instances of much higher spikes in my data.
arash82 is offline   Reply With Quote
Old 02-13-2015, 10:28 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by arash82 View Post
I kind of forgot to mention that I am using it on RNA-seq data from a HiSeq 2500. I currently don't have access to the mapped file, but I'll try it on them as soon as I can.
Just to clarify, CalcUniqueness does not make any use of mapping information, but it's possible to do a similar analysis with mapping information instead of kmers and there are probably programs that do so.

Quote:
The thing is I am using the program (right now at least) just to determine if I am sequencing deep enough or if I can multiplex further. I don't need a perfect curve, just an estimate. Was thinking maybe to trim and then run, but shouldn't gain much from that...
If you want to get an advantage from trimming, you'd have to do fixed-length trimming on the left (like, removing the first 5 bases). Quality-trimming the right end won't affect the graphs (other than the "rand" column) unless the reads end up shorter than a kmer, and variable-length trimming on the left end would wreck them because the kmers would no longer start in the same place for previously identical reads. Quality-filtering and adapter-trimming might help, though:

bbduk.sh in=reads.fq out=clean.fq maq=15 ktrim=r k=25 mink=11 hdist=1 tpe tbo ref=truseq.fa.gz minlen=40

Here the "maq=15" will throw away reads with average quality below 15 (in other words, an expected error rate of over 1/30 or so), and reads trimmed shorter than 40bp after adapter removal will also be discarded. These may not be optimal settings for actual RNA-seq analysis (since requiring a high average quality can bias quantification), but it should clean up the data a bit to allow generation of more accurate saturation curves.
Brian Bushnell is offline   Reply With Quote
Old 01-24-2017, 01:57 AM   #12
boulund
Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 16
Default

Quote:
Originally Posted by Brian Bushnell View Post
In some situations, like RNA-seq, single-cell, or metagenomes, in which the sequences have an exponential coverage distribution, you will NEVER reach zero.
But could this approach still be used with e.g. metagenomics data to get some kind of feeling for if the sequencing depth is deep enough? I guess what I'm really asking is whether you think it would be reasonable to still expect it decrease (even if it doesn't reach zero, but instead bottoms out somewhere higher)?
boulund is offline   Reply With Quote
Old 01-24-2017, 09:27 AM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

It depends on your goals. You can assemble and recover a lot from the higher-depth fraction of most samples. If you can assemble the genes that make up 90% of the DNA by mass in an environment, perhaps that's good enough to determine, for your purposes, what the community looks like and what it does.
Brian Bushnell is offline   Reply With Quote
Old 01-24-2017, 11:54 PM   #14
boulund
Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 16
Default

Thanks for your really swift reply Brian!

Sorry, I'm not being very clear...
I'm really wondering whether bbcountunique is still useful somehow as a tool for quantifying the saturation of a metagenomic sample.
boulund is offline   Reply With Quote
Old 01-25-2017, 10:15 AM   #15
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

We generally use that tool for determining how good a library preparation method was for an isolate of finite size. For a metagenome, by telling you what percent of the reads are unique as you continue to sequence, you can at least get an idea that... for every $1 I spend on additional sequence, $0.99 is spent on things I've already seen. But actually determining the total size of the metagenome from this kind of data is an open research area, and it's not clear to me if the "total size of a metagenome" is meaningful in the wild. So, I think the answer is that it's a little useful, but not a complete answer.
Brian Bushnell is offline   Reply With Quote
Old 04-25-2017, 03:59 PM   #16
TomHarrop
Member
 
Location: New Zealand

Join Date: Jul 2014
Posts: 20
Default

Hi Brian,

Thanks for this tool.

What is the perfect_prob column in the results? I can't see it in the docs. Is it the "probability of correctness" for k-mers (reads?) in that bin based on avg_quality?

Also, if your percentage uniqueness for read 1 is only approaching 60% after ~250 M reads, would you keep sequencing?

Cheers,

Tom
Attached Files
File Type: pdf uniqueness_histogram.pdf (303.7 KB, 8 views)
TomHarrop is offline   Reply With Quote
Old 04-25-2017, 05:24 PM   #17
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Tom,

perfect_prob is the average probability of a read being error-free within that interval. It's related to the avg_quality, but calculated independently. Possibly, it would make more sense for me to do this just for the kmer being used to track uniqueness rather than the whole read, but it's easiest this way. The reason I provide it is because low-quality regions in the fastq file will show inflated uniqueness, when uniqueness is tracked using this method.

It looks like you're down to about 70% uniqueness for each individual read, which would be at least ~100x coverage for 150-bp reads... that coverage estimate is weighted by the high-coverage genomes, though.

It's hard to say whether or not to sequence more based on this plot alone. You're obviously still generating more unique reads, but they might simply be giving more coverage to areas you can already assemble well. I think the best course is to assemble and see if you end up with a lot of short, low-coverage contigs (in addition to the high-coverage contigs that you will clearly generate)... in which case you do need to sequence more
Brian Bushnell is offline   Reply With Quote
Old 04-25-2017, 06:50 PM   #18
TomHarrop
Member
 
Location: New Zealand

Join Date: Jul 2014
Posts: 20
Default

Great, thanks. How did you estimate the average coverage from the % uniqueness? I know I can do it more accurately with BBNorm (which says read depth median at 55x) but I'm curious how you did it from that plot. These are 100 b reads, but it's about 50 Gb of sequencing from a genome we are expecting to be ~600 Mbp, so you are not far off if coverage were even.
TomHarrop is offline   Reply With Quote
Old 04-25-2017, 07:13 PM   #19
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Well, it's a very rough estimate, but...

If 70% of reads are unique, then assuming an even distribution, 30% of the start sites are taken. Meaning there is one read for every 1/0.3 = 3.33 bases. For 150bp reads, that would indicate coverage of 150bp/3.33 = 45x. But since read 1 and read 2 are tracked independently, I doubled it to 90x. Then, since errors artificially inflate uniqueness calculation using this method, and given the % perfect profile, I guessed that maybe I should increase it by ~10%, so I arrived at ~100x coverage, but possibly more if the reads were lower-quality than they seemed based on the mapq.

But, those estimates were based on 150bp reads... for 100bp reads the estimate would have been 66x+, which is not too far off from 55x. I initially thought this was a metagenome because of the sharp decrease in uniqueness at the very beginning of the file, but perhaps you just have a highly repetitive genome, or lots of duplicate reads. Was this library PCR-amplified? And did you trim adapters and remove phiX (if you spiked it in) prior to running the program? Also, is this a Nextera library; or, what method did you use for fragmentation? It's unusual for a PCR-free isolate to have such a sharp decrease in uniqueness at the beginning; that indicates there is some sequence that is extremely abundant in the library. Notably, the drop is not present in the paired uniqueness, which is completely linear. I'm not entirely sure what this means.

At any rate, for an isolate, it looks like you've sequenced enough (for a diploid/haploid). Sometimes you can get a better assembly with more coverage, though, up to around 100x. And you certainly can't beat longer reads!

Last edited by Brian Bushnell; 04-25-2017 at 07:39 PM.
Brian Bushnell is offline   Reply With Quote
Old 04-25-2017, 07:41 PM   #20
TomHarrop
Member
 
Location: New Zealand

Join Date: Jul 2014
Posts: 20
Default

Thanks for the suggestions.

It's a TruSeq PCR-free library with an insert size around 470 bp according to the BioAnalyser. I did remove adaptors and contaminants with BBDuk2 (adapters.fa and phix174_ill files that ship with bbtools) but that only removed 0.05% of bases, so maybe I should be look again at the adaptor sequences.

It's a diploid insect. I extracted the DNA from a single, whole individual so there may be some [gut] flora in there... or yes, a repetitive genome, but let's hope not.

PacBio is too expensive for this project and we can't get enough DNA, but I'm looking into getting a MinION for gap closing.
TomHarrop is offline   Reply With Quote
Reply

Tags
sequencing depth

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:51 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO