Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Interpreting INDELs aforntacc Bioinformatics 7 02-24-2015 10:12 AM
preseq: predicting the complexity of genomic sequencing libraries timydaley Bioinformatics 35 10-10-2014 08:25 AM
preseq works on deduplicated BAM files? ewels Bioinformatics 0 09-24-2014 07:59 AM
Need help interpreting these weird Bioanalyzer results Sciurus Sample Prep / Library Generation 5 01-29-2014 08:04 AM
Interpreting Local Realignment results from GATK alma Bioinformatics 0 07-07-2011 02:46 AM

Thread Tools
Old 04-04-2016, 01:39 AM   #1
Senior Member
Location: Germany

Join Date: May 2010
Posts: 150
Default interpreting preseq results


I am running the preseq c_curve and lc_extrap on a few
files using the option -V.
(Q: does it make sense at all to run preseq on the fastq files, or will it be more accurate, when running on the mapped files?)

I was wondering though how to interpret the results I am getting.
for example, I have a fastq file with this summarized results.
I am using the following script to run multiple files through preseq:
for file in *.fastq.gz
        base=$(basename $file .fastq.gz)
		zcat ${base}.fastq.gz |awk '{if (NR%4==2) print substr ($0,1,20);}'| sort | uniq -c | awk '{print $1,$2}' > ${base}.counts
		preseq c_curve  -v -V  -o ${base}.preseq.complexity ${base}.counts 2> ${base}.complexitySummary.text
		preseq lc_extrap -v -V -o ${base}.preseq.yields     ${base}.counts 2> ${base}.yieldsSummary.text
I get these values in the output files from the two commands (including my interpretations of the specific rows):
TOTAL READS     = 3130582 - how many reads I have in the library
COUNTS_SUM      = 3130582 - how many reads where counted in the run
DISTINCT READS  = 513863 - that many distinct reads were founds
DISTINCT COUNTS = 197 - what does that mean?
MAX COUNT       = 1131097 - the sequence with the highest copy number
COUNTS OF 1     = 254836 - number of unique reads in the library
OBSERVED COUNTS (1131098) - what does that mean?

TOTAL READS     = 3130582 - same as above
DISTINCT READS  = 513863 - same as above
DISTINCT COUNTS = 197- what does that mean?
MAX COUNT       = 1131097 - same as above
COUNTS OF 1     = 254836 - same as above
MAX TERMS       = 100- what does that mean?
OBSERVED COUNTS (1131098)- what does that mean?
the results from the two runs are as such:
total_reads	distinct_reads
0	0
1000000	294482
2000000	414438
3000000	503167

0	0	0	0
1000000.0	294958.5	210028.8	414231.3
2000000.0	414987.0	304029.5	566439.1
3000000.0	503926.0	362806.3	699936.5
4000000.0	583253.5	410283.3	829145.8
5000000.0	658204.7	453968.8	954324.2
9996000000.0	8300141.2	538673.3	127892632.4
9997000000.0	8300152.7	538664.3	127895121.7
9998000000.0	8300164.1	538655.3	127897610.5
9999000000.0	8300175.6	538646.3	127900099.0
Q:Do I understand it correctly, when assuming, that in my experiment I have ~3.1M reads, from that ~255K are unique. If I'll use the same library and sequence it deeper to the depth of 9999M I will have ~8.3M unique reads?
How should I understand the two confidence intervals? (s. image below)

Q: Is there a way to say when the library is of such low quality / complexity, that it is not worth further investingating this one?
I have given here an example of what in my opinion would be not such a good library, as I have a lot of repeats (one read takes as much as a third of the data). I know there is probably no black or white in such experiments, but a rule of thumbs would be nice :-)

Q: How does the curve of the plot should look like, for a "good" and for a "bad" library?
Below are the plots I get for this library (done fast with Excel):
img=c_curve and lc_plot


Last edited by frymor; 04-04-2016 at 02:09 AM.
frymor is offline   Reply With Quote

library complexity, preseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 05:39 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO