SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
quality control from fastq to vcf dongshenglulv Bioinformatics 3 11-05-2014 02:08 PM
Quality control of genomic resequencing data from a HiSeq gavin.oliver Genomic Resequencing 2 06-30-2013 01:48 AM
Webinar on Quality Control of NGS Data - FREE Strand SI Events / Conferences 0 09-09-2011 06:33 PM
TileQC: a system for tile-based quality control of Solexa data ScottC Illumina/Solexa 0 06-03-2008 04:54 PM
PubMed: TileQC: a system for tile-based quality control of Solexa data. Newsbot! Literature Watch 0 05-30-2008 08:21 AM

Reply
 
Thread Tools
Old 07-13-2011, 10:52 PM   #181
Clare S
Junior Member
 
Location: Melbourne, Australia

Join Date: Jan 2010
Posts: 5
Default

Hi Simon,

Thanks! I'll get our version upgraded and try it out. I appreciate the support!

Clare
Clare S is offline   Reply With Quote
Old 07-14-2011, 04:43 AM   #182
husamia
Member
 
Location: cinci

Join Date: Apr 2010
Posts: 66
Default

The program is very efficient in memory usage to make it run on any system. I have one concern for some applications that contain large number of sequence reads which require obviously larger system to analyze. In the "Duplicate Sequences" it says "To cut down on the memory requirements for this module only sequences which occur in the first 200,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file." can you provide option to analyze entire file. Some applications such as exome sequencing has around 40million sequence reads. I like to see more like 90% of my entire sequence set 200k is around .5%. I know this may take longer time to process but its worth it for me. Is it possible to allow processing entire file or 90% of file depending on its size not a hard cut off value like 200k? do you think this would be better representative of my entire set. This applies to the "Overrepresented sequences" and "Overrepresented Kmers" analysis as well I think.

Last edited by husamia; 07-14-2011 at 04:47 AM.
husamia is offline   Reply With Quote
Old 07-14-2011, 07:09 AM   #183
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by husamia View Post
Some applications such as exome sequencing has around 40million sequence reads. I like to see more like 90% of my entire sequence set 200k is around .5%. I know this may take longer time to process but its worth it for me. Is it possible to allow processing entire file or 90% of file depending on its size not a hard cut off value like 200k? do you think this would be better representative of my entire set. This applies to the "Overrepresented sequences" and "Overrepresented Kmers" analysis as well I think.
You raise a few points which I'll try to cover:

Firstly, the sampled coverage only applies to the Overrepresented sequences and duplicate sequences modules. The Kmer analysis always uses everything.

For the other two modules, it's not that the program only analyses the first 200k sequences, it's that it analyses the first 200k *different* sequences it finds, and then it tracks these to the end of the file. Any new copies of those same sequences which are seen later will be counted into the analysis, but new sequences not seen within the first 200k different sequences will not be tracked. The duplicate plot then extrapolates from these results to what it would have been likely to see if it had been able to analyse all sequences.

For the duplicate sequence plot this gives a perfectly fair view of your data since you're only trying to measure the relative number of unique, duplicated, triplicated etc sequences so it's not necessary to see every sequence.

For the overrepresented sequence list it's theoretically possible that there are sequences which make up a significant proportion of the total yet don't appear at all within the first 200k. However in your example even if every sequence were different you'd have to be fairly unlucky not to see an overrepresented sequence in the first 200k - you'd have a 50% chance of seeing something which made up 0.5% of your data, and the odds get better for more enriched sequences.

The problem with not analysing more (or even all) of the sequences isn't the time to process, since all sequences are examined, it's the amount of RAM this could consume. If every sequence in a file were different then you could have to store them all in memory to construct the plot. For a 40million read file this could equate to 10GB+ for these modules alone, which isn't going to be feasible for most people.

The one significant problem which could occur in these modules is that they assume that sequences which arrive in a random order. If however your file isn't random (such as a sorted BAM file for example), then the first sequences you see may not be representative of the whole file - however I can't see an easy way around this if I want to be able to analyse the whole dataset in a single pass with reasonable memory usage (any suggestions are welcome!).
simonandrews is offline   Reply With Quote
Old 07-14-2011, 08:26 AM   #184
Jeremy37
Member
 
Location: Montreal, Canada

Join Date: Feb 2011
Posts: 17
Default

If it's straightforward to implement, then it seems like it could be a useful option to specify the maximum number of distinct sequences to keep in memory, with the default being 200k. This would be similar to the MAX_RECORDS_IN_RAM option in Picard.

A few GB of RAM per sample is feasible for me, and I suspect a few others, so if this were available I might set the limit to a couple million rather than 200k, just to get as accurate results as possible. (Though I wouldn't set it to infinity...) It's not a feature I'm dying for, however, so I'm not too concerned either way.
Jeremy37 is offline   Reply With Quote
Old 08-06-2011, 06:12 AM   #185
Yilong Li
Member
 
Location: WTSI

Join Date: Dec 2010
Posts: 41
Default

Hi Simon,

Have you published FastQC anywhere yet? Any idea how I should make a reference to it?
Yilong Li is offline   Reply With Quote
Old 08-08-2011, 11:07 PM   #186
chenyao
Member
 
Location: Beijing

Join Date: Jul 2011
Posts: 74
Default

Quote:
Originally Posted by simonandrews View Post
It looks like you're trying to use a file which isn't in FastQ format. I'd guess from the name of your file that it's a fastA file? FastQC is designed to work with file formats which include both sequence and quality data, which fastA doesn't have. You could, I suppose, make up a fake FastQ format from your fastA file if you really wanted to, but it would probably be better to find your raw sequence output and run that through the program, rather than trying to analyse your assembled contigs directly.

If you let us know what kind of data you have and what you're trying to find out we may be able to offer other suggestions.
I used "csfasta" file of solid data and got the same error, fastqc didn't support this format?

Any solution
chenyao is offline   Reply With Quote
Old 09-22-2011, 05:02 AM   #187
jeny
Member
 
Location: france

Join Date: Mar 2010
Posts: 16
Default

I downloaded latest FastQC version to analyze fastq generated by CASAVA1.8 pipeline.
But running it on PhiX data returned message :
...
Approx 95% complete for PhiX_NoIndex_L008_R1.fastq.gz
Analysis complete for PhiX_NoIndex_L008_R1.fastq.gz
Failed to process file PhiX_NoIndex_L008_R1.fastq.gz
java.awt.HeadlessException
at java.awt.dnd.DropTarget.<init>(libgcj.so.90)
at java.awt.dnd.DropTarget.<init>(libgcj.so.90)
at javax.swing.JComponent.<init>(libgcj.so.90)
at javax.swing.JPanel.<init>(libgcj.so.90)
at javax.swing.JPanel.<init>(libgcj.so.90)
at uk.ac.bbsrc.babraham.FastQC.Graphs.QualityBoxPlot.<init>(QualityBoxPlot.java:52)
at uk.ac.bbsrc.babraham.FastQC.Modules.PerBaseQualityScores.makeReport(PerBaseQualityScores.java:194)
at uk.ac.bbsrc.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:93)
at uk.ac.bbsrc.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:157)
at uk.ac.bbsrc.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:108)
at java.lang.Thread.run(libgcj.so.90)

java is installed and version is 1.5.0
Command line was : fastqc --casava Sample_PhiX/PhiX_NoIndex_L008_R1_00*

I have no idea what gone wrong.
Any help appreciated.
jeny is offline   Reply With Quote
Old 09-22-2011, 05:09 AM   #188
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by jeny View Post
java.awt.HeadlessException
We've seen this error for some older JREs, (particularly some versions of GCJ) which have a bug where they fail to work with offscreen displays even if the awt.headless flag is set to indicate there is no locally connected display.

The simple fix is to install the latest Oracle JRE onto the machine and use that to run FastQC. As it's a bug in java itself there's nothing we can do within the FastQC code to fix this.

This assumes that you're using the fastqc wrapper to launch the program - if you've constructed your own command then try running through the wrapper which will set these kinds of flags for you.
simonandrews is offline   Reply With Quote
Old 09-22-2011, 05:28 AM   #189
jeny
Member
 
Location: france

Join Date: Mar 2010
Posts: 16
Default

ok, thanks for your quick reply.
I will install latest jre.
Cheers
jeny is offline   Reply With Quote
Old 11-02-2011, 11:19 AM   #190
mnkyboy
Member
 
Location: Seattle, WA

Join Date: Mar 2009
Posts: 87
Default

I have a question about the contaminants list.

Is there a limit to the length of the sequence I can add to the file, or a length at which the tool may have a problem?
mnkyboy is offline   Reply With Quote
Old 11-03-2011, 12:20 AM   #191
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by mnkyboy View Post
I have a question about the contaminants list.

Is there a limit to the length of the sequence I can add to the file, or a length at which the tool may have a problem?
There isn't an intrinsic limit (other than the amount of RAM you have), but I don't think it would make sense to include sequences more than a few hundred bases. The search algorithm is very simplistic, doing a pretty inefficient search for linear matches with the sequences in the file. Also, since the matches reported are a simple yes/no (with percentages, but without any positional information), I'm not sure how useful they'd be if you put in a really long sequence.

Having said all of this we've never tried using long sequences so have a go and let us know if it's actually useful, or if there are simple changes we could make which would make it more usable.
simonandrews is offline   Reply With Quote
Old 11-11-2011, 08:18 AM   #192
marcaill
Junior Member
 
Location: paris

Join Date: Jul 2008
Posts: 6
Default

Hi Simon,

About duplicates. For PE run, I obtained duplicates results for reads 1 and reads 2. For me it's therefore a single read evaluation. Is it possible to get the real PE duplicates ?
About the result, I understood that the proportion is calculated from unique reads. Would it be possible to obtain duplicates proportion from the total reads ? For exome sequencing, it's what interests us but we need a (time-consuming) alignment to get that QC.

However I'm a new user of your really usefull tool, sorry if I misunderstood some features...
marcaill is offline   Reply With Quote
Old 11-12-2011, 10:01 AM   #193
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

Hi Simon,

First off, this is an excellent piece of software I have only recently started to use. I was trying to analyze some of our old HiSeq data that are all outputted as huge gzip Fastq files (10-15gb each). Because of the size, they take a long time to load in FastQC and I can't easily split the files unless I unzip them first. Any chance you could put in FastQC a way to only read a certain number of reads or percentage of reads? That way I should be able to have a file with 100m reads but only read, say, 1m - which is sufficient for the analysis and much faster.

Thanks for the great program once again.
kga1978 is offline   Reply With Quote
Old 11-13-2011, 11:54 PM   #194
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by marcaill View Post
Hi Simon,
About duplicates. For PE run, I obtained duplicates results for reads 1 and reads 2. For me it's therefore a single read evaluation. Is it possible to get the real PE duplicates ?
We've actually been thinking about how to handle paired end data better. This is particularly problematic for paired end BAM files where the results of the two ends are just merged together in a single report at the moment. What I guess we'd need would be a combined report where most of the graphs are duplicated, but things like the summary and duplicate plot could account for both ends. At the moment we're just limited on the amount of time we can put into this kind of development.

Quote:
Originally Posted by marcaill View Post
About the result, I understood that the proportion is calculated from unique reads. Would it be possible to obtain duplicates proportion from the total reads ? For exome sequencing, it's what interests us but we need a (time-consuming) alignment to get that QC.
Doing an exhaustive duplicate calculation is problematic only because of the potential memory usage. We have people putting hundreds of millions of reads through the program, and in the worst case (every read is different) we'd need to hold all of those reads in memory (in an non-hugely efficient data structure) to calculate duplicate levels. This would either mean reserving huge amounts of RAM at startup (and making the program not work on most desktop machines), or having the program fail part way through (ironically, only if you have really good data).

Thinking about what you asked, I'm sure it's possible to work out a way to extrapolate to real numbers from the data we have, but I don't think the conversion is entirely trivial. If anyone wants to help working out the correct way to convert from the proportion tracking we currently have to working back to real numbers then please contact me directly and I'd be happy to have a chat about this.
simonandrews is offline   Reply With Quote
Old 11-13-2011, 11:57 PM   #195
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by kga1978 View Post
Hi Simon,

First off, this is an excellent piece of software I have only recently started to use. I was trying to analyze some of our old HiSeq data that are all outputted as huge gzip Fastq files (10-15gb each). Because of the size, they take a long time to load in FastQC and I can't easily split the files unless I unzip them first. Any chance you could put in FastQC a way to only read a certain number of reads or percentage of reads? That way I should be able to have a file with 100m reads but only read, say, 1m - which is sufficient for the analysis and much faster.

Thanks for the great program once again.
As long as you're not interested in getting correct summary information for the total number of sequences etc then this should be simple enough to do. I'll look at adding this as an option to the next release. The limit would have to be an absolute number of sequences, or a percentage of the file (not the number of sequences). I guess setting a sequence number limit would be easiest since the program would naturally stop if the file finished.

Limiting the analysis in this way would also mean that the duplicate sequence plot wouldn't be correct since it would underestimate duplication levels and would show the duplication you actually got within the subset of the file you analysed.
simonandrews is offline   Reply With Quote
Old 11-14-2011, 04:44 AM   #196
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

That sounds great, thanks - getting an estimate from a subset of reads will be good enough for most of my analyses. I take out duplicates anyway (with prinseq), so losing that information is okay.
kga1978 is offline   Reply With Quote
Old 11-29-2011, 06:14 AM   #197
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

Hi Simon,
I am trying to use FastQC as part of a pipeline and I use /dev/stdin as my input (I have to unzip my files before parsing to FastQC). I redirect my report using '-o', but there doesn't appear to be any way I can give the report a name? The problem is that I will be processing multiple files, so they all have to have a unique name containing the sample name - I haven't found a way to do that when using stdin. Any thoughts? My command is as follows:

gunzip *.gz -c | fastqc -f fastq /dev/stdin -o /Volumes/Storage_1/Sequencing_1/Reports/

Thanks very much
kga1978 is offline   Reply With Quote
Old 11-29-2011, 06:39 AM   #198
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by kga1978 View Post
Hi Simon,
I am trying to use FastQC as part of a pipeline and I use /dev/stdin as my input (I have to unzip my files before parsing to FastQC). I redirect my report using '-o', but there doesn't appear to be any way I can give the report a name? The problem is that I will be processing multiple files, so they all have to have a unique name containing the sample name - I haven't found a way to do that when using stdin. Any thoughts? My command is as follows:

gunzip *.gz -c | fastqc -f fastq /dev/stdin -o /Volumes/Storage_1/Sequencing_1/Reports/

Thanks very much
FastQC doesn't support reading from stdin in it's current incarnation. If you're doing this to merge together the multiple files generated by the illumina pipeline then you can use the --casava option and pass in all of the fastq.gz files and FastQC will merge them together appropriately and write out a combined analysis report for each lane.
simonandrews is offline   Reply With Quote
Old 11-29-2011, 06:57 AM   #199
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

Hey Simon,

I have tried that, but the casava option doesn't appear to work correctly on my files. I get the following:

Code:
fastqc --casava Sample_O_215-1_225-2_225TGACCAreads0*.gz
File 'Sample_O_215-1_225-2_225TGACCAreads001.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads002.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads003.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads004.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads005.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads006.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads007.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads008.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads009.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads010.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads011.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads012.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads013.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads014.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads015.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads016.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads017.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads018.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads019.gz' didn't look like part of a CASAVA group
I have tried to add the files individually as well, but I got the same error. Any thoughts?
kga1978 is offline   Reply With Quote
Old 11-29-2011, 07:38 AM   #200
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by kga1978 View Post
File 'Sample_O_215-1_225-2_225TGACCAreads001.gz' didn't look like part of a CASAVA group
Those names don't look like the names generated by Casava. According to the docs I've got the fastq file names should follow the pattern:

<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

Which is what FastQC looks for. The end of your file names seems to have been changed so that FastQC isn't able to group them together. I deliberately stuck quite closely to the official spec as I didn't want to end up merging together files which shouldn't be. I assumed that no one would bother going through and changing the names of all of the individual files, but it looks like I was wrong :-)
simonandrews is offline   Reply With Quote
Reply

Tags
fastq, quality, report

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:14 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO