SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
quality control from fastq to vcf dongshenglulv Bioinformatics 3 11-05-2014 02:08 PM
Quality control of genomic resequencing data from a HiSeq gavin.oliver Genomic Resequencing 2 06-30-2013 01:48 AM
Webinar on Quality Control of NGS Data - FREE Strand SI Events / Conferences 0 09-09-2011 06:33 PM
TileQC: a system for tile-based quality control of Solexa data ScottC Illumina/Solexa 0 06-03-2008 04:54 PM
PubMed: TileQC: a system for tile-based quality control of Solexa data. Newsbot! Literature Watch 0 05-30-2008 08:21 AM

Reply
 
Thread Tools
Old 11-08-2014, 07:19 AM   #321
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Quote:
Originally Posted by GenoMax View Post
Heptamers are too short for blast searches. You should search with (a few full length) read(s) to get results.

http://www.ncbi.nlm.nih.gov/blast/Why.shtml
I didn't realize that heptamers would be too short to blast. I have searched with a few full reads and they all come up as wheat or wheat wild relative mitochondrial or plastid sequence, which is exactly what they should be since the libraries were prepared from purified organellar DNA from wheat.
Marisa_Miller is offline   Reply With Quote
Old 11-08-2014, 12:11 PM   #322
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Perfect. You can move on to the analysis (do run your sequences through bbduk or trimmomatic to ensure that there are no remnants of adapters etc).
GenoMax is offline   Reply With Quote
Old 11-09-2014, 05:41 AM   #323
gab0
Member
 
Location: Talca, Chile

Join Date: Apr 2014
Posts: 11
Default

Quote:
Originally Posted by Marisa_Miller View Post
I didn't realize that heptamers would be too short to blast. I have searched with a few full reads and they all come up as wheat or wheat wild relative mitochondrial or plastid sequence, which is exactly what they should be since the libraries were prepared from purified organellar DNA from wheat.
Hi Marisa:

I faced a similar problem some time ago, and as you, I was puzzled about the weird Kmer content.

What I did was check if the Kmers would overlap (just by eye) and then I searched for the consensus sequence, which turned out to be sequences matching with Illumina sequences. I don't have the exact information right now but as I recall, these sequences aligned with the sequences from "Process Controls for TruSeq Sample Preparation Kits". There's a PDF of illumina in the web, illumina-customer-sequence-letter describing these sequences.

I found that there was over-representation of these sequences in the library. Hence that was why FASTQC was warning about Kmer content.

I finally removed those sequences from my fastq files and then performed assembly. They *should not* affect my assembly as they matched only with the adapters AND the identity % between them was very high. But just in case, I removed them

Maybe that can help you solve this issue!

Cheers,

Gabriel
gab0 is offline   Reply With Quote
Old 11-11-2014, 03:09 AM   #324
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Quote:
Originally Posted by gab0 View Post
Hi Marisa:

I faced a similar problem some time ago, and as you, I was puzzled about the weird Kmer content.

What I did was check if the Kmers would overlap (just by eye) and then I searched for the consensus sequence, which turned out to be sequences matching with Illumina sequences. I don't have the exact information right now but as I recall, these sequences aligned with the sequences from "Process Controls for TruSeq Sample Preparation Kits". There's a PDF of illumina in the web, illumina-customer-sequence-letter describing these sequences.

I found that there was over-representation of these sequences in the library. Hence that was why FASTQC was warning about Kmer content.

I finally removed those sequences from my fastq files and then performed assembly. They *should not* affect my assembly as they matched only with the adapters AND the identity % between them was very high. But just in case, I removed them

Maybe that can help you solve this issue!

Cheers,

Gabriel
Hi Gabriel,
Thanks for the info! I will do as you suggest and see if I can overlap those sequences to find out what they are.

Thanks!
Marisa_Miller is offline   Reply With Quote
Old 11-18-2014, 05:41 AM   #325
Haiopai
Junior Member
 
Location: Norway

Join Date: Jun 2014
Posts: 1
Default

Quote:
Originally Posted by simonandrews View Post
...
Does anyone here know if this format is something which is actually generated by an Illumina sequencer, or is it something an individual or maybe the ENA have done to the file? I can add a quick fix to just abandon the module if too many tiles are predicted, but if this is a format which might be more generally about then I should try to cope with this properly.
...
I have to come back to the bug regarding the tile module. I just ran a simulation of RNA-Seq with the Flux Simulator. It puts out reads with identifiers that at first glance look like the ones from Illumina (numbers, seperated by ':' ) but have completely different meanings. http://sammeth.net/confluence/displa...ad+Identifiers

I simply renamed the reads afterwards using sed, but in general it would be good to have some kind of mechanism as you proposed, that skips the tile module when there are too many tiles. The --limits possibility seems also good (as soon as it functions) but I had a hard time figuring out, why the error occured. So an automatic "overflow" detection would be good, maybe with some warning in the log-file/error-out.
Haiopai is offline   Reply With Quote
Old 12-12-2014, 01:06 PM   #326
sahodges
Junior Member
 
Location: Santa Barbara

Join Date: Apr 2011
Posts: 2
Default Color Codes of Per Tile Sequence Quality

I had a run that clearly had a flow cell problem in that there the per tile sequence quality degraded starting around cycle 35 for a cluster of tiles. In the plot attached, there are 6 clusters but these tiles are all actually contiguous (and on both sides of the flow cell).

What I'm trying to understand is how to interpret the colors, i.e., how poor of quality is green/yellow/orange/red? When I look at the fastqc_data file that's generated, there's a "Mean" value for each tile and base but I'm confused as to what this is a mean of? The numbers don't make sense to me as quality scores. For instance for base 1 the mean scores range from -1.6 to 0.73852 (average is 0) for the 96 tiles so this seems to be some sort of variance measure?

thanks for any help in interpretation!
Attached Files
File Type: pdf Tile Sequence Quality.pdf (29.3 KB, 21 views)
sahodges is offline   Reply With Quote
Old 12-13-2014, 06:55 AM   #327
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by sahodges View Post
I had a run that clearly had a flow cell problem in that there the per tile sequence quality degraded starting around cycle 35 for a cluster of tiles. In the plot attached, there are 6 clusters but these tiles are all actually contiguous (and on both sides of the flow cell).

What I'm trying to understand is how to interpret the colors, i.e., how poor of quality is green/yellow/orange/red? When I look at the fastqc_data file that's generated, there's a "Mean" value for each tile and base but I'm confused as to what this is a mean of? The numbers don't make sense to me as quality scores. For instance for base 1 the mean scores range from -1.6 to 0.73852 (average is 0) for the 96 tiles so this seems to be some sort of variance measure?

thanks for any help in interpretation!
The tiles on the FastQC plot are sorted by number, so discontiguous numbers might actually be adjacent on the actual flowcell - you'd need to look at the illumina documentation for the flowcell version you used (there's more than one).

The numbers you get are Phred score differences from the mean Phred score for that cycle of sequencing. For example if you had a chemistry cycle where the mean Phred score across all tiles was 30 and in tile 1234 the average Phred score was 25 then the per-tile measure for that tile would be -5.

The colour scale goes from 0 to 0 - whatever error threshold you have set in your config file (limits.txt), which is 10 by default.

Hope this helps.
simonandrews is offline   Reply With Quote
Old 12-13-2014, 07:24 AM   #328
sahodges
Junior Member
 
Location: Santa Barbara

Join Date: Apr 2011
Posts: 2
Default

Yes that does help. Thanks very much! Really appreciate your fast response too!

I had figured out the tile location coding - the bad tiles are, in fact, all adjacent.
sahodges is offline   Reply With Quote
Old 12-21-2014, 08:18 PM   #329
tv195
Junior Member
 
Location: Australia

Join Date: Dec 2014
Posts: 2
Question FastQC Exception in thread "main" java.awt.HeadlessException

Hi,
A quick question before Christmas, I get the following error:

java -version

java version "1.6.0_33"
OpenJDK Runtime Environment (IcedTea6 1.13.5) (6b33-1.13.5-1ubuntu0.12.04)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)

./fastqc

Exception in thread "main" java.awt.HeadlessException
at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:173)
at java.awt.Window.<init>(Window.java:547)
at java.awt.Frame.<init>(Frame.java:419)
at java.awt.Frame.<init>(Frame.java:384)
at javax.swing.JFrame.<init>(JFrame.java:174)
at uk.ac.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:71)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:324)

Thanks,
Chris
tv195 is offline   Reply With Quote
Old 12-22-2014, 03:26 AM   #330
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

@chris: Simon will no doubt swing by but it appears that you are trying to run fastqc in interactive mode on a server where the X11 is not set/not present.

You can get around this by running fastqc non-interactively. Just specify the name of the sequence file(s) on the command like this:

Code:
$ fastqc seq_file1 seq_file2
GenoMax is offline   Reply With Quote
Old 12-22-2014, 03:40 PM   #331
tv195
Junior Member
 
Location: Australia

Join Date: Dec 2014
Posts: 2
Thumbs up non-interactive mode works

Hi, great thanks a lot! The non-interactive mode works fine and produces an .html file with the results. That's all I need.
Thanks,
Chris
tv195 is offline   Reply With Quote
Old 02-18-2015, 08:26 AM   #332
liz_is
Member
 
Location: London

Join Date: Nov 2012
Posts: 10
Default

Hi,

I haven't been following this thread in the meantime, but I previous had this issue with fastqc - the header format was causing it to think there were more tiles than sensible.

Quote:
Does anyone here know if this format is something which is actually generated by an Illumina sequencer, or is it something an individual or maybe the ENA have done to the file?
Now I've come across the same issue with a completely different dataset, so I thought I'd let you know! Seems to be the same problem as the devel version that fixed it last time also works on this data. The dataset is here: http://www.ncbi.nlm.nih.gov/geo/quer...acc=GSM1004802
liz_is is offline   Reply With Quote
Old 02-19-2015, 12:39 AM   #333
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by liz_is View Post
Hi,

I haven't been following this thread in the meantime, but I previous had this issue with fastqc - the header format was causing it to think there were more tiles than sensible.



Now I've come across the same issue with a completely different dataset, so I thought I'd let you know! Seems to be the same problem as the devel version that fixed it last time also works on this data. The dataset is here: http://www.ncbi.nlm.nih.gov/geo/quer...acc=GSM1004802
Thanks for reporting this. I think this is the same issue as before and is fixed in the development version. We're really close to being able to release an update to finally address this. There are two outstanding bugs which we want to close:
  1. Some headers are mis-recognised as tile identifiers and suck up all available memory
  2. A non thread-safe counter can cause the program to hang when processing multiple files at once (the processing is actually complete but the program doesn't recognise that)

Hopefully there will be a new release next week to sort these out.
simonandrews is offline   Reply With Quote
Old 02-23-2015, 02:28 PM   #334
yoyoming1001
Member
 
Location: West Coast

Join Date: Aug 2010
Posts: 15
Default Anyone have any publicly available non-html output files?

I'm looking for any FastQC results output tables. Any help pointing me in the right direction would be really appreciated.
yoyoming1001 is offline   Reply With Quote
Old 02-23-2015, 05:16 PM   #335
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Quote:
Originally Posted by yoyoming1001 View Post
I'm looking for any FastQC results output tables. Any help pointing me in the right direction would be really appreciated.
Like I said in the other thread you can download some public data from NCBI SRA and create your own.
GenoMax is offline   Reply With Quote
Old 02-23-2015, 05:19 PM   #336
gab0
Member
 
Location: Talca, Chile

Join Date: Apr 2014
Posts: 11
Default

yes, you may download some fastq files and then try Fastqc on them
gab0 is offline   Reply With Quote
Old 02-24-2015, 12:15 AM   #337
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by yoyoming1001 View Post
I'm looking for any FastQC results output tables. Any help pointing me in the right direction would be really appreciated.
All of the example reports shown on the fastqc project page also have the accompanying text output available (just not linked from the html page), so you're welcome to download those.

www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.zip
www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.zip
www.bioinformatics.babraham.ac.uk/projects/fastqc/small_rna_fastqc.zip
www.bioinformatics.babraham.ac.uk/projects/fastqc/RNA-Seq_fastqc.zip
www.bioinformatics.babraham.ac.uk/projects/fastqc/RRBS_fastqc.zip
www.bioinformatics.babraham.ac.uk/projects/fastqc/pacbio_srr075104_fastqc.zip
www.bioinformatics.babraham.ac.uk/projects/fastqc/454_SRR073599_fastqc.zip
simonandrews is offline   Reply With Quote
Old 03-25-2015, 01:57 AM   #338
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

FastQC v0.11.3 has just been released. It fixes a few annoying bugs which have been mentioned on here before, and adds some support for processing folders of nanopore sequencing reads in HDF5 format.

http://www.bioinformatics.babraham.a...ojects/fastqc/
simonandrews is offline   Reply With Quote
Old 07-30-2015, 05:27 AM   #339
makost
Junior Member
 
Location: Cambridge

Join Date: Jun 2010
Posts: 5
Default

Dear Simon,
I am trying to run FASTQC v0.11.2 on a CentOS 6.6 and I get the following error message:

Exception in thread "main" java.lang.NoClassDefFoundError: org/itadaki/bzip2/BZip2InputStream
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:104)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:122)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:95)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:308)
Caused by: java.lang.ClassNotFoundException: org.itadaki.bzip2.BZip2InputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

I tried different java version, but to no avail. Any idea what's wrong?

Thanks
makost is offline   Reply With Quote
Old 07-30-2015, 05:47 AM   #340
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by makost View Post
Dear Simon,
I am trying to run FASTQC v0.11.2 on a CentOS 6.6 and I get the following error message:

Exception in thread "main" java.lang.NoClassDefFoundError: org/itadaki/bzip2/BZip2InputStream
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:104)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:122)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:95)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:308)
Caused by: java.lang.ClassNotFoundException: org.itadaki.bzip2.BZip2InputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

I tried different java version, but to no avail. Any idea what's wrong?

Thanks
That error suggests that one of the bundled jar files which ships with FastQC (jbzip2-0.9.jar) was missing from your installation.

Quickest and easiest fix would be to download the latest version, extract the contents and try that (which will contain a fresh copy of the missing library). If that doesn't work then email me directly and I can go through some other debugging with you.

Simon.
simonandrews is offline   Reply With Quote
Reply

Tags
fastq, quality, report

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:04 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO