SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
quality control from fastq to vcf dongshenglulv Bioinformatics 3 11-05-2014 03:08 PM
Quality control of genomic resequencing data from a HiSeq gavin.oliver Genomic Resequencing 2 06-30-2013 02:48 AM
Webinar on Quality Control of NGS Data - FREE Strand SI Events / Conferences 0 09-09-2011 07:33 PM
TileQC: a system for tile-based quality control of Solexa data ScottC Illumina/Solexa 0 06-03-2008 05:54 PM
PubMed: TileQC: a system for tile-based quality control of Solexa data. Newsbot! Literature Watch 0 05-30-2008 09:21 AM

Reply
 
Thread Tools
Old 07-11-2012, 01:13 PM   #221
Patincle
Junior Member
 
Location: Cleveland ohio

Join Date: Apr 2011
Posts: 2
Default

Simon,
I am a newcomer to NGS and FastQC . I love your software.
My 10 FastQ files have been generated by Illumina HighScan. They are 100bp PE reads. In the report I get lots of green ticks, a scattering of gold and 1 consistent red (for every sample R1 and R2). It is the duplicated sequences. Duplicates are off the charts in every case. What is going on? My target is small (exons for ~170 genes). This was a custom capture DNA project using Agilent Sure select. Also what are the units on the Y-axis in this report graph? Also does this one bad mark doom all the samples in terms of usefulness?
patrick
Patincle is offline   Reply With Quote
Old 07-11-2012, 11:52 PM   #222
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

If you're capturing a very small region and sequencing this to huge depth then the warning about duplication is probably spurious since you might well be expecting that every sequence will be present multiple times. More details about how to interpret the duplicate plot, and when it's OK to ignore duplication can be found here.
simonandrews is offline   Reply With Quote
Old 10-18-2012, 10:50 AM   #223
gokhulkrishnakilaru
Member
 
Location: Bethesda, Maryland

Join Date: Jul 2011
Posts: 39
Default

Quote:
Originally Posted by simonandrews View Post
But it also had a bug in it :-)

This version should work on all systems (if they have perl installed), and will let you set both java arguments and pass in files as arguments. I may add it to the next release.

Code:
#!/usr/bin/perl
use warnings;
use strict;
use FindBin qw($Bin);


if ($ENV{CLASSPATH}) {
	$ENV{CLASSPATH} .= ":$Bin";
}
else {
	$ENV{CLASSPATH} = $Bin;
}

my @java_args = '-Xmx250m';
my @files;

foreach (@ARGV) {
  if (/^\-/) {
    push @java_args,$_;
  }
  else {
    push @files,$_;
  }
}


exec "java",@java_args, "uk.ac.bbsrc.babraham.FastQC.FastQCApplication", @files;

Hi,
This is my fastqc code, after placing the above content into it

Code:
#!/usr/bin/perl
use warnings;
use strict;
use FindBin qw($RealBin);
use Getopt::Long;

# Check to see if they've mistakenly downloaded the source distribution
# since several people have made this mistake

if (-e "$RealBin/uk/ac/babraham/FastQC/FastQCApplication.java") {
        die "This is the source distribution of FastQC.  You need to get the compiled version if you want to run the program\n";
}

my $delimiter = ':';

if ($^O =~ /Win/) {
        $delimiter = ';';
}

if ($ENV{CLASSPATH}) {
        $ENV{CLASSPATH} .= "$delimiter$RealBin$delimiter$RealBin/sam-1.32.jar$delimiter$RealBin/jbzip2-0.9.jar";
}
else {
        $ENV{CLASSPATH} = "$RealBin$delimiter$RealBin/sam-1.32.jar$delimiter$RealBin/jbzip2-0.9.jar";
}


my @java_args = '-Xmx250m';
my @files;


foreach (@ARGV) {
  if (/^\-/) {
    push @java_args,$_;
  }
  else {
    push @files,$_;
  }
}


exec "java",@java_args, "uk.ac.bbsrc.babraham.FastQC.FastQCApplication", @files;
I am hit with an error now.


Code:
FASTQ type: Sanger or Phred+33 (standard, --phred33-quals)
Total reads processed: 40743144
Quality score range: (2, 41)
Converting to Sanger FASTQ...
Conversion done!
Statement unlikely to be reached at /home/bin/fastqc line 47.
        (Maybe you meant system() when you said exec()?)
Unrecognized option: -Xt
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Any pointers would be of great help.
gokhulkrishnakilaru is offline   Reply With Quote
Old 10-19-2012, 04:17 AM   #224
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I'm not exactly sure what you're trying to do with the code you posted. But in the context of the code you quoted I think all of the changes in there made it into the most recent FastQC release, so you should check the launcher distributed with the latest FastQC to see if it does what you need.
simonandrews is offline   Reply With Quote
Old 11-25-2013, 05:58 AM   #225
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default Adapter sequences for new fastqc module

I've been working on a new analysis module for FastQC which will specifically plot out the occurrences of a small number of adapter sequences so you can easily tell what benefit you would derive from trimming your data. I've attached an example so you can see what it will look like.

At the moment I only have 2 adapter sequences which I search for, these are the common start sequence to most illumina libraries and the Illumina smallRNA adapter. This covers all of the sequences we routinely see but I suspect there are other sequences which may commonly be seen on libraries and which would be removed by adapter trimmers. My sequences are below:

Illumina Universal Adapter AGATCGGAAGAG
Illumina Small RNA Adapter ATGGAATTCTCG

..if you know of any others could you please post them here - preferably with a link to a dataset which contains them so I can check the detection is working. You can also email them directly to me (simon.andrews@babraham.ac.uk) if you prefer.

Thanks.
Attached Images
File Type: jpg adapter_content.jpg (51.3 KB, 35 views)
simonandrews is offline   Reply With Quote
Old 01-06-2014, 12:43 PM   #226
JQL
Member
 
Location: MO, USA

Join Date: Apr 2011
Posts: 83
Default fastQC citation

Hi,

Just want to follow up with citation question. For fastQC, do you still want us to cite the website?

For trim_galore, is this the citation? Is it an actual publication with index?
http://journal.embnet.org/index.php/...ticle/view/200

Please advise.
JQL is offline   Reply With Quote
Old 01-06-2014, 02:16 PM   #227
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

That would be a great new feature Simon! There is a list of adapters at;
http://support.illumina.com/download...es_letter.ilmn

There are some datasets at the Illumina website you can use as test.

Especially the Nextera adapter would be nice to include.
boetsie is offline   Reply With Quote
Old 01-07-2014, 12:53 AM   #228
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by JQL View Post
Hi,

Just want to follow up with citation question. For fastQC, do you still want us to cite the website?

For trim_galore, is this the citation? Is it an actual publication with index?
http://journal.embnet.org/index.php/...ticle/view/200

Please advise.
Yes, there's still no paper for either FastQC or trim galore (although we're probably going to put one out to go with the next release which is pretty much ready to go). We recommend just citing the URL.
simonandrews is offline   Reply With Quote
Old 01-07-2014, 12:58 AM   #229
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by boetsie View Post
That would be a great new feature Simon! There is a list of adapters at;
http://support.illumina.com/download...es_letter.ilmn

There are some datasets at the Illumina website you can use as test.

Especially the Nextera adapter would be nice to include.
Thanks for sending that. Is that really an official posting on Illumina's site? They've been so tight over the years about not officially releasing the sequences of their adapters (so we didn't use sequences supplied by Illumina with FastQC for example), and then they go and post them on their website (along with the warning that you shouldn't post these anywhere!).

I'll take a look through that list but I think all of the Nextera adapters use the same common core as the bulk of their adapters so would get caught by the sequence we're already using.

Another request - if anyone has any nice examples of datasets heavily contaminated with different adapters and would be willing to run a test version of FastQC on them then it would be nice to get some confirmation that we're catching the cases we're after with this new module.
simonandrews is offline   Reply With Quote
Old 01-07-2014, 01:13 AM   #230
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I've looked through the official Illumina list and I'm pretty sure the two sequences we have will catch everything on there. If there's anyone with Ion Torrent or Solid data which is adapter contaminated I'll like to get those caught too but we don't have any suitable data to test with.
simonandrews is offline   Reply With Quote
Old 01-07-2014, 02:40 AM   #231
gprakhar
Member
 
Location: India

Join Date: Aug 2010
Posts: 78
Default Adapter Contamination Detection

Quote:
Originally Posted by simonandrews View Post
Thanks for sending that. Is that really an official posting on Illumina's site? They've been so tight over the years about not officially releasing the sequences of their adapters (so we didn't use sequences supplied by Illumina with FastQC for example), and then they go and post them on their website (along with the warning that you shouldn't post these anywhere!).

I'll take a look through that list but I think all of the Nextera adapters use the same common core as the bulk of their adapters so would get caught by the sequence we're already using.

Another request - if anyone has any nice examples of datasets heavily contaminated with different adapters and would be willing to run a test version of FastQC on them then it would be nice to get some confirmation that we're catching the cases we're after with this new module.
Hello,

This blog post is also of relevance here.

I am assembling a Bacterial Genome.
Library details, Illumina MiSeq (Comes from a commercial sequencing provider)
Paired end library:
150bp Read Length
450bp Fragment Lenght
Mate pair Library:
250bp Read Length
300-1200bp (Average 700bp) Fragment Lenght

Used fastqc (with -k 10) on the Mate Pair data, both untrimmed and trimmed (Using Trimmomatic with Nextra adapters)
The fastqc kmer-profiles plot for untrimmed data,
Untrimmed Read 1
Untrimmed Read 2

The fastqc kmer-profiles plot for trimmed data,
(Using Trimmomatic 0.32 with Nextra adapters only)
Trimmed Read 1
Trimmed Read 2

An interesting observation is that this problem is not there with Paired end data for same sample. In my opinion this might be due to the shorter read lenght(150bp) in comparison to Mate Pair (250bp).

Hope this helps.

--
prakhar
gprakhar is offline   Reply With Quote
Old 01-07-2014, 03:37 AM   #232
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Ha, I just wanted to post that blog prakhar!

In addition, the datasets from Illumina's BaseSpace are said to be publicly available; https://basespace.illumina.com/home/index
boetsie is offline   Reply With Quote
Old 01-07-2014, 03:40 AM   #233
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Thanks - the blog was really useful and I've added in the Nextera transposase sequence as an extra check in the default set. I think the barcode Kmers in that blog are just read through effects from the same adapters so don't need to be considered separately.

I can improve this over time (and of course people can add their own sequences in manually) but I'd like to get as useful a default set as possible when the new version ships.
simonandrews is offline   Reply With Quote
Old 01-27-2014, 08:06 AM   #234
nouse
Member
 
Location: Germany

Join Date: Sep 2013
Posts: 11
Default

Hi there!

Three quick questions....
1. What is the maximum amount of data fastqc can handle?
I am trying to analyze a huge concatenated sample of illumina data, but its stuck @"Starting analysis" for a while now. RAM is enough, and server is idle except for fastqc. I also see a running java command in top.
2. Any recommendations to filter away the bad sequences, fastqc had identified? mothur had filter.seqs, maybe something similar for illumina?
3. The whiskers in the boxplots are representing 100%?

Thank you very much!
nouse is offline   Reply With Quote
Old 01-27-2014, 09:08 AM   #235
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by nouse View Post

Three quick questions....
1. What is the maximum amount of data fastqc can handle?
We've run it on data files with over 2 billion reads internally and it was OK. I seem to remember fixing a bug which affected some datasets with over 2^31 reads in them but these were corner cases.

Quote:
Originally Posted by nouse View Post
I am trying to analyze a huge concatenated sample of illumina data, but its stuck @"Starting analysis" for a while now. RAM is enough, and server is idle except for fastqc. I also see a running java command in top.
FastQC will only report on progress every 5% through the file so if it's a really big file it might take a while to get to 5%. If you can see the process running then you can check that it's taking both CPU and IO (using top and iotop). If it's doing both then it's probably OK. You could always use head to extract some reads from the top of the file and see if processing those works OK if you wanted to be sure it's likely to finish.


Quote:
Originally Posted by nouse View Post
2. Any recommendations to filter away the bad sequences, fastqc had identified? mothur had filter.seqs, maybe something similar for illumina?
That depends what you mean by bad sequences. We use trim_galore (which we wrote and is simply a wrapper around cutadapt) for adapter and quality trimming, which is generally all of the filtering we apply to our data, but it will depend on what you're going to do with the remainder as to how you want to filter it.


Quote:
Originally Posted by nouse View Post
3. The whiskers in the boxplots are representing 100%?
No, they're 10% - 90%. In big NGS datasets the real extremes are pretty uninformative so we look a bit further in.
simonandrews is offline   Reply With Quote
Old 01-28-2014, 07:34 AM   #236
nouse
Member
 
Location: Germany

Join Date: Sep 2013
Posts: 11
Default

Thanks for the quick answer.

My 460 million reads were processed over night without troubles. I was just impatient.

I have paired end data, and it seems like some of the samples have problematic reverse reads (whiskers going down to phred<10 for some positions in some samples). This however seem to affect my downstream processing, so i want to get rid of say anything that has stretches of low quality over n bases. And anything with more than n ambigous base calls.
Mothur could do that fairly well with filter.seqs, but its just too slow for my dataset. also i need to convert fastq to fasta. SILVA ngs is able to that, too, but it is a webservice.
I check trim galore and solexaqa, but in the end i dont want to trim, i want to reject completely. I am a little bit surprised that those reads could make it out of the HiSeq (i was told the denoising is done internally).

Just to get the boxplots correctly, the whiskers represent 75-90 and 10-25% respectively. So, outside the whiskers there are still 20% of the data and outliers, correct?
nouse is offline   Reply With Quote
Old 01-28-2014, 07:49 AM   #237
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Trimmomatic can do a sliding window trimming, and can reject reads that are below a minimum quality or length threshold.
mastal is offline   Reply With Quote
Old 01-28-2014, 07:50 AM   #238
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

You can do this sort of trimming with trim galore. Basically you specify to trim only based on quality and then reject anything with a final length which is shorter than the starting length. This would remove completely any reads which had any data removed.

Another thing to remember in Illumina data is that Illumina uses very low quality scores (I think it's a Phred score of 2 if I remember correctly) as a flag for calls it doens't like rather than as a true error probability. This is why you'll often see whiskers on fastqc plots suddently jump down to very low values and it's not really indicative of a sudden problem, just that some reads have crossed a threshold. There is an option to turn this off in the sequencing pipeline but I don't think anyone routinely uses it.
simonandrews is offline   Reply With Quote
Old 01-30-2014, 06:24 AM   #239
Susanna5
Junior Member
 
Location: Netherlands

Join Date: Apr 2013
Posts: 5
Default

Hello,

I really like this application, and have used it successfully on several files, but now I'm trying to compare it to a trimmed file, and the trimmed file gives this exception:

Exception in thread "Thread-4" java.lang.NullPointerException
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:141)

at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:105)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java
:76)
at java.lang.Thread.run(Unknown Source)

Has anyone encountered this before or know of any possible solution?
I am using data from IlluminaBodyMap2.
trimming with Trimmomatic using these options: -phred33,
ILLUMINACLIP:/home/Trimmomatic-0.30/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:20
The fastqc application is the latest windows version. (I just keep transferring to a linux VM)

Thanks to anyone who can help. The file refuses to complete due to the exception, but it reads 1362859 sequences before stopping.
Susanna5 is offline   Reply With Quote
Old 01-30-2014, 07:11 AM   #240
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Hi Susanna - that error suggests that your fastq file stopped in the middle of a fastq entry (which is 4 lines long) which suggests that your file has been truncated. There will be a nicer error message in the next release, but it will still mean that you've lost some data during one of your transfers and you'll need to go back to the original source to ensure that you have the rest of the file. It's a good idea to check that the file sizes match when you've downloaded a file and if possible check the md5sums of the downloaded files so you know you have the same data.
simonandrews is offline   Reply With Quote
Reply

Tags
fastq, quality, report

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:27 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO