SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SRA to fastq conversion with fastq-dump loses sequences pcantalupo Bioinformatics 13 10-08-2015 05:09 PM
For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq byb121 Bioinformatics 6 12-20-2013 02:26 AM
RNA-Seq: Second-Generation Sequencing Supply an Effective Way to Screen RNAi Targets Newsbot! Literature Watch 0 04-16-2011 03:50 AM
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? jjw14 Illumina/Solexa 2 06-01-2010 05:35 PM
PubMed: Implementation of Novel Pyrosequencing Assays to Screen for Common Mutations Newsbot! Literature Watch 0 05-12-2009 06:00 AM

Reply
 
Thread Tools
Old 03-24-2011, 06:55 AM   #1
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Lightbulb FastQ Screen: Does your library contain what you think it does?

I've just released the first version of a simple little program which allows you to screen a FastQ file against a panel of sequence databases so you can quickly see if your libraries contain the types of sequence you think they do, and if not then what sources of contamination might be in there.

I take no credit at all for the idea behind this which I saw in the CRI QC pipeline, and which looked so useful I wrote an implementation for our sequencing facility. We've been running our historical data through it and we've already found a few issues we didn't know we had up until now.

The code is pretty new, so please use with caution, and file bugs if you hit problems.

You can see example output and get the code from:

http://www.bioinformatics.bbsrc.ac.u.../fastq_screen/
simonandrews is offline   Reply With Quote
Old 03-24-2011, 07:26 AM   #2
ttnguyen
Member
 
Location: Ireland

Join Date: Mar 2010
Posts: 41
Default

Sorry, I could not find the code.
ttnguyen is offline   Reply With Quote
Old 03-24-2011, 07:30 AM   #3
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by ttnguyen View Post
Sorry, I could not find the code.
Try pressing shift+refresh on our download page. Our downstream cache sometimes likes to hold on to old versions of some of our pages.
simonandrews is offline   Reply With Quote
Old 04-08-2011, 08:38 AM   #4
idonaldson
Member
 
Location: Manchester, UK

Join Date: Oct 2009
Posts: 37
Default Colour space usage

Looks like a great tool. I am using colour space data, so when i use my colour space indexes i get the following message:

Error: -C was not specified when running bowtie, but index is in colorspace. If
your reads are in colorspace, please use the -C option. If your reads are not
in colorspace, please use a normal index (one built without specifying -C to
bowtie-build).

How do i specify this?

Apologies if this is simple, but it is Friday afternoon.

Ian
idonaldson is offline   Reply With Quote
Old 04-08-2011, 08:43 AM   #5
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by idonaldson View Post
Looks like a great tool. I am using colour space data, so when i use my colour space indexes i get the following message:

Error: -C was not specified when running bowtie, but index is in colorspace.
The current version of the script doesn't include colorspace support - I'll put that into the next release.

A quick fix would be to edit the script and just add -C to the list of bowtie options to force colorspace mode on all analyses. Let me know if this doesn't work and I'll send out a proper fix.
simonandrews is offline   Reply With Quote
Old 04-11-2011, 01:43 AM   #6
gaffa
Member
 
Location: Gothenburg/Uppsala, Sweden

Join Date: Oct 2010
Posts: 82
Default

This looks pretty nice. As I understand it the user supplies the libraries to screen against - I wonder how one would go about to set up a library of adaptors/vectors/contaminants. Are there any commonly used collections of such sequences?
gaffa is offline   Reply With Quote
Old 04-11-2011, 01:50 AM   #7
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by gaffa View Post
This looks pretty nice. As I understand it the user supplies the libraries to screen against - I wonder how one would go about to set up a library of adaptors/vectors/contaminants. Are there any commonly used collections of such sequences?
Yes, the sequences to screen against are left up to the user since different sets of sequences will be applicable for different facilities. If you look in the example config file shipped with the application then you can see in the comments the set of libraries we're using and where we got them from.

Basically I'm trying to cover the species which we're commonly working with in our institute plus vectors, adapters and other common sources of contamination (eg E.coli) which could come from any molecular biology lab.

Any suggestions for other sources to screen against would be welcome.
simonandrews is offline   Reply With Quote
Old 04-11-2011, 02:27 AM   #8
idonaldson
Member
 
Location: Manchester, UK

Join Date: Oct 2009
Posts: 37
Default

Thanks - i got it working for color-space by changing:
Quote:
my $path_to_bowtie = 'bowtie -C';

Last edited by idonaldson; 04-11-2011 at 02:28 AM. Reason: typo
idonaldson is offline   Reply With Quote
Old 04-13-2011, 06:04 AM   #9
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 274
Default

Quote:
Originally Posted by gaffa View Post
This looks pretty nice. As I understand it the user supplies the libraries to screen against - I wonder how one would go about to set up a library of adaptors/vectors/contaminants. Are there any commonly used collections of such sequences?
It sounds like you are looking for UniVec to screen for contaminants. For the adaptor sequences you will want to contact your sequencing center about a project because custom adaptors and combinations of them are commonly used.
SES is offline   Reply With Quote
Old 05-18-2011, 09:19 AM   #10
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

I've just put an updated version of fastq screen up onto our website. This version adds a new mode of analysis where the screening results are reported as the percentage of sequences which map to only one of the screen libraries, and the percentage which could map to more than one. This then allows you to see if you're seeing unexpected hits which are specific to the wrong species, or if you just have low complexity sequence which could have mapped anywhere.

The new release also fixes a few bugs and adds support for colorspace encoded reads.
simonandrews is offline   Reply With Quote
Old 05-19-2011, 01:22 AM   #11
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

I've just put up v0.2.1 of fastq screen to fix a bug which affected v0.2 if you were running multilib searches on paired end data. In these cases the percentage hits reported were twice as high as the true value.

This bug didn't affect v0.1, nor did it affect searches on single end data, or searches not using the --multilib option.
simonandrews is offline   Reply With Quote
Old 05-21-2011, 01:21 PM   #12
husamia
Member
 
Location: cinci

Join Date: Apr 2010
Posts: 66
Default

what if my reads are long like 100bps will still work?
husamia is offline   Reply With Quote
Old 05-22-2011, 03:14 AM   #13
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by husamia View Post
what if my reads are long like 100bps will still work?
Yes, that should work. The screen uses bowtie behind the scenes, so any data you could search with bowtie will work. The only problem you might have with really long reads is that a significant proportion of your library might read through into adapter. In that case you can pass the --trim3 bowtie option in the fastq_screen extra bowtie parameters option to limit how much of your reads you use to determine the match.
simonandrews is offline   Reply With Quote
Old 11-09-2012, 07:11 AM   #14
albireo
Member
 
Location: Europe

Join Date: Sep 2012
Posts: 39
Default

Hi, I'm new to the forum and very interested in this tool.

Simon, are they any sequence libraries (on top of those you recommend in your sample config files) one would want to search against when checking generic human illumina chip-seq reads? Thanks a lot for your work.
albireo is offline   Reply With Quote
Old 11-09-2012, 07:43 AM   #15
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

The choice of libraries really comes down to what other types of library are likely to be around in your facility, or other common sources of contamination. If there are a load of people doing drosophila work then I'd have a drosophila library.

The only common ones would be the vectors/adapters, phix and Ecoli as everyone is likely to have those around somewhere.

If people have found other common sources of contamination then I'd be interested to hear which species they found. The only odd one we've had (that we figured out at least) was acinetobacter which we think came from the beads used for our ChIP (the OmpA protein on the beads comes from this organism).
simonandrews is offline   Reply With Quote
Old 11-09-2012, 08:39 AM   #16
albireo
Member
 
Location: Europe

Join Date: Sep 2012
Posts: 39
Default

Thanks for the reply Simon. Could you also advise on how to feed the fastqc "contaminants.txt" data to the program?
albireo is offline   Reply With Quote
Old 11-09-2012, 08:51 AM   #17
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by albireo View Post
Thanks for the reply Simon. Could you also advise on how to feed the fastqc "contaminants.txt" data to the program?
You'd need to convert it into a fasta file. The script below should do this:

Code:
#!/usr/bin/perl
use warnings;
use strict;

open (IN,'contaminant_list.txt') or die $!;
open (OUT,'>','contaminant_list.fa') or die $!;

while (<IN>) {
  next if (/^\#/);
  chomp;
  next unless ($_);
  my ($name,$seq) = split(/\t+/);
  next unless ($seq);
  $name =~ s/\s+/_/g;
  print OUT ">$name\n$seq\n";
}
close OUT or die $!;
Once you have that you can index it with bowtie-build using something like:

bowtie-build -f contaminant_list.fa contaminants

You can then put the contaminants database into fastq_screen.

Hope this helps
simonandrews is offline   Reply With Quote
Old 11-09-2012, 09:10 AM   #18
albireo
Member
 
Location: Europe

Join Date: Sep 2012
Posts: 39
Default

Hello Simon, it works perfectly, thank you. It actually detected adapter contamination in some of my libraries
albireo is offline   Reply With Quote
Old 12-06-2012, 04:34 AM   #19
albireo
Member
 
Location: Europe

Join Date: Sep 2012
Posts: 39
Default

Hi,

I have a problem when running fastqscreen on mouse paired-end ChIPseq data. Basically for all of the four libraries I have, I'm getting more than 99% no hits in the final fastqscreen graph.

Code:
Mmus    99.96   0.02    0.02    0.00    0.00
The sequences I'm checking my libraries against are human, mouse, rat, fly, vectors, adapters. I downloaded the mouse mm9 fasta from the ucsc and generated the bowtie index with bowtie 0.12.7. The same version of bowtie is used in the fastqscreen.conf file.

The reads are 51b paired end and I call the program as follows

Code:
fastq_screen --nohits --conf=fastq_screen.conf --paired <library>_2_sequence.fastq.gz <library>_1_sequence.fastq.gz
I also tried using the --bowtie="--trim5 10" option, as well as --trim3 but this didn't affect the 99% to 100% nohits results.

Separately, I had used bwa to align the reads agains mm9, and the sequences did align. This is the output of samtools flagstats for one of the four bam files:

Code:
78666176 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
76266600 + 0 mapped (96.95%:-nan%)
78666176 + 0 paired in sequencing
39333088 + 0 read1
39333088 + 0 read2
74908040 + 0 properly paired (95.22%:-nan%)
75455201 + 0 with itself and mate mapped
811399 + 0 singletons (1.03%:-nan%)
346117 + 0 with mate mapped to a different chr
130284 + 0 with mate mapped to a different chr (mapQ>=5)
Any idea on what I might be doing wrong? Apologies if I'm missing something really obvious.
albireo is offline   Reply With Quote
Old 12-06-2012, 05:30 AM   #20
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by albireo View Post
Hi,
Any idea on what I might be doing wrong? Apologies if I'm missing something really obvious.
I can't immediately see why this would be going wrong from the data you've provided. If you run the screen against just the first of your paired reads do you find any hits from that? If you don't then there's probably something odd going on in the search. If you find hits from analysing each of the files as single end, but not when you pair them then that suggests that either something is going wrong in the pairing of sequences or that you have oddly separated pairs.

If you can put a subset of your sequences up somewhere where we can see them (just 100k or so would be plenty) then we could take a look and see what's happening with your data.
simonandrews is offline   Reply With Quote
Reply

Tags
contamination, quality, screening, search

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:50 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO