![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
SRA to fastq conversion with fastq-dump loses sequences | pcantalupo | Bioinformatics | 13 | 10-08-2015 05:09 PM |
For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq | byb121 | Bioinformatics | 6 | 12-20-2013 02:26 AM |
RNA-Seq: Second-Generation Sequencing Supply an Effective Way to Screen RNAi Targets | Newsbot! | Literature Watch | 0 | 04-16-2011 03:50 AM |
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? | jjw14 | Illumina/Solexa | 2 | 06-01-2010 05:35 PM |
PubMed: Implementation of Novel Pyrosequencing Assays to Screen for Common Mutations | Newsbot! | Literature Watch | 0 | 05-12-2009 06:00 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
I've just released the first version of a simple little program which allows you to screen a FastQ file against a panel of sequence databases so you can quickly see if your libraries contain the types of sequence you think they do, and if not then what sources of contamination might be in there.
I take no credit at all for the idea behind this which I saw in the CRI QC pipeline, and which looked so useful I wrote an implementation for our sequencing facility. We've been running our historical data through it and we've already found a few issues we didn't know we had up until now. The code is pretty new, so please use with caution, and file bugs if you hit problems. You can see example output and get the code from: http://www.bioinformatics.bbsrc.ac.u.../fastq_screen/ |
![]() |
![]() |
![]() |
#2 |
Member
Location: Ireland Join Date: Mar 2010
Posts: 41
|
![]()
Sorry, I could not find the code.
|
![]() |
![]() |
![]() |
#3 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]() |
![]() |
![]() |
![]() |
#4 |
Member
Location: Manchester, UK Join Date: Oct 2009
Posts: 37
|
![]()
Looks like a great tool. I am using colour space data, so when i use my colour space indexes i get the following message:
Error: -C was not specified when running bowtie, but index is in colorspace. If your reads are in colorspace, please use the -C option. If your reads are not in colorspace, please use a normal index (one built without specifying -C to bowtie-build). How do i specify this? Apologies if this is simple, but it is Friday afternoon. Ian |
![]() |
![]() |
![]() |
#5 | |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]() Quote:
A quick fix would be to edit the script and just add -C to the list of bowtie options to force colorspace mode on all analyses. Let me know if this doesn't work and I'll send out a proper fix. |
|
![]() |
![]() |
![]() |
#6 |
Member
Location: Gothenburg/Uppsala, Sweden Join Date: Oct 2010
Posts: 82
|
![]()
This looks pretty nice. As I understand it the user supplies the libraries to screen against - I wonder how one would go about to set up a library of adaptors/vectors/contaminants. Are there any commonly used collections of such sequences?
|
![]() |
![]() |
![]() |
#7 | |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]() Quote:
Basically I'm trying to cover the species which we're commonly working with in our institute plus vectors, adapters and other common sources of contamination (eg E.coli) which could come from any molecular biology lab. Any suggestions for other sources to screen against would be welcome. |
|
![]() |
![]() |
![]() |
#8 | |
Member
Location: Manchester, UK Join Date: Oct 2009
Posts: 37
|
![]()
Thanks - i got it working for color-space by changing:
Quote:
Last edited by idonaldson; 04-11-2011 at 02:28 AM. Reason: typo |
|
![]() |
![]() |
![]() |
#9 | |
Senior Member
Location: Vancouver, BC Join Date: Mar 2010
Posts: 275
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#10 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
I've just put an updated version of fastq screen up onto our website. This version adds a new mode of analysis where the screening results are reported as the percentage of sequences which map to only one of the screen libraries, and the percentage which could map to more than one. This then allows you to see if you're seeing unexpected hits which are specific to the wrong species, or if you just have low complexity sequence which could have mapped anywhere.
The new release also fixes a few bugs and adds support for colorspace encoded reads. |
![]() |
![]() |
![]() |
#11 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
I've just put up v0.2.1 of fastq screen to fix a bug which affected v0.2 if you were running multilib searches on paired end data. In these cases the percentage hits reported were twice as high as the true value.
This bug didn't affect v0.1, nor did it affect searches on single end data, or searches not using the --multilib option. |
![]() |
![]() |
![]() |
#12 |
Member
Location: cinci Join Date: Apr 2010
Posts: 66
|
![]()
what if my reads are long like 100bps will still work?
|
![]() |
![]() |
![]() |
#13 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
Yes, that should work. The screen uses bowtie behind the scenes, so any data you could search with bowtie will work. The only problem you might have with really long reads is that a significant proportion of your library might read through into adapter. In that case you can pass the --trim3 bowtie option in the fastq_screen extra bowtie parameters option to limit how much of your reads you use to determine the match.
|
![]() |
![]() |
![]() |
#14 |
Member
Location: Europe Join Date: Sep 2012
Posts: 39
|
![]()
Hi, I'm new to the forum and very interested in this tool.
Simon, are they any sequence libraries (on top of those you recommend in your sample config files) one would want to search against when checking generic human illumina chip-seq reads? Thanks a lot for your work. |
![]() |
![]() |
![]() |
#15 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
The choice of libraries really comes down to what other types of library are likely to be around in your facility, or other common sources of contamination. If there are a load of people doing drosophila work then I'd have a drosophila library.
The only common ones would be the vectors/adapters, phix and Ecoli as everyone is likely to have those around somewhere. If people have found other common sources of contamination then I'd be interested to hear which species they found. The only odd one we've had (that we figured out at least) was acinetobacter which we think came from the beads used for our ChIP (the OmpA protein on the beads comes from this organism). |
![]() |
![]() |
![]() |
#16 |
Member
Location: Europe Join Date: Sep 2012
Posts: 39
|
![]()
Thanks for the reply Simon. Could you also advise on how to feed the fastqc "contaminants.txt" data to the program?
|
![]() |
![]() |
![]() |
#17 | |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]() Quote:
Code:
#!/usr/bin/perl use warnings; use strict; open (IN,'contaminant_list.txt') or die $!; open (OUT,'>','contaminant_list.fa') or die $!; while (<IN>) { next if (/^\#/); chomp; next unless ($_); my ($name,$seq) = split(/\t+/); next unless ($seq); $name =~ s/\s+/_/g; print OUT ">$name\n$seq\n"; } close OUT or die $!; bowtie-build -f contaminant_list.fa contaminants You can then put the contaminants database into fastq_screen. Hope this helps |
|
![]() |
![]() |
![]() |
#18 |
Member
Location: Europe Join Date: Sep 2012
Posts: 39
|
![]()
Hello Simon, it works perfectly, thank you. It actually detected adapter contamination in some of my libraries
![]() |
![]() |
![]() |
![]() |
#19 |
Member
Location: Europe Join Date: Sep 2012
Posts: 39
|
![]()
Hi,
I have a problem when running fastqscreen on mouse paired-end ChIPseq data. Basically for all of the four libraries I have, I'm getting more than 99% no hits in the final fastqscreen graph. Code:
Mmus 99.96 0.02 0.02 0.00 0.00 The reads are 51b paired end and I call the program as follows Code:
fastq_screen --nohits --conf=fastq_screen.conf --paired <library>_2_sequence.fastq.gz <library>_1_sequence.fastq.gz Separately, I had used bwa to align the reads agains mm9, and the sequences did align. This is the output of samtools flagstats for one of the four bam files: Code:
78666176 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 76266600 + 0 mapped (96.95%:-nan%) 78666176 + 0 paired in sequencing 39333088 + 0 read1 39333088 + 0 read2 74908040 + 0 properly paired (95.22%:-nan%) 75455201 + 0 with itself and mate mapped 811399 + 0 singletons (1.03%:-nan%) 346117 + 0 with mate mapped to a different chr 130284 + 0 with mate mapped to a different chr (mapQ>=5) |
![]() |
![]() |
![]() |
#20 | |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]() Quote:
If you can put a subset of your sequences up somewhere where we can see them (just 100k or so would be plenty) then we could take a look and see what's happening with your data. |
|
![]() |
![]() |
![]() |
Tags |
contamination, quality, screening, search |
Thread Tools | |
|
|