Go Back   SEQanswers > General

Similar Threads
Thread Thread Starter Forum Replies Last Post
Illumina paired end adapter contamination problem Gazaldeep Illumina/Solexa 5 11-23-2016 10:15 AM
Merging Paired-End FastQ Files jmpi Bioinformatics 3 05-22-2013 07:21 AM
Converting Tophats bam output back to separate paired end read fastq files bob-loblaw Bioinformatics 0 12-03-2012 04:23 AM
Software for splicing paired-end fastq files? BioHak Bioinformatics 4 04-11-2012 03:37 AM
Why are Illumina paired-end SRA datasets made up of 3 FASTQ files? Bio.X2Y Illumina/Solexa 9 12-21-2010 11:36 AM

Thread Tools
Old 12-02-2016, 03:46 AM   #1
Junior Member
Location: Oxford UK

Join Date: Dec 2016
Posts: 1
Default Paired end fastq files contamination


I wasn't sure where to post this but I was hoping to get some insight into a problem I am experiencing with my paired end fastq files. I should also say I am molecular microbiologist by training with very little hands on training in bioinformatics and programming in general so apologies if this is a stupid question!

I have had intermittent problems when downloading .fastq.gz files from an external sequencing provider (I have seen this when using both FTP file server and direct download links on the company's website).

The download itself appears fine (I checked the md5 values and they match) and I use 7zip (as recommended by the sequence provider) to extract the fastq files, all done on my local computer. I transfer the files onto a Unix machine to carry out bwa and samtools analysis to identify SNPs.

The problem only becomes apparent when I try to merge paired end fastq data in BWA, when the bwa sampe step fails - when I check the number of lines in each file (using wc –l), they do not match, which explains why the step fails. My colleague wrote a script to pull out where the mismatch between the paired files occurs - the first time we noticed this we identified an insertion of about 10-20 sequencing reads from a completely different sequencing run – when we showed the headers of the reads to the sequencing provider they traced it back a previous sequencing project done by them, and said the problem has occurred at our end. However since this has happened at least twice now.

My question: is this “contamination” (sorry that’s the microbiologist in me!) likely to occur just from unzipping the data? I haven’t done anything else to the data (e.g. quality trimming) and appears to happen randomly – in the last batch of sequencing I received, 3 of the strains sequenced went through the bwa analysis absolutely fine, others needed a 2nd or 3rd attempt at downloading before I got “clean” files. Is the problem more likely to be at the sequencing providers end? Either way, what can I do to stop this happening? From now on I will always check the number of lines in each file before I proceed with any analysis but I would like to resolve the problem too.

HSmith is offline   Reply With Quote
Old 12-02-2016, 05:19 AM   #2
Senior Member
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,726

Let me say this at the beginning this all sounds odd.

If you must use a windows machine then minimize things you need to do with the data file to a minimum there. Just download the file and then move it to server. If your server has direct internet connection then download the sequence files directly using wget or curl on unix). Now a days all NGS tools understand compressed fastq files so there is not need to uncompress them with 7-zip but if you must then use gunzip on unix end.

As for the insertion of unrelated data (however small) that should never happen and there is no way that can happen on your end.

Last edited by GenoMax; 12-02-2016 at 10:17 AM.
GenoMax is offline   Reply With Quote
Old 12-02-2016, 10:05 AM   #3
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

I suggest you verify the pairing before doing anything further processing. You can do that with the BBMap package like this: in1=file2.fastq.gz in2=file2.fastq.gz vpair

If that shows a problem, the problem is absolutely occurring on their end. It's unlikely but theoretically possible that you caused the corruption during the unzipping process - say, if you were unzipping lots of things at once, and outputting some of them to the same file so they were overwriting each other - but if the gzips pass the gzip integrity test, then they were not corrupted during transmission.
Brian Bushnell is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 06:51 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO