Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Paired end fastq files contamination

    Hello,

    I wasn't sure where to post this but I was hoping to get some insight into a problem I am experiencing with my paired end fastq files. I should also say I am molecular microbiologist by training with very little hands on training in bioinformatics and programming in general so apologies if this is a stupid question!

    I have had intermittent problems when downloading .fastq.gz files from an external sequencing provider (I have seen this when using both FTP file server and direct download links on the company's website).

    The download itself appears fine (I checked the md5 values and they match) and I use 7zip (as recommended by the sequence provider) to extract the fastq files, all done on my local computer. I transfer the files onto a Unix machine to carry out bwa and samtools analysis to identify SNPs.

    The problem only becomes apparent when I try to merge paired end fastq data in BWA, when the bwa sampe step fails - when I check the number of lines in each file (using wc –l), they do not match, which explains why the step fails. My colleague wrote a script to pull out where the mismatch between the paired files occurs - the first time we noticed this we identified an insertion of about 10-20 sequencing reads from a completely different sequencing run – when we showed the headers of the reads to the sequencing provider they traced it back a previous sequencing project done by them, and said the problem has occurred at our end. However since this has happened at least twice now.

    My question: is this “contamination” (sorry that’s the microbiologist in me!) likely to occur just from unzipping the data? I haven’t done anything else to the data (e.g. quality trimming) and appears to happen randomly – in the last batch of sequencing I received, 3 of the strains sequenced went through the bwa analysis absolutely fine, others needed a 2nd or 3rd attempt at downloading before I got “clean” files. Is the problem more likely to be at the sequencing providers end? Either way, what can I do to stop this happening? From now on I will always check the number of lines in each file before I proceed with any analysis but I would like to resolve the problem too.

    Thanks!

  • #2
    Let me say this at the beginning this all sounds odd.

    If you must use a windows machine then minimize things you need to do with the data file to a minimum there. Just download the file and then move it to server. If your server has direct internet connection then download the sequence files directly using wget or curl on unix). Now a days all NGS tools understand compressed fastq files so there is not need to uncompress them with 7-zip but if you must then use gunzip on unix end.

    As for the insertion of unrelated data (however small) that should never happen and there is no way that can happen on your end.
    Last edited by GenoMax; 12-02-2016, 11:17 AM.

    Comment


    • #3
      I suggest you verify the pairing before doing anything further processing. You can do that with the BBMap package like this:

      reformat.sh in1=file2.fastq.gz in2=file2.fastq.gz vpair


      If that shows a problem, the problem is absolutely occurring on their end. It's unlikely but theoretically possible that you caused the corruption during the unzipping process - say, if you were unzipping lots of things at once, and outputting some of them to the same file so they were overwriting each other - but if the gzips pass the gzip integrity test, then they were not corrupted during transmission.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 03-27-2024, 06:37 PM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-27-2024, 06:07 PM
      0 responses
      11 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      53 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      69 views
      0 likes
      Last Post seqadmin  
      Working...
      X