Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help! lost data in fastq file

    I got some Illumina Hiseq result in fastq files from my facility, but the yield is a lot lower than expected.

    For example, the summary file for one sample shows that it contains 225566 +/- 0 clusters (PF), and I expected to get around 225566*32=7 million reads. But the fastq file only includes less than 0.5 million reads. One example read is

    @HWI-1KL117_0134:6:2104:1422:2023#GNCAAT/1
    NTTTTAATGAAAACACGGAAATTAAAAATTCTTGAAGGTGACATCCCTCCA

    Because the sample has index barcode "GCCAAT", is it possible that the facility selected reads with bad barcode "GNCAAT" in stead of "GCCAAT" when demultiplexing? If it is true, is there anyway to rescue my sample?

    Thanks in advance!

  • #2
    Demultiplexing usually allows for one mismatch and sometimes indels, so it's not surprising that this read would have been considered as matching the barcode. Why don't you go through the whole fastq file and check for the barcodes that appear in the headers with something like

    grep '@' file.fastq | cut -d '#' -f2 | sort | uniq -c

    Maybe your summary was referring to the cluster density for the whole lane, and they had 12 multiplexed samples in it? That would fit approximately with 12*0.5M. 7 million reads for a lane is pretty low, though ...
    Last edited by kopi-o; 11-11-2011, 01:26 PM. Reason: typo

    Comment


    • #3
      Thank you for your quick reply. I tried the command like

      grep '@' 003_s_6_sequence.txt | cut -d '#' -f2 | sort |*uniq -c

      But get the notice like

      -bash: *uniq: command not found

      Beside, the entire fastq file only contains reads from barcode "GNCAAT". As to the summary file, I have three samples in one lane, and each sample has its own summary file after demultiplexing. The PF cluster numbers are similar. So there should be around 10 million PF reads for each, which is still low for HiSeq.

      Comment


      • #4
        I'm sorry, there was a typo in my reply. The star '*' is not supposed to be there; the command is called uniq. I've removed it.

        Anyway, there is no need for you to run the command if you already know that the entire fastq file contains the same barcode. Then it would appear that the facility has failed to extract the exact-match barcode, or that the second cycle in the index read step went so seriously awry that no base calls could be made.

        Comment


        • #5
          Also, just a crazy suggestion - why not just ask the facility? :-)

          Comment


          • #6
            Now the command works, and all my three samples from one lane have the same problem. The barcode of each read all carries a mismatch "N" at the second character, e.g. CNATGT, ANAGTG, GNCAAT.

            I will ask the facility to reprocess the demultiplexing step. Hope it's just a computer problem and I can get reads with perfect match barcodes.

            Thanks!

            Comment


            • #7
              Because the facility believed I gave them bad samples, although they said no problem basing on the bioanalyzer result!

              Originally posted by kopi-o View Post
              Also, just a crazy suggestion - why not just ask the facility? :-)

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 08:47 AM
              0 responses
              14 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X