Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fastq format questions

    Dear All,
    We are a small research group working on compressing fastq file information. We have read through the format specifications (as described in http://nar.oxfordjournals.org/content/38/6/1767) and have the following questions:
    1. How significant is each field in a fastq file? We know that the sequence data is important. What about quality data? Do we need to accurately reproduce it?
    2. Title line: What aspects of title line are important? What does the title line typically signify. What aspects of title should be retained and what can be dropped? We want to know, what information in the title line is most important to those working with the fastq file.
    3. Is the order of reads as stored in a fastq file important or can the reads be reordered to make a new fastq file. Note that the new fastq file will contain the same but reordered information.

    Thank you all very much in advance for any inputs that we can get from you.
    Best Regards
    Ajit

  • #2
    Do you know that article?:
    available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site.


    Regarding your questions (from my point of view):
    1) Quality data is important as well for quality-aware alignment software and especially for SNP-calling (noone would waste that much disk space for unnecessary information)

    2) I am not sure about the information of the title line the only thing that comes to my mind is that in case of paired end sqeuencing the first and the second mate need to have the same name. Not sure about the rest of the title line.

    3) I don't think the order plays an important role but appreciate any comments correcting me...

    Comment


    • #3
      3. Generally speaking, the order of sequences in a file carries no significance. However, in Next Generation Sequencing it is common that reads produced by a paired-end experiment are stored in two separate fastq files, with the two reads of a pair being found on the same line in the two corresponding files. Clearly in this situation order is crucial.

      Comment


      • #4
        Thank you 'ulz_peter' and 'gaffa' for your answers. We appreciate your feedback. With respect to quality values, is there a particular threshold above which quality values always mean 'high quality'? If there is such a limit, what ascii value corresponds to it?

        Comment


        • #5
          Regarding the FastQ read name: Illumina has its own way of naming the reads:
          see here: http://en.wikipedia.org/wiki/FASTQ_format
          Generally I don't know any fastq naming convention but overriding the titles for illumina data would cause loss of information (afaik the coordinates can be used for optical duplicate detection)

          Concerning the qualities: that depends on many factor I don't think there is a general threshold value. Moreover there are different quality encoding standards each using it's own range of ascii values...

          Comment


          • #6
            Random access

            Hi all, I gave a query regarding importance of random access to individual reads in a compressed FASTQ file? If so then what can be the possible applications which random access can be used for? I'll appreciate any help in this regard. Thank you in advance

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 11:49 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X