Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A strange size difference of fastq file

    Hi, currently I'm working on a Illumina sequencing data in fastq format. I downloaded it from public available database (TCGA) and it was zipped. After unzip and trimming the size of the file is about 16G. Interesting thing comes. After I copied this file to another partition, the size of the new copy became 7.6G. The number of lines in the files, the number of reads and their length distribution are the same in the two files. So I guess the two files have the same content, the new copy is not truncated.

    Moreover, when I run Tophat2/Cufflinks with 16G copy, it takes much longer time to finish and the the result looks strange. But it is quite normal with the 7.6G copy. This might not be a bioinformatics question but it's quite interesting. What happened to the file? What might be those additional size in the file?

    Thanks a lot.

  • #2
    I can't tell... But one thing you can try to get some hints is:

    Code:
    cat -vet my_strange_reads.fq | less
    This is will show you non-printable characters in the file. In a typical fastq file you shouldn't see anything new in addition to the usual alphanumeric characters and some metacharacters in the read names.

    In practice, I would download again the file just to make sure something got corrupted in the process.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Today, 08:47 AM
    0 responses
    10 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    60 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    57 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    53 views
    0 likes
    Last Post seqadmin  
    Working...
    X