Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A strange size difference of fastq file

    Hi, currently I'm working on a Illumina sequencing data in fastq format. I downloaded it from public available database (TCGA) and it was zipped. After unzip and trimming the size of the file is about 16G. Interesting thing comes. After I copied this file to another partition, the size of the new copy became 7.6G. The number of lines in the files, the number of reads and their length distribution are the same in the two files. So I guess the two files have the same content, the new copy is not truncated.

    Moreover, when I run Tophat2/Cufflinks with 16G copy, it takes much longer time to finish and the the result looks strange. But it is quite normal with the 7.6G copy. This might not be a bioinformatics question but it's quite interesting. What happened to the file? What might be those additional size in the file?

    Thanks a lot.

  • #2
    I can't tell... But one thing you can try to get some hints is:

    Code:
    cat -vet my_strange_reads.fq | less
    This is will show you non-printable characters in the file. In a typical fastq file you shouldn't see anything new in addition to the usual alphanumeric characters and some metacharacters in the read names.

    In practice, I would download again the file just to make sure something got corrupted in the process.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    30 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    32 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    28 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    52 views
    0 likes
    Last Post seqadmin  
    Working...
    X