Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • compressing reference genome and indexing

    Hi all

    I report here a strange behavior I experience on both OSX10.6.8 and centos 5.7 with latest software installed (macport and yum respectively)

    If I download archived versions of the human reference genome from the Broad ftp (bundle 1.2)
    and run the command:
    Code:
    zgrep ">" human_b36_both.fasta.gz
    I get it all OK until the end and suddenly pages of binary garbage occur

    I thought this had to do with corruption of the archive and I did the following:
    expand the archive back to multifasta
    reformat the fasta content using bioperl (in and out as fasta)
    recompress with razip (samtools 1.18)
    repeat the zgrep command
    I get the garbage again!!

    Is this normal and due to some specificity of razip?

    ## below a series of commands and results
    Code:
    # sorry for the length
    $> razip -c human_b36_both.fa > human_b36_both.fa.gz
    $> zgrep ">" human_b36_both.fa.gz
    >1
    >2
    …
    >NT_113899
    >NT_113965
    >NT_113898
    >NC_007605
    ?f???,ixAp_?Òoё\G?C?Nm????R??[D?;?œ?)?pj5X??UL?`??mM?l%??ZºŐP?BI??W??d???HCoo??DS?ѷivfq(??X??U???w??? }?C"R???¿?.???????.\???,??7???bҳ*?k??F?b?l!??M??Ս??[???D?T?NfJ?8Ɉ?f?p??cGm
                                                 ?<?:vRv?Hd?ղ???C.??߉?ye???N?
                                                                             U?4CY???w??<??v!?@?o?w;??İ?xHD?b+????|????e9???D?
                          ??:??(fUY???m??jL??o?»B}??!;?X?c????` ?\Y4??)ß????<?/Þ?@@j!'?Y?B?2???"?$?	I???9_?5??=묦???H1?l??Q??|?{??6[????G?a;:&??gw?<e?u2???R,]?P?%?Vd')Y?_K?ae
                                                                          -Z ʂ@??g?YvF?By??q?'??m?Z&? (~5?ʈ???????3??8{?W?j?? 7_?L??-??r?kԊRb?׏?g?8?<?6??$???K??M?-
    ?$H?k?r??v%Jp˴lںxSJ
                       ?q????Khg?	db?>??b?q`E?RJ ?~lH?????%?m.???X?+??t?ߒ??%̽ @??ޫ?)`?it[?w??:?݃ݓY
                                                                                                 ?P3fg$j?????t?>??e?9?n?5?????y23?2WgT?f?*?=l??`ԊU?C??????P?TO
                                                          ??~?4dg?mq&z3??ZJ?qP-??j??r?*??????20?;vRe*?B??LD]
    #(… many many such pages of trailing binary garbage)
    
    # while the tail of the fasta file is clean
    $> tail human_b36_both.fa
    ATGGGGGGCCGCGCATTCCTGGAAAAAGTGGAGGGGGCGTGGCCTTCCCCCGCGGCCCCC
    CAGCCCCCCCGCACAGAGCGGCGCTACGGCGGGCGGGCGGCGGGGGGTCGGGGTCCGCGG
    GCTCCGGGGGCTGCGGGCGGTGGATGGCGGCGGACGTTCCGGGGATCGGGGGGGTCGGGG
    GGCGCCGCGCGGGCGCAGCCATGCGTGACCGTGATGAGGGGGCAGGGTCGCAGGGGGTGT
    GTCTGGTGGGGGCGGGAGCGGGGGGCGGCGCGGGAGCCTGCACGCCGTTGGAGGGTAGAA
    TGACAGGGGGCGGGGACAGAGAGGCGGTCGCGCCCCCGGCCGCGCCAGCCAAGCCCCCAA
    GGGGGGCGGGGAGCGGGCAATGGAGCGTGACGAAGGGCCCCAGGGCTGACCCCGGCAAAC
    GTGACCCGGGGCTCCGGGGTGACCCAGCCAAGCGTGACCAAGGGGCCCGTGGGTGACACA
    GGCAACCCTGACAAAGGCCCCCCAGGAAAGACCCCCGGGGGGCATCGGGGGGGGTGTTGG
    CGGGGGCATGGGGGGGTCGGATTTCGCCCTTATTGCCCTGTTT
    
    # indexing the archive works fine
    $>samtools faidx human_b36_both.fa.gz
    $>cat human_b36_both.fa.gz.fai
    1	247249719	3	60	61
    2	242951149	251370554	60	61
    ...
    NT_113898	1305230	3143346495	60	61
    NC_007605	171823	3144673490	60	61
    
    # extracting the last record also works and ends just like the tail above
    $>samtools faidx human_b36_both.fa.gz NC_007605
    >NC_007605
    AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTAT
    GAATCCAGTATGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGC
    CGCCTCGTGTTTCACGGCCTCAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTC
    ACTTCTACTCTTGGCAGCAGTGGCCAGCTCATATGCCGCTGCACAAAGGAAACTGCTGAC
    …
    GGGGGGCGGGGAGCGGGCAATGGAGCGTGACGAAGGGCCCCAGGGCTGACCCCGGCAAAC
    GTGACCCGGGGCTCCGGGGTGACCCAGCCAAGCGTGACCAAGGGGCCCGTGGGTGACACA
    GGCAACCCTGACAAAGGCCCCCCAGGAAAGACCCCCGGGGGGCATCGGGGGGGGTGTTGG
    CGGGGGCATGGGGGGGTCGGATTTCGCCCTTATTGCCCTGTTT
    Last edited by splaisan; 01-29-2012, 05:50 AM.
    http://www.bits.vib.be/index.php

  • #2
    It appears that razip uses gzip compression (or other kinds of compression) but is not gzip. The output contains chunks of (gzipped) compressed data but the entire file is not a gzipped file. razip output cannot be gUNzipped (but can be uncompressed using razip). zgrep tries to UNgzip the file, succeeds party, but then runs into the uncompressed part and fails.

    Comment


    • #3
      Thanks Richard

      I wrongly assumed this was a bonafide archive. Never mind, I will keep a second copy compressed with bgzip to reduce storage as compared to the plain 3GB fasta and use one or the other depending on the needs.

      Great help, thanks
      Stephane

      Originally posted by Richard Finney View Post
      It appears that razip uses gzip compression (or other kinds of compression) but is not gzip. The output contains chunks of (gzipped) compressed data but the entire file is not a gzipped file. razip output cannot be gUNzipped (but can be uncompressed using razip). zgrep tries to UNgzip the file, succeeds party, but then runs into the uncompressed part and fails.
      http://www.bits.vib.be/index.php

      Comment


      • #4
        It is part of the razip problem and part of gzip. If you run "gzip -dc | grep", you get normal output. But if you run "gzip -dcf | grep", which is what zgrep is actually calling, you get those rubbish.

        Comment


        • #5
          thanks Heng

          This helps a lot and I will alias it for regular use!
          You just saved me several gigabites of disk space.
          Cool
          Stephane

          Originally posted by lh3 View Post
          It is part of the razip problem and part of gzip. If you run:
          Code:
          gzip -dc | grep
          , you get normal output. But if you run "gzip -dcf | grep", which is what zgrep is actually calling, you get those rubbish.
          http://www.bits.vib.be/index.php

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X