Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • XeroxHero69
    Member
    • Apr 2019
    • 12

    Accessing Genomic data in a large vcf.gz file

    Hello all, I am very new to bioinformatics; I'm a high school intern. I have a .vcf.gz file that is 332 GB. It contains the whole genomes of around 800 dogs and all I need to do is check for a mutation for each of them. But before I can do that, I need to figure out how to crack the file open. I have heard of ways to possibly access them without decompressing the .vcf.gz because someone told me its possible that the uncompressed file could be up to three terabytes.

    If anyone has any suggestions on how to proceed, I would be eternally grateful as I am nearing the end of my internship and need to figure this out.
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    I think "bcftools view" command should give you what you need. Check the help page here.

    Comment

    • JTH-Genome
      Junior Member
      • Apr 2019
      • 6

      #3
      Canine genomes

      I liked GenoMax's suggestion, so go with that if you haven't already. Also be sure to use long read (PacBio) based assembly for your analysis against a reference genome.
      If you are interested in canine genomes, here are some articles that may be of interest:


      A proposal to sequence the genome of the German Shepherd dog by the University of Wisconsin - Madison was selected as winner of the 2019 Plant and Animal SMRT Grant after garnering more than 3,000 votes in a close online video competition. We caught up with members of the Comparative Genetics Research Laboratory to find out more about the project.

      Scientists are doing deep dives into the genomes of a range of canine cousins along the evolutionary chain.

      #SeqLife

      Comment

      • XeroxHero69
        Member
        • Apr 2019
        • 12

        #4
        Originally posted by GenoMax View Post
        I think "bcftools view" command should give you what you need. Check the help page here.
        Thanks so much, I'll try this out

        Comment

        • XeroxHero69
          Member
          • Apr 2019
          • 12

          #5
          Originally posted by GenoMax View Post
          I think "bcftools view" command should give you what you need. Check the help page here.
          So I have bcftools installed. I want to check for a single-replacement mutation on a specific gene in each of the genomes in the file. Do you know of a way to do this using bcftools? I am new to bioinformatics and this is going way over my head. Thanks

          Comment

          • questor2010
            Junior Member
            • Mar 2010
            • 5

            #6
            Originally posted by GenoMax View Post
            I think "bcftools view" command should give you what you need. Check the help page here.
            Note that for very large vcf files, using -t/-T instead of -r/-R is going to be much faster. And use the --threads option, that will help as well.

            Comment

            • XeroxHero69
              Member
              • Apr 2019
              • 12

              #7
              Originally posted by questor2010 View Post
              Note that for very large vcf files, using -t/-T instead of -r/-R is going to be much faster. And use the --threads option, that will help as well.
              Thanks for your reply. Could you give me an example of the format of what a command that would use bcftools view to find a base at a position would look like?

              Comment

              • questor2010
                Junior Member
                • Mar 2010
                • 5

                #8
                Originally posted by XeroxHero69 View Post
                Thanks for your reply. Could you give me an example of the format of what a command that would use bcftools view to find a base at a position would look like?
                Sure - here's an example:

                bcftools view -f PASS --threads 8 -T target.bed -o gnomad.genomes.r2.1.sites.target.vcf.gz -O z gnomad.genomes.r2.1.sites.vcf.bgz

                In this case, I'm using a bed file, instead of a single region. It is pulling out the FILTER=PASS variants that intersect the bed file into a new compressed vcf file. The source vcf file in this case is 465GB. If you have a single variant, you could use -t 1:11022. It might be best to specify a short range (1:11015-11030) if you're looking at indels - variant callers represent indels in different ways and you want to be sure you properly intersect them.

                Comment

                • XeroxHero69
                  Member
                  • Apr 2019
                  • 12

                  #9
                  Originally posted by questor2010 View Post
                  Sure - here's an example:

                  bcftools view -f PASS --threads 8 -T target.bed -o gnomad.genomes.r2.1.sites.target.vcf.gz -O z gnomad.genomes.r2.1.sites.vcf.bgz

                  In this case, I'm using a bed file, instead of a single region. It is pulling out the FILTER=PASS variants that intersect the bed file into a new compressed vcf file. The source vcf file in this case is 465GB. If you have a single variant, you could use -t 1:11022. It might be best to specify a short range (1:11015-11030) if you're looking at indels - variant callers represent indels in different ways and you want to be sure you properly intersect them.
                  Thanks so much. I tried this command:

                  bcftools view -f PASS --threads 8 -r chr9:55252802-55252802 -o 722g.990.SNP.INDEL.chrAll.vcf.gz -O z 722g.990.SNP.INDEL.chrAll.vcf.gz

                  and it returned:
                  [W::hts_idx_load2] The index file is older than the data file: 722g.990.SNP.INDEL.chrAll.vcf.gz.tbi
                  [W::hts_idx_load2] The index file is older than the data file: 722g.990.SNP.INDEL.chrAll.vcf.gz.tbi
                  [W::bgzf_read_block] EOF marker is absent. The input is probably truncated

                  Do you think you could point out what I did wrong?
                  Edit: I think i see my error, i made it output the data to the same file and now I think i ruined that data file... its only 16.1 kb now
                  Last edited by XeroxHero69; 05-01-2019, 10:38 AM.

                  Comment

                  • questor2010
                    Junior Member
                    • Mar 2010
                    • 5

                    #10
                    Originally posted by XeroxHero69 View Post
                    Thanks so much. I tried this command:

                    bcftools view -f PASS --threads 8 -r chr9:55252802-55252802 -o 722g.990.SNP.INDEL.chrAll.vcf.gz -O z 722g.990.SNP.INDEL.chrAll.vcf.gz

                    and it returned:
                    [W::hts_idx_load2] The index file is older than the data file: 722g.990.SNP.INDEL.chrAll.vcf.gz.tbi
                    [W::hts_idx_load2] The index file is older than the data file: 722g.990.SNP.INDEL.chrAll.vcf.gz.tbi
                    [W::bgzf_read_block] EOF marker is absent. The input is probably truncated

                    Do you think you could point out what I did wrong?
                    Edit: I think i see my error, i made it output the data to the same file and now I think i ruined that data file... its only 16.1 kb now
                    Ouch. Yes - that's the problem. I hope you have another copy readily available. The index file warnings are often caused by file transfers. They transfer first for large files and then have an older timestamp. You can use "touch" to fix that, or reindex, but that takes a long time with large vcf files.

                    Comment

                    • XeroxHero69
                      Member
                      • Apr 2019
                      • 12

                      #11
                      Originally posted by questor2010 View Post
                      Ouch. Yes - that's the problem. I hope you have another copy readily available. The index file warnings are often caused by file transfers. They transfer first for large files and then have an older timestamp. You can use "touch" to fix that, or reindex, but that takes a long time with large vcf files.
                      So to make sure I do it right this time, what do I do to make the .vcf.bgz file?

                      Comment

                      • questor2010
                        Junior Member
                        • Mar 2010
                        • 5

                        #12
                        The -o switch specifies the output file name, the -O switch specifies the format:

                        -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF

                        So -O z implies that the name should be basename.vcf.gz (or basename.vcf.bgz).

                        If you have a single variant you just want to examine, you could use -v and output to a standard vcf file (text).

                        Comment

                        • XeroxHero69
                          Member
                          • Apr 2019
                          • 12

                          #13
                          Originally posted by questor2010 View Post
                          The -o switch specifies the output file name, the -O switch specifies the format:

                          -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF

                          So -O z implies that the name should be basename.vcf.gz (or basename.vcf.bgz).

                          If you have a single variant you just want to examine, you could use -v and output to a standard vcf file (text).
                          So do I need to create a new file for the output prior to running this command? how would the command look if I use -v without overwriting my file again. Sorry if these are simple questions. Thanks!

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            Yesterday, 10:05 AM
                          • SEQadmin2
                            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                            by SEQadmin2


                            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                            Introduction

                            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                            05-22-2026, 06:42 AM
                          • SEQadmin2
                            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                            by SEQadmin2

                            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                            05-06-2026, 09:04 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 12:03 PM
                          0 responses
                          19 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, Yesterday, 11:40 AM
                          0 responses
                          14 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-28-2026, 11:40 AM
                          0 responses
                          29 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-26-2026, 10:12 AM
                          0 responses
                          31 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...