Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • priesgo
    Member
    • Aug 2012
    • 22

    Select only one sample in 1000 Genomes whole genome VCF

    Hi there,

    First of all I have to say how incredible is the work people is doing at 1KG project. This post is not to criticize their work, but just to ask for some insight into how to best use their dataset.

    So, I'm trying to obtain a whole genome VCF file for only one sample at a time and it is ridiculously difficult and tedious. The variation data in 1KG is provided as one VCF file per chromosome containing the genotypes in all samples analyzed. These files are big even compressed: from 11G for chromosome 1 to 1.8G for chromosome 22.

    The first approach provided by 1KG project is tabix + vcftools command line tools:
    Code:
    tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz | vcf-subset -c NA12842 | bgzip -c NA12842.1KG.chr1.vcf.gz
    This does not work very well as it relies on the network, so my first thought was to download the ALL.chr1... file. OK, but it still takes a long time for tabix to read the file and make a subset, and you have to do this for 24 chromosomes and then merge the results.

    The second approach provided by the 1KG project is an online tool: the Data Slicer. It works all right, but only for small regions. As soon as you try to select a whole chromosome from a VCF file it does not seem to finish loading.


    So, I downloaded all VCF files and made an script to launch all the needed commands, but it takes days to output a whole genome VCF file for only one sample. It is not really a good solution.


    Am I missing something? I guess when the data was published nobody thought at the possibility of needing the whole data by sample. Anybody there went through the same problem??



    Thanx,
    Pablo.
  • cwhelan
    Member
    • Nov 2010
    • 23

    #2
    Try using the -e and -a options for vcf-subset. I believe that without those it still writes a line for every locus that has a variant, even if that variant is not present in the sample that you are subsetting on, causing very large files to be written. Also I'm not sure that you need tabix if you're operating on the whole chromosome file. I just did something similar to what you are attempting, although I was only extracting indels, and it did not take me days. I did something like this:

    for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 X
    do
    vcf-subset -e -a -c $SAMPLE ALL.chr${i}.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | bgzip -c > ALL.chr${i}.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.$SAMPLE.vcf.gz
    done

    Comment

    • priesgo
      Member
      • Aug 2012
      • 22

      #3
      Hi,


      Thanks for your response!
      I actually was wrong with the code I posted, I am already executing vcf-subset with -e and -a and I'm already not using tabix. So, we are at the same point.

      I just launched the following command for chromosome 1 and sample NA12814:
      Code:
      vcf-subset -e -a -c NA12814 ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz | bgzip -c > NA12814/exome_variants/NA12814.chr1.vcf.gz
      1 hour 10 minutes and counting... is it taking so much time in your case?


      Pablo.

      Comment

      • gsgs
        Senior Member
        • Oct 2009
        • 139

        #4
        I can't handle files > 4G.

        maybe try the 2123 genomes (except _x,_y,_m) in the
        /omni directory ? It's still difficult and tedious and lengthy
        to get a full genome, so far I just have the 2123 sequences,
        64155 SNP-positions only, of chromosome 17, 1st allele, 136MB

        Comment

        • priesgo
          Member
          • Aug 2012
          • 22

          #5
          HI there,

          What is the 2123???

          Comment

          • gsgs
            Senior Member
            • Oct 2009
            • 139

            #6
            they have 2123 individuals there, it's also described in the paper,
            page 60 supplemental material,



            page 60 , chapter 10.5

            it's a sub-project, as I understood.
            ~2M SNPs for 2123 individuals
            Last edited by gsgs; 12-13-2012, 05:52 AM.

            Comment

            • priesgo
              Member
              • Aug 2012
              • 22

              #7
              Thanks gsgs,
              It seems a good alternative, but in my case I need data from NGS.

              By the way I launched yesterday an script like the one cwhelan proposed and it is now with the chromosome 15 and it has been already running for 21 hours and counting...


              Pablo.

              Comment

              • gsgs
                Senior Member
                • Oct 2009
                • 139

                #8
                searching what NGS is ...

                Wang,Y.,Lu,J.,Yu,J.,Gibbs,R.&Yu,F.An integrative variant analysis
                pipeline for accurate genotype/haplotype inference in population NGS
                data.n$revision(2012).



                do they have, what you want ?

                NGS = next generation sequencing ?
                Last edited by gsgs; 12-14-2012, 01:19 AM.

                Comment

                • priesgo
                  Member
                  • Aug 2012
                  • 22

                  #9
                  Mmmmm, sorry I didn't explain well myself.

                  Yes I was referring to Next Generation Sequencing. The thing is that the OMNI data you proposed is obtained via Illumina OMNI microarray genotyping, while the main data comes from NGS machines (Illumina, SOLiD or 454 in this case).
                  For my purpose I need the data obtained from NGS.

                  About SNPTools, it seems a good tool for SNP calling, but it does not manage VCF files. So it does not really apply for this problem. But in aby case I didn't know it I'll have a look at it.


                  Regards,
                  Pablo.

                  Comment

                  • gsgs
                    Senior Member
                    • Oct 2009
                    • 139

                    #10
                    we get that giant (positions,sequences) matrix, but we need the transpose (sequences,positions)
                    transposing a big matrix needs much memory

                    better you split the matrix into smaller parts and transpose them separately ?!


                    "most difficult" says wikipedia
                    Last edited by gsgs; 12-14-2012, 03:25 AM.

                    Comment

                    • cwhelan
                      Member
                      • Nov 2010
                      • 23

                      #11
                      It's strange that that is taking so long for you; I don't think it took as long for me. I ran each fille in parallel on different cores, that's one way to speed it up.

                      Comment

                      • priesgo
                        Member
                        • Aug 2012
                        • 22

                        #12
                        gsgs I am trying to process this data using existing tools as vcftools I am avoiding getting into the guts of it. Anyway this may apply to optimize vcftools.

                        Comment

                        • albireo
                          Member
                          • Sep 2012
                          • 39

                          #13
                          Hi, I have the same problem as the OP. I understand how to slice the chromosome data using VCF tools, however I would be glad if you could explain how to stream the data directly to VCF tools without downloading the huge .vcf files first. If I understand correctly, tabix is useless here because I want data for the full chromosome, not for an interval.

                          Any ideas? Thanks in advance.

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by SEQadmin2


                            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                            Here are nine questions we think about, in roughly the order they matter, before...
                            06-18-2026, 07:11 AM
                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            06-02-2026, 10:05 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, 06-17-2026, 06:09 AM
                          0 responses
                          36 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-09-2026, 11:58 AM
                          0 responses
                          100 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-05-2026, 10:09 AM
                          0 responses
                          120 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-04-2026, 08:59 AM
                          0 responses
                          113 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...