Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Select only one sample in 1000 Genomes whole genome VCF

    Hi there,

    First of all I have to say how incredible is the work people is doing at 1KG project. This post is not to criticize their work, but just to ask for some insight into how to best use their dataset.

    So, I'm trying to obtain a whole genome VCF file for only one sample at a time and it is ridiculously difficult and tedious. The variation data in 1KG is provided as one VCF file per chromosome containing the genotypes in all samples analyzed. These files are big even compressed: from 11G for chromosome 1 to 1.8G for chromosome 22.

    The first approach provided by 1KG project is tabix + vcftools command line tools:
    Code:
    tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz | vcf-subset -c NA12842 | bgzip -c NA12842.1KG.chr1.vcf.gz
    This does not work very well as it relies on the network, so my first thought was to download the ALL.chr1... file. OK, but it still takes a long time for tabix to read the file and make a subset, and you have to do this for 24 chromosomes and then merge the results.

    The second approach provided by the 1KG project is an online tool: the Data Slicer. It works all right, but only for small regions. As soon as you try to select a whole chromosome from a VCF file it does not seem to finish loading.


    So, I downloaded all VCF files and made an script to launch all the needed commands, but it takes days to output a whole genome VCF file for only one sample. It is not really a good solution.


    Am I missing something? I guess when the data was published nobody thought at the possibility of needing the whole data by sample. Anybody there went through the same problem??



    Thanx,
    Pablo.

  • #2
    Try using the -e and -a options for vcf-subset. I believe that without those it still writes a line for every locus that has a variant, even if that variant is not present in the sample that you are subsetting on, causing very large files to be written. Also I'm not sure that you need tabix if you're operating on the whole chromosome file. I just did something similar to what you are attempting, although I was only extracting indels, and it did not take me days. I did something like this:

    for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 X
    do
    vcf-subset -e -a -c $SAMPLE ALL.chr${i}.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | bgzip -c > ALL.chr${i}.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.$SAMPLE.vcf.gz
    done

    Comment


    • #3
      Hi,


      Thanks for your response!
      I actually was wrong with the code I posted, I am already executing vcf-subset with -e and -a and I'm already not using tabix. So, we are at the same point.

      I just launched the following command for chromosome 1 and sample NA12814:
      Code:
      vcf-subset -e -a -c NA12814 ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz | bgzip -c > NA12814/exome_variants/NA12814.chr1.vcf.gz
      1 hour 10 minutes and counting... is it taking so much time in your case?


      Pablo.

      Comment


      • #4
        I can't handle files > 4G.

        maybe try the 2123 genomes (except _x,_y,_m) in the
        /omni directory ? It's still difficult and tedious and lengthy
        to get a full genome, so far I just have the 2123 sequences,
        64155 SNP-positions only, of chromosome 17, 1st allele, 136MB

        Comment


        • #5
          HI there,

          What is the 2123???

          Comment


          • #6
            they have 2123 individuals there, it's also described in the paper,
            page 60 supplemental material,



            page 60 , chapter 10.5

            it's a sub-project, as I understood.
            ~2M SNPs for 2123 individuals
            Last edited by gsgs; 12-13-2012, 05:52 AM.

            Comment


            • #7
              Thanks gsgs,
              It seems a good alternative, but in my case I need data from NGS.

              By the way I launched yesterday an script like the one cwhelan proposed and it is now with the chromosome 15 and it has been already running for 21 hours and counting...


              Pablo.

              Comment


              • #8
                searching what NGS is ...

                Wang,Y.,Lu,J.,Yu,J.,Gibbs,R.&Yu,F.An integrative variant analysis
                pipeline for accurate genotype/haplotype inference in population NGS
                data.n$revision(2012).



                do they have, what you want ?

                NGS = next generation sequencing ?
                Last edited by gsgs; 12-14-2012, 01:19 AM.

                Comment


                • #9
                  Mmmmm, sorry I didn't explain well myself.

                  Yes I was referring to Next Generation Sequencing. The thing is that the OMNI data you proposed is obtained via Illumina OMNI microarray genotyping, while the main data comes from NGS machines (Illumina, SOLiD or 454 in this case).
                  For my purpose I need the data obtained from NGS.

                  About SNPTools, it seems a good tool for SNP calling, but it does not manage VCF files. So it does not really apply for this problem. But in aby case I didn't know it I'll have a look at it.


                  Regards,
                  Pablo.

                  Comment


                  • #10
                    we get that giant (positions,sequences) matrix, but we need the transpose (sequences,positions)
                    transposing a big matrix needs much memory

                    better you split the matrix into smaller parts and transpose them separately ?!


                    "most difficult" says wikipedia
                    Last edited by gsgs; 12-14-2012, 03:25 AM.

                    Comment


                    • #11
                      It's strange that that is taking so long for you; I don't think it took as long for me. I ran each fille in parallel on different cores, that's one way to speed it up.

                      Comment


                      • #12
                        gsgs I am trying to process this data using existing tools as vcftools I am avoiding getting into the guts of it. Anyway this may apply to optimize vcftools.

                        Comment


                        • #13
                          Hi, I have the same problem as the OP. I understand how to slice the chromosome data using VCF tools, however I would be glad if you could explain how to stream the data directly to VCF tools without downloading the huge .vcf files first. If I understand correctly, tabix is useless here because I want data for the full chromosome, not for an interval.

                          Any ideas? Thanks in advance.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          30 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          32 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X