Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • VCF file for the Mouse genome (mm9) used for GATK

    I needed a mm9 dbSNP128 VCF file(v4 above) to integrate into our whole genome mouse sequencing pipeline using the GATK.
    Anybody is lucky enough to generate this file? Broad Institute only provide human dbsnp VCF files. Sanger Institute does provide VCF3.3 files for the mouse strains they sequenced, but no VCF file is provided for mouse dbsnp128.

    NCBI/UCSC only has mouse dbsnp128 in text format, and it is not easy to convert it to "workable" vcf file.

  • #2
    Hi, what do you mean a workable vcf format? Have you tried that dbsnp128 (from NCBI/UCSC) in GATK?

    Comment


    • #3
      I did following in order to get mouse dbsnp 128 vcf file
      wget http://hgdownload.cse.ucsc.edu/golde.../snp128.txt.gz
      gunzip snp128.txt.gz
      vcfutils.pl ucscsnp2vcf snp128.txt >snp128.vcf

      then run GATK
      ##### ERROR MESSAGE: We saw a record with a start of chr1:21250573 after a record with a start of chr1:21250574, for input source: /mm9/snp128.vcf

      I sort mouse vcf file to snp128_sorted.vcf and re-run
      ##### ERROR MESSAGE: The provided VCF file is malformed at line number 2: Unparsable vcf record with allele NLENGTHTOOLONG

      more snp128_sorted.vcf
      ##fileformat=VCFv4.0
      #CHROM POS ID REF ALT QUAL FILTER INFO
      chr1 3000248 rs32640266 G T 0 . molType=genomic;
      class=single;valid=by-frequency
      chr1 3000289 rs32137367 T G 0 . molType=genomic;
      class=single;valid=by-frequency
      chr1 3000353 rs31719101 C T 0 . molType=genomic;
      class=single;valid=by-frequency
      chr1 3000355 rs31443144 T C 0 . molType=genomic;
      class=single;valid=by-frequency
      chr1 3000424 rs32793820 TTTTTTTCTTGGGTTTCTGATATTCTTTAAAGGATTTATTGATTTCCT
      CCAATTTTTAATTTGCTTTTTTCTTGATTTCTTTAGGATATTTCTTTTTCATTTTCCTTT A,T 0
      . molType=genomic;class=single;valid=by-frequency
      chr1 3001066 rs49746803 G T 0 . molType=genomic;
      class=single

      Comment


      • #4
        For the first error: You will have to sort the VCF file on chromosome coordinate order to work with GATK.

        A simple unix sort command should do the trick for you.

        However I have encountered other issues like the second error while using the dbsnp128 from UCSC with GATK and I was hoping to build a VCF from the xml files supplied by NCBI. I will post here if I find any success that way.

        I *think* the snp128.txt file has variants other than small indels and SNPs, which are not being properly negotiated by GATK.
        Last edited by SeekAnswers; 05-07-2012, 10:50 AM.

        Comment


        • #5
          Originally posted by gap View Post
          I needed a mm9 dbSNP128 VCF file(v4 above) to integrate into our whole genome mouse sequencing pipeline using the GATK.
          Anybody is lucky enough to generate this file? Broad Institute only provide human dbsnp VCF files. Sanger Institute does provide VCF3.3 files for the mouse strains they sequenced, but no VCF file is provided for mouse dbsnp128.

          NCBI/UCSC only has mouse dbsnp128 in text format, and it is not easy to convert it to "workable" vcf file.

          maybe U can try this daatabase:
          We are pleased to announce the release of VCF version dbSNP build 132, available on the mouse assembly (UCSC/mm9). dbSNP build 132 is available at NCBI.

          This dbSNP_132 VCF version can be used to GATK pipeline.
          Many thanks to dbSNP at NCBI for the data. This version were produced at WuXi Genome Center by Guan Wang and Qin Luo.
          mm9_karyosort =['chrM','chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chrX', 'chrY','chr1_random','chr3_random','chr4_random','chr5_random','chr7_random','chr8_random','chr9_random','chr13_random','chr16_random','chr17_random','chrX_random','chrY_random','chrUn_random',]
          Last edited by wanguan2000; 05-28-2012, 06:40 PM.

          Comment


          • #6
            ^Great!

            and for working with the snp128.txt file from UCSC I was able to convert it to a workable VCF by usibg GATK VariantsToVCF and then using GATK's liftOverVCF.pl to convert it to MM10 reference.

            This seems to be working fine with GATK so far.

            Comment


            • #7
              Can you tell me the command you used to convert .txt to .vcf with GATK? When I try to do this I get an error that the text file isn't sorted correctly but I'm not sure how to sort just a text file.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 08:47 AM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              59 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X