Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mini tutorial: build reference mouse genome and SNP for GATK

    GATK is a standard tool for calling SNPs however their authors did not provide any reference genomes or reference SNPs for non-human organism, such as mouse. Here is my quick tutorial for building a mm10 reference mouse genome and dbSNP reference SNP from scratch. It's not automated. I appreciate any inputs to make this workflow more efficient.

    1. Build reference mm10 genome.
    1.1 Download reference here:http://ccb.jhu.edu/software/tophat/igenomes.shtml, make sure you are downloading the "Mus musculus UCSC MM10" reference.
    1.2 Untar the file, find the directory which contains the sequence for each individual chromosomes. The directory looks like this "Mus_musculus_UCSC_mm10\Mus_musculus\UCSC\mm10\Sequence\Chromosomes"
    Enter the directory.
    1.3 Change the chromosome header:
    sed -i -- "s/chr//g" #.fa
    1.4 Combine the chromosomes into a full genome:
    cat ch1.fa chr2.fa...chrX.fa chr.Y.fa > mm10.fa #Make sure you are combining the chromosomes in karyotypic order and you are not including random or unmapped chromosomes.
    1.5 index the genome and build dictionary file:
    samtools faidx mm10.fa
    java -jar CreateSequenceDictionary.jar R=mm10.fa O=mm10.dict
    1.6 Create BWA index
    bwa index -a bwtsw mm10.fa

    2. Build reference mouse SNP
    2.1 Download VCF (reference mouse SNP)
    wget ftp://ftp.ncbi.nih.gov/snp/organisms...f_chr_*.vcf.gz
    #Discard un and MT and randome chromosome, then unzip
    #Remove excessive header (delete first 14 rows):
    sed "1,14d" chr2.vcf #do all except chr1
    #merge all vcf
    cat chr1.vcf chr2.vcf... chrX.vcf chrY.vcf > dbsnp.vcf

    Now you can use BWA to align the raw reads first, and then use GATK to call the SNPs.

  • #2
    I am not sure why you are editing the chromosome names or merging multiple files. iGenomes already comes with a combined genome FASTA file (Sequence/WholeGenomeFasta) that is already indexed.

    Comment


    • #3
      Originally posted by id0 View Post
      I am not sure why you are editing the chromosome names or merging multiple files. iGenomes already comes with a combined genome FASTA file (Sequence/WholeGenomeFasta) that is already indexed.
      That genome is not sorted in karyotypic order
      chr10 130694993 7 50 51
      chr11 122082543 133308907 50 51
      chr12 120129022 257833108 50 51
      chr13 120421639 380364718 50 51
      chr14 124902244 503194797 50 51
      chr15 104043685 630595093 50 51
      chr16 98207768 736719659 50 51
      chr17 94987271 836891590 50 51
      chr18 90702639 933778614 50 51
      chr19 61431566 1026295313 50 51
      chr1 195471971 1088955517 50 51
      chr2 182113224 1288336934 50 51
      chr3 160039680 1474092429 50 51
      chr4 156508116 1637332909 50 51
      chr5 151834684 1796971194 50 51
      chr6 149736546 1951842578 50 51
      chr7 145441459 2104573861 50 51
      chr8 129401213 2252924156 50 51
      chr9 124595110 2384913400 50 51
      chrM 16299 2512000419 50 51
      chrX 171031299 2512017050 50 51
      chrY 91744698 2686468981 50 51

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        Yesterday, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      58 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      45 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Working...
      X