Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting an early genome assembly to current coordinates?

    Hi there,

    I have been involved in a project where I have aligned my sequence data to MGSCv3, the first build of the mouse genome, which consists of ~250,000 contigs. My project is testing whether I can use sequencing technique to order and reconstitute these contigs into a more 'complete' genome.

    As such, I would like to see how accurately I have ordered my MGSCv3-data by comparing it to the actual locations of each of the 250k contigs in the latest build of the mouse genome (GRCm38 / mm10). I initially did a 'dirty' approach of just taking the first 100nt of each contig, and performing a bwa aln to the latest build, but I would like to get more accurate localizations.

    Initially I thought I could just find the current mm10 coordinates of the MGSCv3 accession numbers or gi numbers in NCBI, but I can't locate such a table.

    Then I thought I could use LiftOver to find the coordinates, but the assembly versions don't go back far enough in UCSC (they only support liftOvers from mm7 onward). Then I tried BLAT or BLAST, but the online versions couldn't handle the number of records I want to analyze, and I couldn't find a good way to implement a local installation to do this.

    Finally, I've been looking at NCBI remap, but again the web-based version cannot handle the number of records, and I can't find a way to implement this locally. Also, the identifiers for remap MGSCv3 are different to the identifiers I have. From the NCBI-downloaded build, each fasta region is in the format

    "gi|20564479|emb|CAAA01000001.1|,9601"

    while remap wants the location in the format

    "chrMmUn_WIFeb01_42457:1 -9600"

    I was wondering if this community has any ideas on how to convert bulk records from a very early reference assembly to a later version? Or if there are any repositories that would contain this information?

    Any advice would be greatly appreciated!

  • #2
    LiftOver files for the older genome builds for mouse are available via UCSC archives: http://genome-archive.cse.ucsc.edu/downloads.html

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      Yesterday, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    58 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    53 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    45 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    55 views
    0 likes
    Last Post seqadmin  
    Working...
    X