Hi there,
I have been involved in a project where I have aligned my sequence data to MGSCv3, the first build of the mouse genome, which consists of ~250,000 contigs. My project is testing whether I can use sequencing technique to order and reconstitute these contigs into a more 'complete' genome.
As such, I would like to see how accurately I have ordered my MGSCv3-data by comparing it to the actual locations of each of the 250k contigs in the latest build of the mouse genome (GRCm38 / mm10). I initially did a 'dirty' approach of just taking the first 100nt of each contig, and performing a bwa aln to the latest build, but I would like to get more accurate localizations.
Initially I thought I could just find the current mm10 coordinates of the MGSCv3 accession numbers or gi numbers in NCBI, but I can't locate such a table.
Then I thought I could use LiftOver to find the coordinates, but the assembly versions don't go back far enough in UCSC (they only support liftOvers from mm7 onward). Then I tried BLAT or BLAST, but the online versions couldn't handle the number of records I want to analyze, and I couldn't find a good way to implement a local installation to do this.
Finally, I've been looking at NCBI remap, but again the web-based version cannot handle the number of records, and I can't find a way to implement this locally. Also, the identifiers for remap MGSCv3 are different to the identifiers I have. From the NCBI-downloaded build, each fasta region is in the format
"gi|20564479|emb|CAAA01000001.1|,9601"
while remap wants the location in the format
"chrMmUn_WIFeb01_42457:1 -9600"
I was wondering if this community has any ideas on how to convert bulk records from a very early reference assembly to a later version? Or if there are any repositories that would contain this information?
Any advice would be greatly appreciated!
I have been involved in a project where I have aligned my sequence data to MGSCv3, the first build of the mouse genome, which consists of ~250,000 contigs. My project is testing whether I can use sequencing technique to order and reconstitute these contigs into a more 'complete' genome.
As such, I would like to see how accurately I have ordered my MGSCv3-data by comparing it to the actual locations of each of the 250k contigs in the latest build of the mouse genome (GRCm38 / mm10). I initially did a 'dirty' approach of just taking the first 100nt of each contig, and performing a bwa aln to the latest build, but I would like to get more accurate localizations.
Initially I thought I could just find the current mm10 coordinates of the MGSCv3 accession numbers or gi numbers in NCBI, but I can't locate such a table.
Then I thought I could use LiftOver to find the coordinates, but the assembly versions don't go back far enough in UCSC (they only support liftOvers from mm7 onward). Then I tried BLAT or BLAST, but the online versions couldn't handle the number of records I want to analyze, and I couldn't find a good way to implement a local installation to do this.
Finally, I've been looking at NCBI remap, but again the web-based version cannot handle the number of records, and I can't find a way to implement this locally. Also, the identifiers for remap MGSCv3 are different to the identifiers I have. From the NCBI-downloaded build, each fasta region is in the format
"gi|20564479|emb|CAAA01000001.1|,9601"
while remap wants the location in the format
"chrMmUn_WIFeb01_42457:1 -9600"
I was wondering if this community has any ideas on how to convert bulk records from a very early reference assembly to a later version? Or if there are any repositories that would contain this information?
Any advice would be greatly appreciated!
Comment