Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Retrieve genomic coordinates (locus start, locus end) from blastn hits?

    Hi everyone, I'm trying to carry out some sort of reciprocal best hits analysis with blastn (so two-step process) in order to find orthologs (I know it is probably not the best way) of novel human miRNAs predicted by a software called miRDeep2. Working on linux and using blast 2.2.30+ version. I'm basically using blastn against the pre-formatted ncbi database of partially non-redundant nucleotides (the huge one, I downloaded from here ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt*), but limiting the search by using -seqidlist with a file containing all accession numbers for all mammalian-human genes, contigs, chunks or DNA or whatever is in the databases. This accession list for mammals I got it by doing this:

    "esearch -db nucleotide -query "Mammalia[organism]" | efetch -db nucleotide -format acc > mammalia.acc"

    Then I did the same for homo sapiens, and I also subtracted both in order to have an accession list with mammals excluding humans;

    "esearch -db nucleotide -query "Homo sapiens[organism]" | efetch -db nucleotide -format acc > hsa.acc"

    The command line for running blastn looks something like this (the miRNA fasta files only contain one sequence plus an identifier per file, and the sequences are short, between 50 and 90 nt, the mammalian acc file has around 35M lines):

    blastn -db nt -seqidlist mammalian_minus_hsa.acc -query mymiRNA.fa -out mymiRNA_blast.out -evalue 1e-10 -word_size 20 -outfmt "6 qacc sseqid sseq sstart send sscinames evalue"

    In the end I get some hits against each of my novel miRNAs, and these hits of course have an ID, and it also tells you the start and end position of the match both in your query and in the subject sequence. But this is not what I need, now I would like to have the specific genomic locations of the matches for each subject sequences (locus start and locus end of the hit) in order to know where these sequences are located in mammals. I was thinking maybe I could get the genomic coordinates for the accession number of each hit, and then I would just need to add or subtract these positions to the start and end positions of the matches in my sequences...but I still haven't found a way of getting these genomic coordinates by using an ID, I would appreciate some suggestions...

    I also need these locations because ultimately what I want to do is to take each of my novel human miRNAs, and for each best mammal hit I get (best one per organism), blast it back against the human genome, and check if the best hit I get now is located in the same position where I know my novel human miRNA is located, so that I can consider them orthologs, and for each of these novel miRNAs, have a fasta file with all its orthologs for further analysis.

    I hope it is more or less clear what I intend to do.

    Thank you in advance,

  • #2
    I am not 100% clear on what additional data you are trying to pull out from the blastn output.

    Could you post a sample of the blastn output, and maybe a rough schematic of what you wish the them to look like after processing?

    Comment


    • #3
      This is an example of how my blastn output looked like:

      8_124616757-124616839- gi|146149349|gb|AC197214.8| GCCTCCTTAGCGTAGTAGGTAGCACGTCAGTCTCATAATCTGAAGATTTCAACAACTGAGTGCCTCATTGCTCAAGGAGTGAA 180078 179996 Macaca mulatta 5e-30
      8_124616757-124616839- gi|54908|emb|X04525.1| GCCTCCTTAGCGCAGTAGGTAGCGCGTCAGTCTCATAATCTGAAG 1 45 Mus musculus 3e-12
      8_124616757-124616839- gi|51571999|gb|AC119854.7| GCCTCCTTAGCGCAGTAGGCAGCGCGTCAGTCTCATAATCTGAAGAT 74096 74142 Mus musculus 1e-11
      8_124616757-124616839- gi|37651859|gb|AC101490.8| GCCTCCTTAGCGCAGTAGGCAGCGCGTCAGTCTCATAATCTGAAGAT 223791 223837 Mus musculus 1e-11


      As I understand this, each row corresponds to a hit, so a match was found for each of the accessions that are shown, then it shows the exact match in terms of DNA sequence, followed by the start and end position of the match within the piece of DNA represented by this accession number, the problem being that I took all accession numbers of all mammals in the nucleotide database to blast against, so those accessions may correspond to genes, contigs or whole chromosomes submitted to the NCBI database, and sometimes there is no information about the genomic location of these sequences. I just wanted to know, for each hit, where exactly in the mammal genome it belongs in terms of genomic coordinates, but I don't think that blast keeps track of this information so there is no direct way of getting what I want from the blast output. Anyways, I ended up retrieving whatever information was available for each hit by means of its accession (esearch | efetch), and since some of them, as I said, don't contain information about chromosomic location, I just gave up on that.
      Last edited by sombrajo; 03-10-2018, 04:14 AM.

      Comment


      • #4
        My 5 cents:

        1. It is better to have a dedicated filtered blast database(es) than working with a complete nt/nr database, especially if you are searching against the small subset of sequences - filter the input nt.fasta file to match your criteria (like genus/species name) and formatdb/makeblastdb it.

        2. if you want to be able to extract locus hit coordinates in the genome of your choice than you either have to blast your sequences against the selected version of the complete genome sequence - like whole human or mouse chromosomes, or pre process the input fasta file used for the database creation and add the chromosome ID and start/stop to each of the fasta ID's in the blast db.
        EX:

        >[NCBI fasta header] chr=chr1 start=10000000 stop=10020000

        Than this would give you the global coordinates of the subject in your genome of choice

        PS: If blast+ does not like = signs in the fasta header - than use : or similar to separate variable from the value.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        9 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        67 views
        0 likes
        Last Post seqadmin  
        Working...
        X