Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Get chromosome number from BLAST results

    Hi everyone,

    I am a beginner in bioinformatics, could anyone tell me how to extract the chromosome number of each BLAST hit for a bunch of query sequences? I looked at the "hit table" but only found the start and end loci of each hit, and I knew there's chr info in standard BLAST report (txt format), it's like below, hard to manipulate on a large scale:

    Query=
    Length=59


    Score E
    Sequences producing significant alignments: (Bits) Value

    ref|NT_039716.7| Mus musculus strain C57BL/6J chromosome X ge... 93.5 2e-17
    ref|NW_001035178.1| Mus musculus strain mixed chromosome X ge... 93.5 2e-17

    ALIGNMENTS
    >ref|NT_039716.7| Mus musculus strain C57BL/6J chromosome X genomic contig, MGSCv37
    C57BL/6J
    Length=15097629

    Score = 93.5 bits (50), Expect = 2e-17
    Identities = 57/60 (95%), Gaps = 1/60 (2%)
    Strand=Plus/Plus

    Query 1 TACC-CTGTAGGGTTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 59
    |||| ||||| | |||||||||||||||||||||||||||||||||||||||||||||||
    Sbjct 1486960 TACCTCTGTAAGATTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 1487019


    >ref|NW_001035178.1| Mus musculus strain mixed chromosome X genomic scaffold, alternate
    assembly Mm_Celera 232000009784844, whole genome shotgun
    sequence
    Length=8237495

    Features flanking this part of subject sequence:
    323432 bp at 5' side: uncharacterized protein LOC211208
    337262 bp at 3' side: uncharacterized protein LOC73934

    Score = 93.5 bits (50), Expect = 2e-17
    Identities = 57/60 (95%), Gaps = 1/60 (2%)
    Strand=Plus/Plus

    Query 1 TACC-CTGTAGGGTTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 59
    |||| ||||| | |||||||||||||||||||||||||||||||||||||||||||||||
    Sbjct 577313 TACCTCTGTAAGATTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 577372
    I am wondering whether the ASN file would contain such information (it seems not human readable?). If not, the only way I can think of is to extract chr no. from the standard report by perl or grep (another problem is I don't know how to write perl scripts)... Thanks a lot!

  • #2
    blast knows nothing of chromosomes so it they are in a blast report it is because you happened to have a blast database constructed from fasta file with Chr info in the their defline ( the line above the sequence that begins with ">" )
    I think your choices are going to be either,
    extract it from the report
    (which can be messy as there are also many standards for writing deflines as well)

    or find/make blast databases that are already per chromosome then any hits are
    to the chromosome you are blasting.

    Comment


    • #3
      Tomc, thank you. Making such a database seems too challenging for me....

      Comment


      • #4
        Update.
        just got the answer from a nice NCBI staff:
        choose NCBI genome(chromosome) as the database for web megablast, specify the organism (here I used mouse genome), then you get a lot NC_'s (and NT's NW's etc.) in the hit table, the NC is complete chromosome, NC_000067 stands for chr1, NC_000085 for chr19, NC_000086/87 for X/Y (I've left off the current .version number for these accessions). Besides, you can get the coordinates of alignment on corresponding chromosome by looking at column 9 and 10. For hit table format, click here(http://www.ornl.gov/sci/techresource...me/blast.shtml).
        Originally posted by logicthief View Post
        Hi everyone,

        I am a beginner in bioinformatics, could anyone tell me how to extract the chromosome number of each BLAST hit for a bunch of query sequences? I looked at the "hit table" but only found the start and end loci of each hit, and I knew there's chr info in standard BLAST report (txt format), it's like below, hard to manipulate on a large scale:

        Query=
        Length=59


        Score E
        Sequences producing significant alignments: (Bits) Value

        ref|NT_039716.7| Mus musculus strain C57BL/6J chromosome X ge... 93.5 2e-17
        ref|NW_001035178.1| Mus musculus strain mixed chromosome X ge... 93.5 2e-17

        ALIGNMENTS
        >ref|NT_039716.7| Mus musculus strain C57BL/6J chromosome X genomic contig, MGSCv37
        C57BL/6J
        Length=15097629

        Score = 93.5 bits (50), Expect = 2e-17
        Identities = 57/60 (95%), Gaps = 1/60 (2%)
        Strand=Plus/Plus

        Query 1 TACC-CTGTAGGGTTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 59
        |||| ||||| | |||||||||||||||||||||||||||||||||||||||||||||||
        Sbjct 1486960 TACCTCTGTAAGATTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 1487019


        >ref|NW_001035178.1| Mus musculus strain mixed chromosome X genomic scaffold, alternate
        assembly Mm_Celera 232000009784844, whole genome shotgun
        sequence
        Length=8237495

        Features flanking this part of subject sequence:
        323432 bp at 5' side: uncharacterized protein LOC211208
        337262 bp at 3' side: uncharacterized protein LOC73934

        Score = 93.5 bits (50), Expect = 2e-17
        Identities = 57/60 (95%), Gaps = 1/60 (2%)
        Strand=Plus/Plus

        Query 1 TACC-CTGTAGGGTTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 59
        |||| ||||| | |||||||||||||||||||||||||||||||||||||||||||||||
        Sbjct 577313 TACCTCTGTAAGATTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 577372
        I am wondering whether the ASN file would contain such information (it seems not human readable?). If not, the only way I can think of is to extract chr no. from the standard report by perl or grep (another problem is I don't know how to write perl scripts)... Thanks a lot!
        Last edited by logicthief; 04-11-2012, 05:02 PM.

        Comment


        • #5
          use the module Bio::SearchIO of Bioperl

          Comment


          • #6
            Originally posted by Growlywolf View Post
            use the module Bio::SearchIO of Bioperl

            http://bioperl.open-bio.org/wiki/HOWTO:SearchIO
            Thanks, growlywolf. It seems very powerful (although not very straightforward for my purpose), I will try it later.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin


              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            39 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            41 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            35 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X