Hi everyone,
I am a beginner in bioinformatics, could anyone tell me how to extract the chromosome number of each BLAST hit for a bunch of query sequences? I looked at the "hit table" but only found the start and end loci of each hit, and I knew there's chr info in standard BLAST report (txt format), it's like below, hard to manipulate on a large scale:
Query=
Length=59
Score E
Sequences producing significant alignments: (Bits) Value
ref|NT_039716.7| Mus musculus strain C57BL/6J chromosome X ge... 93.5 2e-17
ref|NW_001035178.1| Mus musculus strain mixed chromosome X ge... 93.5 2e-17
ALIGNMENTS
>ref|NT_039716.7| Mus musculus strain C57BL/6J chromosome X genomic contig, MGSCv37
C57BL/6J
Length=15097629
Score = 93.5 bits (50), Expect = 2e-17
Identities = 57/60 (95%), Gaps = 1/60 (2%)
Strand=Plus/Plus
Query 1 TACC-CTGTAGGGTTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 59
|||| ||||| | |||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1486960 TACCTCTGTAAGATTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 1487019
>ref|NW_001035178.1| Mus musculus strain mixed chromosome X genomic scaffold, alternate
assembly Mm_Celera 232000009784844, whole genome shotgun
sequence
Length=8237495
Features flanking this part of subject sequence:
323432 bp at 5' side: uncharacterized protein LOC211208
337262 bp at 3' side: uncharacterized protein LOC73934
Score = 93.5 bits (50), Expect = 2e-17
Identities = 57/60 (95%), Gaps = 1/60 (2%)
Strand=Plus/Plus
Query 1 TACC-CTGTAGGGTTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 59
|||| ||||| | |||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 577313 TACCTCTGTAAGATTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 577372
I am wondering whether the ASN file would contain such information (it seems not human readable?). If not, the only way I can think of is to extract chr no. from the standard report by perl or grep (another problem is I don't know how to write perl scripts)... Thanks a lot!
I am a beginner in bioinformatics, could anyone tell me how to extract the chromosome number of each BLAST hit for a bunch of query sequences? I looked at the "hit table" but only found the start and end loci of each hit, and I knew there's chr info in standard BLAST report (txt format), it's like below, hard to manipulate on a large scale:
Query=
Length=59
Score E
Sequences producing significant alignments: (Bits) Value
ref|NT_039716.7| Mus musculus strain C57BL/6J chromosome X ge... 93.5 2e-17
ref|NW_001035178.1| Mus musculus strain mixed chromosome X ge... 93.5 2e-17
ALIGNMENTS
>ref|NT_039716.7| Mus musculus strain C57BL/6J chromosome X genomic contig, MGSCv37
C57BL/6J
Length=15097629
Score = 93.5 bits (50), Expect = 2e-17
Identities = 57/60 (95%), Gaps = 1/60 (2%)
Strand=Plus/Plus
Query 1 TACC-CTGTAGGGTTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 59
|||| ||||| | |||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1486960 TACCTCTGTAAGATTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 1487019
>ref|NW_001035178.1| Mus musculus strain mixed chromosome X genomic scaffold, alternate
assembly Mm_Celera 232000009784844, whole genome shotgun
sequence
Length=8237495
Features flanking this part of subject sequence:
323432 bp at 5' side: uncharacterized protein LOC211208
337262 bp at 3' side: uncharacterized protein LOC73934
Score = 93.5 bits (50), Expect = 2e-17
Identities = 57/60 (95%), Gaps = 1/60 (2%)
Strand=Plus/Plus
Query 1 TACC-CTGTAGGGTTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 59
|||| ||||| | |||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 577313 TACCTCTGTAAGATTGATAAGCTTATGTTCACTATAACAATTAACACATTTGCCATTGAC 577372
I am wondering whether the ASN file would contain such information (it seems not human readable?). If not, the only way I can think of is to extract chr no. from the standard report by perl or grep (another problem is I don't know how to write perl scripts)... Thanks a lot!
Comment