Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How should I parse this blast output?

    Hi.

    So I have RNA-seq data from a bacteria. I did de novo assembly using the data to create a set of contigs. I then made a blast database from the contigs, and also a blast database from the reference genome features as downloaded from NCBI. I blasted the contigs against the features database, and then I blasted the features against the contigs database. I produced tab format and xml format files, in addition to the standard blast output.

    What I want to do, is to get to a point where I can say, "This feature in the reference genome is the same as this region in this contig."

    Is there a simple way to parse this data such that I can find the following information:
    • Contig ID
    • Reference Feature ID
    • Contig hit position
    • Feature hit position


    For all 99 or 100% identity, as this is the same genome. I also only want the best hit when multiple hits exist.


    This is my first time working this deeply with BLAST data and so I'm not familiar with what kind of tools are out there or the best approach to finding what I'm looking for with some level of confidence. I've been reading the BioPython manual and if nothing exists that can do this I expect I will have to write a parser to do it, but I still need to figure out how I can identify what I'm looking for. I appreciate any help anyone can offer in regard to what tools or scripts I might use to do this and how I can identify what I'm looking for. Thanks very much in advance.



    Here are some segments of my output in case it helps. (this is tab format, I have other formats as well):

    Contig blast against features.
    TR|845|c4_g7_i2| NC_004459.3_gene_3 99.76 1665 4 0 4431 6095 1665 1 0.0 3053
    TR|845|c4_g7_i2| NC_004459.3_gene_3035 99.27 1509 11 0 703 2211 1509 1 0.0 2726
    TR|845|c4_g7_i2| NC_004459.3_gene_1 99.10 1002 9 0 2252 3253 1002 1 0.0 1801
    TR|845|c4_g7_i2| NC_004459.3_gene_2 98.98 981 10 0 3382 4362 981 1 0.0 1757
    TR|845|c4_g7_i2| NC_004459.3_gene_3034 100.00 211 0 0 1 211 1338 1548 4e-108 390
    TR|845|c4_g3_i1| NC_004459.3_gene_6 100.00 413 0 0 1 413 2536 2124 0.0 763
    TR|845|c4_g2_i1| NC_004459.3_gene_6 100.00 355 0 0 1 355 2106 1752 0.0 656
    TR|845|c4_g8_i1| NC_004459.3_gene_6 100.00 404 0 0 1 404 1482 1079 0.0 747
    TR|845|c4_g5_i1| NC_004459.3_gene_13 100.00 1332 0 0 6364 7695 1 1332 0.0 2460
    TR|845|c4_g5_i1| NC_004459.3_gene_6 100.00 1082 0 0 1 1082 1082 1 0.0 1999
    Features blast againt contigs.
    lcl|NC_004459.3_gene_1 tr|845|c4_g7_i2 99.10 1002 9 0 1 1002 2745 1744 0.0 1801
    lcl|NC_004459.3_gene_1 tr|845|c4_g7_i2 99.10 1002 9 0 1 1002 3253 2252 0.0 1801
    lcl|NC_004459.3_gene_2 tr|845|c4_g7_i2 98.98 981 10 0 1 981 3854 2874 0.0 1757
    lcl|NC_004459.3_gene_2 tr|845|c4_g7_i2 98.98 981 10 0 1 981 4362 3382 0.0 1757
    lcl|NC_004459.3_gene_3 tr|845|c4_g7_i2 99.76 1665 4 0 1 1665 6095 4431 0.0 3053
    lcl|NC_004459.3_gene_3 tr|845|c4_g7_i2 98.80 753 9 0 913 1665 4675 3923 0.0 1341
    lcl|NC_004459.3_gene_6 tr|845|c4_g7_i2 100.00 1082 0 0 1 1082 1082 1 0.0 1999
    lcl|NC_004459.3_gene_6 tr|845|c4_g7_i2 100.00 413 0 0 2124 2536 413 1 0.0 763
    lcl|NC_004459.3_gene_6 tr|845|c4_g7_i2 100.00 404 0 0 1079 1482 404 1 0.0 747
    lcl|NC_004459.3_gene_6 tr|845|c4_g7_i2 100.00 355 0 0 1752 2106 355 1 0.0 656

  • #2
    Here is my solution...

    So I ended up writing a script that will identify reciprocal best hits that are also uniquely mapped and also meet some basic quality criteria like length must be >200. In case anyone in the future needs this kind of tool, here is a link.

    blastContigSelector.py.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:37 PM
    0 responses
    10 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 06:07 PM
    0 responses
    9 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    49 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    67 views
    0 likes
    Last Post seqadmin  
    Working...
    X