Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • parsing the gff file

    Hi all,

    I have a gff file like:

    gn|nvit|C3905550 assmcg CDS 68 646 . - . asmbl_401
    gn|nvit|C3905550 assmcg exon 68 646 . - . asmbl_401
    gn|nvit|C3918365 assmcg CDS 42 252 . + . asmbl_443
    gn|nvit|C3918365 assmcg CDS 522 705 . + . asmbl_443
    gn|nvit|C3918365 assmcg exon 522 705 . + . asmbl_443
    gn|nvit|C3930535 assmcg exon 64 888 . - . asmbl_465
    gn|nvit|C3930535 assmcg three_prime_utr 64 393 . - . asmbl_465
    gn|nvit|C3930535 assmcg CDS 394 699 . - . asmbl_465
    gn|nvit|C3930535 assmcg five_prime_utr 700 888 . - . asmbl_465
    gn|nvit|C3935122 assmcg exon 4 567 . + . asmbl_476
    gn|nvit|C3938828 assmcg CDS 293 745 . + . asmbl_481
    gn|nvit|C3938828 assmcg exon 293 745 . + . asmbl_481
    gn|nvit|C3942486 assmcg CDS 244 942 . - . asmbl_489
    gn|nvit|C3942486 assmcg exon 244 942 . - . asmbl_489
    gn|nvit|C3950921 assmcg exon 40 80 . + . asmbl_506
    gn|nvit|C3950921 assmcg three_prime_utr 40 80 . + . asmbl_506
    gn|nvit|C3950921 assmcg exon 172 253 . + . asmbl_506
    gn|nvit|C3950921 assmcg five_prime_utr 172 190 . + . asmbl_506


    I want to create a list writing all the 'transcript id that have both the three_prime_utr and five_prime_utr corndinates' like
    asmbl_465
    asmbl_506

    I used cat final1.gff | perl -ne 's/.*\t(\S+_prime_utr)\t.*transcript_id \"(\S+)\".*/$2\t$1/; print;' | sort -u | perl -ne 'split; print "$_[0]\n" if ($g eq $_[0]); $g = $_[0];' > myutr_list.txt
    but it did not worked for me.

    Thanks!
    Last edited by Shishir; 09-06-2013, 03:48 AM.

  • #2
    Hi- Try this one:

    Code:
    grep -E 'five_prime_utr|three_prime_utr' final1.gff \
        | cut -f 3,9 \
        | sort -k2,2 -k 1,1 -u \
        | cut -f 2 \
        | uniq -c \
        | awk '{if($1 == 2) print $2}'
    1st line: Get lines with either utr
    2st: Get columns with feature type and gene id
    3rd: Get unique lines (now each gene has one line if it has 3UTR OR 5UTR, two lines if it has both)
    4th: Get only the column of gene name
    5th: Count how many times the gene name is found
    6th: If found two times it must have both UTRs so print it

    See if it works...

    Dario

    Comment


    • #3
      Many thanks! it worked for me.

      Originally posted by dariober View Post
      Hi- Try this one:

      Code:
      grep -E 'five_prime_utr|three_prime_utr' final1.gff \
          | cut -f 3,9 \
          | sort -k2,2 -k 1,1 -u \
          | cut -f 2 \
          | uniq -c \
          | awk '{if($1 == 2) print $2}'
      1st line: Get lines with either utr
      2st: Get columns with feature type and gene id
      3rd: Get unique lines (now each gene has one line if it has 3UTR OR 5UTR, two lines if it has both)
      4th: Get only the column of gene name
      5th: Count how many times the gene name is found
      6th: If found two times it must have both UTRs so print it

      See if it works...

      Dario

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      7 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      7 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      49 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      66 views
      0 likes
      Last Post seqadmin  
      Working...
      X