This seems to be a trivial question. Does every transcript has one 5'UTR and one 3'UTR? I think the answer is yes.. but I stumbled upon some annotation files that has more 5'UTRs than 3'UTRs and this makes me really confused.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Here is how I encountered the problem. I'm trying to write a perl scrip to parse mouse refSeq file and give me promoters, introns, coding sequences, 5'UTRs and 3'UTRs. I posted my code here. If you look at the code, I basically get one 5'UTR and one 3'UTR for each entry in the refSeq file. That's how I got equal number of 5'UTR and 3'UTR..
I had made these files a while ago and I don't remember how I got them but the old 5'UTR file has more entries than 3'UTR file and that's how I started to wonder why this is the case..
I really appreciate if you could explain if my thinking/code is wrong. Any comments on how to improve this code will also be greatly appreciated!! I found my coding skills is improving very slowly
<Code:#!/usr/bin/perl -w use strict; my $usage =" This script takes refGene default format and ouput several genomic feature file. Usage: perl genomic_feature.pl <genomic feature> <refGene> <output> "; die $usage unless @ARGV; my ($input,$prom,$cds,$intron,$utr5,$utr3,$wholegene) = @ARGV; open (IN, $input) || die "cannot open $input"; open (PM, ">$prom") || die "cannot open $prom"; open (CD, ">$cds") || die "cannot open $cds"; open (IR, ">$intron") || die "cannot open $intron"; open (UT5, ">$utr5") || die "cannot open $utr5"; open (UT3, ">$utr3") || die "cannot open $utr3"; open (WG, ">$wholegene") || die "cannot open $wholegene"; <IN>; while (<IN>){ chomp; my @array = split/\t/; my $refid=$array[1]; my $chr=$array[2]; my $strand=$array[3]; my $txstart=$array[4]; my $txend=$array[5]; my $cdsstart=$array[6]; my $cdsend=$array[7]; my $exstart=$array[9]; my $exend=$array[10]; my $gensym=$array[12]; my $promstart; my $promend; my $prombound; my $utr5start; my $utr5end; my $utr3start; my $utr3end; if ($strand eq '+'){ $promstart =$txstart - 2000; $promend = $txstart; $prombound = $promstart; $utr5start = $txstart; $utr5end = $cdsstart; $utr3start = $cdsend; $utr3end = $txend; } else{ $promstart =$txend; $promend =$txend+2000; $prombound = $promend; $utr5start = $cdsend; $utr5end = $txend; $utr3start = $txstart; $utr3end = $cdsstart; } ## print whole gene bed6 format my $geneinfo = $refid.'_wholegene_'.$chr.'_'.$strand; print WG "$chr\t$txstart\t$txend\t$geneinfo\t0\t$strand\n"; # print promoters my $prominfo = $refid.'_up_2000_'.$chr.'_'.$prombound.'_'.$strand; print PM "$chr\t$promstart\t$promend\t$prominfo\t1\t$strand\n"; # print 5'UTR my $utr5info= $refid.'_utr5_'.$chr.'_'.$utr5start.'_'.$strand; print UT5 "$chr\t$utr5start\t$utr5end\t$utr5info\t2\t$strand\n"; # print 3'UTR my $utr3info= $refid.'_utr3_'.$chr.'_'.$utr3start.'_'.$strand; print UT3 "$chr\t$utr3start\t$utr3end\t$utr3info\t5\t$strand\n"; # print coding sequences my @exonst = split(/,/,$exstart); my @exoned = split(/,/,$exend); # if there is only one coding sequence my $cdsinfo1st = $refid.'_cds_1_'.$chr.'_'.$cdsstart.'_'.$strand; print CD "$chr\t$cdsstart\t$exoned[0]\t$cdsinfo1st\t3\4\t$strand\n"; my $cdsinfolast = $refid.'_cds_'.scalar(@exonst).'_'.$chr.'_'.$exonst[-1].'_'.$strand; my $introninfo1st = $refid.'_intron_1_'.$chr.'_'.$exoned[0].'_'.$strand; # if there are 2 coding sequences if (scalar (@exonst) ==2 ){ print CD "$chr\t$exonst[-1]\t$cdsend\t$cdsinfolast\t3\t$strand\n"; print IR "$chr\t$exoned[0]\t$exonst[1]\t$introninfo1st\t4\t$strand\n"; } # if there are more than 2 coding sequences elsif (scalar (@exonst) >2 ){ print IR "$chr\t$exoned[0]\t$exonst[1]\t$introninfo1st\t4\t$strand\n"; for (my $i=2;$i <=$#exonst; $i++){ my $cdsinfo = $refid.'_cds_'.$i.'_'.$chr.'_'.$exonst[$i-1].'_'.$strand; print CD "$chr\t$exonst[$i-1]\t$exoned[$i-1]\t$cdsinfo\t3\t$strand\n"; my $introninfo = $refid.'_intron_'.$i.'_'.$chr.'_'.$exoned[$i-1].'_'.$strand; print IR "$chr\t$exoned[$i-1]\t$exonst[$i]\t$introninfo\t4\t$strand\n"; } print CD "$chr\t$exonst[-1]\t$cdsend\t$cdsinfolast\t3\t$strand\n"; } } close IN; close PM; close CD; close IR; close UT5; close UT3; close WG;
Comment
-
Originally posted by Bukowski View PostNo, different transcription start sites will lead to different 5'UTRs etc. But I admire your line of thought that made you assume the annotators of genes were incorrectLast edited by HESmith; 03-02-2013, 08:30 AM.
Comment
-
Having also tried this recently there are a few problems with defining a set of 5' and 3' UTRs (I was actually only looking at 3').
One problem was that in theory the coding sequence for a transcript should terminate in the final exon of the transcript, otherwise it will be removed by nonsense mediated decay - however there are a number of annotated transcripts where this is not the case. I chose to remove these from my analysis but if you keep them you either need to handle spliced 3' UTRs or cover the whole region, including the non-coding intron.
The other potential source for error is alternative termination sites in transcripts. Annotated transcripts are generally listed with only one termination site and from our observations in human/mouse it tends to be the longest commonly used site. If you look at real RNA-Seq data though you can often see evidence for multiple alternate termination sites, where different proportions of the transcripts can terminate at 2 or more positions. Depending on what you want to do with the set of UTRs you generate this might be something to consider.
Comment
Latest Articles
Collapse
-
by seqadmin
Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...-
Channel: Articles
03-22-2024, 06:39 AM -
-
by seqadmin
The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.
Avian Conservation
Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...-
Channel: Articles
03-08-2024, 10:41 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 03-27-2024, 06:37 PM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
03-27-2024, 06:37 PM
|
||
Started by seqadmin, 03-27-2024, 06:07 PM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
03-27-2024, 06:07 PM
|
||
Started by seqadmin, 03-22-2024, 10:03 AM
|
0 responses
53 views
0 likes
|
Last Post
by seqadmin
03-22-2024, 10:03 AM
|
||
Started by seqadmin, 03-21-2024, 07:32 AM
|
0 responses
69 views
0 likes
|
Last Post
by seqadmin
03-21-2024, 07:32 AM
|
Comment