Seqanswers Leaderboard Ad

**HESmith** · 03-01-2013, 11:48 AM

Ever hear of alternative splicing?

**Bukowski** · 03-01-2013, 11:54 AM

No, different transcription start sites will lead to different 5'UTRs etc. But I admire your line of thought that made you assume the annotators of genes were incorrect

**gene_x** · 03-01-2013, 12:26 PM

Here is how I encountered the problem. I'm trying to write a perl scrip to parse mouse refSeq file and give me promoters, introns, coding sequences, 5'UTRs and 3'UTRs. I posted my code here. If you look at the code, I basically get one 5'UTR and one 3'UTR for each entry in the refSeq file. That's how I got equal number of 5'UTR and 3'UTR..

I had made these files a while ago and I don't remember how I got them but the old 5'UTR file has more entries than 3'UTR file and that's how I started to wonder why this is the case..

I really appreciate if you could explain if my thinking/code is wrong. Any comments on how to improve this code will also be greatly appreciated!! I found my coding skills is improving very slowly

<

Code:

#!/usr/bin/perl -w
use strict;

my $usage ="
This script takes refGene default format and ouput several 
genomic feature file.

Usage: perl genomic_feature.pl <genomic feature> <refGene> <output>
";
die $usage unless @ARGV;

my ($input,$prom,$cds,$intron,$utr5,$utr3,$wholegene) = @ARGV;

open (IN, $input) || die "cannot open $input";
open (PM, ">$prom") || die "cannot open $prom";
open (CD, ">$cds") || die "cannot open $cds";
open (IR, ">$intron") || die "cannot open $intron";
open (UT5, ">$utr5") || die "cannot open $utr5";
open (UT3, ">$utr3") || die "cannot open $utr3";
open (WG, ">$wholegene") || die "cannot open $wholegene";

<IN>;
while (<IN>){
	chomp;
	my @array = split/\t/;
	my $refid=$array[1];
	my $chr=$array[2];
	my $strand=$array[3];
	my $txstart=$array[4];
	my $txend=$array[5];
	my $cdsstart=$array[6];
	my $cdsend=$array[7];
	my $exstart=$array[9];
	my $exend=$array[10];
	my $gensym=$array[12];
	my $promstart;
	my $promend;
	my $prombound;
	my $utr5start;
	my $utr5end;
	my $utr3start;
	my $utr3end;
	if ($strand eq '+'){
		$promstart =$txstart - 2000;
		$promend = $txstart;
		$prombound = $promstart;
		$utr5start = $txstart;
		$utr5end = $cdsstart;
		$utr3start = $cdsend;
		$utr3end = $txend;
	}
	else{
		$promstart =$txend;
		$promend =$txend+2000;
		$prombound = $promend;
		$utr5start = $cdsend;
		$utr5end = $txend;
		$utr3start = $txstart;
		$utr3end = $cdsstart;
	}
	
	## print whole gene bed6 format
	my $geneinfo = $refid.'_wholegene_'.$chr.'_'.$strand;
	print WG "$chr\t$txstart\t$txend\t$geneinfo\t0\t$strand\n";
	# print promoters
	my $prominfo = $refid.'_up_2000_'.$chr.'_'.$prombound.'_'.$strand;
	print PM "$chr\t$promstart\t$promend\t$prominfo\t1\t$strand\n";
	# print 5'UTR
	my $utr5info= $refid.'_utr5_'.$chr.'_'.$utr5start.'_'.$strand;
	print UT5 "$chr\t$utr5start\t$utr5end\t$utr5info\t2\t$strand\n";
	# print 3'UTR
	my $utr3info= $refid.'_utr3_'.$chr.'_'.$utr3start.'_'.$strand;
	print UT3 "$chr\t$utr3start\t$utr3end\t$utr3info\t5\t$strand\n";
	
	# print coding sequences
	my @exonst = split(/,/,$exstart);
	my @exoned = split(/,/,$exend);
	# if there is only one coding sequence
	my $cdsinfo1st = $refid.'_cds_1_'.$chr.'_'.$cdsstart.'_'.$strand;
	print CD "$chr\t$cdsstart\t$exoned[0]\t$cdsinfo1st\t3\4\t$strand\n";
	my $cdsinfolast = $refid.'_cds_'.scalar(@exonst).'_'.$chr.'_'.$exonst[-1].'_'.$strand;
	
	my $introninfo1st = $refid.'_intron_1_'.$chr.'_'.$exoned[0].'_'.$strand;
	# if there are 2 coding sequences
	if (scalar (@exonst) ==2 ){		
		print CD "$chr\t$exonst[-1]\t$cdsend\t$cdsinfolast\t3\t$strand\n";
		print IR "$chr\t$exoned[0]\t$exonst[1]\t$introninfo1st\t4\t$strand\n"; 
	}
	# if there are more than 2 coding sequences
	elsif (scalar (@exonst) >2 ){
		print IR "$chr\t$exoned[0]\t$exonst[1]\t$introninfo1st\t4\t$strand\n"; 
		for (my $i=2;$i <=$#exonst; $i++){
			my $cdsinfo = $refid.'_cds_'.$i.'_'.$chr.'_'.$exonst[$i-1].'_'.$strand;
			print CD "$chr\t$exonst[$i-1]\t$exoned[$i-1]\t$cdsinfo\t3\t$strand\n";
			my $introninfo = $refid.'_intron_'.$i.'_'.$chr.'_'.$exoned[$i-1].'_'.$strand;
			print IR "$chr\t$exoned[$i-1]\t$exonst[$i]\t$introninfo\t4\t$strand\n"; 
		}
		print CD "$chr\t$exonst[-1]\t$cdsend\t$cdsinfolast\t3\t$strand\n";
	}
}

close IN;
close PM;
close CD;
close IR;
close UT5;
close UT3;
close WG;

**HESmith** · 03-02-2013, 08:25 AM

Originally posted by Bukowski View Post

No, different transcription start sites will lead to different 5'UTRs etc. But I admire your line of thought that made you assume the annotators of genes were incorrect

Alternative splicing of exon(s) at the 5' end of the transcript that contain non-coding sequence would change the 5'UTR, no? (in addition to alternative start sites) The reply was more of a hint than an answer, to guide the poster's thinking about the solution.

**simonandrews** · 03-03-2013, 12:32 AM

Having also tried this recently there are a few problems with defining a set of 5' and 3' UTRs (I was actually only looking at 3').

One problem was that in theory the coding sequence for a transcript should terminate in the final exon of the transcript, otherwise it will be removed by nonsense mediated decay - however there are a number of annotated transcripts where this is not the case. I chose to remove these from my analysis but if you keep them you either need to handle spliced 3' UTRs or cover the whole region, including the non-coding intron.

The other potential source for error is alternative termination sites in transcripts. Annotated transcripts are generally listed with only one termination site and from our observations in human/mouse it tends to be the longest commonly used site. If you look at real RNA-Seq data though you can often see evidence for multiple alternate termination sites, where different proportions of the transcripts can terminate at 2 or more positions. Depending on what you want to do with the set of UTRs you generate this might be something to consider.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

5'UTR and 3'UTR

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News