Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 5'UTR and 3'UTR

    This seems to be a trivial question. Does every transcript has one 5'UTR and one 3'UTR? I think the answer is yes.. but I stumbled upon some annotation files that has more 5'UTRs than 3'UTRs and this makes me really confused.

  • #2
    Ever hear of alternative splicing?

    Comment


    • #3
      No, different transcription start sites will lead to different 5'UTRs etc. But I admire your line of thought that made you assume the annotators of genes were incorrect

      Comment


      • #4
        Here is how I encountered the problem. I'm trying to write a perl scrip to parse mouse refSeq file and give me promoters, introns, coding sequences, 5'UTRs and 3'UTRs. I posted my code here. If you look at the code, I basically get one 5'UTR and one 3'UTR for each entry in the refSeq file. That's how I got equal number of 5'UTR and 3'UTR..

        I had made these files a while ago and I don't remember how I got them but the old 5'UTR file has more entries than 3'UTR file and that's how I started to wonder why this is the case..

        I really appreciate if you could explain if my thinking/code is wrong. Any comments on how to improve this code will also be greatly appreciated!! I found my coding skills is improving very slowly

        <
        Code:
        #!/usr/bin/perl -w
        use strict;
        
        my $usage ="
        This script takes refGene default format and ouput several 
        genomic feature file.
        
        Usage: perl genomic_feature.pl <genomic feature> <refGene> <output>
        ";
        die $usage unless @ARGV;
        
        my ($input,$prom,$cds,$intron,$utr5,$utr3,$wholegene) = @ARGV;
        
        open (IN, $input) || die "cannot open $input";
        open (PM, ">$prom") || die "cannot open $prom";
        open (CD, ">$cds") || die "cannot open $cds";
        open (IR, ">$intron") || die "cannot open $intron";
        open (UT5, ">$utr5") || die "cannot open $utr5";
        open (UT3, ">$utr3") || die "cannot open $utr3";
        open (WG, ">$wholegene") || die "cannot open $wholegene";
        
        <IN>;
        while (<IN>){
        	chomp;
        	my @array = split/\t/;
        	my $refid=$array[1];
        	my $chr=$array[2];
        	my $strand=$array[3];
        	my $txstart=$array[4];
        	my $txend=$array[5];
        	my $cdsstart=$array[6];
        	my $cdsend=$array[7];
        	my $exstart=$array[9];
        	my $exend=$array[10];
        	my $gensym=$array[12];
        	my $promstart;
        	my $promend;
        	my $prombound;
        	my $utr5start;
        	my $utr5end;
        	my $utr3start;
        	my $utr3end;
        	if ($strand eq '+'){
        		$promstart =$txstart - 2000;
        		$promend = $txstart;
        		$prombound = $promstart;
        		$utr5start = $txstart;
        		$utr5end = $cdsstart;
        		$utr3start = $cdsend;
        		$utr3end = $txend;
        	}
        	else{
        		$promstart =$txend;
        		$promend =$txend+2000;
        		$prombound = $promend;
        		$utr5start = $cdsend;
        		$utr5end = $txend;
        		$utr3start = $txstart;
        		$utr3end = $cdsstart;
        	}
        	
        	## print whole gene bed6 format
        	my $geneinfo = $refid.'_wholegene_'.$chr.'_'.$strand;
        	print WG "$chr\t$txstart\t$txend\t$geneinfo\t0\t$strand\n";
        	# print promoters
        	my $prominfo = $refid.'_up_2000_'.$chr.'_'.$prombound.'_'.$strand;
        	print PM "$chr\t$promstart\t$promend\t$prominfo\t1\t$strand\n";
        	# print 5'UTR
        	my $utr5info= $refid.'_utr5_'.$chr.'_'.$utr5start.'_'.$strand;
        	print UT5 "$chr\t$utr5start\t$utr5end\t$utr5info\t2\t$strand\n";
        	# print 3'UTR
        	my $utr3info= $refid.'_utr3_'.$chr.'_'.$utr3start.'_'.$strand;
        	print UT3 "$chr\t$utr3start\t$utr3end\t$utr3info\t5\t$strand\n";
        	
        	# print coding sequences
        	my @exonst = split(/,/,$exstart);
        	my @exoned = split(/,/,$exend);
        	# if there is only one coding sequence
        	my $cdsinfo1st = $refid.'_cds_1_'.$chr.'_'.$cdsstart.'_'.$strand;
        	print CD "$chr\t$cdsstart\t$exoned[0]\t$cdsinfo1st\t3\4\t$strand\n";
        	my $cdsinfolast = $refid.'_cds_'.scalar(@exonst).'_'.$chr.'_'.$exonst[-1].'_'.$strand;
        	
        	my $introninfo1st = $refid.'_intron_1_'.$chr.'_'.$exoned[0].'_'.$strand;
        	# if there are 2 coding sequences
        	if (scalar (@exonst) ==2 ){		
        		print CD "$chr\t$exonst[-1]\t$cdsend\t$cdsinfolast\t3\t$strand\n";
        		print IR "$chr\t$exoned[0]\t$exonst[1]\t$introninfo1st\t4\t$strand\n"; 
        	}
        	# if there are more than 2 coding sequences
        	elsif (scalar (@exonst) >2 ){
        		print IR "$chr\t$exoned[0]\t$exonst[1]\t$introninfo1st\t4\t$strand\n"; 
        		for (my $i=2;$i <=$#exonst; $i++){
        			my $cdsinfo = $refid.'_cds_'.$i.'_'.$chr.'_'.$exonst[$i-1].'_'.$strand;
        			print CD "$chr\t$exonst[$i-1]\t$exoned[$i-1]\t$cdsinfo\t3\t$strand\n";
        			my $introninfo = $refid.'_intron_'.$i.'_'.$chr.'_'.$exoned[$i-1].'_'.$strand;
        			print IR "$chr\t$exoned[$i-1]\t$exonst[$i]\t$introninfo\t4\t$strand\n"; 
        		}
        		print CD "$chr\t$exonst[-1]\t$cdsend\t$cdsinfolast\t3\t$strand\n";
        	}
        }
        
        close IN;
        close PM;
        close CD;
        close IR;
        close UT5;
        close UT3;
        close WG;

        Comment


        • #5
          Originally posted by Bukowski View Post
          No, different transcription start sites will lead to different 5'UTRs etc. But I admire your line of thought that made you assume the annotators of genes were incorrect
          Alternative splicing of exon(s) at the 5' end of the transcript that contain non-coding sequence would change the 5'UTR, no? (in addition to alternative start sites) The reply was more of a hint than an answer, to guide the poster's thinking about the solution.
          Last edited by HESmith; 03-02-2013, 08:30 AM.

          Comment


          • #6
            Having also tried this recently there are a few problems with defining a set of 5' and 3' UTRs (I was actually only looking at 3').

            One problem was that in theory the coding sequence for a transcript should terminate in the final exon of the transcript, otherwise it will be removed by nonsense mediated decay - however there are a number of annotated transcripts where this is not the case. I chose to remove these from my analysis but if you keep them you either need to handle spliced 3' UTRs or cover the whole region, including the non-coding intron.

            The other potential source for error is alternative termination sites in transcripts. Annotated transcripts are generally listed with only one termination site and from our observations in human/mouse it tends to be the longest commonly used site. If you look at real RNA-Seq data though you can often see evidence for multiple alternate termination sites, where different proportions of the transcripts can terminate at 2 or more positions. Depending on what you want to do with the set of UTRs you generate this might be something to consider.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:47 AM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            57 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X