Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 5'UTR and 3'UTR

    This seems to be a trivial question. Does every transcript has one 5'UTR and one 3'UTR? I think the answer is yes.. but I stumbled upon some annotation files that has more 5'UTRs than 3'UTRs and this makes me really confused.

  • #2
    Ever hear of alternative splicing?

    Comment


    • #3
      No, different transcription start sites will lead to different 5'UTRs etc. But I admire your line of thought that made you assume the annotators of genes were incorrect

      Comment


      • #4
        Here is how I encountered the problem. I'm trying to write a perl scrip to parse mouse refSeq file and give me promoters, introns, coding sequences, 5'UTRs and 3'UTRs. I posted my code here. If you look at the code, I basically get one 5'UTR and one 3'UTR for each entry in the refSeq file. That's how I got equal number of 5'UTR and 3'UTR..

        I had made these files a while ago and I don't remember how I got them but the old 5'UTR file has more entries than 3'UTR file and that's how I started to wonder why this is the case..

        I really appreciate if you could explain if my thinking/code is wrong. Any comments on how to improve this code will also be greatly appreciated!! I found my coding skills is improving very slowly

        <
        Code:
        #!/usr/bin/perl -w
        use strict;
        
        my $usage ="
        This script takes refGene default format and ouput several 
        genomic feature file.
        
        Usage: perl genomic_feature.pl <genomic feature> <refGene> <output>
        ";
        die $usage unless @ARGV;
        
        my ($input,$prom,$cds,$intron,$utr5,$utr3,$wholegene) = @ARGV;
        
        open (IN, $input) || die "cannot open $input";
        open (PM, ">$prom") || die "cannot open $prom";
        open (CD, ">$cds") || die "cannot open $cds";
        open (IR, ">$intron") || die "cannot open $intron";
        open (UT5, ">$utr5") || die "cannot open $utr5";
        open (UT3, ">$utr3") || die "cannot open $utr3";
        open (WG, ">$wholegene") || die "cannot open $wholegene";
        
        <IN>;
        while (<IN>){
        	chomp;
        	my @array = split/\t/;
        	my $refid=$array[1];
        	my $chr=$array[2];
        	my $strand=$array[3];
        	my $txstart=$array[4];
        	my $txend=$array[5];
        	my $cdsstart=$array[6];
        	my $cdsend=$array[7];
        	my $exstart=$array[9];
        	my $exend=$array[10];
        	my $gensym=$array[12];
        	my $promstart;
        	my $promend;
        	my $prombound;
        	my $utr5start;
        	my $utr5end;
        	my $utr3start;
        	my $utr3end;
        	if ($strand eq '+'){
        		$promstart =$txstart - 2000;
        		$promend = $txstart;
        		$prombound = $promstart;
        		$utr5start = $txstart;
        		$utr5end = $cdsstart;
        		$utr3start = $cdsend;
        		$utr3end = $txend;
        	}
        	else{
        		$promstart =$txend;
        		$promend =$txend+2000;
        		$prombound = $promend;
        		$utr5start = $cdsend;
        		$utr5end = $txend;
        		$utr3start = $txstart;
        		$utr3end = $cdsstart;
        	}
        	
        	## print whole gene bed6 format
        	my $geneinfo = $refid.'_wholegene_'.$chr.'_'.$strand;
        	print WG "$chr\t$txstart\t$txend\t$geneinfo\t0\t$strand\n";
        	# print promoters
        	my $prominfo = $refid.'_up_2000_'.$chr.'_'.$prombound.'_'.$strand;
        	print PM "$chr\t$promstart\t$promend\t$prominfo\t1\t$strand\n";
        	# print 5'UTR
        	my $utr5info= $refid.'_utr5_'.$chr.'_'.$utr5start.'_'.$strand;
        	print UT5 "$chr\t$utr5start\t$utr5end\t$utr5info\t2\t$strand\n";
        	# print 3'UTR
        	my $utr3info= $refid.'_utr3_'.$chr.'_'.$utr3start.'_'.$strand;
        	print UT3 "$chr\t$utr3start\t$utr3end\t$utr3info\t5\t$strand\n";
        	
        	# print coding sequences
        	my @exonst = split(/,/,$exstart);
        	my @exoned = split(/,/,$exend);
        	# if there is only one coding sequence
        	my $cdsinfo1st = $refid.'_cds_1_'.$chr.'_'.$cdsstart.'_'.$strand;
        	print CD "$chr\t$cdsstart\t$exoned[0]\t$cdsinfo1st\t3\4\t$strand\n";
        	my $cdsinfolast = $refid.'_cds_'.scalar(@exonst).'_'.$chr.'_'.$exonst[-1].'_'.$strand;
        	
        	my $introninfo1st = $refid.'_intron_1_'.$chr.'_'.$exoned[0].'_'.$strand;
        	# if there are 2 coding sequences
        	if (scalar (@exonst) ==2 ){		
        		print CD "$chr\t$exonst[-1]\t$cdsend\t$cdsinfolast\t3\t$strand\n";
        		print IR "$chr\t$exoned[0]\t$exonst[1]\t$introninfo1st\t4\t$strand\n"; 
        	}
        	# if there are more than 2 coding sequences
        	elsif (scalar (@exonst) >2 ){
        		print IR "$chr\t$exoned[0]\t$exonst[1]\t$introninfo1st\t4\t$strand\n"; 
        		for (my $i=2;$i <=$#exonst; $i++){
        			my $cdsinfo = $refid.'_cds_'.$i.'_'.$chr.'_'.$exonst[$i-1].'_'.$strand;
        			print CD "$chr\t$exonst[$i-1]\t$exoned[$i-1]\t$cdsinfo\t3\t$strand\n";
        			my $introninfo = $refid.'_intron_'.$i.'_'.$chr.'_'.$exoned[$i-1].'_'.$strand;
        			print IR "$chr\t$exoned[$i-1]\t$exonst[$i]\t$introninfo\t4\t$strand\n"; 
        		}
        		print CD "$chr\t$exonst[-1]\t$cdsend\t$cdsinfolast\t3\t$strand\n";
        	}
        }
        
        close IN;
        close PM;
        close CD;
        close IR;
        close UT5;
        close UT3;
        close WG;

        Comment


        • #5
          Originally posted by Bukowski View Post
          No, different transcription start sites will lead to different 5'UTRs etc. But I admire your line of thought that made you assume the annotators of genes were incorrect
          Alternative splicing of exon(s) at the 5' end of the transcript that contain non-coding sequence would change the 5'UTR, no? (in addition to alternative start sites) The reply was more of a hint than an answer, to guide the poster's thinking about the solution.
          Last edited by HESmith; 03-02-2013, 08:30 AM.

          Comment


          • #6
            Having also tried this recently there are a few problems with defining a set of 5' and 3' UTRs (I was actually only looking at 3').

            One problem was that in theory the coding sequence for a transcript should terminate in the final exon of the transcript, otherwise it will be removed by nonsense mediated decay - however there are a number of annotated transcripts where this is not the case. I chose to remove these from my analysis but if you keep them you either need to handle spliced 3' UTRs or cover the whole region, including the non-coding intron.

            The other potential source for error is alternative termination sites in transcripts. Annotated transcripts are generally listed with only one termination site and from our observations in human/mouse it tends to be the longest commonly used site. If you look at real RNA-Seq data though you can often see evidence for multiple alternate termination sites, where different proportions of the transcripts can terminate at 2 or more positions. Depending on what you want to do with the set of UTRs you generate this might be something to consider.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-27-2024, 06:37 PM
            0 responses
            13 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-27-2024, 06:07 PM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            69 views
            0 likes
            Last Post seqadmin  
            Working...
            X