![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Difficulty aligning to reference with AVA | cog | Bioinformatics | 8 | 07-07-2019 07:16 AM |
cufflinks : analysis comparison with and without a gtf reference file | sohnic | Bioinformatics | 3 | 07-07-2019 06:40 AM |
GTF reference files that work with TopHat/Cufflinks | marcora | Bioinformatics | 23 | 01-15-2014 12:10 AM |
Cufflinks' computation of FPKM for --GTF and --GTF-guide estimation | burt | Bioinformatics | 0 | 08-24-2011 12:59 AM |
Difficulty mapping reads with non-reference allele? | krobison | Genomic Resequencing | 3 | 10-09-2009 11:48 AM |
![]() |
|
Thread Tools |
![]() |
#21 |
Member
Location: USA Join Date: May 2010
Posts: 15
|
![]()
Hi, Jun, Here is the script. Any feedback is welcome. Best, ldong
#!/gpfs/runtime/bin/perl use Spreadsheet::WriteExcel; use strict; my %gene; my %alter; my $file = $ARGV[0]; if ($file eq '') { die "Usage: find_gene_and_alternative_splicing_expression_from_cuffcompare_result.pl <\'tmap\' file with path from cuffcompare>\n"; } open I, $file or die "can not open file $file."; while (<I>) { my @p = split /\t/; if ($p[2] eq '=') { $gene{$p[0]} = $gene{$p[0]} + $p[6]; $alter{$p[1]} = $alter{$p[1]} + $p[6]; } } close I; # Create a new workbook called simple.xls and add a worksheet. my $workbook = Spreadsheet::WriteExcel->new($ENV{'LANE'}.'_expression_of_known_gene_&_alternative_splicing.xls'); my $worksheet = $workbook->add_worksheet("AceView Genes"); my $count = 1; $worksheet->write(0, 0, "gene"); $worksheet->write(0, 1, "expression"); for my $k (sort keys %gene) { $worksheet->write_url($count, 0, "http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?exdb=AceView&db=mouse&submit=Go&term=$k", $k); $worksheet->write($count, 1, $gene{$k}); $count++; } $worksheet = $workbook->add_worksheet("AceView Alternative Splicing"); $count = 1; $worksheet->write(0, 0, "alternative splicing variants"); $worksheet->write(0, 1, "expression"); for my $k (sort keys %alter) { $worksheet->write_url($count, 0, "http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?exdb=AceView&db=mouse&submit=Go&term=$k", $k); $worksheet->write($count, 1, $alter{$k}); $count++; } |
![]() |
![]() |
![]() |
#22 |
Member
Location: Dublin Join Date: Mar 2010
Posts: 19
|
![]()
Hi, ldong,
This is great. Thx a million. I will look into it. Cheers, Jun |
![]() |
![]() |
![]() |
#23 |
Member
Location: Dublin Join Date: Mar 2010
Posts: 19
|
![]()
I just read the script, it is very neat and clear. But I notice that you only count the assembled transcripts with "=" mark (match). How do you deal with transcripts with "c" mark (contained).
|
![]() |
![]() |
![]() |
#24 | |
Member
Location: Boston Join Date: Sep 2010
Posts: 36
|
![]()
I have the same problem as you guys here: Cufflinks works well without a reference GTF file, while with reference, it gives 0 FPKM values. And this is not always for all genes, only some genes are like that. I am really confused.
Here are my input files: htt.sam (output from Tophat, for example purpose, I only contain reads in HTT genes regions, where there are >3000 reads. See the screenshot of these aligned reads in UCSC) htt.gtf (GTF for HTT genes, extracted from ensemble gtf file) Here is my code (cufflinks version: 0.9.3.Linux_x86_64): Code:
cufflinks -G htt.gtf htt.sam Code:
$ /home/dongx/bin/cufflinks-0.9.3.Linux_x86_64/cufflinks -G htt.gtf htt.sam /home/dongx/bin/cufflinks-0.9.3.Linux_x86_64/cufflinks: /usr/lib64/libz.so.1: no version information available (required by /home/dongx/bin/cufflinks-0.9.3.Linux_x86_64/cufflinks) [bam_header_read] EOF marker is absent. File htt.sam doesn't appear to be a valid BAM file, trying SAM... [17:33:52] Inspecting reads and determining fragment length distribution. > Processed 1 loci. [*************************] 100% > Map Properties: > Total Map Mass: 3542.44 > Read Type: 40bp single-end > Fragment Length Distribution: Gaussian (default) > Estimated Mean: 204.01 > Estimated Std Dev: 74.81 [17:33:52] Estimating transcript abundances. > Processed 1 loci. [*************************] 100% Code:
$ cat genes.expr gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi status ENSG00000197386 3 chr4 3076406 3245676 0 0 0 OK Code:
[dongx@hpcc01 ~]$ cat transcripts.expr trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length effective_length status ENST00000355072 3 chr4 3076406 3245676 0 0 0 0 0 0 13531 13531 OK ENST00000506137 3 chr4 3117054 3123439 0 0 0 0 0 0 784 784 OK ENST00000512909 3 chr4 3123125 3125193 0 0 0 0 0 0 613 613 OK ENST00000510626 3 chr4 3130064 3245667 0 0 0 0 0 0 14491 14491 OK ENST00000509618 3 chr4 3162087 3174954 0 0 0 0 0 0 432 432 OK ENST00000513639 3 chr4 3180072 3182400 0 0 0 0 0 0 261 261 OK ENST00000513326 3 chr4 3180072 3182514 0 0 0 0 0 0 375 375 OK ENST00000509043 3 chr4 3180072 3182521 0 0 0 0 0 0 382 382 OK ENST00000502820 3 chr4 3204731 3208306 0 0 0 0 0 0 290 290 OK ENST00000509751 3 chr4 3213637 3214782 0 0 0 0 0 0 725 725 OK ENST00000512068 3 chr4 3230370 3231649 0 0 0 0 0 0 403 403 OK ENST00000513806 3 chr4 3230438 3237123 0 0 0 0 0 0 436 436 OK ENST00000508321 3 chr4 3237149 3237906 0 0 0 0 0 0 388 388 OK Anyone have clue for this? You can test the above code in your machine. Pls let me know if you got different results. THANKS!! Quote:
|
|
![]() |
![]() |
![]() |
#25 | |
Member
Location: USA Join Date: May 2010
Posts: 15
|
![]() Quote:
Sorry I didn't see your post earlier. I treat c and j as novel transcripts. His is my script to find them. Feedbacks are welcome: #!/usr/bin/perl use Spreadsheet::WriteExcel; use Text::ParseWords 'shellwords'; use strict; my %exp; my %geneo; #known gene my %genen; #novel gene my %trano; #old transcript my %rel; my %gene; my $file = $ARGV[0]; if (not -f $file) { print "Usage: find_novel_alternative_splicing_expression_from_cuffcompare_result.pl <\'tmap\' file with path from cuffcompare> <threshold for the novel transcript>\n"; die "File not exist: $file"; } if ($ARGV[1] eq '') { print "Usage: find_novel_alternative_splicing_expression_from_cuffcompare_result.pl <\'tmap\' file with path from cuffcompare> <threshold for the novel transcript>\n"; die "You should provide a threshold."; } print "looking for novel transcipts...\n"; #open cufflinks output file and query the novel transcitps. 'j' means novel, 'c' means contain open I, $file; while (<I>) { my @p = split /\t/; if (($p[2] eq 'j' or $p[2] eq 'c') and $p[6] > $ARGV[1]) { #expressoion level $exp{$p[4]} = $p[6]; #old gene name $geneo{$p[4]} = $p[0]; #old transctipt name $trano{$p[4]} = $p[1]; #new gene name $genen{$p[4]} = $p[3]; #relationship $rel{$p[4]}= $p[2]; #old gene name $gene{$p[0]} =1; #new gene name $gene{$p[3]} =1; } } close I; #open the gtf to query the regions in gtf file which are related to the nevol transcipts open I, "/gpfs/runtime/bioinfo/aceviewdb/aceview_annotation_mm9_with_chrID.gtf" or die "can not open gtf file"; print "looking for the related regions in gtf file...\n"; my (%start, %end, %chr); while (<I>) { my ($seqid,$source,$method,$start,$end,$score,$strand,$phase,$tags) = split "\t"; $tags =~ /gene_id\s\"(\S+)?\"/; if($method eq "exon" and defined $gene{$1}) { $start = int($start); $end = int($end); if(not defined $start{$1} or $start{$1} > $start) { $start{$1} = $start; $chr{$1} = $seqid; } if(not defined $end{$1} or $end{$1} < $end) { $end{$1} = $end; } } } close I; #find the file name for transcirpt.gtf $file =~ s/\..+/.gtf/; if(not -f $file) { die "cound not find file $file"; } #query the gtf file for the regions which are related to the novel transcripts, and print out the regions into the new file open I, $file or die "can not open file $file"; open O, ">".$ENV{'LANE'}."_novel_scripts_threshold_".$ARGV[1].".gtf"; while (my $line = <I>) { my ($seqid,$source,$method,$start,$end,$score,$strand,$phase,$tags) = split "\t", $line; $tags =~ /gene_id\s\"(\S+)?\"/; #get all the entries for the novel transcripts if(defined $gene{$1}) { print O $line; } } close I; close O; #find the sam file name $file =~ s/transcripts.gtf/accepted_hits.sam/; if(not -f $file) { die "cound not find file $file"; } open I, $file or die "can not open file $file"; print "looking for related regions in sam file...\n"; #calculate the region we are interested my %range; for my $g (keys %chr) { for my $i($start{$g}..$end{$g}) { $range{$chr{$g}}{$i} = 1; } } $file = $ENV{'LANE'}."_novel_transcripts_related_region_threshold_".$ARGV[1]; #query the sam file for the regions which related to the novel transcipts and print out into a new file open O, ">$file.sam"; while (my $line = <I>) { my ($a,$b,$refid,$start,$d,$e,$f,$end) = split "\t", $line; #this will be removed #$refid = "chr".$refid; if(defined $range{$refid}{int($start)} or defined $range{$refid}{int($end)}) { print O $line; } } close I; close O; #convert the sam file into bam file for UCSC genome browser my $cmd = "samtools view -b -S -T /gpfs/runtime/bioinfo/bowtie/indexes/mm9_all_folded_with_chrID.fa -o $file.bam $file.sam"; #print "$cmd\n"; system($cmd); #sort and index the bam file system("samtools sort $file.bam $file.sorted"); system("samtools index $file.sorted.bam"); #this is part of sam file #ILLUMINA-2F52BD_0002:5:1:2235:9480#0 177 chr1 3047832 255 60M = 173820455 0 TCCTATCTCCACCATGGGCTGCAGCAAGGAGTACCAACCTGGTGCCAATTTAAAATAAAA @D:?=9@>6C-BDDBCCC?CAA5AAA?CBACC55CF=FFFFFFFD:?FFFFFFFDFFFF F NM:i:2 NH:i:1 #ILLUMINA-2F52BD_0002:5:1:1243:4783#0 73 chr1 4122380 0 60M * 0 0 AAGCGAAACCCCGTCCCTATTAAGGAGAATAATCCTTCGCCTAGGACGTGTCACTCCCTG AEFEFBE@BEFFDBEDDFFEFFFFFEEBEBAD??E5EEE?DADFDEDGFGGDFGDFFFDF NM:i:2 NH:i:37 CC:Z:= CP:i:5794552 print "print out result into exel file...\n"; # Create a new workbook called simple.xls and add a worksheet. my $workbook = Spreadsheet::WriteExcel->new($ENV{'LANE'}.'_expression_novel_transcipts.xls'); my $worksheet = $workbook->add_worksheet("Alternative Splicing"); my $count = 1; $worksheet->write(0, 0, "novel gene name"); #alternative splicing variants"); $worksheet->write(0, 1, "novel transcipt"); $worksheet->write(0, 2, "expression"); $worksheet->write(0, 3, "related gene"); $worksheet->write(0, 4, "related transcript"); $worksheet->write(0, 5, "relationship"); $worksheet->write(0, 6, "aceview link"); for my $k (sort keys %exp) { my $range = $chr{$geneo{$k}}.":".($start{$geneo{$k}}-150)."-".($end{$geneo{$k}}+150); $worksheet->write_url($count, 6, "http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Mouse&db=mm9&Submit=submit&position=".$range, $range); $worksheet->write($count, 0, $genen{$k}); $worksheet->write($count, 1, $k); $worksheet->write($count, 2, $exp{$k}); $worksheet->write($count, 3, $geneo{$k}); $worksheet->write($count, 4, $trano{$k}); if ($rel{$k} eq "j") { $worksheet->write($count, 5, "novel"); } else { $worksheet->write($count, 5, "contain"); } $count++; } print "all done\n"; Last edited by ldong; 12-04-2010 at 05:01 AM. |
|
![]() |
![]() |
![]() |
#26 |
Member
Location: Denmark Join Date: Mar 2010
Posts: 10
|
![]()
Hi sterding
I have observed the exact same thing and described it in http://seqanswers.com/forums/showthr...9933#post29933 I have reproduced your strange results using your files locally. If anyone can shed some light on this, it would be great. |
![]() |
![]() |
![]() |
#27 | |
Member
Location: Boston Join Date: Sep 2010
Posts: 36
|
![]() Quote:
I think my problem was due to the wrong Tophat version I was using (v. 1.1.2). It's fixed after V.1.1.3. Check the release news, pls. Regards, Sterding |
|
![]() |
![]() |
![]() |
#28 |
Member
Location: Denmark Join Date: Mar 2010
Posts: 10
|
![]()
Hi Sterding,
That is strange. I am using Tophat version 1.1.4 (beta, Linux x86_64 binary) Bowtie version: 0.12.7.0 Samtools version: 0.1.8.0 And I reproduced your zero values. -Vang42 |
![]() |
![]() |
![]() |
#29 |
Member
Location: Boston Join Date: Sep 2010
Posts: 36
|
![]()
The sam file I attached in my previous post was produced by Tophat V1.1.2, which was wrong! I am re-running the mapping using new version of Tophat (v.1.1.4), hoping it's fine. I will let you know.
|
![]() |
![]() |
![]() |
#30 |
Compendia Bio
Location: Ann Arbor Join Date: Oct 2010
Posts: 35
|
![]()
Thanks for the scripts, ldong. The Spreadsheet::WriteExcel perl module creates an excel spread sheet which is 40mb in size and I am not able to open it... Excel reports that it is corrupted.
Others people using the same module had the problem and they were suggested to use Spreadsheet::WriteExcel::Big module. I have tried using it but I still have the same problem. Any suggestions? More info is here: http://bit.ly/dZh1iS - Dinesh |
![]() |
![]() |
![]() |
#31 |
Member
Location: USA Join Date: May 2010
Posts: 15
|
![]()
Hi, Dinesh,
Thanks for your feedback. If the big file size is the issue, we can try to create smaller Excel file. Anyway, we just want to get a small number of candidates to give a closer look. To make the Excel file smaller, you can either provide a bigger threshold when you run the script, or put a check point in the script and increase the threshold automatically. Another way to do it is to sort all the transcripts and report the top x (less than the max row of Excel). What do you think? Best, ldong |
![]() |
![]() |
![]() |
#32 |
Member
Location: Bhopal Join Date: Jul 2019
Posts: 19
|
![]()
I think I've made sense of what the issue was...
I have utilized a UCSC mm9 reference gathering against an EMSEMBL GTF, and that clearly was the issue. A few traits of the two obviously doesn't coordinate (didn't look in to see precisely why). In the event that rather, I utilize a UCSC GTF for mm9 reference, or on the other hand, utilize the EMSEMBL m_musculus_ncbi37 reference for EMSEMBL GTF, it appears to work. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|