SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
problems with running Cufflinks using Gencode annotation gtf arrchi Bioinformatics 1 02-02-2015 01:50 PM
Annotation with refseq of transcripts rozitaa Bioinformatics 3 07-01-2013 05:49 AM
How to use Ensembl or Gencode annotation for cufflinks? metheuse RNA Sequencing 0 05-02-2013 05:39 AM
refseq vs gencode jarwulf Bioinformatics 1 03-06-2013 08:50 AM

Reply
 
Thread Tools
Old 10-07-2015, 07:26 AM   #1
graceqy
Member
 
Location: New York City

Join Date: Mar 2012
Posts: 14
Default Annotation difference between refSeq and Gencode

Hi all,

I am trying to set up an RNAseq work flow:

1. Generated genome files for STAR using .fna files from NCBI ftp and gtf files from Gencode;

2. Aligned fq using STAR, convert sam to bam and sorted bam.

3. Then I used the sorted bam files to test cufflinks and compared different gtf files for the -G option. The cufflinks output somehow all have different positions for the same genes:

refSeq:
gene_id gene_short_name locus
PDIA3 - chr15:44038589-44064804
CD276 - chr15:73976621-74006859
PROM2 - chr2:95940200-95957055

gencode:
gene_id gene_short_name locus
ENSG00000167004.12 PDIA3 chr15:43746391-43773279
ENSG00000103855.17 CD276 chr15:73683965-73714518
ENSG00000155066.15 PROM2 chr2:95274452-95291308

And the FPKM as a result are very different in the two output.

What am I missing here and how to fix it, please? If the two gtf are inherently different in regard to gene loci, which one should I trust, pls?

Best,
Grace
graceqy is offline   Reply With Quote
Old 10-07-2015, 08:25 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

As you have discovered the hard way it is extremely important to make sure that you are using a consistent genome build/patch level for your analysis (I assume that is what is being reflected in the co-ordinate differences above).

If you want to avoid these types of issues you could download sequence/annotation/index bundles (you will need to roll your own indexes if you want to use STAR but at least the sequence/annotation would be consistent) from iGenomes.

In terms of salvaging the analysis, check to see if there are corresponding annotation files available at NCBI where you got the sequence files.
GenoMax is offline   Reply With Quote
Old 10-07-2015, 08:37 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Example PDIA3:
RefSeq co-ordinates are from Hg19/GRCh37.p19
Gencode are from GRCh38.p2

So if your sequence was from GRCh37/Hg19 then get the corresponding annotation file.
Attached Images
File Type: png NCBI1.PNG (16.0 KB, 6 views)

Last edited by GenoMax; 10-07-2015 at 08:49 AM.
GenoMax is offline   Reply With Quote
Old 10-07-2015, 10:10 AM   #4
graceqy
Member
 
Location: New York City

Join Date: Mar 2012
Posts: 14
Default

Thanks for the responses.

I used GCA_000001405.15_GRCh38_no_alt_analysis_set.fna to build the genome for STAR. Does it mean gencode is the right gtf to use here?

Is it right that if I want to use RefSeq annotation, I could just download hg19 reference sequences from iGenome?

Also the cufflinks output with refseq or gencode gtf are very different, less than 30K genes with refseq and about 60K genes with gencode. Is there any explanation on it?
graceqy is offline   Reply With Quote
Old 10-07-2015, 11:22 AM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

If you used the GRCh38 fasta then gencode should be the right gtf file to use.

If you want to re-do the alignments then you could go the iGenomes route and save yourself some trouble.

Since you are sampling different areas of the genome with the two GTF files (co-ordinate differences) the cufflinks outputs is different (though 2x is a big change). How are you handling multi-mappers? Perhaps there is a repeat region in one but not the other.
GenoMax is offline   Reply With Quote
Old 10-07-2015, 01:07 PM   #6
graceqy
Member
 
Location: New York City

Join Date: Mar 2012
Posts: 14
Default

Thanks.

By 30K vs 60K difference I meant the row numbers in the cufflinks output with the two different gtf files. The row numbers and genes are fixed for each regardless of the input bam files. I checked and found that gencode gtf returns a lot of rows of Y_RNA or 5s_rRNA. Is there a way to only return mRNA annotation with gencode/GRCh38 gtf, pls?
graceqy is offline   Reply With Quote
Old 10-08-2015, 03:39 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

You can filter the rows you do not want/need from the GTF file using grep. Look into the -v option.
GenoMax is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:24 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO