SEQanswers (
-   Bioinformatics (
-   -   Annotation difference between refSeq and Gencode (

graceqy 10-07-2015 08:26 AM

Annotation difference between refSeq and Gencode
Hi all,

I am trying to set up an RNAseq work flow:

1. Generated genome files for STAR using .fna files from NCBI ftp and gtf files from Gencode;

2. Aligned fq using STAR, convert sam to bam and sorted bam.

3. Then I used the sorted bam files to test cufflinks and compared different gtf files for the -G option. The cufflinks output somehow all have different positions for the same genes:

gene_id gene_short_name locus
PDIA3 - chr15:44038589-44064804
CD276 - chr15:73976621-74006859
PROM2 - chr2:95940200-95957055

gene_id gene_short_name locus
ENSG00000167004.12 PDIA3 chr15:43746391-43773279
ENSG00000103855.17 CD276 chr15:73683965-73714518
ENSG00000155066.15 PROM2 chr2:95274452-95291308

And the FPKM as a result are very different in the two output.

What am I missing here and how to fix it, please? If the two gtf are inherently different in regard to gene loci, which one should I trust, pls?


GenoMax 10-07-2015 09:25 AM

As you have discovered the hard way it is extremely important to make sure that you are using a consistent genome build/patch level for your analysis (I assume that is what is being reflected in the co-ordinate differences above).

If you want to avoid these types of issues you could download sequence/annotation/index bundles (you will need to roll your own indexes if you want to use STAR but at least the sequence/annotation would be consistent) from iGenomes.

In terms of salvaging the analysis, check to see if there are corresponding annotation files available at NCBI where you got the sequence files.

GenoMax 10-07-2015 09:37 AM

1 Attachment(s)
Example PDIA3:
RefSeq co-ordinates are from Hg19/GRCh37.p19
Gencode are from GRCh38.p2

So if your sequence was from GRCh37/Hg19 then get the corresponding annotation file.

graceqy 10-07-2015 11:10 AM

Thanks for the responses.

I used GCA_000001405.15_GRCh38_no_alt_analysis_set.fna to build the genome for STAR. Does it mean gencode is the right gtf to use here?

Is it right that if I want to use RefSeq annotation, I could just download hg19 reference sequences from iGenome?

Also the cufflinks output with refseq or gencode gtf are very different, less than 30K genes with refseq and about 60K genes with gencode. Is there any explanation on it?

GenoMax 10-07-2015 12:22 PM

If you used the GRCh38 fasta then gencode should be the right gtf file to use.

If you want to re-do the alignments then you could go the iGenomes route and save yourself some trouble.

Since you are sampling different areas of the genome with the two GTF files (co-ordinate differences) the cufflinks outputs is different (though 2x is a big change). How are you handling multi-mappers? Perhaps there is a repeat region in one but not the other.

graceqy 10-07-2015 02:07 PM


By 30K vs 60K difference I meant the row numbers in the cufflinks output with the two different gtf files. The row numbers and genes are fixed for each regardless of the input bam files. I checked and found that gencode gtf returns a lot of rows of Y_RNA or 5s_rRNA. Is there a way to only return mRNA annotation with gencode/GRCh38 gtf, pls?

GenoMax 10-08-2015 04:39 AM

You can filter the rows you do not want/need from the GTF file using grep. Look into the -v option.

All times are GMT -8. The time now is 12:09 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.