Seqanswers Leaderboard Ad

**GenoMax** · 10-07-2015, 08:25 AM

As you have discovered the hard way it is extremely important to make sure that you are using a consistent genome build/patch level for your analysis (I assume that is what is being reflected in the co-ordinate differences above).

If you want to avoid these types of issues you could download sequence/annotation/index bundles (you will need to roll your own indexes if you want to use STAR but at least the sequence/annotation would be consistent) from iGenomes.

In terms of salvaging the analysis, check to see if there are corresponding annotation files available at NCBI where you got the sequence files.

**GenoMax** · 10-07-2015, 08:37 AM

Example PDIA3:
RefSeq co-ordinates are from Hg19/GRCh37.p19
Gencode are from GRCh38.p2

So if your sequence was from GRCh37/Hg19 then get the corresponding annotation file.

Attached Files

NCBI1.PNG (16.0 KB, 47 views)

**graceqy** · 10-07-2015, 10:10 AM

Thanks for the responses.

I used GCA_000001405.15_GRCh38_no_alt_analysis_set.fna to build the genome for STAR. Does it mean gencode is the right gtf to use here?

Is it right that if I want to use RefSeq annotation, I could just download hg19 reference sequences from iGenome?

Also the cufflinks output with refseq or gencode gtf are very different, less than 30K genes with refseq and about 60K genes with gencode. Is there any explanation on it?

**GenoMax** · 10-07-2015, 11:22 AM

If you used the GRCh38 fasta then gencode should be the right gtf file to use.

If you want to re-do the alignments then you could go the iGenomes route and save yourself some trouble.

Since you are sampling different areas of the genome with the two GTF files (co-ordinate differences) the cufflinks outputs is different (though 2x is a big change). How are you handling multi-mappers? Perhaps there is a repeat region in one but not the other.

**graceqy** · 10-07-2015, 01:07 PM

Thanks.

By 30K vs 60K difference I meant the row numbers in the cufflinks output with the two different gtf files. The row numbers and genes are fixed for each regardless of the input bam files. I checked and found that gencode gtf returns a lot of rows of Y_RNA or 5s_rRNA. Is there a way to only return mRNA annotation with gencode/GRCh38 gtf, pls?

**GenoMax** · 10-08-2015, 03:39 AM

You can filter the rows you do not want/need from the GTF file using grep. Look into the -v option.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Annotation difference between refSeq and Gencode

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News