Haplotypes in Ensembl cDNA confuse pseudo-alignment

kevinrue

Member

Join Date: Jan 2013

Posts: 25
- Share
- Tweet
#1

Haplotypes in Ensembl cDNA confuse pseudo-alignment

11-29-2016, 03:40 AM

Dear all,

Before I dive into details, my question is whether it is a sound idea to envisage:
an Ensembl "cdna" Fasta file that, similarly to the Ensembl FTP files "...dna...primary_assembly.fa", includes only transcripts excluding haplotypes and patches.

a method to collapse expression levels of genes present in multiple haplotypes in the currently available Ensembl FTP full cDNA Fasta files "...cdna.all.fa" into a single gene expression level

I describe here my situation using the Kallisto pseudo-aligner and Ensembl Homo_sapiens.GRCh38.cdna.all.fa, but this may well apply for all the tools out there using transcriptome fasta files.

After using the above cDNA Fasta file with Kallisto, I have realised that it includes transcripts encoded in haplotypes (e.g. CHR_HSCHR8_4_CTG7), and that those transcripts have distinct gene identifiers.

The presence of haplotypes clearly seems to confuse Kallisto for genes that have transcripts encoded in those haplotype regions. I illustrate my point with the TNF gene, that has one transcript on chr 6, while seven haplotypes exist that contain the TNF gene sequence:

GENENAME GENEID SEQNAME GENESEQSTART GENESEQEND
1 TNF ENSG00000204490 CHR_HSCHR6_MHC_COX_CTG1 31562973 31565742
2 TNF ENSG00000206439 CHR_HSCHR6_MHC_QBL_CTG1 31565793 31568564
3 TNF ENSG00000223952 CHR_HSCHR6_MHC_MCF_CTG1 31651872 31654641
4 TNF ENSG00000228321 CHR_HSCHR6_MHC_MANN_CTG1 31615015 31617784
5 TNF ENSG00000228849 CHR_HSCHR6_MHC_DBB_CTG1 31557707 31560476
6 TNF ENSG00000228978 CHR_HSCHR6_MHC_APD_CTG1 31643520 31645322
7 TNF ENSG00000230108 CHR_HSCHR6_MHC_SSTO_CTG1 31566312 31569081
8 TNF ENSG00000232810 6 31575567 31578336

Now, I show below Kallisto-estimated counts for 4 single cells (a subset for example here) from a single human donor (max. two haplotypes expected) using Kallisto.
It reveals identical gene expression levels for seven of those gene identifiers (which therefore most likely share identical pseudo-mappable sequence). Notably, when the remaining haplotype is detected, lower expression is estimated in the other haplotypes (I just checked and validated this observation in many other samples not shown here).

201202 201204 201206 201207
TNF_ENSG00000204490 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000206439 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000223952 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228321 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228849 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228978 0.00000 0.000 69.053237 0.00000
TNF_ENSG00000230108 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000232810 93.46909 2305.451 2.403247 91.27554

Now, my main worry is that the presence of TNF in multiple haplotypes forces Kallisto to assign 1/7th of the "true" expression level to each of the gene identifiers, minus the expression level assigned to the last haplotype.
First there is the question of underestimating the expression level of the "true" gene, considering that the TPM for TNF are split into seven haplotypes.

Then there is the issue of testing multiple times the same hypothesis, considering that seven of the haplotypes seem to carry the same gene information here.

I also wonder about gene-set enrichment analyses, whether all those haplotypes are annotated to Gene Ontology categories for example, and how they may artificially enrich the categories by giving TNF an implicit weight of 7.

Out of curiosity, I have plotted the colSum of the log10-transformed Kallisto-estimated counts above for all samples in the data set. In other words, I manually collapsed all haplotypes counts for TNF in a single value. There are two infection conditions (D and L), and one uninfected (Mock) condition. I also highlighted the samples were the "lone haplotype" was detected:
.

And to come back to my initial question:
Should a cdna Fasta file be made available, that contains only transcripts present in the "primary assembly" ? In this case, I I suppose that "true" expression would be assigned to the "main" gene identifier, at the expense of losing haplotype-specific expression, as illustrated by the lone TNF haplotype above.

Given that after isoform quantitation is usually collapsed to gene-level after quantitation, should another level of collapsing consider haplotypes? If so, how? Although they share gene symbol, they have distinct gene identifiers. However, one cannot simply use the gene symbol as gene feature identifiers, as some gene symbols are annotated to true paralogs.

I am aware of the gffread command available here that could be used to generate a subsetted cDNA Fasta file based on a GFF file, which itself could be restricted to exclude haplotypes and patches. But if we'd be going that way, rather than leaving the filtering open to interpretation by users, I thought that having a reference file provided by Ensembl would be preferable.

Thanks in advance for feedback.

Kévin

Last edited by kevinrue; 11-29-2016, 03:49 AM. Reason: typo
Tags: gene expression, haplotypes, kallisto, single cell analysis, tpm
kevinrue

Member

Join Date: Jan 2013

Posts: 25
- Share
- Tweet
#2

03-25-2017, 03:56 AM

Warning now present in Ensembl release 87 - Dec 2016 © EMBL-EBI

In http://www.ensembl.org/info/genome/g.../assembly.html :

All genome assemblies in Ensembl are haploid and for most species there is only a single path through the genome. Currently, human and mouse are the only genome assemblies where there is more than one path through the genome. The human genome assembly (GRCh38) and the mouse genome assembly (GRCm38) are maintained by the Genome Reference Consortium (GRC). In addition to the primary assembly, the GRCh38 major assembly release included 261 alternate loci including 8 haplotypes on the MHC region of chromosome 6.
Comment

Previous template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Haplotypes in Ensembl cDNA confuse pseudo-alignment

Comment

Latest Articles

ad_right_rmr

News