Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Truseq cDNA libraries show sharp peak of cDNA BUFtoBOS Sample Prep / Library Generation 2 09-23-2014 08:00 AM
Pseudo-molecule Shlomo Blum Bioinformatics 2 10-30-2013 06:29 AM
VCF to pseudo BAM dagarfield Bioinformatics 3 10-22-2012 08:45 AM
FASTA to pseudo-FASTQ file? terencetate Bioinformatics 5 11-25-2010 02:00 PM
Question and Confuse about Complete Genomics Patrick The Pipeline 5 10-13-2009 07:19 AM

Thread Tools
Old 11-29-2016, 02:40 AM   #1
Location: London

Join Date: Jan 2013
Posts: 25
Post Haplotypes in Ensembl cDNA confuse pseudo-alignment

Dear all,

Before I dive into details, my question is whether it is a sound idea to envisage:
  • an Ensembl "cdna" Fasta file that, similarly to the Ensembl FTP files "...dna...primary_assembly.fa", includes only transcripts excluding haplotypes and patches.
  • a method to collapse expression levels of genes present in multiple haplotypes in the currently available Ensembl FTP full cDNA Fasta files "...cdna.all.fa" into a single gene expression level

I describe here my situation using the Kallisto pseudo-aligner and Ensembl Homo_sapiens.GRCh38.cdna.all.fa, but this may well apply for all the tools out there using transcriptome fasta files.

After using the above cDNA Fasta file with Kallisto, I have realised that it includes transcripts encoded in haplotypes (e.g. CHR_HSCHR8_4_CTG7), and that those transcripts have distinct gene identifiers.

The presence of haplotypes clearly seems to confuse Kallisto for genes that have transcripts encoded in those haplotype regions. I illustrate my point with the TNF gene, that has one transcript on chr 6, while seven haplotypes exist that contain the TNF gene sequence:

1 TNF ENSG00000204490 CHR_HSCHR6_MHC_COX_CTG1 31562973 31565742
2 TNF ENSG00000206439 CHR_HSCHR6_MHC_QBL_CTG1 31565793 31568564
3 TNF ENSG00000223952 CHR_HSCHR6_MHC_MCF_CTG1 31651872 31654641
4 TNF ENSG00000228321 CHR_HSCHR6_MHC_MANN_CTG1 31615015 31617784
5 TNF ENSG00000228849 CHR_HSCHR6_MHC_DBB_CTG1 31557707 31560476
6 TNF ENSG00000228978 CHR_HSCHR6_MHC_APD_CTG1 31643520 31645322
7 TNF ENSG00000230108 CHR_HSCHR6_MHC_SSTO_CTG1 31566312 31569081
8 TNF ENSG00000232810 6 31575567 31578336

Now, I show below Kallisto-estimated counts for 4 single cells (a subset for example here) from a single human donor (max. two haplotypes expected) using Kallisto.
It reveals identical gene expression levels for seven of those gene identifiers (which therefore most likely share identical pseudo-mappable sequence). Notably, when the remaining haplotype is detected, lower expression is estimated in the other haplotypes (I just checked and validated this observation in many other samples not shown here).

201202 201204 201206 201207
TNF_ENSG00000204490 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000206439 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000223952 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228321 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228849 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228978 0.00000 0.000 69.053237 0.00000
TNF_ENSG00000230108 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000232810 93.46909 2305.451 2.403247 91.27554

Now, my main worry is that the presence of TNF in multiple haplotypes forces Kallisto to assign 1/7th of the "true" expression level to each of the gene identifiers, minus the expression level assigned to the last haplotype.
  • First there is the question of underestimating the expression level of the "true" gene, considering that the TPM for TNF are split into seven haplotypes.
  • Then there is the issue of testing multiple times the same hypothesis, considering that seven of the haplotypes seem to carry the same gene information here.
  • I also wonder about gene-set enrichment analyses, whether all those haplotypes are annotated to Gene Ontology categories for example, and how they may artificially enrich the categories by giving TNF an implicit weight of 7.

Out of curiosity, I have plotted the colSum of the log10-transformed Kallisto-estimated counts above for all samples in the data set. In other words, I manually collapsed all haplotypes counts for TNF in a single value. There are two infection conditions (D and L), and one uninfected (Mock) condition. I also highlighted the samples were the "lone haplotype" was detected:

And to come back to my initial question:
  • Should a cdna Fasta file be made available, that contains only transcripts present in the "primary assembly" ? In this case, I I suppose that "true" expression would be assigned to the "main" gene identifier, at the expense of losing haplotype-specific expression, as illustrated by the lone TNF haplotype above.
  • Given that after isoform quantitation is usually collapsed to gene-level after quantitation, should another level of collapsing consider haplotypes? If so, how? Although they share gene symbol, they have distinct gene identifiers. However, one cannot simply use the gene symbol as gene feature identifiers, as some gene symbols are annotated to true paralogs.

I am aware of the gffread command available here that could be used to generate a subsetted cDNA Fasta file based on a GFF file, which itself could be restricted to exclude haplotypes and patches. But if we'd be going that way, rather than leaving the filtering open to interpretation by users, I thought that having a reference file provided by Ensembl would be preferable.

Thanks in advance for feedback.


Last edited by kevinrue; 11-29-2016 at 02:49 AM. Reason: typo
kevinrue is offline   Reply With Quote
Old 03-25-2017, 03:56 AM   #2
Location: London

Join Date: Jan 2013
Posts: 25
Default Warning now present in Ensembl release 87 - Dec 2016 © EMBL-EBI

In :

All genome assemblies in Ensembl are haploid and for most species there is only a single path through the genome. Currently, human and mouse are the only genome assemblies where there is more than one path through the genome. The human genome assembly (GRCh38) and the mouse genome assembly (GRCm38) are maintained by the Genome Reference Consortium (GRC). In addition to the primary assembly, the GRCh38 major assembly release included 261 alternate loci including 8 haplotypes on the MHC region of chromosome 6.
kevinrue is offline   Reply With Quote

gene expression, haplotypes, kallisto, single cell analysis, tpm

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 09:23 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO