Tuxedo Pipeline Issue - Multiple Gene Hits per Transcript, High FPKM

chimb

Junior Member

Join Date: Aug 2014

Posts: 2
- Share
- Tweet
#1

Tuxedo Pipeline Issue - Multiple Gene Hits per Transcript, High FPKM

08-06-2014, 06:51 PM

Hello all!

- I've been analyzing some RNA-seq data using the Tuxedo pipeline and have been getting some peculiar results, which are especially noticeable in the tables of significant genes (and their differential expression data) I've attached.

- Some biological background: the experiment is looking at bacteria-bacteria interaction effects between Streptococcus sanguinis (Ss) and Porphyromonas gingivalis (Pg). There are numerous conditions and comparisons that were made using Cuffdiff, but the data I've attached is based on the comparison between the conditions:

Wild-type Ss (SK36) grown in isolation (sample_1)
--vs--
Wild-type Ss cultured with wild-type Pg (sample_2)

In this case, the cuffdiff run utilizes the Ss read alignments and uses the merged transcriptome of Ss across both conditions.

- In Sk--Sk_Pg_sig_genes.txt, I ran the data through the whole Tuxedo Pipeline using Trapnell et. al's protocol from Nature. Tophat, Cufflinks, Cuffmerge, Cuffdiff, cummeRbund -- all default commands/options. In cummeRbund, I used the getSig(), getGenes(), diffData() and featureNames() functions to merge together a table of the significantly diff-expressed genes (alpha=0.05), their differential expression data and their short names. Two peculiar things:

- Some transcripts report hits with multiple genes each (many gene_short_name's per transcript)

- FPKM (value_1 and value_2) are extremely high for some transcripts ~ 3089410 for one of them, which can't be possible.

- My PI and I suspected that tophat may be finding splice junctions that do not exist (I did not include "--no-novel-juncs" in my initial tophat runs). This would link together disparate stretches of DNA as a single transcript and garner multiple gene hits. That, or perhaps many genes overlapping across the same stretches of DNA in different reading frames (though I'd imagine cufflinks would account for that?).

- I tried running the whole pipeline again, but skipped the tophat step (which includes read fragmentation and splice junction discovery). I ran bowtie2 alone for the bare-read alignments, converted the output SAM to BAM, sorted it and fed it through cufflinks and the rest of the pipeline as normal. The result is (using the same extraction methods in cummeRbund): sig_genes_Sk-Sk_Pg_bt2.txt

~ Still, getting multiple gene hits per transcript.. and still getting extremely high FPKM values

**************************************************

- Have any of you experienced the same sort of problems? What might be causing this? Any suggestions for alternate methods for alignment, transcript construction or visualization? ... I realize the Tuxedo pipeline was designed with eukaryotic systems in mind so I'm not sure if it is, in whole or in part, unsuitable for prokaryotes.

Any input would be greatly appreciated!

Thanks!
Tags: cufflinks, cummerbund, rna-seq advice, tophat, tuxedo
mikep

Member

Join Date: Feb 2011

Posts: 45
- Share
- Tweet
#2

08-07-2014, 12:04 AM

Did you try cufflinks with -no-novel-juncs, and tell it to look just for known transcripts? is alternate splicing something you actually need to worry about? Maybe htseq-count and edger/deseq2 is a better route to follow.
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
- Channel: Articles
Yesterday, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 55 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 52 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Tuxedo Pipeline Issue - Multiple Gene Hits per Transcript, High FPKM

Comment

Latest Articles

ad_right_rmr

News