SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
XLOC identifiers from cufflinks/cuffmerge/cuffdiff AdamB RNA Sequencing 4 01-24-2016 08:12 PM
XLOC gene id JMo RNA Sequencing 22 04-02-2015 11:35 AM
XLOC as gene ID? JQL Bioinformatics 1 03-15-2015 11:18 PM
cuffmerge output issue Bioinformaticsnewb Bioinformatics 0 03-01-2013 08:02 AM
Cuffmerge Issue SHeaph Bioinformatics 0 01-10-2013 06:05 AM

Reply
 
Thread Tools
Old 01-24-2016, 08:40 AM   #1
gtduarte
Junior Member
 
Location: Brazil

Join Date: Jan 2016
Posts: 5
Default Gene id changed to XLOC... cuffmerge issue?

Dear community,

I am trying to run differential expression analysis with TopHat and Cufflinks, and alternative splicing analysis with MISO (for which I need TopHat outputs).

However, I find that the gene_ids are being substituted by "XLOC" values after I run tophat-cufflinks-cuffmerge-cuffdiff. I am using the recent release of A.thaliana (TAIR10.30) and providing all reference annotations (.fa .gtf .gff3) where required.

I found some discussion related to it (e.g. http://seqanswers.com/forums/showthread.php?t=19079), but the solutions proposed just solve plotting issues and depends on R (which I don't have experience with) and in fact don't find the cause of the problem. If for instance the problem is occurring on the initial steps, I will not be able to do further analyses because it is required perfect matching of the ids, that is why I'm trying to discover what is causing the issue.

As far as I understood, XLOC is attributed when a matching id is not found, for instance, for denoting new putative transcripts.

The following error is given when I run cuffmerge, although the analysis run until the end and the "merged.gtf" output can be used for running cuffdiff:

You are using Cufflinks v2.2.1, which is the most recent release.
Command line:
cufflinks -o test/ -F 0.05 -g /home/hugo/MUR/Ensembl/At/A_thaliana.TAIR10.30.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 1 test/tmp/mergeSam_filehaqEY0
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).

File test/tmp/mergeSam_filehaqEY0 doesn't appear to be a valid BAM file, trying SAM...
[09:52:24] Loading reference annotation.
[09:52:29] Inspecting reads and determining fragment length distribution.
Processed 30490 loci.
> Map Properties:
> Normalized Map Mass: 83582.00
> Raw Map Mass: 83582.00
> Fragment Length Distribution: Truncated Gaussian (default)
> Default Mean: 200
> Default Std Dev: 80
[09:52:30] Assembling transcripts and estimating abundances.
Processed 30490 loci.


Here is an example of the output merged.gtf file:

1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";

Therefore, I checked the inputs for cuffmerge. However I didn't notice any issues:

For tophat accepted_hits.bam:

$ samtools view accepted_hits.bam | head
wtaba1-SRR1909022.2481187 163 1 3645 50 70M = 3996 421 ACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJIJIJJIJJJJJJJJJJIJJJJJHHHHFFFDDDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:70 YT:Z:UU NH:i:1


For cufflinks transcripts.gtf:

1 Cufflinks transcript 3631 5899 1000 + . gene_id "gene:AT1G01010"; transcript_id "transcript:AT1G01010.1"; FPKM "5.1108000582"; frac "1.000000"; conf_lo "4.325017"; conf_hi "5.896583"; cov "14.756394";

I also tried creating a new .gtf file from the .gff3 instead of using the one available in Ensembl (by using the command "$ gffread A_thaliana.TAIR10.30.gff3 -T -o test.gtf"), but with no success.

So I believe that the problem is with cuffmerge, but I can't figure out how to solve it. I could for instance use a cross reference function in Excel to exchange values between "gene_id" and "nearest_ref" for each line of the merged.gtf file. However, I believe it could be a problem for larger files or for the case in which the XLOC was attributed to a new transcript that was discovered.

I am using the following commands for the analyses:

Tophat:

$ tophat -N 3 --read-edit-dist 4 --read-realign-edit-dist 0 -a 6 --microexon-search -r 150 --mate-std-dev 200 -i 8 -I 10000 --min-segment-intron 8 --max-segment-intron 10000 --b2-very-sensitive ~/path_to/bowtie_index/A_thaliana.TAIR10.30 myfile1_1_paired.fastq.trim myfile1_2_paired.fastq.trim

Samtools:

$ samtools sort accepted_hits.bam myfile_sorted

Cufflinks:

$ cufflinks -o cuff -G ~/path_to/A_thaliana.TAIR10.30.gff3 -b ~/path_to/A_thaliana.TAIR10.30.fa myfile_sorted.bam

Cuffmerge:

$ cuffmerge -g ~/path_to/ A_thaliana.TAIR10.30.gtf -s ~/path_to/ A_thaliana.TAIR10.30.fa assemblies.txt

Thanks in advance for any help.
gtduarte is offline   Reply With Quote
Old 01-24-2016, 08:07 PM   #2
danwiththeplan
Member
 
Location: Auckland

Join Date: Sep 2011
Posts: 72
Default

XLOC numbers are assigned to all genes, not just ones that aren't in GTF files you supply. So some genes will end up with a gene_name, but all genes end up with a gene_id (=XLOC number). I think the issue is that the downstream programs are looking for the gene_id field and not the gene_name field.

Incidentally, as I am interpreting it, using the -G switch (as opposed to the -g switch) means that you'll only ever detect/quantitate/analyse genes that are in your GTF file (while using -g means that cufflinks will create a new gene from scratch if there is sufficient read support, even if it's not in the GTF file you supply).

So, with the code you used, all output genes should have both a gene_id (XLOC number) and a gene_name (from the GTF)
danwiththeplan is offline   Reply With Quote
Old 01-25-2016, 12:32 PM   #3
gtduarte
Junior Member
 
Location: Brazil

Join Date: Jan 2016
Posts: 5
Default

Hello dan, thanks for replying. I tried what you suggested, running cufflinks with the -g option instead of -G, but unfortunately it didn't work:

$ cufflinks -o cuff_g -g ~/path_to/A_thaliana.TAIR10.30.gtf -b ~/path_to/A_thaliana.TAIR10.30.fa myfile_sorted.bam

Indeed the resulting transcripts.gtf was a bit different from the previous one, for instance:

-> with -G switch:

1 Cufflinks transcript 11649 13714 1000 - . gene_id "gene:AT1G01030"; transcript_id "transcript:AT1G01030.1"; FPKM "0.4354775887"; frac "1.000000"; conf_lo "0.217739"; conf_hi "0.653216"; cov "1.257353";

-> with -g switch:

1 Cufflinks transcript 11649 13714 1000 - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; FPKM "0.4330886861"; frac "1.000000"; conf_lo "0.216544"; conf_hi "0.649633"; cov "1.250649"; full_read_support "yes";

However, those bam_errors continue to appear when I run cuffmerge, as before, just as the XLOC values as my gene ids:

Example of the merged.gtf:

1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "NAC001"; oId "transcript:AT1G01010.1"; nearest_ref "transcript:AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";

Nevertheless I run cuffdiff, and there were the XLOCs:

From gene_exp.diff:

test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
XLOC_000001 XLOC_000001 NAC001 1:3630-5899 wtmock1 wtaba1 OK 2.82389 5.42847 0.94286 0.938326 0.3329 0.999039 no


Just in case, I checked tophat accepted_hits.bam headers, but apparently it seems fine:

$ samtools view -H wtaba1_sorted.bam

@HD VN:1.0 SO:coordinate
@SQ SN:1 LN:30427671
@SQ SN:2 LN:19698289
@SQ SN:3 LN:23459830
@SQ SN:4 LN:18585056
@SQ SN:5 LN:26975502
@SQ SN:Mt LN:366924
@SQ SN:Pt LN:154478
@PG ID:TopHat VN:2.1.0 CL:/usr/bin/tophat -N 3 --read-edit-dist 4 --read-realign-edit-dist 0 -a 6 --microexon-search -r 150 --mate-std-dev 200 -i 8 -I 10000 --min-segment-intron 8 --max-segment-intron 10000 --b2-very-sensitive /path_to/bowtie_index/A_thaliana.TAIR10.30 myfile1_1_paired.fastq.trim myfile1_2_paired.fastq.trim


Do you have any other clue?

Many thanks again!
gtduarte is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:08 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO