SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   RNA Sequencing (http://seqanswers.com/forums/forumdisplay.php?f=26)
-   -   Gene id changed to XLOC... cuffmerge issue? (http://seqanswers.com/forums/showthread.php?t=65778)

gtduarte 01-24-2016 07:40 AM

Gene id changed to XLOC... cuffmerge issue?
 
Dear community,

I am trying to run differential expression analysis with TopHat and Cufflinks, and alternative splicing analysis with MISO (for which I need TopHat outputs).

However, I find that the gene_ids are being substituted by "XLOC" values after I run tophat-cufflinks-cuffmerge-cuffdiff. I am using the recent release of A.thaliana (TAIR10.30) and providing all reference annotations (.fa .gtf .gff3) where required.

I found some discussion related to it (e.g. http://seqanswers.com/forums/showthread.php?t=19079), but the solutions proposed just solve plotting issues and depends on R (which I don't have experience with) and in fact don't find the cause of the problem. If for instance the problem is occurring on the initial steps, I will not be able to do further analyses because it is required perfect matching of the ids, that is why I'm trying to discover what is causing the issue.

As far as I understood, XLOC is attributed when a matching id is not found, for instance, for denoting new putative transcripts.

The following error is given when I run cuffmerge, although the analysis run until the end and the "merged.gtf" output can be used for running cuffdiff:

You are using Cufflinks v2.2.1, which is the most recent release.
Command line:
cufflinks -o test/ -F 0.05 -g /home/hugo/MUR/Ensembl/At/A_thaliana.TAIR10.30.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 1 test/tmp/mergeSam_filehaqEY0
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).

File test/tmp/mergeSam_filehaqEY0 doesn't appear to be a valid BAM file, trying SAM...
[09:52:24] Loading reference annotation.
[09:52:29] Inspecting reads and determining fragment length distribution.
Processed 30490 loci.
> Map Properties:
> Normalized Map Mass: 83582.00
> Raw Map Mass: 83582.00
> Fragment Length Distribution: Truncated Gaussian (default)
> Default Mean: 200
> Default Std Dev: 80
[09:52:30] Assembling transcripts and estimating abundances.
Processed 30490 loci.


Here is an example of the output merged.gtf file:

1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";

Therefore, I checked the inputs for cuffmerge. However I didn't notice any issues:

For tophat accepted_hits.bam:

$ samtools view accepted_hits.bam | head
wtaba1-SRR1909022.2481187 163 1 3645 50 70M = 3996 421 ACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJIJIJJIJJJJJJJJJJIJJJJJHHHHFFFDDDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:70 YT:Z:UU NH:i:1


For cufflinks transcripts.gtf:

1 Cufflinks transcript 3631 5899 1000 + . gene_id "gene:AT1G01010"; transcript_id "transcript:AT1G01010.1"; FPKM "5.1108000582"; frac "1.000000"; conf_lo "4.325017"; conf_hi "5.896583"; cov "14.756394";

I also tried creating a new .gtf file from the .gff3 instead of using the one available in Ensembl (by using the command "$ gffread A_thaliana.TAIR10.30.gff3 -T -o test.gtf"), but with no success.

So I believe that the problem is with cuffmerge, but I can't figure out how to solve it. I could for instance use a cross reference function in Excel to exchange values between "gene_id" and "nearest_ref" for each line of the merged.gtf file. However, I believe it could be a problem for larger files or for the case in which the XLOC was attributed to a new transcript that was discovered.

I am using the following commands for the analyses:

Tophat:

$ tophat -N 3 --read-edit-dist 4 --read-realign-edit-dist 0 -a 6 --microexon-search -r 150 --mate-std-dev 200 -i 8 -I 10000 --min-segment-intron 8 --max-segment-intron 10000 --b2-very-sensitive ~/path_to/bowtie_index/A_thaliana.TAIR10.30 myfile1_1_paired.fastq.trim myfile1_2_paired.fastq.trim

Samtools:

$ samtools sort accepted_hits.bam myfile_sorted

Cufflinks:

$ cufflinks -o cuff -G ~/path_to/A_thaliana.TAIR10.30.gff3 -b ~/path_to/A_thaliana.TAIR10.30.fa myfile_sorted.bam

Cuffmerge:

$ cuffmerge -g ~/path_to/ A_thaliana.TAIR10.30.gtf -s ~/path_to/ A_thaliana.TAIR10.30.fa assemblies.txt

Thanks in advance for any help.

danwiththeplan 01-24-2016 07:07 PM

XLOC numbers are assigned to all genes, not just ones that aren't in GTF files you supply. So some genes will end up with a gene_name, but all genes end up with a gene_id (=XLOC number). I think the issue is that the downstream programs are looking for the gene_id field and not the gene_name field.

Incidentally, as I am interpreting it, using the -G switch (as opposed to the -g switch) means that you'll only ever detect/quantitate/analyse genes that are in your GTF file (while using -g means that cufflinks will create a new gene from scratch if there is sufficient read support, even if it's not in the GTF file you supply).

So, with the code you used, all output genes should have both a gene_id (XLOC number) and a gene_name (from the GTF)

gtduarte 01-25-2016 11:32 AM

Hello dan, thanks for replying. I tried what you suggested, running cufflinks with the -g option instead of -G, but unfortunately it didn't work:

$ cufflinks -o cuff_g -g ~/path_to/A_thaliana.TAIR10.30.gtf -b ~/path_to/A_thaliana.TAIR10.30.fa myfile_sorted.bam

Indeed the resulting transcripts.gtf was a bit different from the previous one, for instance:

-> with -G switch:

1 Cufflinks transcript 11649 13714 1000 - . gene_id "gene:AT1G01030"; transcript_id "transcript:AT1G01030.1"; FPKM "0.4354775887"; frac "1.000000"; conf_lo "0.217739"; conf_hi "0.653216"; cov "1.257353";

-> with -g switch:

1 Cufflinks transcript 11649 13714 1000 - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; FPKM "0.4330886861"; frac "1.000000"; conf_lo "0.216544"; conf_hi "0.649633"; cov "1.250649"; full_read_support "yes";

However, those bam_errors continue to appear when I run cuffmerge, as before, just as the XLOC values as my gene ids:

Example of the merged.gtf:

1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "NAC001"; oId "transcript:AT1G01010.1"; nearest_ref "transcript:AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";

Nevertheless I run cuffdiff, and there were the XLOCs:

From gene_exp.diff:

test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
XLOC_000001 XLOC_000001 NAC001 1:3630-5899 wtmock1 wtaba1 OK 2.82389 5.42847 0.94286 0.938326 0.3329 0.999039 no


Just in case, I checked tophat accepted_hits.bam headers, but apparently it seems fine:

$ samtools view -H wtaba1_sorted.bam

@HD VN:1.0 SO:coordinate
@SQ SN:1 LN:30427671
@SQ SN:2 LN:19698289
@SQ SN:3 LN:23459830
@SQ SN:4 LN:18585056
@SQ SN:5 LN:26975502
@SQ SN:Mt LN:366924
@SQ SN:Pt LN:154478
@PG ID:TopHat VN:2.1.0 CL:/usr/bin/tophat -N 3 --read-edit-dist 4 --read-realign-edit-dist 0 -a 6 --microexon-search -r 150 --mate-std-dev 200 -i 8 -I 10000 --min-segment-intron 8 --max-segment-intron 10000 --b2-very-sensitive /path_to/bowtie_index/A_thaliana.TAIR10.30 myfile1_1_paired.fastq.trim myfile1_2_paired.fastq.trim


Do you have any other clue?

Many thanks again!


All times are GMT -8. The time now is 06:28 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.