Dear community,
I am trying to run differential expression analysis with TopHat and Cufflinks, and alternative splicing analysis with MISO (for which I need TopHat outputs).
However, I find that the gene_ids are being substituted by "XLOC" values after I run tophat-cufflinks-cuffmerge-cuffdiff. I am using the recent release of A.thaliana (TAIR10.30) and providing all reference annotations (.fa .gtf .gff3) where required.
I found some discussion related to it (e.g. http://seqanswers.com/forums/showthread.php?t=19079), but the solutions proposed just solve plotting issues and depends on R (which I don't have experience with) and in fact don't find the cause of the problem. If for instance the problem is occurring on the initial steps, I will not be able to do further analyses because it is required perfect matching of the ids, that is why I'm trying to discover what is causing the issue.
As far as I understood, XLOC is attributed when a matching id is not found, for instance, for denoting new putative transcripts.
The following error is given when I run cuffmerge, although the analysis run until the end and the "merged.gtf" output can be used for running cuffdiff:
You are using Cufflinks v2.2.1, which is the most recent release.
Command line:
cufflinks -o test/ -F 0.05 -g /home/hugo/MUR/Ensembl/At/A_thaliana.TAIR10.30.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 1 test/tmp/mergeSam_filehaqEY0
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File test/tmp/mergeSam_filehaqEY0 doesn't appear to be a valid BAM file, trying SAM...
[09:52:24] Loading reference annotation.
[09:52:29] Inspecting reads and determining fragment length distribution.
Processed 30490 loci.
> Map Properties:
> Normalized Map Mass: 83582.00
> Raw Map Mass: 83582.00
> Fragment Length Distribution: Truncated Gaussian (default)
> Default Mean: 200
> Default Std Dev: 80
[09:52:30] Assembling transcripts and estimating abundances.
Processed 30490 loci.
Here is an example of the output merged.gtf file:
1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
Therefore, I checked the inputs for cuffmerge. However I didn't notice any issues:
For tophat accepted_hits.bam:
$ samtools view accepted_hits.bam | head
wtaba1-SRR1909022.2481187 163 1 3645 50 70M = 3996 421 ACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJIJIJJIJJJJJJJJJJIJJJJJHHHHFFFDDDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:70 YT:Z:UU NH:i:1
For cufflinks transcripts.gtf:
1 Cufflinks transcript 3631 5899 1000 + . gene_id "gene:AT1G01010"; transcript_id "transcript:AT1G01010.1"; FPKM "5.1108000582"; frac "1.000000"; conf_lo "4.325017"; conf_hi "5.896583"; cov "14.756394";
I also tried creating a new .gtf file from the .gff3 instead of using the one available in Ensembl (by using the command "$ gffread A_thaliana.TAIR10.30.gff3 -T -o test.gtf"), but with no success.
So I believe that the problem is with cuffmerge, but I can't figure out how to solve it. I could for instance use a cross reference function in Excel to exchange values between "gene_id" and "nearest_ref" for each line of the merged.gtf file. However, I believe it could be a problem for larger files or for the case in which the XLOC was attributed to a new transcript that was discovered.
I am using the following commands for the analyses:
Tophat:
$ tophat -N 3 --read-edit-dist 4 --read-realign-edit-dist 0 -a 6 --microexon-search -r 150 --mate-std-dev 200 -i 8 -I 10000 --min-segment-intron 8 --max-segment-intron 10000 --b2-very-sensitive ~/path_to/bowtie_index/A_thaliana.TAIR10.30 myfile1_1_paired.fastq.trim myfile1_2_paired.fastq.trim
Samtools:
$ samtools sort accepted_hits.bam myfile_sorted
Cufflinks:
$ cufflinks -o cuff -G ~/path_to/A_thaliana.TAIR10.30.gff3 -b ~/path_to/A_thaliana.TAIR10.30.fa myfile_sorted.bam
Cuffmerge:
$ cuffmerge -g ~/path_to/ A_thaliana.TAIR10.30.gtf -s ~/path_to/ A_thaliana.TAIR10.30.fa assemblies.txt
Thanks in advance for any help.
I am trying to run differential expression analysis with TopHat and Cufflinks, and alternative splicing analysis with MISO (for which I need TopHat outputs).
However, I find that the gene_ids are being substituted by "XLOC" values after I run tophat-cufflinks-cuffmerge-cuffdiff. I am using the recent release of A.thaliana (TAIR10.30) and providing all reference annotations (.fa .gtf .gff3) where required.
I found some discussion related to it (e.g. http://seqanswers.com/forums/showthread.php?t=19079), but the solutions proposed just solve plotting issues and depends on R (which I don't have experience with) and in fact don't find the cause of the problem. If for instance the problem is occurring on the initial steps, I will not be able to do further analyses because it is required perfect matching of the ids, that is why I'm trying to discover what is causing the issue.
As far as I understood, XLOC is attributed when a matching id is not found, for instance, for denoting new putative transcripts.
The following error is given when I run cuffmerge, although the analysis run until the end and the "merged.gtf" output can be used for running cuffdiff:
You are using Cufflinks v2.2.1, which is the most recent release.
Command line:
cufflinks -o test/ -F 0.05 -g /home/hugo/MUR/Ensembl/At/A_thaliana.TAIR10.30.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 1 test/tmp/mergeSam_filehaqEY0
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File test/tmp/mergeSam_filehaqEY0 doesn't appear to be a valid BAM file, trying SAM...
[09:52:24] Loading reference annotation.
[09:52:29] Inspecting reads and determining fragment length distribution.
Processed 30490 loci.
> Map Properties:
> Normalized Map Mass: 83582.00
> Raw Map Mass: 83582.00
> Fragment Length Distribution: Truncated Gaussian (default)
> Default Mean: 200
> Default Std Dev: 80
[09:52:30] Assembling transcripts and estimating abundances.
Processed 30490 loci.
Here is an example of the output merged.gtf file:
1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
Therefore, I checked the inputs for cuffmerge. However I didn't notice any issues:
For tophat accepted_hits.bam:
$ samtools view accepted_hits.bam | head
wtaba1-SRR1909022.2481187 163 1 3645 50 70M = 3996 421 ACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJIJIJJIJJJJJJJJJJIJJJJJHHHHFFFDDDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:70 YT:Z:UU NH:i:1
For cufflinks transcripts.gtf:
1 Cufflinks transcript 3631 5899 1000 + . gene_id "gene:AT1G01010"; transcript_id "transcript:AT1G01010.1"; FPKM "5.1108000582"; frac "1.000000"; conf_lo "4.325017"; conf_hi "5.896583"; cov "14.756394";
I also tried creating a new .gtf file from the .gff3 instead of using the one available in Ensembl (by using the command "$ gffread A_thaliana.TAIR10.30.gff3 -T -o test.gtf"), but with no success.
So I believe that the problem is with cuffmerge, but I can't figure out how to solve it. I could for instance use a cross reference function in Excel to exchange values between "gene_id" and "nearest_ref" for each line of the merged.gtf file. However, I believe it could be a problem for larger files or for the case in which the XLOC was attributed to a new transcript that was discovered.
I am using the following commands for the analyses:
Tophat:
$ tophat -N 3 --read-edit-dist 4 --read-realign-edit-dist 0 -a 6 --microexon-search -r 150 --mate-std-dev 200 -i 8 -I 10000 --min-segment-intron 8 --max-segment-intron 10000 --b2-very-sensitive ~/path_to/bowtie_index/A_thaliana.TAIR10.30 myfile1_1_paired.fastq.trim myfile1_2_paired.fastq.trim
Samtools:
$ samtools sort accepted_hits.bam myfile_sorted
Cufflinks:
$ cufflinks -o cuff -G ~/path_to/A_thaliana.TAIR10.30.gff3 -b ~/path_to/A_thaliana.TAIR10.30.fa myfile_sorted.bam
Cuffmerge:
$ cuffmerge -g ~/path_to/ A_thaliana.TAIR10.30.gtf -s ~/path_to/ A_thaliana.TAIR10.30.fa assemblies.txt
Thanks in advance for any help.
Comment