Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gtduarte
    Junior Member
    • Jan 2016
    • 5

    Gene id changed to XLOC... cuffmerge issue?

    Dear community,

    I am trying to run differential expression analysis with TopHat and Cufflinks, and alternative splicing analysis with MISO (for which I need TopHat outputs).

    However, I find that the gene_ids are being substituted by "XLOC" values after I run tophat-cufflinks-cuffmerge-cuffdiff. I am using the recent release of A.thaliana (TAIR10.30) and providing all reference annotations (.fa .gtf .gff3) where required.

    I found some discussion related to it (e.g. http://seqanswers.com/forums/showthread.php?t=19079), but the solutions proposed just solve plotting issues and depends on R (which I don't have experience with) and in fact don't find the cause of the problem. If for instance the problem is occurring on the initial steps, I will not be able to do further analyses because it is required perfect matching of the ids, that is why I'm trying to discover what is causing the issue.

    As far as I understood, XLOC is attributed when a matching id is not found, for instance, for denoting new putative transcripts.

    The following error is given when I run cuffmerge, although the analysis run until the end and the "merged.gtf" output can be used for running cuffdiff:

    You are using Cufflinks v2.2.1, which is the most recent release.
    Command line:
    cufflinks -o test/ -F 0.05 -g /home/hugo/MUR/Ensembl/At/A_thaliana.TAIR10.30.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 1 test/tmp/mergeSam_filehaqEY0
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).

    File test/tmp/mergeSam_filehaqEY0 doesn't appear to be a valid BAM file, trying SAM...
    [09:52:24] Loading reference annotation.
    [09:52:29] Inspecting reads and determining fragment length distribution.
    Processed 30490 loci.
    > Map Properties:
    > Normalized Map Mass: 83582.00
    > Raw Map Mass: 83582.00
    > Fragment Length Distribution: Truncated Gaussian (default)
    > Default Mean: 200
    > Default Std Dev: 80
    [09:52:30] Assembling transcripts and estimating abundances.
    Processed 30490 loci.


    Here is an example of the output merged.gtf file:

    1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";

    Therefore, I checked the inputs for cuffmerge. However I didn't notice any issues:

    For tophat accepted_hits.bam:

    $ samtools view accepted_hits.bam | head
    wtaba1-SRR1909022.2481187 163 1 3645 50 70M = 3996 421 ACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJIJIJJIJJJJJJJJJJIJJJJJHHHHFFFDDDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:70 YT:Z:UU NH:i:1


    For cufflinks transcripts.gtf:

    1 Cufflinks transcript 3631 5899 1000 + . gene_id "gene:AT1G01010"; transcript_id "transcript:AT1G01010.1"; FPKM "5.1108000582"; frac "1.000000"; conf_lo "4.325017"; conf_hi "5.896583"; cov "14.756394";

    I also tried creating a new .gtf file from the .gff3 instead of using the one available in Ensembl (by using the command "$ gffread A_thaliana.TAIR10.30.gff3 -T -o test.gtf"), but with no success.

    So I believe that the problem is with cuffmerge, but I can't figure out how to solve it. I could for instance use a cross reference function in Excel to exchange values between "gene_id" and "nearest_ref" for each line of the merged.gtf file. However, I believe it could be a problem for larger files or for the case in which the XLOC was attributed to a new transcript that was discovered.

    I am using the following commands for the analyses:

    Tophat:

    $ tophat -N 3 --read-edit-dist 4 --read-realign-edit-dist 0 -a 6 --microexon-search -r 150 --mate-std-dev 200 -i 8 -I 10000 --min-segment-intron 8 --max-segment-intron 10000 --b2-very-sensitive ~/path_to/bowtie_index/A_thaliana.TAIR10.30 myfile1_1_paired.fastq.trim myfile1_2_paired.fastq.trim

    Samtools:

    $ samtools sort accepted_hits.bam myfile_sorted

    Cufflinks:

    $ cufflinks -o cuff -G ~/path_to/A_thaliana.TAIR10.30.gff3 -b ~/path_to/A_thaliana.TAIR10.30.fa myfile_sorted.bam

    Cuffmerge:

    $ cuffmerge -g ~/path_to/ A_thaliana.TAIR10.30.gtf -s ~/path_to/ A_thaliana.TAIR10.30.fa assemblies.txt

    Thanks in advance for any help.
  • danwiththeplan
    Member
    • Sep 2011
    • 72

    #2
    XLOC numbers are assigned to all genes, not just ones that aren't in GTF files you supply. So some genes will end up with a gene_name, but all genes end up with a gene_id (=XLOC number). I think the issue is that the downstream programs are looking for the gene_id field and not the gene_name field.

    Incidentally, as I am interpreting it, using the -G switch (as opposed to the -g switch) means that you'll only ever detect/quantitate/analyse genes that are in your GTF file (while using -g means that cufflinks will create a new gene from scratch if there is sufficient read support, even if it's not in the GTF file you supply).

    So, with the code you used, all output genes should have both a gene_id (XLOC number) and a gene_name (from the GTF)

    Comment

    • gtduarte
      Junior Member
      • Jan 2016
      • 5

      #3
      Hello dan, thanks for replying. I tried what you suggested, running cufflinks with the -g option instead of -G, but unfortunately it didn't work:

      $ cufflinks -o cuff_g -g ~/path_to/A_thaliana.TAIR10.30.gtf -b ~/path_to/A_thaliana.TAIR10.30.fa myfile_sorted.bam

      Indeed the resulting transcripts.gtf was a bit different from the previous one, for instance:

      -> with -G switch:

      1 Cufflinks transcript 11649 13714 1000 - . gene_id "gene:AT1G01030"; transcript_id "transcript:AT1G01030.1"; FPKM "0.4354775887"; frac "1.000000"; conf_lo "0.217739"; conf_hi "0.653216"; cov "1.257353";

      -> with -g switch:

      1 Cufflinks transcript 11649 13714 1000 - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; FPKM "0.4330886861"; frac "1.000000"; conf_lo "0.216544"; conf_hi "0.649633"; cov "1.250649"; full_read_support "yes";

      However, those bam_errors continue to appear when I run cuffmerge, as before, just as the XLOC values as my gene ids:

      Example of the merged.gtf:

      1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "NAC001"; oId "transcript:AT1G01010.1"; nearest_ref "transcript:AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";

      Nevertheless I run cuffdiff, and there were the XLOCs:

      From gene_exp.diff:

      test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
      XLOC_000001 XLOC_000001 NAC001 1:3630-5899 wtmock1 wtaba1 OK 2.82389 5.42847 0.94286 0.938326 0.3329 0.999039 no


      Just in case, I checked tophat accepted_hits.bam headers, but apparently it seems fine:

      $ samtools view -H wtaba1_sorted.bam

      @HD VN:1.0 SO:coordinate
      @SQ SN:1 LN:30427671
      @SQ SN:2 LN:19698289
      @SQ SN:3 LN:23459830
      @SQ SN:4 LN:18585056
      @SQ SN:5 LN:26975502
      @SQ SN:Mt LN:366924
      @SQ SN:Pt LN:154478
      @PG ID:TopHat VN:2.1.0 CL:/usr/bin/tophat -N 3 --read-edit-dist 4 --read-realign-edit-dist 0 -a 6 --microexon-search -r 150 --mate-std-dev 200 -i 8 -I 10000 --min-segment-intron 8 --max-segment-intron 10000 --b2-very-sensitive /path_to/bowtie_index/A_thaliana.TAIR10.30 myfile1_1_paired.fastq.trim myfile1_2_paired.fastq.trim


      Do you have any other clue?

      Many thanks again!

      Comment

      Latest Articles

      Collapse

      • GATTACAT
        Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
        by GATTACAT
        Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
        Today, 11:43 AM
      • SEQadmin2
        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
        by SEQadmin2


        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

        Here are nine questions we think about, in roughly the order they matter, before...
        06-18-2026, 07:11 AM
      • SEQadmin2
        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
        by SEQadmin2


        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
        ...
        06-02-2026, 10:05 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by SEQadmin2, Yesterday, 05:37 AM
      0 responses
      7 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-26-2026, 11:10 AM
      0 responses
      17 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-17-2026, 06:09 AM
      0 responses
      52 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-09-2026, 11:58 AM
      0 responses
      110 views
      0 reactions
      Last Post SEQadmin2  
      Working...