Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gene id changed to XLOC... cuffmerge issue?

    Dear community,

    I am trying to run differential expression analysis with TopHat and Cufflinks, and alternative splicing analysis with MISO (for which I need TopHat outputs).

    However, I find that the gene_ids are being substituted by "XLOC" values after I run tophat-cufflinks-cuffmerge-cuffdiff. I am using the recent release of A.thaliana (TAIR10.30) and providing all reference annotations (.fa .gtf .gff3) where required.

    I found some discussion related to it (e.g. http://seqanswers.com/forums/showthread.php?t=19079), but the solutions proposed just solve plotting issues and depends on R (which I don't have experience with) and in fact don't find the cause of the problem. If for instance the problem is occurring on the initial steps, I will not be able to do further analyses because it is required perfect matching of the ids, that is why I'm trying to discover what is causing the issue.

    As far as I understood, XLOC is attributed when a matching id is not found, for instance, for denoting new putative transcripts.

    The following error is given when I run cuffmerge, although the analysis run until the end and the "merged.gtf" output can be used for running cuffdiff:

    You are using Cufflinks v2.2.1, which is the most recent release.
    Command line:
    cufflinks -o test/ -F 0.05 -g /home/hugo/MUR/Ensembl/At/A_thaliana.TAIR10.30.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 1 test/tmp/mergeSam_filehaqEY0
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).

    File test/tmp/mergeSam_filehaqEY0 doesn't appear to be a valid BAM file, trying SAM...
    [09:52:24] Loading reference annotation.
    [09:52:29] Inspecting reads and determining fragment length distribution.
    Processed 30490 loci.
    > Map Properties:
    > Normalized Map Mass: 83582.00
    > Raw Map Mass: 83582.00
    > Fragment Length Distribution: Truncated Gaussian (default)
    > Default Mean: 200
    > Default Std Dev: 80
    [09:52:30] Assembling transcripts and estimating abundances.
    Processed 30490 loci.


    Here is an example of the output merged.gtf file:

    1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";

    Therefore, I checked the inputs for cuffmerge. However I didn't notice any issues:

    For tophat accepted_hits.bam:

    $ samtools view accepted_hits.bam | head
    wtaba1-SRR1909022.2481187 163 1 3645 50 70M = 3996 421 ACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCG CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJIJIJJIJJJJJJJJJJIJJJJJHHHHFFFDDDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:70 YT:Z:UU NH:i:1


    For cufflinks transcripts.gtf:

    1 Cufflinks transcript 3631 5899 1000 + . gene_id "gene:AT1G01010"; transcript_id "transcript:AT1G01010.1"; FPKM "5.1108000582"; frac "1.000000"; conf_lo "4.325017"; conf_hi "5.896583"; cov "14.756394";

    I also tried creating a new .gtf file from the .gff3 instead of using the one available in Ensembl (by using the command "$ gffread A_thaliana.TAIR10.30.gff3 -T -o test.gtf"), but with no success.

    So I believe that the problem is with cuffmerge, but I can't figure out how to solve it. I could for instance use a cross reference function in Excel to exchange values between "gene_id" and "nearest_ref" for each line of the merged.gtf file. However, I believe it could be a problem for larger files or for the case in which the XLOC was attributed to a new transcript that was discovered.

    I am using the following commands for the analyses:

    Tophat:

    $ tophat -N 3 --read-edit-dist 4 --read-realign-edit-dist 0 -a 6 --microexon-search -r 150 --mate-std-dev 200 -i 8 -I 10000 --min-segment-intron 8 --max-segment-intron 10000 --b2-very-sensitive ~/path_to/bowtie_index/A_thaliana.TAIR10.30 myfile1_1_paired.fastq.trim myfile1_2_paired.fastq.trim

    Samtools:

    $ samtools sort accepted_hits.bam myfile_sorted

    Cufflinks:

    $ cufflinks -o cuff -G ~/path_to/A_thaliana.TAIR10.30.gff3 -b ~/path_to/A_thaliana.TAIR10.30.fa myfile_sorted.bam

    Cuffmerge:

    $ cuffmerge -g ~/path_to/ A_thaliana.TAIR10.30.gtf -s ~/path_to/ A_thaliana.TAIR10.30.fa assemblies.txt

    Thanks in advance for any help.

  • #2
    XLOC numbers are assigned to all genes, not just ones that aren't in GTF files you supply. So some genes will end up with a gene_name, but all genes end up with a gene_id (=XLOC number). I think the issue is that the downstream programs are looking for the gene_id field and not the gene_name field.

    Incidentally, as I am interpreting it, using the -G switch (as opposed to the -g switch) means that you'll only ever detect/quantitate/analyse genes that are in your GTF file (while using -g means that cufflinks will create a new gene from scratch if there is sufficient read support, even if it's not in the GTF file you supply).

    So, with the code you used, all output genes should have both a gene_id (XLOC number) and a gene_name (from the GTF)

    Comment


    • #3
      Hello dan, thanks for replying. I tried what you suggested, running cufflinks with the -g option instead of -G, but unfortunately it didn't work:

      $ cufflinks -o cuff_g -g ~/path_to/A_thaliana.TAIR10.30.gtf -b ~/path_to/A_thaliana.TAIR10.30.fa myfile_sorted.bam

      Indeed the resulting transcripts.gtf was a bit different from the previous one, for instance:

      -> with -G switch:

      1 Cufflinks transcript 11649 13714 1000 - . gene_id "gene:AT1G01030"; transcript_id "transcript:AT1G01030.1"; FPKM "0.4354775887"; frac "1.000000"; conf_lo "0.217739"; conf_hi "0.653216"; cov "1.257353";

      -> with -g switch:

      1 Cufflinks transcript 11649 13714 1000 - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; FPKM "0.4330886861"; frac "1.000000"; conf_lo "0.216544"; conf_hi "0.649633"; cov "1.250649"; full_read_support "yes";

      However, those bam_errors continue to appear when I run cuffmerge, as before, just as the XLOC values as my gene ids:

      Example of the merged.gtf:

      1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "NAC001"; oId "transcript:AT1G01010.1"; nearest_ref "transcript:AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";

      Nevertheless I run cuffdiff, and there were the XLOCs:

      From gene_exp.diff:

      test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
      XLOC_000001 XLOC_000001 NAC001 1:3630-5899 wtmock1 wtaba1 OK 2.82389 5.42847 0.94286 0.938326 0.3329 0.999039 no


      Just in case, I checked tophat accepted_hits.bam headers, but apparently it seems fine:

      $ samtools view -H wtaba1_sorted.bam

      @HD VN:1.0 SO:coordinate
      @SQ SN:1 LN:30427671
      @SQ SN:2 LN:19698289
      @SQ SN:3 LN:23459830
      @SQ SN:4 LN:18585056
      @SQ SN:5 LN:26975502
      @SQ SN:Mt LN:366924
      @SQ SN:Pt LN:154478
      @PG ID:TopHat VN:2.1.0 CL:/usr/bin/tophat -N 3 --read-edit-dist 4 --read-realign-edit-dist 0 -a 6 --microexon-search -r 150 --mate-std-dev 200 -i 8 -I 10000 --min-segment-intron 8 --max-segment-intron 10000 --b2-very-sensitive /path_to/bowtie_index/A_thaliana.TAIR10.30 myfile1_1_paired.fastq.trim myfile1_2_paired.fastq.trim


      Do you have any other clue?

      Many thanks again!

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      7 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      7 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      49 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      66 views
      0 likes
      Last Post seqadmin  
      Working...
      X