Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GTF input tophat and cufflinks

    Hi everyone,

    I have two questions about the GTF file that you can use as a reference in both TopHat and cuffdiff. A general GTF file that can be downloaded from for instance UCSC will look something like:

    chr12 refGene exon 12262139 12262238 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name "Fam49a$
    chr12 refGene exon 12304181 12304322 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "2"; exon_id "NM_001146119.2"; gene_name "Fam49a$
    chr12 refGene exon 12340679 12340758 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "3"; exon_id "NM_001146119.3"; gene_name "Fam49a$
    chr12 refGene CDS 12340689 12340758 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "3"; exon_id "NM_001146119.3"; gene_name "Fam49a$
    chr12 refGene exon 12358045 12358166 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "4"; exon_id "NM_001146119.4"; gene_name "Fam49a$
    chr12 refGene CDS 12358045 12358166 . + 2 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "4"; exon_id "NM_001146119.4"; gene_name "Fam49a$
    chr12 refGene exon 12359213 12359318 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "5"; exon_id "NM_001146119.5"; gene_name "Fam49a$
    chr12 refGene CDS 12359213 12359318 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "5"; exon_id "NM_001146119.5"; gene_name "Fam49a$
    chr12 refGene exon 12361435 12361571 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "6"; exon_id "NM_001146119.6"; gene_name "Fam49a$
    chr12 refGene CDS 12361435 12361571 . + 2 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "6"; exon_id "NM_001146119.6"; gene_name "Fam49a$
    chr12 refGene exon 12362015 12362092 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "7"; exon_id "NM_001146119.7"; gene_name "Fam49a$
    chr12 refGene CDS 12362015 12362092 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "7"; exon_id "NM_001146119.7"; gene_name "Fam49a$
    chr12 refGene exon 12362252 12362368 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "8"; exon_id "NM_001146119.8"; gene_name "Fam49a$
    chr12 refGene CDS 12362252 12362368 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "8"; exon_id "NM_001146119.8"; gene_name "Fam49a$
    chr12 refGene exon 12362461 12362540 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "9"; exon_id "NM_001146119.9"; gene_name "Fam49a$
    chr12 refGene CDS 12362461 12362540 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "9"; exon_id "NM_001146119.9"; gene_name "Fam49a$
    chr12 refGene exon 12364720 12364846 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "10"; exon_id "NM_001146119.10"; gene_name "Fam4$
    chr12 refGene CDS 12364720 12364846 . + 1 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "10"; exon_id "NM_001146119.10"; gene_name "Fam4$
    chr12 refGene exon 12369894 12369964 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "11"; exon_id "NM_001146119.11"; gene_name "Fam4$
    chr12 refGene CDS 12369894 12369964 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "11"; exon_id "NM_001146119.11"; gene_name "Fam4$
    chr12 refGene exon 12372747 12376361 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "12"; exon_id "NM_001146119.12"; gene_name "Fam4$
    chr12 refGene CDS 12372747 12372807 . + 1 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "12"; exon_id "NM_001146119.12"; gene_name "Fam4$
    chr12 refGene start_codon 12340689 12340691 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name$
    chr12 refGene stop_codon 12372808 12372810 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name$
    chr7 refGene exon 24902986 24903128 . + . gene_id "Arhgef1"; transcript_id "NM_008488"; exon_number "1"; exon_id "NM_008488.1"; gene_name "Arhgef1";

    question 1:
    the GTF file includes exons, coding sequences (CDS), and also miRNA, start and stop codons. Often the coding sequence and exons will be identical or otherwise almost identical. For this reason I kept only the exons in my reference file. I was wondering if this is the wisest thing to do. What will TopHat and cufflinks do when I keep the additional information? will it be able to use these annotations of exons coding sequences etc., or will it just try to map the reads to each individual line in the reference file and not be able to distinguish between the different "types" (Exon versus CDS versus micro RNA)? IF the latter is the case, will this basically mean that the number of reads will halve for exons, since now halve are mapped to the CDS?

    question 2:
    how can cufflinks perform CDS-level transcription difference tests, splicing tests, promoter preference tests and relative CDS output tests? Where do you provide the inputs so that it knows where these are?
    My output when just using exons look like this:
    Performed 12350 isoform-level transcription difference tests
    Performed 0 tss-level transcription difference tests
    Performed 10502 gene-level transcription difference tests
    Performed 0 CDS-level transcription difference tests
    Performed 0 splicing tests
    Performed 0 promoter preference tests
    Performing 0 relative CDS output tests

  • #2
    (1) Don't edit your GTF file, keep those things in. Tophat et al. will still work fine (tophat should produce the same results, in fact).

    (2) If you leave the CDS and other fields in and follow the normal cufflinks workflow, you'll generate a merged GTF file with pid and tid fields, that cufflinks can use.

    Comment


    • #3
      Originally posted by dpryan View Post
      (1) Don't edit your GTF file, keep those things in. Tophat et al. will still work fine (tophat should produce the same results, in fact).
      So TopHat/cuffdiff do indeed use those lines containing the information about it being an exon or coding sequence?

      Originally posted by dpryan View Post
      (2) If you leave the CDS and other fields in and follow the normal cufflinks workflow, you'll generate a merged GTF file with pid and tid fields, that cufflinks can use.
      I'm using the workflow without gene discovery, does that matter at all for your answer?

      Comment


      • #4
        The GTF file with the tss_id and p_id fields can either be generated with cuffmerge (on the transcripts.gtf files from your samples) or cuffcompare (on the original GTF file). In the rare instances where I've used cufflinks, I've always used the cuffmerge route. I'm not familiar enough with the inner workings of cufflinks to state whether not doing novel gene detection really changes the transcripts.gtf file (again, I've never tried that route), so I can't offer any insight there.

        Comment


        • #5
          Hi all,
          I had a few questions about cufflinks for RNA seq analysis.
          first : I understand cufflinks is designed for RNA seq and not specifically for miRNA seq.
          I am working with miRNA data from Illumina GA and have reads about 36bases long.
          My problem is the trascripts.gtf (while cuffcomparing two samples) generated as an output , gives me Loci that has mergedfour distinct clusters together into one CUFF.48.1 location. However when I visualize this on IGV I see that the region has four clusters of reads with varied depths seperated by a region as long a the length of the read (~36bases).
          Why are these merged into one cluster?
          Is cufflinks treating my seqs as mRNA seqs?
          What criteria or parameters are used to merge two clusters in cuffcompare into one CUFF ID , for example what is the min or max distance that is necessary to identify them as one cluster?
          Also in the output file that it generated from cuffcompare I have "-" for first seq file and a CUFFvalue for second file I am comparing. However in IGV I see the same four clusters present in both first and second seq files I am comparing.Why is this happening??

          Any tips in helping me understand this concept is greatly appreciated I looked into cufflinks manual but could not get much info.

          Are there any differential analysis tools available for miRNA seq specifically???

          Thanks,
          Geneart.

          Comment


          • #6
            Originally posted by geneart View Post
            Hi all,
            I had a few questions about cufflinks for RNA seq analysis.
            first : I understand cufflinks is designed for RNA seq and not specifically for miRNA seq.
            I am working with miRNA data from Illumina GA and have reads about 36bases long.
            My problem is the trascripts.gtf (while cuffcomparing two samples) generated as an output , gives me Loci that has mergedfour distinct clusters together into one CUFF.48.1 location. However when I visualize this on IGV I see that the region has four clusters of reads with varied depths seperated by a region as long a the length of the read (~36bases).
            Why are these merged into one cluster?
            Is cufflinks treating my seqs as mRNA seqs?
            What criteria or parameters are used to merge two clusters in cuffcompare into one CUFF ID , for example what is the min or max distance that is necessary to identify them as one cluster?
            Also in the output file that it generated from cuffcompare I have "-" for first seq file and a CUFFvalue for second file I am comparing. However in IGV I see the same four clusters present in both first and second seq files I am comparing.Why is this happening??

            Any tips in helping me understand this concept is greatly appreciated I looked into cufflinks manual but could not get much info.

            Are there any differential analysis tools available for miRNA seq specifically???

            Thanks,
            Geneart.
            You might want to create a new topic

            Comment


            • #7
              why tss_id and p_id? how to add them? -

              The use of tss_id and p_id by cuffdiff is explained as
              tss_group_exp.diff Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss_id
              cds_exp.diff Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id

              You can get GTF already containing p_id and tss_id from iGenomes for some organisms.

              If you want/need to use GTF from other sources, this advice is offered
              Note: If an arbitrary GTF/GFF3 file is used as input (instead of the .combined.gtf file produced by Cuffcompare), these attributes will not be present, but Cuffcompare can still be used to obtain these attributes with a command like this:

              cuffcompare -s /path/to/genome_seqs.fa -CG -r annotation.gtf annotation.gtf

              The resulting cuffcmp.combined.gtf file created by this command will have the tss_id and p_id attributes added to each record and this file can be used as input for cuffdiff.


              However, as the comment in my alternate approach
              to the problem states:
              ## NOTE: cuffdiff's documented way of adding these attributes is to
              ## create a .combined.gtf file using `cuffcompare`, but this method
              ## unfortunately (unnecessarily!) resets the gene_id and
              ## transcript_id to newly generated unique values.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X