Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I am afraid I don't understand what you are looking at there. Are 'u' and '.' the only class codes you got in the .tracking file? Because that would rather suggest that the chromosome names do not match at all between the reference annotation and the input sample files (the chromosome IDs in the first column should follow the same naming convention, e.g. it shouldn't be 'chr1' in the GTF with the transfrags and '1' in the reference annotation file).
    However in that case I don't get this statement:
    All of the other categories are consistent across the board.
    What other categories? In the attached table picture I only see 'u' and '.' codes listed, and a summary 'Total transfrags' row, which makes me think there are no other categories / class codes. Sorry for being a little confused here about your questions.

    Also, I suppose in order to compare the class code distribution in different samples, you were in fact looking at the .tmap files, not the .tracking file as you said, is that right? Because you mentioned you ran cuffcompare with 4 GTF files as input -- was that one cuffcompare run with all 4 files at once, or it was 4 different runs?
    Last edited by gpertea; 06-19-2010, 07:03 AM.

    Comment


    • #17
      Sorry about the confusion earlier.

      I see more class codes than just the "u' and the "." To limit confusion, I had previously just posted a table with the relevant numbers--that clearly backfired.

      I have attached my entire table (with the number of transfrags in each category) to this posting. Hopefully this makes things a little clearer.

      As far as my comment of "all of the other categories are consistent across the board," as you can see from my attached table, the samples have very similar "numbers/%" transfrags in all other categories besides the "." and the "u" category.

      I ran the cuffcompare with 4 GTf files as input at once, in addition to using a reference.

      HTML Code:
      my run line was the following
      cuffcompare -r reference.gtf  ./sample1.gtf ./sample2.gtf (etc.)
      I calculated all of these numbers using the tracking file because, if I understood the manual correctly, the tracking file "matches" transcripts up between samples and lists each transcript structure that is present in one or more input GTF files" will be located in this file--thus all transcripts should be present in this file. Or am i interpreting this incorrectly?

      I hope this posting makes my earlier message much easier to understand. Thanks

      Also, in case this helps to diagnose the problem-- here is the pipeline that I used to treat my samples:
      Ga2->tophat->cufflinks (w/ accepted_hits.sam)->cuffcompare (as stated earlier with the gtf file from cufflinks)
      Attached Files

      Comment


      • #18
        I calculated all of these numbers using the tracking file because, if I understood the manual correctly, the tracking file "matches" transcripts up between samples and lists each transcript structure that is present in one or more input GTF files"
        It's correct though I think things get a little fuzzy when single-exon transfrags are considered, because in that case there is no "structure" to look at and transfrags may get merged in a single line in that file if they just overlap each other very well (though not perfectly).

        However I think trying to get such per sample stats based on the tracking file is not a good idea due to the ambiguity of the '.' class code, which simply has no meaning when applied to an individual transfrags in a sample. Instead, you should use the .tmap files, which are generated for each of the input files and provide independent transfrag classification for each sample. As you probably saw in the manual, the '.' code is used in the .tracking file whenever transfrags found to be "structurally equivalent" across samples (and thus likely to come from the same transcript) do not have the same classification code when considered individually (i.e. as shown in the .tmap file). That is, say we have a transfrag t1 in sample 1 that has code 'u' when compared to the reference transcripts, and it has an "equivalent structure" (but see above the caveat for single-exon transfrags) with a transfrag t2 found in sample 2. Now say t2 may be classified as 'p' because it extends a bit closer to a known transcript. So, this combo will end up shown as '.' in the tracking file, and it doesn't make sense to classify the transfrags in both samples as having the '.' code in a table like yours. By the looks of it I suppose it could be that sample 1 and 2 had a lot of these "equivalent" transfrags with mixed individual codes (one of them being 'u') that got reported in the tracking file as '.'.

        In all fairness this still looks like a strange distribution so I suppose it is also possible that there could be some inconsistency somewhere in the initialization of the classifier codes for the .tracking file, such that some transfrags with the 'u' class code (which is the default code) may end up being reported in some cases as a '.' instead. I'll take a look to check my code to see if/how that could happen. But again, if you really wanted to get a meaningful distribution of transfrag categories in each individual sample I would advise to use the .tmap files instead of the .tracking file, because the '.' category doesn't tell anything about the actual classification of transfrags in a single sample (in your case, it looks like this category actually "stole" almost all the 'u' transfrags in sample 1 and 2).

        Comment


        • #19
          Thank you so much! That explanation will definitely help me in my analysis

          Also, recreating my table using the individual .tmap files allowed me to see that the number of transfrags in each class were consistent across all samples.

          Comment


          • #20
            Hi gpertea, I have a question about the format of .tracking file. In the cufflinks manual, it says there are 6 fields for each sample transcript as follows. qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>

            However when I run cufflinks0.8.2, I get 8 fields for each transcript, for example: q1:CUFF.54652|CUFF.54652.1
            |100|10.306160|3.018604|17.593716|1.929078|141

            Could you tell me what are the two extra fields? Thanks a lot!

            Comment


            • #21
              Indeed, the manual hasn't yet been updated to reflect the fact that two extra fields were added there, so the format is now like this:
              Code:
              qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>|<cov>|<len>
              ..where the added fields are:
              • <cov>: the estimated average depth of read coverage across the transfrag
              • <len>: the length of the transfrag

              Comment


              • #22
                gpertea, thank you for your prompt reply!

                Comment


                • #23
                  visualization of Cuffcompare class codes

                  Hi,
                  Some of the class code descriptions are a little difficult to interpret.

                  I have made a visualization of the transfrags that fall into each classes, based on my interpretation and attached it to this post.

                  Am I interpreting the descriptions correctly? Thanks in advance.
                  Attached Files

                  Comment


                  • #24
                    Originally posted by gpertea View Post
                    Indeed, the manual hasn't yet been updated to reflect the fact that two extra fields were added there, so the format is now like this:
                    Code:
                    qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>|<cov>|<len>
                    ..where the added fields are:
                    • <cov>: the estimated average depth of read coverage across the transfrag
                    • <len>: the length of the transfrag
                    HI Gpertea, how is coverage calculated in this file? Can you please tell me the formula used? "Estimated average depth of coverage across the transcript", how do you determine which is the transcript or this is local to the transfrag? I am very confused. Also is transfrag same as "fragments" in FPKM? How can I determine how many million reads were mapped for each experiment, how are multireads handled in this case?

                    Finally, for single end reads, what makes a fragment? I understand the definition of fragments in paired end reads.

                    One last thing, sometimes cufflinks will show disconnected parts of a transcript even though there are reads in the entire gene, why is this so? could it because coverage is too low in other parts of the transcript.

                    Thanks so much, and I really hope that you will reply to my queries because my data is not making sense to me.

                    Comment


                    • #25
                      (=) Is &quot;perfect match&quot; or &quot;Complete match of intron chain&quot;

                      Hy guys,

                      We're trying hard to understand the definition of "intron chain" in the class code description for "=".

                      As you guys stated in your tables "Complete match of intron chain" means "perfect match of a transcript", is that it?

                      thanks in advance (should i study more english? :P )

                      Comment


                      • #26
                        I think, that "intron chain" essentially means ignorance of the 5' end of the first exon and 3' end of the last exon. In other words, you get all the introns recovered, which does not necessarily mean that you get all the ends recovered.

                        Originally posted by brdido View Post
                        Hy guys,

                        We're trying hard to understand the definition of "intron chain" in the class code description for "=".

                        As you guys stated in your tables "Complete match of intron chain" means "perfect match of a transcript", is that it?

                        thanks in advance (should i study more english? :P )

                        Comment


                        • #27
                          wenhuang is correct. The intron coordinates must all match, which means that all the internal exons also match, and only the start coordinate of the first exon and the end coordinate of the last exon are allowed to differ from those of the reference transcript.

                          Comment


                          • #28
                            ok! i think i got it!
                            Thanks!

                            Comment


                            • #29
                              Hi all

                              In an analysis of transcriptomics with cufflinks and cuffcompare I want to filter and eliminate noise, so, eliminate the transfrag "suspicious", for this, I think that the "class code" could be a good parameter for this selection. What are the best "class codes" for the filtering of transfrag?


                              Thanks!!!

                              Comment


                              • #30
                                cuffcompare

                                Hi gpertea,

                                I have couple of questions regarding cufflinks/cuffcompare:

                                1. I found strange results when i compare the cufflinks with annotation and cuffcompare and cufflinks with no annotation and cuffcompare. Here are the results:

                                #with annotation:
                                [upendra_35@vm142-17 Denovo_stuff]$ cut -f3 cufflinks_out/cuffcompare_out.transcripts.gtf.tmap |sort|uniq -c
                                41003 =
                                7 c
                                1 class_code

                                # Without annotation:
                                [upendra_35@vm142-17 Denovo_stuff]$ cut -f3 cufflinks_out_no_annot/cuffcompare_out_no_annot.transcripts.gtf.tmap |sort|uniq -c
                                11935 =
                                6397 c
                                1 class_code
                                5014 e
                                562 i
                                16519 j
                                7226 o
                                1844 p
                                51 s
                                8169 u
                                624 x

                                Why is it that we only one class with cufflinks with annotation. I have already checked the annotation file and transcripts.gtf file and the chromosome names match. I believe the cufflinks without annotation might be true. Right?

                                2. The result above is based on 3 lanes of illumina data. Do you think we can increase the percentages of interesting classes (o and u) if you include more data?

                                Thanks in advance......

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin


                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                  Yesterday, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                55 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                45 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                55 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X