Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks and gene expression

    Hi,

    I'm new to RNASeq but I've successfully managed to run through a tophat/cufflinks workflow, however I'm getting some results which I find slightly difficult to believe as well as some general oddness.

    I was hoping to run it past people who have more experience with cufflinks and might be able to spot any mistakes I'm making.

    Our intention is to compare our treatment (gene knockdown using siRNA - s_6_sequence.txt) vs. control (s_7_sequence.txt).

    First, what I've done:

    Aligned single ended reads with tophat

    tophat --solexa1.3-quals -p 14 -o s_6_sequence hg18 s_6_sequence.txt
    tophat --solexa1.3-quals -p 14 -o s_7_sequence hg18 s_7_sequence.txt
    Downloaded Ensembl Homo_sapiens NCBI36 release 51 GTF file and changed chromosomes to match hg18 format

    From: ftp://ftp.ensembl.org/pub/release-51...BI36.51.gtf.gz

    cat Homo_sapiens.NCBI36.51.gtf | sed "s/\(.*\)/chr\1/" | sed "s/chrMT/chrM/" > Homo_sapiens.NCBI36.51.fixed.gtf
    Run Cufflinks

    cd s_6_sequence
    cufflinks -p 14 accepted_hits.sam
    cd ..
    cd s_7_sequence
    cufflinks -p 14 accepted_hits.sam
    Cuffcompare

    cuffcompare -r Homo_sapiens.NCBI36.51.fixed.gtf -R -o summarystats.txt s_6_sequence/transcripts.gtf s_7_sequence/transcripts.gtf
    Cuffdiff

    cuffdiff -p 14 summarystats.combined.gtf s_6_sequence/accepted_hits.sam s_7_sequence/accepted_hits.sam
    I wanted to be able to view the data myself so I created bigWig and bigBed files that I could use on the UCSC genome browser from the coverage.wig and junctions.bed files created for each of the lanes.

    For some reason the coverage.wig files occasionally have an entry at the end of a chromosome where the start and end positions are identical (which would imply a length of 0) which I take to be a bug so my bigWig and bigBed files are generated as below:

    cat coverage.wig | awk 'FNR > 1' | awk '{if ($2==$3){$3 = $3 +1}; print $1 "\t" $2 "\t" $3 "\t" $4}' > coverage.bedGraph
    UCSC/bedGraphToBigWig coverage.bedGraph chrmSizes.hg18 coverage.bigWig
    rm coverage.bedGraph
    cat junctions.bed | awk 'FNR > 1' | sort -k1,1 -k2,2n > junctions.bedTrim
    /UCSC/bedToBigBed junctions.bedTrim chrmSizes.hg18 junctions.bigBed
    rm junctions.bedTrim
    Using:

    cufflinks v0.8.2
    TopHat v1.0.13
    bowtie version 0.12.5


    So, I believe this is roughly a correct workflow for this data. Now, for my questions.

    Locus Annotation

    Looking at 0_1_gene_expr.diff, it seems that quite a few loci aren't annotated correctly even though they are sitting right on top of a gene and several are wrongly annotated.

    One example is a locus found at chr16:2017016-2029027. This is annotated as NTHL1 however it should be SLC9A3R2 which is at that position.



    NTHL1 is however just next door at chr16:2,029,817-2,037,868 so I thought perhaps SLC9A3R2 is just missing from the annotation and it just picked the closest one, however this is not the case as SLC9A3R2 is there and in the correct place.

    $> grep SLC9A3R2 Homo_sapiens.NCBI36.51.fixed.gtf
    chr16 protein_coding exon 2016924 2017220 . + . gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "1"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201";
    chr16 protein_coding CDS 2017008 2017220 . + 0 gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "1"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201"; protein_id "ENSP00000191922";
    chr16 protein_coding start_codon 2017008 2017010 . + 0 gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "1"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201";
    chr16 protein_coding exon 2019584 2019784 . + . gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "2"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201";
    chr16 protein_coding CDS 2019584 2019784 . + 0 gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "2"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201"; protein_id "ENSP00000191922";
    ... etc
    There are also a large number of loci (~2700) that don't have any annotated transcript from the reference at all despite many of them being right on top of a single gene.

    Is this just what I should expect or is there something wrong with my GTF annotation or parameters?



    Weird significance

    Some loci seem to be classed as significantly different despite having a very small number of reads in them.

    i.e. the locus:

    locus status value_1 value_2 log(fold_change) test_stat p_value significant
    chr1:100503947-100504182 OK 0.728869 0 0 0 0 yes
    Looks like this:



    which by eye I'd have called as a NOTEST. However there are many genes with loads of coverage which are being called as NOTEST while this isn't.

    gene locus status value_1 value_2 log(fold_change) test_stat p_value significant
    RPL7 chr8:74367477-74368779 NOTEST 208.194 206.278 -0.00924444 -0.00573799 0.995422 no


    Why do I get loci with very few reads being tested but loci with large numbers of reads as NOTEST?


    Weird expression values

    I've got some really really really high fold change differences between genes.

    Take for example

    gene locus status value_1 value_2 log(fold_change) test_stat p_value significant
    CCDC36 chr3:49108372-49117225 OK 2.33E-008 5.91318 19.3523 2546.34 0 yes


    A ln(fold change) of 19 ! Surely that'd be a regular fold change of about 178 million? Am I misunderstanding that?

    Obviously coverage isn't directly equatable with FPKM, but still, I'm hard pressed to believe there is such a huge fold change on that gene as cufflinks reports.

    Why do I get huge fold changes for loci? Presumably it occurs because one of the samples is being assigned a FPKM of 7.55E-009 when in reality it should be much higher than this, why would this be the case?

    Summary / TLDR

    I'm obviously doing something a bit suspect and getting weird results for both annotation, fold change and significance. It probably stems from one or more error I've made somewhere along the way and was hoping someone could provide a second set of eyes to spot it. Any help would be appreciated.
    Last edited by McBryan; 06-08-2010, 02:24 AM. Reason: Just realised I'd copied the line for the weird expression values from the wrong spreadsheet. Updated.

  • #2
    Originally posted by McBryan View Post

    Locus Annotation

    Looking at 0_1_gene_expr.diff, it seems that quite a few loci aren't annotated correctly even though they are sitting right on top of a gene and several are wrongly annotated.

    One example is a locus found at chr16:2017016-2029027. This is annotated as NTHL1 however it should be SLC9A3R2 which is at that position.



    NTHL1 is however just next door at chr16:2,029,817-2,037,868 so I thought perhaps SLC9A3R2 is just missing from the annotation and it just picked the closest one, however this is not the case as SLC9A3R2 is there and in the correct place.
    Would you please email me the files you fed to Cuffcompare? We'll need to look at this one ourselves.

    There are also a large number of loci (~2700) that don't have any annotated transcript from the reference at all despite many of them being right on top of a single gene.
    Can you explain this in a bit more detail? Perhaps with an excerpt from the tracking file?


    Weird significance

    Some loci seem to be classed as significantly different despite having a very small number of reads in them.
    This is a known bug that has been fixed in our development builds, and will be resolved in 0.8.3. We introduced the bug in 0.8.2, IIRC.



    Weird expression values

    I've got some really really really high fold change differences between genes.

    Take for example





    A ln(fold change) of 20.4781 ! Surely that'd be a regular fold change of about 700 million? Am I misunderstanding that?

    Obviously coverage isn't directly equatable with FPKM, but still, I'm hard pressed to believe there is such a huge fold change on that gene as cufflinks reports.

    Why do I get huge fold changes for loci? Presumably it occurs because one of the samples is being assigned a FPKM of 7.55E-009 when in reality it should be much higher than this, why would this be the case?
    What GTF file did you feed to cuffdiff? Is this the combined GTF from cuffcompare, or is it the reference? Can you please email me both the GTF and the SAM alignments from this locus in both samples? We'll take a closer look.

    Comment


    • #3
      Originally posted by Cole Trapnell View Post
      Would you please email me the files you fed to Cuffcompare? We'll need to look at this one ourselves.
      cuffcompare -r Homo_sapiens.NCBI36.51.fixed.gtf -R -o summarystats.txt s_6_sequence/transcripts.gtf s_7_sequence/transcripts.gtf

      with these files


      Originally posted by Cole Trapnell View Post
      Can you explain this in a bit more detail? Perhaps with an excerpt from the tracking file?

      One of the un-annotated loci is:

      test_id gene locus status value_1 value_2 log(fold_change) test_stat p_value significant
      XLOC_060829 - chrX:135407087-135422161 OK 1.37306 0.639791 -0.763656 -128.114 0 yes
      which looks like this:



      From the tracking file for that XLOC:

      TCONS_00113902 XLOC_060829 - u q1:CUFF.228858|CUFF.228858.1|100|8.199148|5.205243|11.193054|11.432044|- -
      From the summarystats.loci file for that XLOC:

      XLOC_060829 chrX[-]135409980-135410636 - CUFF.228858.1
      It seems to have been assigned a class code of 'u' (intergenic transcript) while it heavily overlaps with a reference transcript. It's also odd it only seems to have a CUFF id from one sample but not the other.

      I would have imagined both samples would have assembled a transfrag for this region.

      Looking up the s_6_sequence/transcripts.refmap and s_7_sequence/transcripts.refmap files and searching for ENST00000218364 in either file reveals no hits at all. Nor does searching for HTATSF1. Which is a bit odd.

      It's in the reference annotation:

      $> cat Homo_sapiens.NCBI36.51.fixed.gtf | grep ENST00000218364
      chrX protein_coding exon 135407337 135407695 . + . gene_id "ENSG00000102241"; transcript_id "ENST00000218364"; exon_number "1"; gene_name "HTATSF1"; transcript_name "HTATSF1-001";
      chrX protein_coding CDS 135407510 135407695 . + 0 gene_id "ENSG00000102241"; transcript_id "ENST00000218364"; exon_number "1"; gene_name "HTATSF1"; transcript_name "HTATSF1-001"; protein_id "ENSP00000218364";
      chrX protein_coding start_codon 135407510 135407512 . + 0 gene_id "ENSG00000102241"; transcript_id "ENST00000218364"; exon_number "1"; gene_name "HTATSF1"; transcript_name "HTATSF1-001";
      chrX protein_coding exon 135409423 135409603 . + . gene_id "ENSG00000102241"; transcript_id "ENST00000218364"; exon_number "2"; gene_name "HTATSF1"; transcript_name "HTATSF1-001";
      ... etc
      So, I've checked another one:

      test_id gene locus status value_1 value_2 log(fold_change) test_stat p_value significant
      XLOC_000027 - chr18:9536822-9604544 OK 0.862998 0.0533053 -2.78438 -707.483 0 yes


      Again that seems to be a similar story in the tracking file, class 'u' and only cuff id's from one sample.

      TCONS_00000074 XLOC_000027 - u q1:CUFF.97282|CUFF.97282.1|100|0.852185|0.847895|0.856475|1.184949|- -
      summarystats.loci:

      XLOC_000027 chr18[+]9574714-9578237 - CUFF.97282.1 -
      This one does however have listings in the transcripts.refmap files:

      s_6_sequence:
      PPP4R1 ENST00000400555 = CUFF.97280|CUFF.97280.1
      s_7_sequence:
      PPP4R1 ENST00000400555 = CUFF.98756|CUFF.98756.2
      However, those aren't the CUFF id's we had for the XLOC we are looking at. Sooo, I went back to the gene_exp.diff file and searched for PPP4R1 there and, lo and behold found it.

      test_id gene locus status value_1 value_2 log(fold_change) test_stat p_value significant
      XLOC_000165 PPP4R1 chr18:9536822-9604544 OK 19.4349 21.174 0.0857056 1.16386 0.24448 no
      There are some in the ~2700 that are genuinely un-annotatable as the ensembl gtf either doesn't have a gene there (although occasionally ucsc does) or where there are a few overlapping genes in that area which might make annotation difficult. The ones I've listed were obviously cherry picked as ones where there was only a single gene in the locus, with good coverage and where the reads align directly with the exons in the area.


      So, it just occurred to me to sort the gene_exp.diff by chromosome and location and I found this:

      XLOC_042941 ,SLC9A3R2 chr16:2017016-2029027 OK 53.9192 65.1133 0.188642 2.1697 0.0300296 no
      XLOC_043323 NTHL1 chr16:2017016-2029027 OK 0.000487772 1.57884 8.08235 580.701 0 yes
      That's the badly annotated gene from earlier. Next to a correctly (mostly if we ignore that spurious comma) annotated locus as well.

      There is also a:

      XLOC_043324 NTHL1 chr16:2029838-2037837 NOTEST 8.56769 7.57285 -0.123429 -0.161635 0.871593 no
      Which is in the right place.

      So, I looked for CCDC36 - the gene with the huge expression from earlier.

      It has these:

      XLOC_048941 CCDC36 chr3:49108372-49117225 OK 2.8332 3.19538 0.120299 18.6285 0 yes
      XLOC_048942 CCDC36 chr3:49108372-49117225 OK 7.51669 6.85147 -0.0926624 -8.4574 0 yes
      XLOC_048943 CCDC36 chr3:49108372-49117225 OK 2.33E-008 5.91318 19.3523 2546.34 0 yes
      XLOC_049430 QARS chr3:49108372-49117225 OK 34.9382 39.5405 0.123745 1.91556 0.0554212 no
      XLOC_048944 CCDC36 chr3:49120517-49129937 NOTEST 1.15265 0.749238 -0.430759 -18.8474 0 no
      XLOC_048945 CCDC36 chr3:49120517-49129937 NOTEST 0.900771 1.22835 0.310178 18.803 0 no
      XLOC_049431 USP19 chr3:49120517-49129937 NOTEST 16.4928 18.7536 0.128463 0.252713 0.80049 no
      Aha! Interesting. It was my understanding that I should have a single row per gene with the sum of the different transcripts. Lots of locus which are exactly the same but assigned to different XLOC. Since one of them seems to have been assigned an almost zero value in one sample but not in the other this causes a ridiculous fold change. But if we were to naively add up the FPKM's from each row then it's a much more sensible 1.22 fold change difference for the region chr3:49108372-49117225 and 1.12 fold change for the region chr3:49120517-49129937.

      So, my current theory would be that the reason I'm getting weird expression values is *also* because of the annotation and that when one of the XLOCs ends up with a very low value in one sample but anything that is not a very low value in the other sample we get a huge fold change.

      Originally posted by Cole Trapnell View Post
      This is a known bug that has been fixed in our development builds, and will be resolved in 0.8.3. We introduced the bug in 0.8.2, IIRC.
      Excellent, one less thing to worry about.

      Originally posted by Cole Trapnell View Post
      What GTF file did you feed to cuffdiff? Is this the combined GTF from cuffcompare, or is it the reference? Can you please email me both the GTF and the SAM alignments from this locus in both samples? We'll take a closer look.
      My cuffdiff command was:

      cuffdiff -p 14 summarystats.combined.gtf s_6_sequence/accepted_hits.sam s_7_sequence/accepted_hits.sam

      which accepts the combined.gtf file from cuffcompare along with the accepted_hits.sam files from s_6_sequence and s_7_sequence.

      I have created accepted_hits.sam files for the region chr3:49108296-49130013 (which represents the two regions above CCDC36 was in plus some padding, i.e. chr3:49108372-49129937 + 76bp either side).

      AWK used to do that:

      cat accepted_hits.sam | awk '$3=="chr3" && $4 >= 49108296 && $4 <= 49130013' > accepted_hits_ccdc36.sam
      Files hosted here

      Comment


      • #4
        I am wondering why you didn't run tophat with the corresponding GFF file. Have you tried this to see if reads then get assigned to the correctly annotated transcript.

        Comment


        • #5
          Originally posted by thinkRNA View Post
          I am wondering why you didn't run tophat with the corresponding GFF file. Have you tried this to see if reads then get assigned to the correctly annotated transcript.
          Funny you mention that. That was actually my initial approach where I first noticed the discrepancies I mentioned above.

          I had first tried using the script from http://seqanswers.com/forums/showthr...highlight=GFF3 and running:

          perl gtf_to_gff.pl Homo_sapiens.NCBI36.51.fixed.gtf > Homo_sapiens.NCBI36.51.gff
          followed by:

          tophat --solexa1.3-quals -p 14 -G Homo_sapiens.NCBI36.51.gff -o s_6_sequence hg18 s_6_sequence.txt
          tophat --solexa1.3-quals -p 14 -G Homo_sapiens.NCBI36.51.gff -o s_7_sequence hg18 s_7_sequence.txt
          I later reran it with the details I posted earlier to try and remove as many external factors as I could before posting - so in summary it doesn't stop it happening as that was what I was doing when I first noticed it.

          My understanding of the process, however, was that the GFF files merely provided additional reference junctions which bowtie could map the reads against rather than relying only on the ones that tophat detects itself. Providing the reads were ending up in the right place (which I believe they are upon inspection of the coverage.wig files) I would have imagined that this would not have significantly affected cufflink's ability to make transfrags and annotate them.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X