Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • strand information and cuffdiff failures - can't tell what is wrong

    I am having a problem with Cuffdiff where I do not get any DE genes and I have no FPKM values in the tracking files. I am working with Arabidopsis wild type treated vs untreated, the sequencing was done with 100bp Illumina reads with a strand specific library. I have been testing with single technical replicate libraries (biological replicate split across three lanes) to decrease computation time, but each tech rep has ~10 million reads, which should be enough to give me a non-zero fpkm...

    My pipeline is below; [...] notes a [B]valid[\B] path to each file, foo and bar represent the file names.
    I use xargs to execute everything in parallel when appropriate, if that matters to anyone.

    Tophat2:
    tophat2 -p5 -g1 --library-type fr-firststrand -o [...]/foo AraTAIR10 CleanReads/bar.fastq.gz'

    Cufflinks:
    cufflinks -p5 --library-type fr-firststrand -G [...]/Arabidopsis_thaliana.TAIR10.28.gtf -o [...]/Cufflinks_gtf/foo Tophat_aln/bar.bam'

    cuffmerge:
    cuffmerge -p 10 -g [...]/Arabidopsis_thaliana.TAIR10.28.gtf --library-type fr-firststrand --ref-sequence [...]/Arabidopsis_thaliana.TAIR10.28.dna.toplevel.fa -o [...]/CuffMerge_out [...]/assemblies.txt

    CuffDiff:
    cuffdiff -p 7 -o Diff_ensembl_gtf --frag-bias-correct [...]/Arabidopsis_thaliana.TAIR10.28.dna.toplevel.fa -L CR,CRT [...]/Arabidopsis_thaliana.TAIR10.28.gtf [...]/CR1_5s.bam,/[...]/CR2_5s.bam [...]/CRT1_5s.bam,/[...]/CRT2_5s.bam

    where AraTAIR10 is the TAIR10 release from ensembl, assemblies.txt contains the full paths to each transcripts.gtf generated by cufflinks (one per line, $ ls */transcripts.gtf > assemblies.txt). for the bam files CR is a base name, omission of a T is for untreated and inclusion of a T is for treated. The 's' indicates that it was sorted by samtools.

    1) When mapping with tophat2 I get a 95% mapping rate when I specify --library-type=fr-firststrand. When I specify --library-type=fr-secondstrand I get a 71% mapping. [B]Shouldn't I only have 5% given 95% on the first-strand?[\B]. If I load my data into IGB to look at the read mapping, I see nearly all of the reads mapping to the opposing strand of where the gene lies (please see attached image - I hope I attached it correctly).

    2) Looking at -just- the CR1_5 sample.
    How many reads were in the library? [grep -c "^@HWI" /scratch/smcintur/CleanReads/CR1_L005_Clean.fastq] - 10,480,194 reads, How many mapped? [srun -p c14 -n 1 --mem=5G samtools flagstat CR1_5s.bam] 7,517,667, about 75% as reported by tophat2 in alignment_summary.txt.
    Did cufflinks worked (I think) - note the none-zero FPKMs
    $head genes.fpkm_tracking
    tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
    AT1G01010 - - AT1G01010 NAC001 - 1:3630-5899 - - 24.6557 22.0522 27.2592 OK
    AT1G01030 - - AT1G01030 NGA3 - 1:11648-13714 - - 1.41468 0.833249 1.99611 OK
    AT1G01020 - - AT1G01020 ARV1 - 1:5927-8737 - - 14.3169 11.7451 16.8886 OK
    AT1G01073 - - AT1G01073 - - 1:44676-44787 - - 0 0 0 OK
    AT1G01070 - - AT1G01070 - - 1:38751-40944 - - 14.5392 12.1482 16.9302 OK
    AT1G01060 - - AT1G01060 LHY - 1:33378-37871 - - 30.7603 28.327 33.1936 OK
    AT1G01080 - - AT1G01080 - - 1:45295-47019 - - 4.40288 3.09571 5.71005 OK
    AT1G01040 - - AT1G01040 DCL1 - 1:23145-31227 - - 15.3272 14.2023 16.4521 OK
    AT1G01050 - - AT1G01050 PPA1 - 1:31169-33153 - - 104.366 95.7783 112.955 OK

    looks good in so far as I can tell.

    cuffmerge ran without any reported error, and produced a merged.gtf file. the CR1_5 library has 33,558 lines, and the merged.gtf has 230,044 lines.
    [B]$head merged.gtf[\B]
    1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
    1 Cufflinks exon 3996 4276 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
    1 Cufflinks exon 4486 4605 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
    1 Cufflinks exon 4706 5095 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
    1 Cufflinks exon 5174 5326 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "5"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
    1 Cufflinks exon 5439 5899 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "6"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";

    This looks like a valid description of NAC001 looks correct, 6 exons on the positive strand.

    When I run cuffdiff (code above)
    and look at the fpkms: $cut -f10 genes.fpkm_tracking |sort|uniq
    0
    CR_FPKM
    -nan

    and if I look for the testing status of each genes I see: $cut -f7 gene_exp.diff |sort|uniq
    NOTEST
    status

    This is the result for using the merged.gtf or the a fore mentioned gtf file.
    [B]So[\B] cufflinks can count my transcripts and get FPKMs, cuffmerge doesn't get angry, and cuffdiff shows that it couldn't count any of the data points. What is going on here
    Attached Files

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 06:37 PM
0 responses
8 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 06:07 PM
0 responses
8 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
49 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
66 views
0 likes
Last Post seqadmin  
Working...
X