SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
My reads are mapping to the wrong strand? rdsqc22 RNA Sequencing 5 07-23-2014 02:43 PM
Wrong exon number information from tophat-fusion angerusso RNA Sequencing 0 02-07-2014 10:46 AM
Bowtie Mapping Wrong Strand Dario1984 Bioinformatics 1 02-01-2013 04:05 AM
cufflinks transcript on wrong strand ornam Bioinformatics 2 12-13-2010 12:33 PM
Bowtie, BWA, strand information Zimbobo Bioinformatics 6 04-24-2010 08:30 AM

Reply
 
Thread Tools
Old 10-23-2015, 07:20 AM   #1
smcinturf
Junior Member
 
Location: Columbia, Mo

Join Date: May 2012
Posts: 5
Default strand information and cuffdiff failures - can't tell what is wrong

I am having a problem with Cuffdiff where I do not get any DE genes and I have no FPKM values in the tracking files. I am working with Arabidopsis wild type treated vs untreated, the sequencing was done with 100bp Illumina reads with a strand specific library. I have been testing with single technical replicate libraries (biological replicate split across three lanes) to decrease computation time, but each tech rep has ~10 million reads, which should be enough to give me a non-zero fpkm...

My pipeline is below; [...] notes a [B]valid[\B] path to each file, foo and bar represent the file names.
I use xargs to execute everything in parallel when appropriate, if that matters to anyone.

Tophat2:
tophat2 -p5 -g1 --library-type fr-firststrand -o [...]/foo AraTAIR10 CleanReads/bar.fastq.gz'

Cufflinks:
cufflinks -p5 --library-type fr-firststrand -G [...]/Arabidopsis_thaliana.TAIR10.28.gtf -o [...]/Cufflinks_gtf/foo Tophat_aln/bar.bam'

cuffmerge:
cuffmerge -p 10 -g [...]/Arabidopsis_thaliana.TAIR10.28.gtf --library-type fr-firststrand --ref-sequence [...]/Arabidopsis_thaliana.TAIR10.28.dna.toplevel.fa -o [...]/CuffMerge_out [...]/assemblies.txt

CuffDiff:
cuffdiff -p 7 -o Diff_ensembl_gtf --frag-bias-correct [...]/Arabidopsis_thaliana.TAIR10.28.dna.toplevel.fa -L CR,CRT [...]/Arabidopsis_thaliana.TAIR10.28.gtf [...]/CR1_5s.bam,/[...]/CR2_5s.bam [...]/CRT1_5s.bam,/[...]/CRT2_5s.bam

where AraTAIR10 is the TAIR10 release from ensembl, assemblies.txt contains the full paths to each transcripts.gtf generated by cufflinks (one per line, $ ls */transcripts.gtf > assemblies.txt). for the bam files CR is a base name, omission of a T is for untreated and inclusion of a T is for treated. The 's' indicates that it was sorted by samtools.

1) When mapping with tophat2 I get a 95% mapping rate when I specify --library-type=fr-firststrand. When I specify --library-type=fr-secondstrand I get a 71% mapping. [B]Shouldn't I only have 5% given 95% on the first-strand?[\B]. If I load my data into IGB to look at the read mapping, I see nearly all of the reads mapping to the opposing strand of where the gene lies (please see attached image - I hope I attached it correctly).

2) Looking at -just- the CR1_5 sample.
How many reads were in the library? [grep -c "^@HWI" /scratch/smcintur/CleanReads/CR1_L005_Clean.fastq] - 10,480,194 reads, How many mapped? [srun -p c14 -n 1 --mem=5G samtools flagstat CR1_5s.bam] 7,517,667, about 75% as reported by tophat2 in alignment_summary.txt.
Did cufflinks worked (I think) - note the none-zero FPKMs
$head genes.fpkm_tracking
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
AT1G01010 - - AT1G01010 NAC001 - 1:3630-5899 - - 24.6557 22.0522 27.2592 OK
AT1G01030 - - AT1G01030 NGA3 - 1:11648-13714 - - 1.41468 0.833249 1.99611 OK
AT1G01020 - - AT1G01020 ARV1 - 1:5927-8737 - - 14.3169 11.7451 16.8886 OK
AT1G01073 - - AT1G01073 - - 1:44676-44787 - - 0 0 0 OK
AT1G01070 - - AT1G01070 - - 1:38751-40944 - - 14.5392 12.1482 16.9302 OK
AT1G01060 - - AT1G01060 LHY - 1:33378-37871 - - 30.7603 28.327 33.1936 OK
AT1G01080 - - AT1G01080 - - 1:45295-47019 - - 4.40288 3.09571 5.71005 OK
AT1G01040 - - AT1G01040 DCL1 - 1:23145-31227 - - 15.3272 14.2023 16.4521 OK
AT1G01050 - - AT1G01050 PPA1 - 1:31169-33153 - - 104.366 95.7783 112.955 OK

looks good in so far as I can tell.

cuffmerge ran without any reported error, and produced a merged.gtf file. the CR1_5 library has 33,558 lines, and the merged.gtf has 230,044 lines.
[B]$head merged.gtf[\B]
1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
1 Cufflinks exon 3996 4276 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
1 Cufflinks exon 4486 4605 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
1 Cufflinks exon 4706 5095 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
1 Cufflinks exon 5174 5326 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "5"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";
1 Cufflinks exon 5439 5899 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "6"; gene_name "NAC001"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; p_id "P1";

This looks like a valid description of NAC001 looks correct, 6 exons on the positive strand.

When I run cuffdiff (code above)
and look at the fpkms: $cut -f10 genes.fpkm_tracking |sort|uniq
0
CR_FPKM
-nan

and if I look for the testing status of each genes I see: $cut -f7 gene_exp.diff |sort|uniq
NOTEST
status

This is the result for using the merged.gtf or the a fore mentioned gtf file.
[B]So[\B] cufflinks can count my transcripts and get FPKMs, cuffmerge doesn't get angry, and cuffdiff shows that it couldn't count any of the data points. What is going on here
Attached Images
File Type: png AT1G01010_mapping_CR1_5.PNG (15.8 KB, 4 views)
smcinturf is offline   Reply With Quote
Reply

Tags
cuffdiff

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:32 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO