SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to choose Tophat/Cufflinks Library Type for SOLiD data? mart555 Bioinformatics 3 10-13-2011 10:35 AM
Bfast jobs for analyzing AB's SOLiD data vs Illumina data genome_anawk1 Bioinformatics 1 08-24-2011 09:05 AM
SHRiMP and Cufflinks for strand-specific SOLiD bacteria RNA Seq questions pjorth Bioinformatics 0 05-17-2011 12:45 PM
Cufflinks library-type for SOLiD strand specific single data gengen Bioinformatics 1 10-19-2010 08:41 AM
ZOOM released (supporting both Illumina data and ABI SOLiD data) spirit Bioinformatics 2 08-21-2008 06:48 AM

Reply
 
Thread Tools
Old 01-06-2010, 05:14 PM   #1
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default Using Cufflinks with AB SOLiD Data?

Has anyone tried to use cufflinks to assemble isoforms from SOLiD RNA-Seq data? Now that Bowtie supports colorspace reads, I am trying to take this output and process the alignments through cufflinks with little success.

I understand that TopHat adds the required XA:i:[+-] tag to the alignments, which I am able to add due to this being a strand-specific library. Whether or not I add this tag myself, when I run cufflinks, no output is reported (other than headers) into the output files.

Anyone had this issue or dealing with transcript assembly in SOLiD? Any help is appreciated...
lgoff is offline   Reply With Quote
Old 03-08-2010, 10:09 PM   #2
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

I'm actually trying to use the BioScope output for RNA-seq for cufflinks now.

In my sam file, it doesn't contain the "XS:A:" flag that is said to be a must to have from the documentation i read for the splice reads. When I ran cufflinks, I was expecting an error, but nothing popped out. Does this mean the flag is not required?
Haneko is offline   Reply With Quote
Old 03-09-2010, 04:53 AM   #3
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default

I am using Bioscope mapping output .bam files as input into cufflinks. You have to first convert to .sam file, clean it up, and added the strand information by parsing the bitwise flag.

I was able to run this cleaned up version of .sam file through cufflinks with pretty good results. The only problem I am having is that most of the output is not showing any strand information.

I think cufflink is only using strand information for spliced reads and ignoring unspliced read strand? So all the genes assembled with spliced read has strand information, but others don't?
damiankao is offline   Reply With Quote
Old 03-09-2010, 04:10 PM   #4
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi damiankao. For SOLiD data, do you use the full SAM file including all unmapped, unique mapped and multimapped reads? Or do you only use the set that contains the SOLiD-defined unique alignments (best score, and score >4 away from second best)?

EDIT:

I forgot to mention. In the manual, it states that cufflinks is unable to support other operations such as clipping. SOLiD bioscope whole transcriptome analysis includes hard clipping (H) in the CIGAR string for the output. May I ask how you dealt with that?

Thanks!

Last edited by Haneko; 03-09-2010 at 11:03 PM.
Haneko is offline   Reply With Quote
Old 03-09-2010, 11:42 PM   #5
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 199
Default

Quote:
Originally Posted by lgoff View Post
Has anyone tried to use cufflinks to assemble isoforms from SOLiD RNA-Seq data? Now that Bowtie supports colorspace reads, I am trying to take this output and process the alignments through cufflinks with little success.

I understand that TopHat adds the required XA:i:[+-] tag to the alignments, which I am able to add due to this being a strand-specific library. Whether or not I add this tag myself, when I run cufflinks, no output is reported (other than headers) into the output files.

Anyone had this issue or dealing with transcript assembly in SOLiD? Any help is appreciated...
Bowtie does not yet report gapped alignments; this is future work.
Won't this affect you if you apply it to RNA-seq data?
KevinLam is offline   Reply With Quote
Old 03-10-2010, 01:18 AM   #6
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default

I didn't change the CIGAR string at all. I assumed that cufflinks will just ignore anything that's not M or N. The basepair sequence and quality score correspond only to the Ms in the CIGAR string.

I am having memory issues with cufflinks right now though. I did several sucessful runs with about 100 million mapped reads (~30gig .sam file). But I am getting allocating new memory error now with ~200million mapped reads (~74gig .sam file).
damiankao is offline   Reply With Quote
Old 03-16-2010, 10:58 PM   #7
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

I'm using BioScope SAM output for cufflinks and getting a strange message when running:

Counting hits in map
CIGAR op has zero length
CIGAR op has zero length
..

There are few such lines ('CIGAR op has zero length'), and I'm not sure what they mean. Can someone please help?
Haneko is offline   Reply With Quote
Old 03-16-2010, 11:51 PM   #8
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by Haneko View Post
I'm using BioScope SAM output for cufflinks and getting a strange message when running:

Counting hits in map
CIGAR op has zero length
CIGAR op has zero length
..

There are few such lines ('CIGAR op has zero length'), and I'm not sure what they mean. Can someone please help?
It may mean the SAM entry is incorrect. Could you post the line under question?
nilshomer is offline   Reply With Quote
Old 03-17-2010, 01:04 AM   #9
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Unfortunately, I can't pinpoint which line it is referring to because my input file is rather big. Here is a larger chunk of the output:

Counting hits in map
CIGAR op has zero length
CIGAR op has zero length
Total map density: 1335893.196411
Processing bundle [ chrX:2712030-2712060 ] with 2 non-redundant alignments
Filtering bundle introns, avg bundle doc = 1.166667, thresh = 0.058333
Intron filtering pass finished
Filtering forward strand
Initial filter pass complete
Updated avg bundle doc = nan
threshold is = nan
Filtering reverse strand
Initial filter pass complete
Updated avg bundle doc = 1.166667
threshold is = 0.175000
Saw reverse strand only
No introns in bundle, collapsing all hits to single transcript
Calculating abundances
Calculating intial MLE
Tossing likely garbage isoforms
Revising MLE
Importance sampling posterior distribution
1 isoforms with 1 abundances
Considering isoform with FMI 1.000000
Processing bundle [ chrX:2767648-2767776 ] with 3 non-redundant alignments
Filtering bundle introns, avg bundle doc = 1.489583, thresh = 0.074479
....

Is there a way to find the line giving the error using these information?
Haneko is offline   Reply With Quote
Old 03-17-2010, 01:32 AM   #10
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default

Bioscope sam output includes unmapped reads. Lines that have no CIGAR string and no chromosome/contig ID are the umapped reads. I have no idea why BIoscope decided to include them, but you have to filter them out.

I ran about 250 million Bioscope mapped reads few days ago on our university's maths server. Cufflinks needed about 60-70 gigs of memory for that amount of reads. But at least it ran and gave me results. Now I just have to wade through 800,000 features that it predicted and filter out all the crap.

Last edited by damiankao; 03-17-2010 at 01:35 AM.
damiankao is offline   Reply With Quote
Old 03-17-2010, 02:10 AM   #11
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

I've already removed all the non-mappable reads from the SAM file when it threw me the error. =(
Haneko is offline   Reply With Quote
Old 03-17-2010, 02:17 AM   #12
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default

Are you sure you have no alignments with empty CIGAR string? They usually are at the end of the sam file.
damiankao is offline   Reply With Quote
Old 03-17-2010, 02:20 AM   #13
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Yes, I'm quite certain. I reran cufflinks using the reads that map to chrX, and it still gave me the error.
Haneko is offline   Reply With Quote
Old 03-17-2010, 02:27 AM   #14
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default

This doesn't seem likely to me, but do you have any alignments where the CIGAR string contains no 'M'?
damiankao is offline   Reply With Quote
Old 03-17-2010, 10:15 PM   #15
Xi Wang
Senior Member
 
Location: MDC, Berlin, Germany

Join Date: Oct 2009
Posts: 317
Default

Quote:
Originally Posted by Haneko View Post
Unfortunately, I can't pinpoint which line it is referring to because my input file is rather big. Here is a larger chunk of the output:

Counting hits in map
CIGAR op has zero length
CIGAR op has zero length
Total map density: 1335893.196411
Processing bundle [ chrX:2712030-2712060 ] with 2 non-redundant alignments
Filtering bundle introns, avg bundle doc = 1.166667, thresh = 0.058333
Intron filtering pass finished
Filtering forward strand
Initial filter pass complete
Updated avg bundle doc = nan
threshold is = nan
Filtering reverse strand
Initial filter pass complete
Updated avg bundle doc = 1.166667
threshold is = 0.175000
Saw reverse strand only
No introns in bundle, collapsing all hits to single transcript
Calculating abundances
Calculating intial MLE
Tossing likely garbage isoforms
Revising MLE
Importance sampling posterior distribution
1 isoforms with 1 abundances
Considering isoform with FMI 1.000000
Processing bundle [ chrX:2767648-2767776 ] with 3 non-redundant alignments
Filtering bundle introns, avg bundle doc = 1.489583, thresh = 0.074479
....

Is there a way to find the line giving the error using these information?
You can use this script to find the empty CIGAR reads, although it will be a little bit slow.

Code:
sort -k6,6 <your_file.sam> | more
__________________
Xi Wang
Xi Wang is offline   Reply With Quote
Old 03-18-2010, 05:17 PM   #16
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

I'm getting the following using your code:

1206_912_423 16 chrX 148852770 255 10H10M101N30M * 0 0 CTCCCGTAGCCTTGATGGTCTGCTGCTTCCGTCTGTCACT ,GA%%:IIIIIIIIIIIIIIIIIIIIIIIIIIII
IIIIII CS:Z:T32112112213020231231221013210203231320221310310031 XJ:Z:K CQ:Z:<<::9@9=:?==;:=>>>=:>9>695;;773:885&%*80,/&7&())6( XL:Z:39,39 XU:Z:3,1 IH:i:2 HI
:i:2 MD:Z:40 XS:A:-
922_1240_1515 16 chrX 119563391 255 10H10M1029N30M * 0 0 TGATCATGATCATTTGTCTGCAATGGTTTTGCCAGCATCT "C?H?'';?&&A?"""IIIIIIIIIIIIIIIIII
IIIIII CS:Z:T32231321031000101301312213103133211123213222112001 XJ:Z:K CQ:Z::>>:>:?<>;==::<9=;>>9><:&4,6&.2*',45+9()50)'&*&2 XL:Z:39 XU:Z:4 IH:i:1 HI:i:1 MD:Z:40 XS
:A:-
1297_662_654 0 chrX 153279920 255 10H10M102N26M4H * 0 0 CTTCGGTGTGCCACTGAAGATCCTGGTGTCGCCATG 1IIEIIIIC?III&&III&&4?I:;BDI=+.;I=3% CS
:Z:T20331231203202301111301111202132021011123301313032 XJ:Z:K CQ:Z:@@96564=5/919428;7>&:78=&:585&+*66%7,98&&)38&.%8,+ XL:Z:30,35 XU:Z:2,2 IH:i:2 HI:i:2 MD:Z:36 XS
:A:+
1289_854_1683 16 chrX 153666617 255 10H10M1046N30M * 0 0 TGCCACTCGCCATTCCTGCAGCTCAGGGGAAGGGATCAAT '<A;5<IB9;@IDH((IIHIIIIGGIIIIIIIII
IIIIII CS:Z:T33012320020200021223213122203103322110313223332222 XJ:Z:K CQ:Z:AA;A>;9>>?;?6:3:.:;4872:7(=,98)3'<7&0,6')1'5/1.)4/ XL:Z:39 XU:Z:1 IH:i:1 HI:i:1 MD:Z:40 XS
:A:-
1409_132_757 16 chrX 153666617 255 10H10M1046N30M * 0 0 TGCCACTCTACATTCCTGCAGCTCAGGGGAAGGGATCAAT "9:<?;G%###"IF##GFIIIAGIECIIIIIIII
IIIIII CS:Z:T33012320020200021223213121203213222110311113332022 XJ:Z:K CQ:Z:?><@;9<>?8:>5<31553/<7526#619&#/%71+5(3'$&%:4&&-44 XL:Z:39 XU:Z:4 IH:i:1 HI:i:1 MD:Z:8TA30
XS:A:-
1125_1188_1449 16 chrX 53458535 255 10H10M110N30M * 0 0 GAAGAACCTCCTACAATGACACGGGCAAAGGTACGGTCCT &-<I<?##E@)/<>"""/:?IIIIIIIIIIIIII
IIIIII CS:Z:T32021031310200130031112113113102231022021112101031 XJ:Z:K CQ:Z:;=:<=?A:?>@<=8=5:==<.2)'/*7(5/)8.#:&75(&*6#)9$8$#8 XL:Z:39 XU:Z:4 IH:i:1 HI:i:1 MD:Z:40 XS
:A:-
.
.
.
Haneko is offline   Reply With Quote
Old 03-18-2010, 09:17 PM   #17
Xi Wang
Senior Member
 
Location: MDC, Berlin, Germany

Join Date: Oct 2009
Posts: 317
Default

It seems that this problem is caused by Cufflinks can't deal with the hard clip (H).
Besides, as your CIGAR strings are all right, the error message "CIGAR op has zero length" is not quite distinguishable.
__________________
Xi Wang

Last edited by Xi Wang; 03-18-2010 at 09:20 PM.
Xi Wang is offline   Reply With Quote
Old 03-19-2010, 02:01 AM   #18
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default

My sam files have H in the CIGAR string and it worked fine.

Is what you copied and pasted into your post exactly as it looks in the sam file? I noticed there are some newline breaks in the attribute column that doesn't look like a natural linebreak imposed by the forum.
damiankao is offline   Reply With Quote
Old 03-22-2010, 07:10 PM   #19
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

no, actually. it's because i directly copied it from the window of my ssh. So there aren't any line breaks there. sorry for the confusion.
Haneko is offline   Reply With Quote
Old 04-26-2010, 08:04 AM   #20
xguo
Member
 
Location: Maryland

Join Date: Jul 2008
Posts: 48
Default

Quote:
Originally Posted by damiankao View Post
I am using Bioscope mapping output .bam files as input into cufflinks. You have to first convert to .sam file, clean it up, and added the strand information by parsing the bitwise flag.

I was able to run this cleaned up version of .sam file through cufflinks with pretty good results. The only problem I am having is that most of the output is not showing any strand information.

I think cufflink is only using strand information for spliced reads and ignoring unspliced read strand? So all the genes assembled with spliced read has strand information, but others don't?
Bioscope output includes two separate records in SAM file if a read aligns to a splice junction. I suppose what cufflink expects for spliced reads is one SAM record per junction read with CIGAR string ##M###N##M. Thus, reads mapped to splice junction in Bioscope output are not treated as spliced reads by cufflink and strand information is not used at all. As cufflink author Cole stated, "You should do your best to feed Cufflinks spliced alignments that are stranded with the XS". It may be necessary to merge two junction read records in Bioscope output into one with CIGAR string "##M###N##M".

Any thought on this issue?
xguo is offline   Reply With Quote
Reply

Tags
bowtie, cufflinks, solid, tophat, transcript assembly

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:02 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO