SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
where is the error in my input files? shuang Bioinformatics 3 08-23-2011 02:23 AM
input files for IMAGE Maegwin Bioinformatics 4 04-22-2011 05:54 PM
SVA input files srd Introductions 0 03-16-2011 07:17 AM
IMAGE input files skingan Genomic Resequencing 0 07-29-2010 01:02 PM
BWA - input files Bruins Bioinformatics 2 07-07-2010 12:43 AM

Reply
 
Thread Tools
Old 07-06-2012, 01:52 AM   #1
Azazel
Member
 
Location: Japan

Join Date: Oct 2010
Posts: 52
Question Most straightforward way to prepare input files for PhyloCSF (rat rn4)?

Core question: I did a de novo transcriptome assembly with cufflinks for rat based on RNA-Seq data. There are a couple of 1000 transcripts not overlapping with annotated genes and I would like to divide these into putatively coding and putatively non-coding RNA, using PhyloCSF.

I find it difficult to prepare the input files for PhyloCSF and wonder what would be a straightforward way to do this?

What I already tried: I think as input I need a multi-alignment of the orthologous loci, and the sequence for rat needs to be ungapped.

I would like to avoid doing my own multi-genome alignment if at all possible and searched UCSC. There http://hgdownload.cse.ucsc.edu/golde...n4/multiz9way/ I found that they already have multi-genome alignment for the rat genome against:
  • mouse (Feb 2006, mm8)
  • human (Mar 2006, hg18)
  • dog (May 2005, canFam2)
  • cow (Mar 2005, bosTau2)
  • opossum (Jan 2006, monDom4)
  • chicken (Feb 2004, galGal2)
  • frog (Oct 2004, xenTro1)
  • zebrafish (May 2005, danRer3)
PhyloCSF does not offer this phylogeny and according to https://github.com/mlin/PhyloCSF/wiki/ it is not directly possible to prepare my own phylogeny for this. However, PhyloCSF does support the 29 mammals phylogeny.

So if I want to go with this approach, I would need to:
  • remove zebrafish, frog, chicken, opossum (how?)
  • make it so that the rat part is ungapped (is this even possible? how?)
  • extract the sequence from the remaining multi-alignment, that correpsonds to the transcript which I want to test (how?)
Or, maybe this approach is too convoluted anyway? Any help or suggestions for a better strategy would be much appreciated!

P.S.: I am using rn4 coordinates.
Azazel is offline   Reply With Quote
Old 07-11-2012, 01:59 PM   #2
liguow
Member
 
Location: Houston

Join Date: Apr 2009
Posts: 12
Default

Quote:
Originally Posted by Azazel View Post
Core question: I did a de novo transcriptome assembly with cufflinks for rat based on RNA-Seq data. There are a couple of 1000 transcripts not overlapping with annotated genes and I would like to divide these into putatively coding and putatively non-coding RNA, using PhyloCSF.

I find it difficult to prepare the input files for PhyloCSF and wonder what would be a straightforward way to do this?

What I already tried: I think as input I need a multi-alignment of the orthologous loci, and the sequence for rat needs to be ungapped.

I would like to avoid doing my own multi-genome alignment if at all possible and searched UCSC. There http://hgdownload.cse.ucsc.edu/golde...n4/multiz9way/ I found that they already have multi-genome alignment for the rat genome against:
  • mouse (Feb 2006, mm8)
  • human (Mar 2006, hg18)
  • dog (May 2005, canFam2)
  • cow (Mar 2005, bosTau2)
  • opossum (Jan 2006, monDom4)
  • chicken (Feb 2004, galGal2)
  • frog (Oct 2004, xenTro1)
  • zebrafish (May 2005, danRer3)
PhyloCSF does not offer this phylogeny and according to https://github.com/mlin/PhyloCSF/wiki/ it is not directly possible to prepare my own phylogeny for this. However, PhyloCSF does support the 29 mammals phylogeny.

So if I want to go with this approach, I would need to:
  • remove zebrafish, frog, chicken, opossum (how?)
  • make it so that the rat part is ungapped (is this even possible? how?)
  • extract the sequence from the remaining multi-alignment, that correpsonds to the transcript which I want to test (how?)
Or, maybe this approach is too convoluted anyway? Any help or suggestions for a better strategy would be much appreciated!

P.S.: I am using rn4 coordinates.

Most straightforward way is not to use phyloCSF. Instead, using PCAT, you only need the mRNA sequence or genome coordinates (which you already have if you already rebuild the transcriptome)
http://code.google.com/p/cpat/

It's accurate , efficient and convenient.
liguow is offline   Reply With Quote
Old 10-23-2012, 05:45 AM   #3
sqcrft
Member
 
Location: boston

Join Date: May 2012
Posts: 29
Default is it published?

the software is not published yet, right?
is there a paper?

Quote:
Originally Posted by liguow View Post
Most straightforward way is not to use phyloCSF. Instead, using PCAT, you only need the mRNA sequence or genome coordinates (which you already have if you already rebuild the transcriptome)
http://code.google.com/p/cpat/

It's accurate , efficient and convenient.
sqcrft is offline   Reply With Quote
Old 10-23-2012, 07:06 AM   #4
liguow
Member
 
Location: Houston

Join Date: Apr 2009
Posts: 12
Default

Quote:
Originally Posted by sqcrft View Post
the software is not published yet, right?
is there a paper?
Not yet. The manuscript is under review now.
liguow is offline   Reply With Quote
Old 10-23-2012, 09:16 AM   #5
sqcrft
Member
 
Location: boston

Join Date: May 2012
Posts: 29
Default Thanks for your fast reply

Thanks for your fast reply.

Is there a direct way to connect PCAT with cufflinks/cuffmerge/cuffcompare?

As far as I can see, the 12 columns bed format required in your software has a different format from the gtf gene annotation file I downloaded from UCSC. What are meanings of columns 7-12?

Also, the format is different from the output of cufflinks/cuffcompare/cuffmerge.

Is there a way to analyze the output from cufflinks suite directly in PCAT. It would be significantly improve the usablity of your software.

Quote:
Originally Posted by liguow View Post
Not yet. The manuscript is under review now.
sqcrft is offline   Reply With Quote
Old 10-23-2012, 09:25 AM   #6
sqcrft
Member
 
Location: boston

Join Date: May 2012
Posts: 29
Default my fault

Your bed format is the standard bed format.

Now, my only question will be how to combine cufflinks suite with your software to identify noval coding and non-coding transcripts.

It seems to be that, I still have to code to change the output of cufflinks to the format required in CPAT. It seems not that difficult though.


Quote:
Originally Posted by liguow View Post
Not yet. The manuscript is under review now.
sqcrft is offline   Reply With Quote
Old 10-24-2012, 11:47 AM   #7
sqcrft
Member
 
Location: boston

Join Date: May 2012
Posts: 29
Default

There is a tool that can convert the cufflinks output gtf to the input of cpat bed format.
https://lists.soe.ucsc.edu/pipermail...il/025696.html

If you can integrate it into your software, it will be much user friendly, especially for beginners like me.


Quote:
Originally Posted by liguow View Post
Not yet. The manuscript is under review now.
sqcrft is offline   Reply With Quote
Reply

Tags
non-coding, phylocsf, phylogeny

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:38 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO