SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
Applying Next-Gen Sequencing and Next-Gen Sequencing Data Analysis LifeScienceMarketing Events / Conferences 0 08-06-2012 06:21 AM
New to Next-gen jhunter Introductions 0 06-27-2012 02:44 PM
GeneProf - Next-Gen Analysis for Next-Gen Data florian Bioinformatics 0 01-30-2012 02:21 AM
last gen, next gen, third gen GW_OK General 26 08-19-2010 11:05 AM
Transitioning from Previous-Gen to Next-Gen conrad_halling Introductions 0 05-23-2010 06:58 AM

Reply
 
Thread Tools
Old 08-09-2013, 07:47 AM   #1
tjeffe01
Junior Member
 
Location: Canada

Join Date: Aug 2013
Posts: 4
Default New to Next Gen not sure where to go from here

Hello all,

My supervisor and I recently jumped on the next gen bandwagon. Another professor in our department purchased lanes on an RNAseq plate and couldn't fill the entire thing so we took 4 spots. We are working on glyphosate resistant giant ragweed; for which there is no reference genome and little to no genetic data available.

We have a few questions:
1) What do you recommend we do now?
2) In addition to the whole transcriptome date we are also interested specifically in looking at the expression of two or three genes (Catalase, SOD1 Cu/Zn, and EPSPS) however we don't have giant ragweed sequences for those genes. Can we search our data for those genes and their expression levels or is that just out of the question?
3) This data will not be in my MSc and I won't be doing much work with it beyond posting here; but I would like to include a short section in my thesis on the data we collect here. Is it possible to get average read lengths, fold number and other quality statistics about the data to include in my thesis, or does that not make any sense?

Thanks everyone,
Taylor
tjeffe01 is offline   Reply With Quote
Old 08-09-2013, 07:56 AM   #2
vivek_
PhD Student
 
Location: Denmark

Join Date: Jul 2012
Posts: 164
Default

http://trinityrnaseq.sourceforge.net/

Looks like a good place to start for transcriptome assembly.
vivek_ is offline   Reply With Quote
Old 08-09-2013, 08:35 AM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

1) As vivek alluded, if you're thinking of doing some RNAseq, then you'll need to do assembly. Knowing absolutely nothing about this, I would think an interesting experiment would be to compare glyphosate resistant vs. glyphosate sensitive ragweed, but without a reference genome/transcriptome you might need more sequencing capacity to do that easily in one shot (hopefully someone who does assembly can chime in). BTW, if a related plant has been sequenced, you might have luck aligning to that (I haven't a clue how related the various plants are, I work on mice and humans!). This is particularly true for transcriptome alignments, since there's selective pressure on that.
2) After assembly, you'll have to blast the various contigs to try to figure out what they are. Presumably you'll pick up the genes that you're most interested in. There's no great way to search the raw alignments for just a few genes, you'll likely end up getting a lot of false-positive matches.
3) Just a listing of some various quality metrics probably wouldn't be interesting enough for inclusion in a thesis, at least without knowing the exact thesis topic.
dpryan is offline   Reply With Quote
Old 08-09-2013, 09:18 AM   #4
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

4 lanes should give some useful data. Use Trinity to assemble the transcriptomes. Use blast and perhaps something like blast2go for annotation.

I can't give out details but I am 100% sure that you will find potential differential Catalase and CU-SOD1 activity. I agree with dpryan that listing numbers in your thesis would not be interesting. However such numbers can be obtained if you really want them.

Good luck with the analysis. Expect a high learning curve but with, it is to be hoped, a high payback.
westerman is offline   Reply With Quote
Old 08-09-2013, 09:25 AM   #5
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

can also use MIRA and Newbler for transcriptome assembly
JackieBadger is offline   Reply With Quote
Old 08-09-2013, 09:33 AM   #6
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

4 lanes is a lot these days. You can expect to get about 800 Million reads from that, and you really only need maybe 30-50 Million per replicate, and 3 replicates per sample. So you can easily stick in about 12-20 total samples in those 4 lanes.

So, you should have some sort of experimental design. Don't just throw some random stuff in each lane. You'll over sequence them and end up regretting it later. Now, you're probably not going to find many people able to help with the experimental design. But you should think about if there are is some sort of developmental time course, drug/condition treatment and control, or different tissues from an adult, that would actually give some interesting comparisons. Once you have a few conditions/tissues/time-points picked out, you need to have 3 (or more) replicates for any sort of meaningful statistical analysis.

Now, if you can't come up with more than 2-3 different samples to sequences, you may want to consider genome sequencing. But how big is your genome, or do you have any idea? Because if you want maybe 9 samples for RNA-seq, that only leaves you with about 400M reads for the genome. With 2x100bp reads, you're really only going to have useful depth of sequencing for 2Gbp size genomes or smaller. And even then, its going to be pretty fragments due to lack of matepair reads (though you could do 300bp and 800bp libraries now). But if you plan to continue working on this species, it may be useful to get the genome sequencing effort started, adding things like mate pair libraries at a later in time. This is something your advisor should be heavily involved in deciding, since most genome sequencing projects out live a single graduate student (especially if you're already in year 3 or 4).

Now, for RNAseq analysis without a genome, I highly recommend trinity (linked above), it makes assembly, orthology assignment and expression analysis all very user friendly (for command line stuff).
Wallysb01 is offline   Reply With Quote
Old 08-09-2013, 10:10 AM   #7
tjeffe01
Junior Member
 
Location: Canada

Join Date: Aug 2013
Posts: 4
Default

Thanks for your replies everyone.

I feel I should clarify my third point. I don't necessarily want to include the metrics about the data as a part of my thesis. I guess I would like to be able to include a paragraph and part of a slide in my defense about where the research is going beyond the work I've done so far. Being able to cite some metrics about the data sounds a little more scientific than "We ran RNAseq and got back a lot of data"

I also should clarify that we didn't buy four lanes. Our colleague bought a lane of 12 to himself and we bought 4 spots on that lane. We provided the sequencing facility with RNA from 4 plants representing 4 different states: Resistant sprayed (after 2 hours) and unsprayed to look for differential expression and susceptible sprayed (after 2 hours) and unsprayed to eliminate differences that are just a normal response to glyphosate.

The closest plant with a sequenced and aligned genome is Sunflower, which is much too far to be of use.

Using c-value our genome size is about 1.8 x 10^10 bp.

Like I said, this isn't part of my project. I did the RNA extraction and the paper work but that is where my responsibility ends in my opinion. Look like I need to make the recommendation to my supervisor that if he really wants to work with this data he needs to get a genome sequence first. Otherwise he could use trinity but he'll probably need to hire a new grad student or post doc to do it.

Thanks for all of the answers everyone.
tjeffe01 is offline   Reply With Quote
Old 08-09-2013, 10:16 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,975
Default

Quote:
Originally Posted by tjeffe01 View Post

I also should clarify that we didn't buy four lanes. Our colleague bought a lane of 12 to himself and we bought 4 spots on that lane. We provided the sequencing facility with RNA from 4 plants representing 4 different states: Resistant sprayed (after 2 hours) and unsprayed to look for differential expression and susceptible sprayed (after 2 hours) and unsprayed to eliminate differences that are just a normal response to glyphosate.
If this is a single lane of sequencing with 12 samples (if that is what you mean by lane of 12) then that would not be a lot of data.
GenoMax is offline   Reply With Quote
Old 08-09-2013, 11:01 AM   #9
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Quote:
Originally Posted by GenoMax View Post
If this is a single lane of sequencing with 12 samples (if that is what you mean by lane of 12) then that would not be a lot of data.
Indeed. In fact, its probably far too little. Ideally, you're looking at about 12M reads per sample. That's just not enough sequencing depth. A single lane really shouldn't be split with more than 6 ways, or equivalent (i.e. 12 samples spread over 2 lanes). And the fact that there isn't a reference genome makes it even harder, as to do any meaningful analysis, genes/transcripts first need to be assembled, which requires much higher coverage than pure DE analysis.

So, its probably a good thing this is just a "future direction" for the OP's thesis.
Wallysb01 is offline   Reply With Quote
Old 08-09-2013, 11:02 AM   #10
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Quote:
Originally Posted by tjeffe01 View Post
Using c-value our genome size is about 1.8 x 10^10 bp.
So 18Gbp? That genome isn't getting sequenced anytime soon.
Wallysb01 is offline   Reply With Quote
Old 08-09-2013, 11:42 AM   #11
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by Wallysb01 View Post
So 18Gbp? That genome isn't getting sequenced anytime soon.
Most plants aren't.

On the bright side of working with the un-characterized part of Life is that experiments don't have to stick to rigorous statistical principles. Which, in your case, is a good thing since you don't have biological nor technical replicates. Instead you can treat this as a "fishing expedition".

@Genomax. I agree that there isn't a lot of data but they should be able to get enough even with 1/3 of a lane. For rnaSeq we shoot for at least 30M reads per sample. A recent one-lane 8-sample experiment (similar to tjeffe01's) that recently came through our center yielded a total of 450M reads. So assuming that tjeffe01's sequencing center can balance across those 12 samples then he will get over 30M reads per sample. From that Trinity will be able to provide a nice assembly. Not to human/mouse standards but for us plant & animal guys ... well, we just take what we can.


I should emphasize what Wallysb01 said. Trinity does the assembly and, via its Trinnotate package -- the annotation and expression analysis. I am a bit behind the times by still using Blast and Blast2Go for my annotation but Trinity is becoming a one-stop solution.
westerman is offline   Reply With Quote
Old 08-09-2013, 04:15 PM   #12
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Quote:
Originally Posted by westerman View Post
A recent one-lane 8-sample experiment (similar to tjeffe01's) that recently came through our center yielded a total of 450M reads. So assuming that tjeffe01's sequencing center can balance across those 12 samples then he will get over 30M reads per sample. From that Trinity will be able to provide a nice assembly. Not to human/mouse standards but for us plant & animal guys ... well, we just take what we can.
Do you mean 450M reads as in 225M PE reads? I don't think counting a read on the same fragment twice is the right thing to do here, if that's in deed what you're doing.

But I guess we should ask tjeffe01, how many PE reads did you get for each sample? Or is it not completed yet?
Wallysb01 is offline   Reply With Quote
Old 08-09-2013, 04:40 PM   #13
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Since it is a roundup-resistant weed, you have obvious candidate genes. Part of the problem is that you don't have biological replicates, if I read it right, at least not for expression.

What you do have is biological replicates in sequence....

The first thing I would try to do is mine this data for any sequence variants in candidate genes...especially EPSPS. Imagine if you find variants in EPSPS in the resistant variety that are not in the non-resistant variety. That would be a very obvious candidate for resistance.

You could try to make a de novo transcriptome assembly. I would actually combine the reads from samples to make the assembly, or at least combine the reads from resistant varieties and combine the reads from non-resistant varieties. Then realign your reads back to this reference transcriptome to get differential expression.

Last edited by chadn737; 08-09-2013 at 04:42 PM.
chadn737 is offline   Reply With Quote
Old 08-09-2013, 04:45 PM   #14
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Quote:
Originally Posted by chadn737 View Post
I would actually combine the reads from samples to make the assembly, or at least combine the reads from resistant varieties and combine the reads from non-resistant varieties. Then realign your reads back to this reference transcriptome to get differential expression.
This, definitely this. I'd suggest assembling them all together, given your fairly limited sequencing depth. But I'd do both myself, and compare.
Wallysb01 is offline   Reply With Quote
Old 08-09-2013, 05:03 PM   #15
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

I understand the pitfalls of this suggestion, so nobody rip me to pieces.

I suggest it only because there are VERY obvious candidate genes in this experimental design. For those not familiar with how roundup works, the target enzyme is EPSPS and roundup-resistant crops carry a resistant EPSPS gene (no offense to those who know all this already, I just don't want to be attacked for suggesting this).

Find some sequences of candidate genes, whether from sunflower or other organisms. You may even be able to find the sequence of EPSPS from ragweed in a database somewhere. Then just align your sequences against this small reference. Obviously, this can lead to a lot of misalignment, but it would give a very quick look at any reads aligning to candidate genes. I would suggest this only as an initial quick dirty look at your data while you are running a de novo assembly or something, not as an approach to getting your data published.

What do people think?

Last edited by chadn737; 08-09-2013 at 05:08 PM.
chadn737 is offline   Reply With Quote
Old 08-12-2013, 09:01 AM   #16
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by Wallysb01 View Post
Do you mean 450M reads as in 225M PE reads? I don't think counting a read on the same fragment twice is the right thing to do here, if that's in deed what you're doing.
That is what I meant -- 450M total reads, 225M PE reads. For 12 samples per lane that would give about 37M total reads or about 3700 Mbases per sample.

Considering assembly only and assuming that the samples are not overwhelmed with rRNA or other highly expressed transcripts, then there are about 14800 Mbases to work with -- if all 4 samples are merged together to create the assembly (which is the only way I would do it.) With the transcriptome being, what around 100M base pairs?, then we have 148x coverage. Even with rRNA and the highly expressed genes using up a lot of the bases the coverage should be high enough for a rough assembly. Perhaps not high enough to tease out very low expression transcripts but still good enough to get a handle on the transcriptome.

On a side note, I am finding this thread to be interesting.
westerman is offline   Reply With Quote
Old 08-12-2013, 09:36 AM   #17
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,975
Default

Quote:
Originally Posted by westerman View Post
That is what I meant -- 450M total reads, 225M PE reads.
That is illumina marketing speak

I prefer to think of it as 225M unique clusters. The libraries being referred to must be extra good quality libraries, something that is not guaranteed (there is beginner's luck, if applicable here).

We have given tjeff01 a ton of advice but Taylor is not going to do much with the data as indicated in the original post. Hopefully these suggestions will be passed on to the person who would be doing the data analysis.

Taylor: Do report back in this thread as to how the sequencing turned out. Good luck.
GenoMax is offline   Reply With Quote
Old 08-15-2013, 06:55 AM   #18
tjeffe01
Junior Member
 
Location: Canada

Join Date: Aug 2013
Posts: 4
Default Thanks for all the info everybody

Wow, this forum is amazing, thanks for answering everyone.

A couple updates:
- After talking to our colleague who actually arranged the sequencing we purchased 4 lanes, he purchased 12 in total. I apologize if I don't use the technical jargon properly. My background is in molecular genetics and biochemistry and while I understand how RNAseq and other next gen sequencing technology works I'm not very familiar with the technical side of the process or with the computer science side.

- In terms of candidate genes we have sequenced the EPSPS gene and found no polymorphisms between resistant or susceptible plants and haven't found any evidence of gene duplication or over expression. When it comes to glyphosate resistance, resistance tends to either be EPSPS mutation or some other mechanism unrelated to EPSPS (for example changes in translocation patterns or sequestration in the vacuole) that are responsible. Having ruled EPSPS mutation as a mechanisms we are attempting to use a reverse genetic approach to find anything that may be linked to our resistance trait.

- I've forwarded all of this information on to my supervisor and he has decided that he either needs to sit on this data for a few years until the technology advances enough to make it easier, or he needs to hire a bioinformatician.

- Unfortunately sequencing the genome is not really a possibility. Funding for plant genomes is cropping up (excuse the pun) but is heavily focused solely on crop plants. There just isn't funding available for sequencing a genome for a plant species that only effects a small portion of the U.S and Canada. In addition there's no guarantee that sequencing the genome would lead to a new control mechanism for resistant plants. Farmers and other researchers would rather just blast the plants with more roundup and other herbicides.

Thanks everyone, I'll update you as new data becomes available,
Taylor
tjeffe01 is offline   Reply With Quote
Old 09-08-2013, 09:59 AM   #19
Melissa
Senior Member
 
Location: Switzerland

Join Date: Aug 2008
Posts: 116
Default

3) This data will not be in my MSc and I won't be doing much work with it beyond posting here; but I would like to include a short section in my thesis on the data we collect here. Is it possible to get average read lengths, fold number and other quality statistics about the data to include in my thesis, or does that not make any sense?
>>> I would advise against including the data in your thesis. It might not fit very well in your thesis. More importantly, it might invite questions from your examiners. Or worst, confuse them.

Another advice is forget everything you just learned from this post until you have finished writing your thesis.
Melissa is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:55 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO