SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Alignment to a set of custom reference sequences along with standard genome reference eeyun Bioinformatics 4 05-08-2013 04:06 PM
Create new Reference by merging SNV list with reference genome rdoan Bioinformatics 0 10-12-2012 07:17 AM
How to compute percentage of my genome covering human reference genome? bioinf newbie Bioinformatics 2 07-10-2012 03:16 AM
Targeted Genome Assembly for region poorly represented in reference genome? gumbos Bioinformatics 1 01-09-2012 04:01 PM
Reference genome for MAQ - split reference genome by chromosome or not? inesdesantiago Bioinformatics 4 02-18-2009 08:44 AM

Reply
 
Thread Tools
Old 12-18-2014, 11:02 AM   #1
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default Which reference genome to use?

Dear everyone,

I am doing RNA sequencing on a bacteria, but I am unsure which type of reference genome to use for my RNAseq data. Currently, there are two options:

1. A complete and annotated reference genome of a bacteria from a different sequence type.

2. A newly published genome of the same sequence type as my bacteria, but the genome is separated in several contigs.

I do not know how different are the different sequence type or how many of the genes are specific to the bacteria of my sequence type and not the complete reference genome (option 1). They are all the same bacterial species though.

Which would be more appropriate? Would appreciate some advice.

Thank you.
michaellim is offline   Reply With Quote
Old 12-18-2014, 11:05 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

For most purposes, you should use the most closely-related genome, even if it is not a single-contig assembly. I'm not sure what you mean by "sequence type" though.
Brian Bushnell is offline   Reply With Quote
Old 12-18-2014, 12:09 PM   #3
AntonioRFranco
Member
 
Location: Cordoba, Spain

Join Date: Feb 2013
Posts: 21
Default

You can do a whole genome comparison with some programs such as Mauve or Act. There are tutorials around explaining how to use them
AntonioRFranco is offline   Reply With Quote
Old 12-18-2014, 02:21 PM   #4
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by Brian Bushnell View Post
For most purposes, you should use the most closely-related genome, even if it is not a single-contig assembly. I'm not sure what you mean by "sequence type" though.
Hi Brian,

For example with E. coli, although this is ONE species, but there are various version of it, i.e. sequence type (ST), for example the human adapted E. coli which causes problematic infections around the world is ST131. Between the different sequence types, there might be mutations/genes specific to each of them.

I'm totally new to sequencing. When they are in several contigs, does it mean that there are gaps between the sequences, hence the authors deposited the sequences in contigs rather than a circular 4Mb chromosome?

Many thanks for the advice.
michaellim is offline   Reply With Quote
Old 12-18-2014, 02:22 PM   #5
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by AntonioRFranco View Post
You can do a whole genome comparison with some programs such as Mauve or Act. There are tutorials around explaining how to use them
Hi Antonio,

Do you mean compare the two options first? What if there's a difference between the two genomes? What do you suggest I do then?

Many thanks.
michaellim is offline   Reply With Quote
Old 12-18-2014, 02:49 PM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,059
Default

Quote:
Originally Posted by michaellim View Post

I'm totally new to sequencing. When they are in several contigs, does it mean that there are gaps between the sequences, hence the authors deposited the sequences in contigs rather than a circular 4Mb chromosome?

Many thanks for the advice.
That is a likely explanation. If submitters are not completely sure that the contigs go together (there could be multiple plasmids in some bacteria and the separate pieces may be real) they would be left in that state.
GenoMax is offline   Reply With Quote
Old 12-18-2014, 02:50 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by michaellim View Post
Hi Brian,

For example with E. coli, although this is ONE species, but there are various version of it, i.e. sequence type (ST), for example the human adapted E. coli which causes problematic infections around the world is ST131. Between the different sequence types, there might be mutations/genes specific to each of them.

I'm totally new to sequencing. When they are in several contigs, does it mean that there are gaps between the sequences, hence the authors deposited the sequences in contigs rather than a circular 4Mb chromosome?

Many thanks for the advice.
It's difficult to get single-contig assemblies (unless you use PacBio data). Multiple contigs typically mean that the coverage was too low in places to assemble correctly, or there were long repeats that confused the assembler. When we assemble a microbe from Illumina data, we might get 50 contigs or more. Probably 99%+ of the genome is there, but typically the order and orientation of the contigs is not know. There are not necessarily gaps but there may be.

As for "ST", I've just never heard that terminology before; people I work with normally refer to those as "strains". And yes, I think it's still best to use the genome that is most closely related to your organism unless the assembly is really bad (hundreds of small contigs).

Edit - also, as GenoMax pointed out, plasmids will cause correct multi-contig assemblies.
Brian Bushnell is offline   Reply With Quote
Old 12-18-2014, 02:58 PM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,059
Default

Quote:
Originally Posted by michaellim View Post
Hi Brian,

For example with E. coli, although this is ONE species, but there are various version of it, i.e. sequence type (ST), for example the human adapted E. coli which causes problematic infections around the world is ST131. Between the different sequence types, there might be mutations/genes specific to each of them.
If the overall organization of the genomes is similar then whole genome comparison can be informative. Mauve is designed for doing these kinds of comparisons, which can help locate genome level rearrangements. Comparing multiple Ecoli strains would be appropriate as in this example from Yersinia: http://asap.genetics.wisc.edu/softwa...creenshots.php
GenoMax is offline   Reply With Quote
Old 12-18-2014, 04:46 PM   #9
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by Brian Bushnell View Post
It's difficult to get single-contig assemblies (unless you use PacBio data). Multiple contigs typically mean that the coverage was too low in places to assemble correctly, or there were long repeats that confused the assembler. When we assemble a microbe from Illumina data, we might get 50 contigs or more. Probably 99%+ of the genome is there, but typically the order and orientation of the contigs is not know. There are not necessarily gaps but there may be.

As for "ST", I've just never heard that terminology before; people I work with normally refer to those as "strains". And yes, I think it's still best to use the genome that is most closely related to your organism unless the assembly is really bad (hundreds of small contigs).

Edit - also, as GenoMax pointed out, plasmids will cause correct multi-contig assemblies.
Hi Brian,

So if I were to use the multiple contigs for my reference when aligning my RNAseq data, may I ask how should I do this? Do I need to first combine all the contigs (how can I do this?)?

And during alignment, which is the best to be used for bacterial RNAseq? Tophat or BWA or Bowtie? I heard Tophat is used a lot in eukaryotic RNAseq as it looks for splice-junctions.

Thank you very much.
michaellim is offline   Reply With Quote
Old 12-18-2014, 04:50 PM   #10
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by GenoMax View Post
If the overall organization of the genomes is similar then whole genome comparison can be informative. Mauve is designed for doing these kinds of comparisons, which can help locate genome level rearrangements. Comparing multiple Ecoli strains would be appropriate as in this example from Yersinia: http://asap.genetics.wisc.edu/softwa...creenshots.php
Hi GenoMax,

Thanks for the info. Could you please advise how do I compare the "Published completed genome" with the other "published genome which is in contigs", do I need to merge the contigs first before using Mauve (may I ask how can I do that?)?

Many thanks.
michaellim is offline   Reply With Quote
Old 12-18-2014, 04:55 PM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by michaellim View Post
Hi Brian,

So if I were to use the multiple contigs for my reference when aligning my RNAseq data, may I ask how should I do this? Do I need to first combine all the contigs (how can I do this?)?
All aligners are designed to handle references with multiple contigs; you don't need to combine anything (nor should you). You just need to index it.

Quote:
And during alignment, which is the best to be used for bacterial RNAseq? Tophat or BWA or Bowtie? I heard Tophat is used a lot in eukaryotic RNAseq as it looks for splice-junctions.

Thank you very much.
Well since you ask me, I will recommend BBMap, which also handles RNA-seq data, but is faster and more sensitive than Tophat. But bacteria generally lack introns - when they are present, they are very short and only in a handful of genes. So it's not strictly necessary to use a splice-aware aligner for bacterial RNA-seq, though I would still recommend it.
Brian Bushnell is offline   Reply With Quote
Old 12-18-2014, 05:49 PM   #12
Sergioo
Member
 
Location: Japan

Join Date: Oct 2013
Posts: 29
Default

Quote:
Originally Posted by michaellim View Post
Dear everyone,

A complete and annotated reference genome of a bacteria from a different sequence type.

Which would be more appropriate? Would appreciate some advice.

Thank you.
What do you mean exactly by sequence type? Maybe those assigned from MLST typing?
Sergioo is offline   Reply With Quote
Old 12-19-2014, 07:44 AM   #13
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by Sergioo View Post
What do you mean exactly by sequence type? Maybe those assigned from MLST typing?
Hi Sergioo,

Yes, MLST. For example, E.coli ST11 will be different from ST131. However, we aren't certain whether there is any genes which is specific to ST131 which cannot be found in other E. coli sequence types.

So, if ST11 has a completed genome, but ST131 is in contigs, and my current RNA seq data is on ST131, should I use ST131 (multiple contigs) as the reference or the completed genome of ST11 which is not so related as the reference genome. That was my question. Hope that makes it clearer.

Thank you.
michaellim is offline   Reply With Quote
Old 12-19-2014, 07:44 AM   #14
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by Brian Bushnell View Post
All aligners are designed to handle references with multiple contigs; you don't need to combine anything (nor should you). You just need to index it.



Well since you ask me, I will recommend BBMap, which also handles RNA-seq data, but is faster and more sensitive than Tophat. But bacteria generally lack introns - when they are present, they are very short and only in a handful of genes. So it's not strictly necessary to use a splice-aware aligner for bacterial RNA-seq, though I would still recommend it.
Thanks Brian for the info.

I will give it a go first and see what happens.
michaellim is offline   Reply With Quote
Old 12-19-2014, 12:33 PM   #15
piet
Member
 
Location: planet earth

Join Date: Aug 2014
Posts: 21
Default

Quote:
Originally Posted by michaellim View Post
For example, E.coli ST11 will be different from ST131. However, we aren't certain whether there is any genes which is specific to ST131 which cannot be found in other E. coli sequence types.

So, if ST11 has a completed genome, but ST131 is in contigs, and my current RNA seq data is on ST131, should I use ST131 (multiple contigs) as the reference or the completed genome of ST11 which is not so related as the reference genome. That was my question. Hope that makes it clearer.
Multi locus sequence typing (MLST) is a method frequently used to characterized bacterial genomes. MLST schematas have been published for most pathogenic bacteria. For the species Escherichia coli (including Shigella) there exist even three concurring schematas. With the schema maintained at Cork University sequence type 11 (ST11) refers to isolates typically found with cattle (serovar O157:H4), while strains belonging to ST131 are uropathogenic which means they are assoziated with infections of the urinary tract in humans. The chromosome of E.coli encodes more than 4000 proteins. Maybe half of them belongs to the accessory genome, which means they are only found in some strains or clonal groups.

If you want to map your reads from RNA sequencing I would recommend to use a genome from the same or a very closely related sequence type. Otherwise you will miss several genes from the accessory genome. For E.coli ST131 there are several genomes available in Genbank, even fully finished ones (AP009378.1 and plasmid AP009379.1, CP002797.2). Sequences for ST131 isolates KTE173, KTE49, KTE162, KTE6, KTE211, KTE175, KTE178, KTE216, KTE148, KTE139 are available as WGS contigs.

I would recommend to try several reference genomes. A mapping run usually takes only a few minutes on a desktop PC.
--
piet
piet is offline   Reply With Quote
Old 12-19-2014, 12:37 PM   #16
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by piet View Post
Multi locus sequence typing (MLST) is a method frequently used to characterized bacterial genomes. MLST schematas have been published for most pathogenic bacteria. For the species Escherichia coli (including Shigella) there exist even three concurring schematas. With the schema maintained at Cork University sequence type 11 (ST11) refers to isolates typically found with cattle (serovar O157:H4), while strains belonging to ST131 are uropathogenic which means they are assoziated with infections of the urinary tract in humans. The chromosome of E.coli encodes more than 4000 proteins. Maybe half of them belongs to the accessory genome, which means they are only found in some strains or clonal groups.

If you want to map your reads from RNA sequencing I would recommend to use a genome from the same or a very closely related sequence type. Otherwise you will miss several genes from the accessory genome. For E.coli ST131 there are several genomes available in Genbank, even fully finished ones (AP009378.1 and plasmid AP009379.1, CP002797.2). Sequences for ST131 isolates KTE173, KTE49, KTE162, KTE6, KTE211, KTE175, KTE178, KTE216, KTE148, KTE139 are available as WGS contigs.

I would recommend to try several reference genomes. A mapping run usually takes only a few minutes on a desktop PC.
--
piet
Hi Piet,

Many thanks for the clarification. I will give it a try with different genomes then if it doesn't take too long. May I know what kind of alignment/mapping software do you use? Is there any particular reasons for that choice?

Cheers.
michaellim is offline   Reply With Quote
Old 12-19-2014, 01:27 PM   #17
piet
Member
 
Location: planet earth

Join Date: Aug 2014
Posts: 21
Default

Quote:
Originally Posted by michaellim View Post
May I know what kind of alignment/mapping software do you use?
I use 'bwa mem' but my use case is processing of DNA sequencing data. It is very fast and reliable with default settings. Nevertheless, bwa and similar mappers should be suited also for bacterial RNA sequencing since bacteria do not splice their messanger RNA.

In the beginning it took me quite a while to fiddle out how to write shell scripts to start bwa runs in a comfortable way and to handle the resulting sam files. You will definitely need to learn some kind of shell or script programming if you want to go that route.

Why don't you do a DNA sequencing run of your particular isolate before you go into RNA sequencing?
--
piet

Last edited by piet; 12-19-2014 at 01:57 PM.
piet is offline   Reply With Quote
Old 12-20-2014, 04:50 AM   #18
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by piet View Post
I use 'bwa mem' but my use case is processing of DNA sequencing data. It is very fast and reliable with default settings. Nevertheless, bwa and similar mappers should be suited also for bacterial RNA sequencing since bacteria do not splice their messanger RNA.

In the beginning it took me quite a while to fiddle out how to write shell scripts to start bwa runs in a comfortable way and to handle the resulting sam files. You will definitely need to learn some kind of shell or script programming if you want to go that route.

Why don't you do a DNA sequencing run of your particular isolate before you go into RNA sequencing?
--
piet

Hi Piet,

I see, I have close to none coding/programming knowledge, then maybe BWA is not suitable then. But I will check out the website for more info about it.

I did consider DNA sequencing the genome of my sequence type strain, but the lab has limited funds.

Thank you very much.
michaellim is offline   Reply With Quote
Old 12-20-2014, 04:52 AM   #19
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Dear All,

May I also ask, since my RNA seq libraries were about 260 bp in size according to Illumina's preparation protocol, for the FASTQ files which I've currently have, do I need to remove the Adapter (Index) sequences before mapping on the reference genome?

Many thanks.
michaellim is offline   Reply With Quote
Old 12-20-2014, 04:54 AM   #20
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by GenoMax View Post
That is a likely explanation. If submitters are not completely sure that the contigs go together (there could be multiple plasmids in some bacteria and the separate pieces may be real) they would be left in that state.
Hi Sergio,

May I check with you whether I need to trim the adapter sequence from my RNA seq FASTQ file? My Library was about 260 bp each.

Any suggestion how should I do this? Do I just set a software to trim from base 1 to base X or do I need to input the individual adapter sequence to the trimmer, I've noticed quite a few version of trimmers online. There is a built in one in Galaxy too.

Many thanks.
michaellim is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:47 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO