Which reference genome to use?

michaellim replied

12-20-2014, 05:52 AM
Dear All,

May I also ask, since my RNA seq libraries were about 260 bp in size according to Illumina's preparation protocol, for the FASTQ files which I've currently have, do I need to remove the Adapter (Index) sequences before mapping on the reference genome?

Many thanks.
Leave a comment:
michaellim replied

12-20-2014, 05:50 AM
Originally posted by piet View Post

I use 'bwa mem' but my use case is processing of DNA sequencing data. It is very fast and reliable with default settings. Nevertheless, bwa and similar mappers should be suited also for bacterial RNA sequencing since bacteria do not splice their messanger RNA.

In the beginning it took me quite a while to fiddle out how to write shell scripts to start bwa runs in a comfortable way and to handle the resulting sam files. You will definitely need to learn some kind of shell or script programming if you want to go that route.

Why don't you do a DNA sequencing run of your particular isolate before you go into RNA sequencing?
--
piet

Hi Piet,

I see, I have close to none coding/programming knowledge, then maybe BWA is not suitable then. But I will check out the website for more info about it.

I did consider DNA sequencing the genome of my sequence type strain, but the lab has limited funds.

Thank you very much.
Leave a comment:
piet replied

12-19-2014, 02:27 PM
Originally posted by michaellim View Post

May I know what kind of alignment/mapping software do you use?

I use 'bwa mem' but my use case is processing of DNA sequencing data. It is very fast and reliable with default settings. Nevertheless, bwa and similar mappers should be suited also for bacterial RNA sequencing since bacteria do not splice their messanger RNA.

In the beginning it took me quite a while to fiddle out how to write shell scripts to start bwa runs in a comfortable way and to handle the resulting sam files. You will definitely need to learn some kind of shell or script programming if you want to go that route.

Why don't you do a DNA sequencing run of your particular isolate before you go into RNA sequencing?
--
piet

Last edited by piet; 12-19-2014, 02:57 PM.
Leave a comment:
michaellim replied

12-19-2014, 01:37 PM
Originally posted by piet View Post

Multi locus sequence typing (MLST) is a method frequently used to characterized bacterial genomes. MLST schematas have been published for most pathogenic bacteria. For the species Escherichia coli (including Shigella) there exist even three concurring schematas. With the schema maintained at Cork University sequence type 11 (ST11) refers to isolates typically found with cattle (serovar O157:H4), while strains belonging to ST131 are uropathogenic which means they are assoziated with infections of the urinary tract in humans. The chromosome of E.coli encodes more than 4000 proteins. Maybe half of them belongs to the accessory genome, which means they are only found in some strains or clonal groups.

If you want to map your reads from RNA sequencing I would recommend to use a genome from the same or a very closely related sequence type. Otherwise you will miss several genes from the accessory genome. For E.coli ST131 there are several genomes available in Genbank, even fully finished ones (AP009378.1 and plasmid AP009379.1, CP002797.2). Sequences for ST131 isolates KTE173, KTE49, KTE162, KTE6, KTE211, KTE175, KTE178, KTE216, KTE148, KTE139 are available as WGS contigs.

I would recommend to try several reference genomes. A mapping run usually takes only a few minutes on a desktop PC.
--
piet

Hi Piet,

Many thanks for the clarification. I will give it a try with different genomes then if it doesn't take too long. May I know what kind of alignment/mapping software do you use? Is there any particular reasons for that choice?

Cheers.
Leave a comment:
piet replied

12-19-2014, 01:33 PM
Originally posted by michaellim View Post

For example, E.coli ST11 will be different from ST131. However, we aren't certain whether there is any genes which is specific to ST131 which cannot be found in other E. coli sequence types.

So, if ST11 has a completed genome, but ST131 is in contigs, and my current RNA seq data is on ST131, should I use ST131 (multiple contigs) as the reference or the completed genome of ST11 which is not so related as the reference genome. That was my question. Hope that makes it clearer.

Multi locus sequence typing (MLST) is a method frequently used to characterized bacterial genomes. MLST schematas have been published for most pathogenic bacteria. For the species Escherichia coli (including Shigella) there exist even three concurring schematas. With the schema maintained at Cork University sequence type 11 (ST11) refers to isolates typically found with cattle (serovar O157:H4), while strains belonging to ST131 are uropathogenic which means they are assoziated with infections of the urinary tract in humans. The chromosome of E.coli encodes more than 4000 proteins. Maybe half of them belongs to the accessory genome, which means they are only found in some strains or clonal groups.

If you want to map your reads from RNA sequencing I would recommend to use a genome from the same or a very closely related sequence type. Otherwise you will miss several genes from the accessory genome. For E.coli ST131 there are several genomes available in Genbank, even fully finished ones (AP009378.1 and plasmid AP009379.1, CP002797.2). Sequences for ST131 isolates KTE173, KTE49, KTE162, KTE6, KTE211, KTE175, KTE178, KTE216, KTE148, KTE139 are available as WGS contigs.

I would recommend to try several reference genomes. A mapping run usually takes only a few minutes on a desktop PC.
--
piet
Leave a comment:
michaellim replied

12-19-2014, 08:44 AM
Originally posted by Brian Bushnell View Post

All aligners are designed to handle references with multiple contigs; you don't need to combine anything (nor should you). You just need to index it.

Well since you ask me, I will recommend BBMap, which also handles RNA-seq data, but is faster and more sensitive than Tophat. But bacteria generally lack introns - when they are present, they are very short and only in a handful of genes. So it's not strictly necessary to use a splice-aware aligner for bacterial RNA-seq, though I would still recommend it.

Thanks Brian for the info.

I will give it a go first and see what happens.
Leave a comment:
michaellim replied

12-19-2014, 08:44 AM
Originally posted by Sergioo View Post

What do you mean exactly by sequence type? Maybe those assigned from MLST typing?

Hi Sergioo,

Yes, MLST. For example, E.coli ST11 will be different from ST131. However, we aren't certain whether there is any genes which is specific to ST131 which cannot be found in other E. coli sequence types.

So, if ST11 has a completed genome, but ST131 is in contigs, and my current RNA seq data is on ST131, should I use ST131 (multiple contigs) as the reference or the completed genome of ST11 which is not so related as the reference genome. That was my question. Hope that makes it clearer.

Thank you.
Leave a comment:
Sergioo replied

12-18-2014, 06:49 PM
Originally posted by michaellim View Post

Dear everyone,

A complete and annotated reference genome of a bacteria from a different sequence type.

Which would be more appropriate? Would appreciate some advice.

Thank you.

What do you mean exactly by sequence type? Maybe those assigned from MLST typing?
Leave a comment:
Brian Bushnell replied

12-18-2014, 05:55 PM
Originally posted by michaellim View Post

Hi Brian,

So if I were to use the multiple contigs for my reference when aligning my RNAseq data, may I ask how should I do this? Do I need to first combine all the contigs (how can I do this?)?

All aligners are designed to handle references with multiple contigs; you don't need to combine anything (nor should you). You just need to index it.

And during alignment, which is the best to be used for bacterial RNAseq? Tophat or BWA or Bowtie? I heard Tophat is used a lot in eukaryotic RNAseq as it looks for splice-junctions.

Thank you very much.

Well since you ask me, I will recommend BBMap, which also handles RNA-seq data, but is faster and more sensitive than Tophat. But bacteria generally lack introns - when they are present, they are very short and only in a handful of genes. So it's not strictly necessary to use a splice-aware aligner for bacterial RNA-seq, though I would still recommend it.
Leave a comment:
michaellim replied

12-18-2014, 05:50 PM
Originally posted by GenoMax View Post

If the overall organization of the genomes is similar then whole genome comparison can be informative. Mauve is designed for doing these kinds of comparisons, which can help locate genome level rearrangements. Comparing multiple Ecoli strains would be appropriate as in this example from Yersinia: http://asap.genetics.wisc.edu/softwa...creenshots.php

Hi GenoMax,

Thanks for the info. Could you please advise how do I compare the "Published completed genome" with the other "published genome which is in contigs", do I need to merge the contigs first before using Mauve (may I ask how can I do that?)?

Many thanks.
Leave a comment:
michaellim replied

12-18-2014, 05:46 PM
Originally posted by Brian Bushnell View Post

It's difficult to get single-contig assemblies (unless you use PacBio data). Multiple contigs typically mean that the coverage was too low in places to assemble correctly, or there were long repeats that confused the assembler. When we assemble a microbe from Illumina data, we might get 50 contigs or more. Probably 99%+ of the genome is there, but typically the order and orientation of the contigs is not know. There are not necessarily gaps but there may be.

As for "ST", I've just never heard that terminology before; people I work with normally refer to those as "strains". And yes, I think it's still best to use the genome that is most closely related to your organism unless the assembly is really bad (hundreds of small contigs).

Edit - also, as GenoMax pointed out, plasmids will cause correct multi-contig assemblies.

Hi Brian,

So if I were to use the multiple contigs for my reference when aligning my RNAseq data, may I ask how should I do this? Do I need to first combine all the contigs (how can I do this?)?

And during alignment, which is the best to be used for bacterial RNAseq? Tophat or BWA or Bowtie? I heard Tophat is used a lot in eukaryotic RNAseq as it looks for splice-junctions.

Thank you very much.
Leave a comment:
GenoMax replied

12-18-2014, 03:58 PM
Originally posted by michaellim View Post

Hi Brian,

For example with E. coli, although this is ONE species, but there are various version of it, i.e. sequence type (ST), for example the human adapted E. coli which causes problematic infections around the world is ST131. Between the different sequence types, there might be mutations/genes specific to each of them.

If the overall organization of the genomes is similar then whole genome comparison can be informative. Mauve is designed for doing these kinds of comparisons, which can help locate genome level rearrangements. Comparing multiple Ecoli strains would be appropriate as in this example from Yersinia: http://asap.genetics.wisc.edu/softwa...creenshots.php
Leave a comment:
Brian Bushnell replied

12-18-2014, 03:50 PM
Originally posted by michaellim View Post

Hi Brian,

For example with E. coli, although this is ONE species, but there are various version of it, i.e. sequence type (ST), for example the human adapted E. coli which causes problematic infections around the world is ST131. Between the different sequence types, there might be mutations/genes specific to each of them.

I'm totally new to sequencing. When they are in several contigs, does it mean that there are gaps between the sequences, hence the authors deposited the sequences in contigs rather than a circular 4Mb chromosome?

Many thanks for the advice.

It's difficult to get single-contig assemblies (unless you use PacBio data). Multiple contigs typically mean that the coverage was too low in places to assemble correctly, or there were long repeats that confused the assembler. When we assemble a microbe from Illumina data, we might get 50 contigs or more. Probably 99%+ of the genome is there, but typically the order and orientation of the contigs is not know. There are not necessarily gaps but there may be.

As for "ST", I've just never heard that terminology before; people I work with normally refer to those as "strains". And yes, I think it's still best to use the genome that is most closely related to your organism unless the assembly is really bad (hundreds of small contigs).

Edit - also, as GenoMax pointed out, plasmids will cause correct multi-contig assemblies.
Leave a comment:
GenoMax replied

12-18-2014, 03:49 PM
Originally posted by michaellim View Post

I'm totally new to sequencing. When they are in several contigs, does it mean that there are gaps between the sequences, hence the authors deposited the sequences in contigs rather than a circular 4Mb chromosome?

Many thanks for the advice.

That is a likely explanation. If submitters are not completely sure that the contigs go together (there could be multiple plasmids in some bacteria and the separate pieces may be real) they would be left in that state.
Leave a comment:
michaellim replied

12-18-2014, 03:22 PM
Originally posted by AntonioRFranco View Post

You can do a whole genome comparison with some programs such as Mauve or Act. There are tutorials around explaining how to use them

Hi Antonio,

Do you mean compare the two options first? What if there's a difference between the two genomes? What do you suggest I do then?

Many thanks.
Leave a comment:

Previous 1 2 3 4 template Next

An Introduction to the Technologies Transforming Precision Medicine

by seqadmin

In recent years, precision medicine has become a major focus for researchers and healthcare professionals. This approach offers personalized treatment and wellness plans by utilizing insights from each person's unique biology and lifestyle to deliver more effective care. Its advancement relies on innovative technologies that enable a deeper understanding of individual variability. In a joint documentary with our colleagues at Biocompare, we examined the foundational principles of precision...
- Channel: Articles
01-27-2025, 07:46 AM

Topics	Statistics	Last Post
AI Model Maps 3D Genome Structures in Minutes by seqadmin Started by seqadmin, Yesterday, 09:07 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 09:07 AM
Long-Read Sequencing Speeds Up Diagnosis of Rare Genetic Diseases by seqadmin Started by seqadmin, 01-31-2025, 08:31 AM	0 responses 23 views 0 likes	Last Post by seqadmin 01-31-2025, 08:31 AM
New Genome Analysis Tool Offers Scalable Phylogenomic Insights by seqadmin Started by seqadmin, 01-24-2025, 07:35 AM	0 responses 78 views 0 likes	Last Post by seqadmin 01-24-2025, 07:35 AM
How T Cells Protect the Gut from Infections by seqadmin Started by seqadmin, 01-23-2025, 09:43 AM	0 responses 46 views 0 likes	Last Post by seqadmin 01-23-2025, 09:43 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News