Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Which reference genome to use?

    Dear everyone,

    I am doing RNA sequencing on a bacteria, but I am unsure which type of reference genome to use for my RNAseq data. Currently, there are two options:

    1. A complete and annotated reference genome of a bacteria from a different sequence type.

    2. A newly published genome of the same sequence type as my bacteria, but the genome is separated in several contigs.

    I do not know how different are the different sequence type or how many of the genes are specific to the bacteria of my sequence type and not the complete reference genome (option 1). They are all the same bacterial species though.

    Which would be more appropriate? Would appreciate some advice.

    Thank you.

  • #2
    For most purposes, you should use the most closely-related genome, even if it is not a single-contig assembly. I'm not sure what you mean by "sequence type" though.

    Comment


    • #3
      You can do a whole genome comparison with some programs such as Mauve or Act. There are tutorials around explaining how to use them

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        For most purposes, you should use the most closely-related genome, even if it is not a single-contig assembly. I'm not sure what you mean by "sequence type" though.
        Hi Brian,

        For example with E. coli, although this is ONE species, but there are various version of it, i.e. sequence type (ST), for example the human adapted E. coli which causes problematic infections around the world is ST131. Between the different sequence types, there might be mutations/genes specific to each of them.

        I'm totally new to sequencing. When they are in several contigs, does it mean that there are gaps between the sequences, hence the authors deposited the sequences in contigs rather than a circular 4Mb chromosome?

        Many thanks for the advice.

        Comment


        • #5
          Originally posted by AntonioRFranco View Post
          You can do a whole genome comparison with some programs such as Mauve or Act. There are tutorials around explaining how to use them
          Hi Antonio,

          Do you mean compare the two options first? What if there's a difference between the two genomes? What do you suggest I do then?

          Many thanks.

          Comment


          • #6
            Originally posted by michaellim View Post

            I'm totally new to sequencing. When they are in several contigs, does it mean that there are gaps between the sequences, hence the authors deposited the sequences in contigs rather than a circular 4Mb chromosome?

            Many thanks for the advice.
            That is a likely explanation. If submitters are not completely sure that the contigs go together (there could be multiple plasmids in some bacteria and the separate pieces may be real) they would be left in that state.

            Comment


            • #7
              Originally posted by michaellim View Post
              Hi Brian,

              For example with E. coli, although this is ONE species, but there are various version of it, i.e. sequence type (ST), for example the human adapted E. coli which causes problematic infections around the world is ST131. Between the different sequence types, there might be mutations/genes specific to each of them.

              I'm totally new to sequencing. When they are in several contigs, does it mean that there are gaps between the sequences, hence the authors deposited the sequences in contigs rather than a circular 4Mb chromosome?

              Many thanks for the advice.
              It's difficult to get single-contig assemblies (unless you use PacBio data). Multiple contigs typically mean that the coverage was too low in places to assemble correctly, or there were long repeats that confused the assembler. When we assemble a microbe from Illumina data, we might get 50 contigs or more. Probably 99%+ of the genome is there, but typically the order and orientation of the contigs is not know. There are not necessarily gaps but there may be.

              As for "ST", I've just never heard that terminology before; people I work with normally refer to those as "strains". And yes, I think it's still best to use the genome that is most closely related to your organism unless the assembly is really bad (hundreds of small contigs).

              Edit - also, as GenoMax pointed out, plasmids will cause correct multi-contig assemblies.

              Comment


              • #8
                Originally posted by michaellim View Post
                Hi Brian,

                For example with E. coli, although this is ONE species, but there are various version of it, i.e. sequence type (ST), for example the human adapted E. coli which causes problematic infections around the world is ST131. Between the different sequence types, there might be mutations/genes specific to each of them.
                If the overall organization of the genomes is similar then whole genome comparison can be informative. Mauve is designed for doing these kinds of comparisons, which can help locate genome level rearrangements. Comparing multiple Ecoli strains would be appropriate as in this example from Yersinia: http://asap.genetics.wisc.edu/softwa...creenshots.php

                Comment


                • #9
                  Originally posted by Brian Bushnell View Post
                  It's difficult to get single-contig assemblies (unless you use PacBio data). Multiple contigs typically mean that the coverage was too low in places to assemble correctly, or there were long repeats that confused the assembler. When we assemble a microbe from Illumina data, we might get 50 contigs or more. Probably 99%+ of the genome is there, but typically the order and orientation of the contigs is not know. There are not necessarily gaps but there may be.

                  As for "ST", I've just never heard that terminology before; people I work with normally refer to those as "strains". And yes, I think it's still best to use the genome that is most closely related to your organism unless the assembly is really bad (hundreds of small contigs).

                  Edit - also, as GenoMax pointed out, plasmids will cause correct multi-contig assemblies.
                  Hi Brian,

                  So if I were to use the multiple contigs for my reference when aligning my RNAseq data, may I ask how should I do this? Do I need to first combine all the contigs (how can I do this?)?

                  And during alignment, which is the best to be used for bacterial RNAseq? Tophat or BWA or Bowtie? I heard Tophat is used a lot in eukaryotic RNAseq as it looks for splice-junctions.

                  Thank you very much.

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    If the overall organization of the genomes is similar then whole genome comparison can be informative. Mauve is designed for doing these kinds of comparisons, which can help locate genome level rearrangements. Comparing multiple Ecoli strains would be appropriate as in this example from Yersinia: http://asap.genetics.wisc.edu/softwa...creenshots.php
                    Hi GenoMax,

                    Thanks for the info. Could you please advise how do I compare the "Published completed genome" with the other "published genome which is in contigs", do I need to merge the contigs first before using Mauve (may I ask how can I do that?)?

                    Many thanks.

                    Comment


                    • #11
                      Originally posted by michaellim View Post
                      Hi Brian,

                      So if I were to use the multiple contigs for my reference when aligning my RNAseq data, may I ask how should I do this? Do I need to first combine all the contigs (how can I do this?)?
                      All aligners are designed to handle references with multiple contigs; you don't need to combine anything (nor should you). You just need to index it.

                      And during alignment, which is the best to be used for bacterial RNAseq? Tophat or BWA or Bowtie? I heard Tophat is used a lot in eukaryotic RNAseq as it looks for splice-junctions.

                      Thank you very much.
                      Well since you ask me, I will recommend BBMap, which also handles RNA-seq data, but is faster and more sensitive than Tophat. But bacteria generally lack introns - when they are present, they are very short and only in a handful of genes. So it's not strictly necessary to use a splice-aware aligner for bacterial RNA-seq, though I would still recommend it.

                      Comment


                      • #12
                        Originally posted by michaellim View Post
                        Dear everyone,

                        A complete and annotated reference genome of a bacteria from a different sequence type.

                        Which would be more appropriate? Would appreciate some advice.

                        Thank you.
                        What do you mean exactly by sequence type? Maybe those assigned from MLST typing?

                        Comment


                        • #13
                          Originally posted by Sergioo View Post
                          What do you mean exactly by sequence type? Maybe those assigned from MLST typing?
                          Hi Sergioo,

                          Yes, MLST. For example, E.coli ST11 will be different from ST131. However, we aren't certain whether there is any genes which is specific to ST131 which cannot be found in other E. coli sequence types.

                          So, if ST11 has a completed genome, but ST131 is in contigs, and my current RNA seq data is on ST131, should I use ST131 (multiple contigs) as the reference or the completed genome of ST11 which is not so related as the reference genome. That was my question. Hope that makes it clearer.

                          Thank you.

                          Comment


                          • #14
                            Originally posted by Brian Bushnell View Post
                            All aligners are designed to handle references with multiple contigs; you don't need to combine anything (nor should you). You just need to index it.



                            Well since you ask me, I will recommend BBMap, which also handles RNA-seq data, but is faster and more sensitive than Tophat. But bacteria generally lack introns - when they are present, they are very short and only in a handful of genes. So it's not strictly necessary to use a splice-aware aligner for bacterial RNA-seq, though I would still recommend it.
                            Thanks Brian for the info.

                            I will give it a go first and see what happens.

                            Comment


                            • #15
                              Originally posted by michaellim View Post
                              For example, E.coli ST11 will be different from ST131. However, we aren't certain whether there is any genes which is specific to ST131 which cannot be found in other E. coli sequence types.

                              So, if ST11 has a completed genome, but ST131 is in contigs, and my current RNA seq data is on ST131, should I use ST131 (multiple contigs) as the reference or the completed genome of ST11 which is not so related as the reference genome. That was my question. Hope that makes it clearer.
                              Multi locus sequence typing (MLST) is a method frequently used to characterized bacterial genomes. MLST schematas have been published for most pathogenic bacteria. For the species Escherichia coli (including Shigella) there exist even three concurring schematas. With the schema maintained at Cork University sequence type 11 (ST11) refers to isolates typically found with cattle (serovar O157:H4), while strains belonging to ST131 are uropathogenic which means they are assoziated with infections of the urinary tract in humans. The chromosome of E.coli encodes more than 4000 proteins. Maybe half of them belongs to the accessory genome, which means they are only found in some strains or clonal groups.

                              If you want to map your reads from RNA sequencing I would recommend to use a genome from the same or a very closely related sequence type. Otherwise you will miss several genes from the accessory genome. For E.coli ST131 there are several genomes available in Genbank, even fully finished ones (AP009378.1 and plasmid AP009379.1, CP002797.2). Sequences for ST131 isolates KTE173, KTE49, KTE162, KTE6, KTE211, KTE175, KTE178, KTE216, KTE148, KTE139 are available as WGS contigs.

                              I would recommend to try several reference genomes. A mapping run usually takes only a few minutes on a desktop PC.
                              --
                              piet

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X