Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by piet View Post
    Multi locus sequence typing (MLST) is a method frequently used to characterized bacterial genomes. MLST schematas have been published for most pathogenic bacteria. For the species Escherichia coli (including Shigella) there exist even three concurring schematas. With the schema maintained at Cork University sequence type 11 (ST11) refers to isolates typically found with cattle (serovar O157:H4), while strains belonging to ST131 are uropathogenic which means they are assoziated with infections of the urinary tract in humans. The chromosome of E.coli encodes more than 4000 proteins. Maybe half of them belongs to the accessory genome, which means they are only found in some strains or clonal groups.

    If you want to map your reads from RNA sequencing I would recommend to use a genome from the same or a very closely related sequence type. Otherwise you will miss several genes from the accessory genome. For E.coli ST131 there are several genomes available in Genbank, even fully finished ones (AP009378.1 and plasmid AP009379.1, CP002797.2). Sequences for ST131 isolates KTE173, KTE49, KTE162, KTE6, KTE211, KTE175, KTE178, KTE216, KTE148, KTE139 are available as WGS contigs.

    I would recommend to try several reference genomes. A mapping run usually takes only a few minutes on a desktop PC.
    --
    piet
    Hi Piet,

    Many thanks for the clarification. I will give it a try with different genomes then if it doesn't take too long. May I know what kind of alignment/mapping software do you use? Is there any particular reasons for that choice?

    Cheers.

    Comment


    • #17
      Originally posted by michaellim View Post
      May I know what kind of alignment/mapping software do you use?
      I use 'bwa mem' but my use case is processing of DNA sequencing data. It is very fast and reliable with default settings. Nevertheless, bwa and similar mappers should be suited also for bacterial RNA sequencing since bacteria do not splice their messanger RNA.

      In the beginning it took me quite a while to fiddle out how to write shell scripts to start bwa runs in a comfortable way and to handle the resulting sam files. You will definitely need to learn some kind of shell or script programming if you want to go that route.

      Why don't you do a DNA sequencing run of your particular isolate before you go into RNA sequencing?
      --
      piet
      Last edited by piet; 12-19-2014, 02:57 PM.

      Comment


      • #18
        Originally posted by piet View Post
        I use 'bwa mem' but my use case is processing of DNA sequencing data. It is very fast and reliable with default settings. Nevertheless, bwa and similar mappers should be suited also for bacterial RNA sequencing since bacteria do not splice their messanger RNA.

        In the beginning it took me quite a while to fiddle out how to write shell scripts to start bwa runs in a comfortable way and to handle the resulting sam files. You will definitely need to learn some kind of shell or script programming if you want to go that route.

        Why don't you do a DNA sequencing run of your particular isolate before you go into RNA sequencing?
        --
        piet

        Hi Piet,

        I see, I have close to none coding/programming knowledge, then maybe BWA is not suitable then. But I will check out the website for more info about it.

        I did consider DNA sequencing the genome of my sequence type strain, but the lab has limited funds.

        Thank you very much.

        Comment


        • #19
          Dear All,

          May I also ask, since my RNA seq libraries were about 260 bp in size according to Illumina's preparation protocol, for the FASTQ files which I've currently have, do I need to remove the Adapter (Index) sequences before mapping on the reference genome?

          Many thanks.

          Comment


          • #20
            Originally posted by GenoMax View Post
            That is a likely explanation. If submitters are not completely sure that the contigs go together (there could be multiple plasmids in some bacteria and the separate pieces may be real) they would be left in that state.
            Hi Sergio,

            May I check with you whether I need to trim the adapter sequence from my RNA seq FASTQ file? My Library was about 260 bp each.

            Any suggestion how should I do this? Do I just set a software to trim from base 1 to base X or do I need to input the individual adapter sequence to the trimmer, I've noticed quite a few version of trimmers online. There is a built in one in Galaxy too.

            Many thanks.

            Comment


            • #21
              It is always a good idea to check for and trim adapter sequences, if present. Many aligners will soft clip them but if you are planning to do any assembly you want to start with clean reads. BTW adapters and indexes are not the same thing. With illumina technology index sequences are never a part of the main read so do not need to be trimmed (unless you are using custom inline indexes).

              BBDuk is easy to use (on Windows/Mac/*nix) so is Trimmomatic. You could do this in galaxy but at some point you will need to move to command line (e.g if you decide to use Mauve).
              Last edited by GenoMax; 12-20-2014, 06:04 AM.

              Comment


              • #22
                Originally posted by michaellim View Post
                Hi Sergioo,

                Yes, MLST. For example, E.coli ST11 will be different from ST131. However, we aren't certain whether there is any genes which is specific to ST131 which cannot be found in other E. coli sequence types.

                So, if ST11 has a completed genome, but ST131 is in contigs, and my current RNA seq data is on ST131, should I use ST131 (multiple contigs) as the reference or the completed genome of ST11 which is not so related as the reference genome. That was my question. Hope that makes it clearer.

                Thank you.
                By now, you've got many suggestions from more experienced readers. You are lucky because you've just got to sit down and think of which option to use.

                I am not familiar wth RNA seq projects, but if it was whole genome seq, I will go first for an assembly (even de novo) using a complete genome (not the one in multiple contigs). The complete genome, even not exactly related, will allow you to order your contigs and resolve misassembly. Note that you can not rely on a draft genome sequence since its biggest inconvenience is the lack of order of composite contigs.

                Now, once you've got your draft sequences ordered, you are free to compare it with what you think is more related (for example sequences from the same ST as your isolate).
                Hope it helps.

                Comment


                • #23
                  Originally posted by GenoMax View Post
                  It is always a good idea to check for and trim adapter sequences, if present. Many aligners will soft clip them but if you are planning to do any assembly you want to start with clean reads. BTW adapters and indexes are not the same thing. With illumina technology index sequences are never a part of the main read so do not need to be trimmed (unless you are using custom inline indexes).

                  BBDuk is easy to use (on Windows/Mac/*nix) so is Trimmomatic. You could do this in galaxy but at some point you will need to move to command line (e.g if you decide to use Mauve).
                  Hi Genomax,

                  Sorry, it's my first time doing RNA seq and dealing with sequencing data. I was using MiSeq for the sequencing (the running of the flow cell was done by the sequencing lab, I prepared all the way up to the denatured libraries). From the Illumina Library Prep manual, I (mis)understood 'adapters' to be the same as 'index/indices' (unique 6 nucleotide sequences to labelled each RNA sample).

                  Could you please explain how are they different? Will the sequence still be in sequencing FASTQ file?

                  By the way, looking at the Per Base Sequence Quality, for all of my samples, the lower end of the yellow box goes below the 20 Quality Score after base-150 (all sequences are 200 bases). Does this mean I need to trim the adapters and also everything after base-150?

                  Was reading some blogs, there are arguments about whether it is important to trim or not to trim before mapping. It's rather confusing to me.

                  Thank you.

                  Comment


                  • #24
                    Originally posted by Sergioo View Post
                    By now, you've got many suggestions from more experienced readers. You are lucky because you've just got to sit down and think of which option to use.

                    I am not familiar wth RNA seq projects, but if it was whole genome seq, I will go first for an assembly (even de novo) using a complete genome (not the one in multiple contigs). The complete genome, even not exactly related, will allow you to order your contigs and resolve misassembly. Note that you can not rely on a draft genome sequence since its biggest inconvenience is the lack of order of composite contigs.

                    Now, once you've got your draft sequences ordered, you are free to compare it with what you think is more related (for example sequences from the same ST as your isolate).
                    Hope it helps.
                    Hi Sergio,

                    Yes, I'm truly very grateful for all the response given. I'm slowly understanding more about the software options and uses and the initial mapping analyses. I have currently no idea as I've not done this before and there are no one in the department who has does this kind of work, so I couldn't get any advice internally.

                    By the way, when you are doing mapping, for example when you have 1 chromosome sequence, and 5 plasmid sequences on NCBI. How do you do the mapping? I was looking at Galaxy and you can only choose one reference genome for any single mapping task.

                    Thank you.

                    Comment


                    • #25
                      Originally posted by Brian Bushnell View Post
                      All aligners are designed to handle references with multiple contigs; you don't need to combine anything (nor should you). You just need to index it.



                      Well since you ask me, I will recommend BBMap, which also handles RNA-seq data, but is faster and more sensitive than Tophat. But bacteria generally lack introns - when they are present, they are very short and only in a handful of genes. So it's not strictly necessary to use a splice-aware aligner for bacterial RNA-seq, though I would still recommend it.
                      Hi Brian,

                      Can I get some further clarification from you too? I was looking at some genomes in NCBI and they are deposited as Chromosome and multiple plasmids.

                      In this case, when I'm mapping, am I supposed to combine all the sequences (chromosome and plasmid) in NCBI? Or do I index them as you've mentioned? Sorry, I have no prior knowledge at all on DNA sequencing/RNA sequencing.

                      I was trying to map the RNA seq data in Galaxy, but I can only choose one reference at a time.

                      Thank you.

                      Comment


                      • #26
                        Originally posted by michaellim View Post
                        Could you please explain how are they different? Will the sequence still be in sequencing FASTQ file?
                        Watch this short video from Illumina that explains how their sequencing technology works (it addresses adapters/indexes): https://www.youtube.com/watch?v=HMyCqWhwB8E Index read sequence will not be part of the actual read. It will be included in the Fastq read header (http://en.wikipedia.org/wiki/FASTQ_f...ce_identifiers skip to CASAVA 1.8 format headers).

                        By the way, looking at the Per Base Sequence Quality, for all of my samples, the lower end of the yellow box goes below the 20 Quality Score after base-150 (all sequences are 200 bases). Does this mean I need to trim the adapters and also everything after base-150?

                        Thank you.
                        Post the FastQC plots for your sample(s) if you need specific comments but in general if you had inserts that were shorter than your read length then you are going to have adapters in your sequences. If your data was processed on the MiSeq by MiSeq reporter then the adapters may have already been removed (ask the facility if you are not sure).

                        BBDuk is specially good at documenting statistics about how many reads had adapters/were trimmed. Make sure you use the correct adapter reference files (nextera, truseq etc they are included in BBMap download in reference directory).

                        Comment


                        • #27
                          Originally posted by GenoMax View Post
                          Watch this short video from Illumina that explains how their sequencing technology works (it addresses adapters/indexes): https://www.youtube.com/watch?v=HMyCqWhwB8E Index read sequence will not be part of the actual read. It will be included in the Fastq read header (http://en.wikipedia.org/wiki/FASTQ_f...ce_identifiers skip to CASAVA 1.8 format headers).



                          Post the FastQC plots for your sample(s) if you need specific comments but in general if you had inserts that were shorter than your read length then you are going to have adapters in your sequences. If your data was processed on the MiSeq by MiSeq reporter then the adapters may have already been removed (ask the facility if you are not sure).

                          BBDuk is specially good at documenting statistics about how many reads had adapters/were trimmed. Make sure you use the correct adapter reference files (nextera, truseq etc they are included in BBMap download in reference directory).
                          Hi Genomax,

                          Thanks for the explanation. I've attached four plots for you to comment. Honestly, I have not much idea about it. All I was told was that as long as the read is above 20 on the Y-axis then it's good to use. Those below 20 may probably be a wrongly called base and may need to be trimmed before mapping.

                          By the way, I did a trial mapping of one of the RNA seq Groom'ed file with a reference chromosome sequence, but when I try to view the BAM file on Integrative Genomic Viewer, the reference chromosome is not in the drop down list. When I tried to upload my own fasta file downloaded from NCBI, there is no gene annotation in it. Do you know how should I upload the annotation? I tried reading the IGV website, it says to download the GFF file from NCBI, but I don't see any "http://www.ncbi.nlm.nih.gov/nuccore/407479587" place for me to download the GFF file from the "Display Settings" (Top Left of the screen).

                          Could you kindly advise?

                          Thank you.
                          Attached Files

                          Comment


                          • #28
                            Originally posted by michaellim View Post
                            download the GFF file from NCBI, but I don't see any "http://www.ncbi.nlm.nih.gov/nuccore/407479587" place for me to download the GFF file from the "Display Settings" (Top Left of the screen).
                            The primary format NCBI has used for ages is Genbank flat file format. Downlad the entry in 'Genbank' format an then use a tool like 'seqret' from the Emboss package to convert Genbank flat file into GFF.

                            Or you may use the TogoWS web service to download the entry in GFF format directly:
                            wget http://togows.org/entry/nucleotide/407479587.gff

                            Please note, that GFF is not a strict format but rather a framework to invent your own format. Column 9 of the GFF file comprises several tags. The names of these tags are more or less arbitrary. The tag names assigned by TogoWS may or may not meet the requirements of your sequence viewer. Furthermore, column 1 of the GFF file holds the name of the sequence. The name used in column 1 of the GFF file must be EXACTLY the same as the name used in the corresponding FASTA file. In a FASTA file the name of the sequence is the first word of the description line (all the characters before the first space).

                            Please also note that GI=407479587 is an isolate from the German HUSEC outbreak in 2011 which is sequence type 678 and differs from the uropathogenic ST131 you have ask for before.

                            With regard to your questions about read trimming and about inclusion of plasmids, I would recomment that you initially start with just a single chromosomal sequence and without any read trimming. You should be able to map 60 to 80 percent of your reads that way. Your goal for the next weeks should be to make yourself familiar with all these tools and to establish a basic work flow. If you have found such a work flow you can try to improve the number of reads mapped by either adding plasmidic sequences to your set of reference sequences or by doing some read trimming.
                            --
                            piet
                            Last edited by piet; 12-21-2014, 09:14 AM.

                            Comment


                            • #29
                              @michaellim: What is your ultimate aim with this RNAseq study? Are you looking to do differential expression or just checking to see what is expressed under some specific condition(s)?

                              For the immediate issue of not being able to see annotations you can use the gff file from Piet's example and see if that works with IGV. You should compare the pre- and post-trimming FastQC plots to see if there is an improvement in stats. Plots you have posted don't look bad but it is difficult to say if you have adapter contamination unless you try the trimming. No fastq grooming in galaxy should be necessary with MiSeq data. It is already in sanger fastq format.

                              Once you go away from "model" organisms tools such as galaxy start becoming limiting (as you have already discovered). Depending on your overall goals it may be beneficial to start learning how to do these analyses on command line. If this is a small part of whatever you are trying to do then enlisting the help of a friend/local bioinformatics support folks may be the easiest thing to do so you can get a set of hypotheses to test at the bench and move on.
                              Last edited by GenoMax; 12-21-2014, 08:29 AM.

                              Comment


                              • #30
                                Originally posted by piet View Post
                                The primary format NCBI has used for ages is Genbank flat file format. Downlad the entry in 'Genbank' format an then use a tool like 'seqret' from the Emboss package to convert Genbank flat file into GFF.

                                Or you may use the TogoWS web service to download the entry in GFF format directly:
                                wget http://togows.org/entry/nucleotide/407479587.gff

                                Please note, that GFF is not a strict format but rather a framework to invent your own format. Column 9 of the GFF file comprises several tags. The names of these tags are more or less arbitrary. The tag names assigned by TogoWS may or may not meet the requirements of your sequence viewer. Furthermore, column 1 of the GFF file holds the name of the sequence. The name used in column 1 of the GFF file must be EXACTLY the same as the name used in the corresponding FASTA file. In a FASTA file the name of the sequence is the first word of the description line (all the characters before the first space).

                                Please also note that GI=407479587 is an isolate from the German HUSEC outbreak in 2011 which is sequence type 678 and differs from the uropathogenic ST131 you have ask for before.

                                With regard to your questions about read trimming and about inclusion of plasmids, I would recomment that you initially start with just a single chromosomal sequence and without any read trimming. You should be able to map 60 to 80 percent of your reads that way. Your goal for the next weeks should be to make yourself familiar with all these tools and to establish a basic work flow. If you have found such a work flow you can try to improve the number of reads mapped by either adding plasmidic sequences to your set of reference sequences or by doing some read trimming.
                                --
                                piet
                                Many thanks Piet, appreciate the info for the GFF file.

                                Thanks for letting me know, I was unaware of the difference in sequence type. I will give it a go with just the chromosome first. But how do I add the plasmid sequences too?

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                22 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X