Seqanswers Leaderboard Ad

**michaellim** · 12-19-2014, 01:37 PM

Originally posted by piet View Post

Multi locus sequence typing (MLST) is a method frequently used to characterized bacterial genomes. MLST schematas have been published for most pathogenic bacteria. For the species Escherichia coli (including Shigella) there exist even three concurring schematas. With the schema maintained at Cork University sequence type 11 (ST11) refers to isolates typically found with cattle (serovar O157:H4), while strains belonging to ST131 are uropathogenic which means they are assoziated with infections of the urinary tract in humans. The chromosome of E.coli encodes more than 4000 proteins. Maybe half of them belongs to the accessory genome, which means they are only found in some strains or clonal groups.

If you want to map your reads from RNA sequencing I would recommend to use a genome from the same or a very closely related sequence type. Otherwise you will miss several genes from the accessory genome. For E.coli ST131 there are several genomes available in Genbank, even fully finished ones (AP009378.1 and plasmid AP009379.1, CP002797.2). Sequences for ST131 isolates KTE173, KTE49, KTE162, KTE6, KTE211, KTE175, KTE178, KTE216, KTE148, KTE139 are available as WGS contigs.

I would recommend to try several reference genomes. A mapping run usually takes only a few minutes on a desktop PC.
--
piet

Hi Piet,

Many thanks for the clarification. I will give it a try with different genomes then if it doesn't take too long. May I know what kind of alignment/mapping software do you use? Is there any particular reasons for that choice?

Cheers.

**piet** · 12-19-2014, 02:27 PM

Originally posted by michaellim View Post

May I know what kind of alignment/mapping software do you use?

I use 'bwa mem' but my use case is processing of DNA sequencing data. It is very fast and reliable with default settings. Nevertheless, bwa and similar mappers should be suited also for bacterial RNA sequencing since bacteria do not splice their messanger RNA.

In the beginning it took me quite a while to fiddle out how to write shell scripts to start bwa runs in a comfortable way and to handle the resulting sam files. You will definitely need to learn some kind of shell or script programming if you want to go that route.

Why don't you do a DNA sequencing run of your particular isolate before you go into RNA sequencing?
--
piet

**michaellim** · 12-20-2014, 05:50 AM

Originally posted by piet View Post

I use 'bwa mem' but my use case is processing of DNA sequencing data. It is very fast and reliable with default settings. Nevertheless, bwa and similar mappers should be suited also for bacterial RNA sequencing since bacteria do not splice their messanger RNA.

In the beginning it took me quite a while to fiddle out how to write shell scripts to start bwa runs in a comfortable way and to handle the resulting sam files. You will definitely need to learn some kind of shell or script programming if you want to go that route.

Why don't you do a DNA sequencing run of your particular isolate before you go into RNA sequencing?
--
piet

Hi Piet,

I see, I have close to none coding/programming knowledge, then maybe BWA is not suitable then. But I will check out the website for more info about it.

I did consider DNA sequencing the genome of my sequence type strain, but the lab has limited funds.

Thank you very much.

**michaellim** · 12-20-2014, 05:52 AM

Dear All,

May I also ask, since my RNA seq libraries were about 260 bp in size according to Illumina's preparation protocol, for the FASTQ files which I've currently have, do I need to remove the Adapter (Index) sequences before mapping on the reference genome?

Many thanks.

**michaellim** · 12-20-2014, 05:54 AM

Originally posted by GenoMax View Post

That is a likely explanation. If submitters are not completely sure that the contigs go together (there could be multiple plasmids in some bacteria and the separate pieces may be real) they would be left in that state.

Hi Sergio,

May I check with you whether I need to trim the adapter sequence from my RNA seq FASTQ file? My Library was about 260 bp each.

Any suggestion how should I do this? Do I just set a software to trim from base 1 to base X or do I need to input the individual adapter sequence to the trimmer, I've noticed quite a few version of trimmers online. There is a built in one in Galaxy too.

Many thanks.

**GenoMax** · 12-20-2014, 06:01 AM

It is always a good idea to check for and trim adapter sequences, if present. Many aligners will soft clip them but if you are planning to do any assembly you want to start with clean reads. BTW adapters and indexes are not the same thing. With illumina technology index sequences are never a part of the main read so do not need to be trimmed (unless you are using custom inline indexes).

BBDuk is easy to use (on Windows/Mac/*nix) so is Trimmomatic. You could do this in galaxy but at some point you will need to move to command line (e.g if you decide to use Mauve).

**Sergioo** · 12-20-2014, 07:54 AM

Originally posted by michaellim View Post

Hi Sergioo,

Yes, MLST. For example, E.coli ST11 will be different from ST131. However, we aren't certain whether there is any genes which is specific to ST131 which cannot be found in other E. coli sequence types.

So, if ST11 has a completed genome, but ST131 is in contigs, and my current RNA seq data is on ST131, should I use ST131 (multiple contigs) as the reference or the completed genome of ST11 which is not so related as the reference genome. That was my question. Hope that makes it clearer.

Thank you.

By now, you've got many suggestions from more experienced readers. You are lucky because you've just got to sit down and think of which option to use.

I am not familiar wth RNA seq projects, but if it was whole genome seq, I will go first for an assembly (even de novo) using a complete genome (not the one in multiple contigs). The complete genome, even not exactly related, will allow you to order your contigs and resolve misassembly. Note that you can not rely on a draft genome sequence since its biggest inconvenience is the lack of order of composite contigs.

Now, once you've got your draft sequences ordered, you are free to compare it with what you think is more related (for example sequences from the same ST as your isolate).
Hope it helps.

**michaellim** · 12-20-2014, 11:11 AM

Originally posted by GenoMax View Post

It is always a good idea to check for and trim adapter sequences, if present. Many aligners will soft clip them but if you are planning to do any assembly you want to start with clean reads. BTW adapters and indexes are not the same thing. With illumina technology index sequences are never a part of the main read so do not need to be trimmed (unless you are using custom inline indexes).

BBDuk is easy to use (on Windows/Mac/*nix) so is Trimmomatic. You could do this in galaxy but at some point you will need to move to command line (e.g if you decide to use Mauve).

Hi Genomax,

Sorry, it's my first time doing RNA seq and dealing with sequencing data. I was using MiSeq for the sequencing (the running of the flow cell was done by the sequencing lab, I prepared all the way up to the denatured libraries). From the Illumina Library Prep manual, I (mis)understood 'adapters' to be the same as 'index/indices' (unique 6 nucleotide sequences to labelled each RNA sample).

Could you please explain how are they different? Will the sequence still be in sequencing FASTQ file?

By the way, looking at the Per Base Sequence Quality, for all of my samples, the lower end of the yellow box goes below the 20 Quality Score after base-150 (all sequences are 200 bases). Does this mean I need to trim the adapters and also everything after base-150?

Was reading some blogs, there are arguments about whether it is important to trim or not to trim before mapping. It's rather confusing to me.

Thank you.

**michaellim** · 12-20-2014, 11:15 AM

Originally posted by Sergioo View Post

By now, you've got many suggestions from more experienced readers. You are lucky because you've just got to sit down and think of which option to use.

I am not familiar wth RNA seq projects, but if it was whole genome seq, I will go first for an assembly (even de novo) using a complete genome (not the one in multiple contigs). The complete genome, even not exactly related, will allow you to order your contigs and resolve misassembly. Note that you can not rely on a draft genome sequence since its biggest inconvenience is the lack of order of composite contigs.

Now, once you've got your draft sequences ordered, you are free to compare it with what you think is more related (for example sequences from the same ST as your isolate).
Hope it helps.

Hi Sergio,

Yes, I'm truly very grateful for all the response given. I'm slowly understanding more about the software options and uses and the initial mapping analyses. I have currently no idea as I've not done this before and there are no one in the department who has does this kind of work, so I couldn't get any advice internally.

By the way, when you are doing mapping, for example when you have 1 chromosome sequence, and 5 plasmid sequences on NCBI. How do you do the mapping? I was looking at Galaxy and you can only choose one reference genome for any single mapping task.

Thank you.

**michaellim** · 12-20-2014, 11:29 AM

Originally posted by Brian Bushnell View Post

All aligners are designed to handle references with multiple contigs; you don't need to combine anything (nor should you). You just need to index it.

Well since you ask me, I will recommend BBMap, which also handles RNA-seq data, but is faster and more sensitive than Tophat. But bacteria generally lack introns - when they are present, they are very short and only in a handful of genes. So it's not strictly necessary to use a splice-aware aligner for bacterial RNA-seq, though I would still recommend it.

Hi Brian,

Can I get some further clarification from you too? I was looking at some genomes in NCBI and they are deposited as Chromosome and multiple plasmids.

In this case, when I'm mapping, am I supposed to combine all the sequences (chromosome and plasmid) in NCBI? Or do I index them as you've mentioned? Sorry, I have no prior knowledge at all on DNA sequencing/RNA sequencing.

I was trying to map the RNA seq data in Galaxy, but I can only choose one reference at a time.

Thank you.

**GenoMax** · 12-20-2014, 01:47 PM

Originally posted by michaellim View Post

Could you please explain how are they different? Will the sequence still be in sequencing FASTQ file?

Watch this short video from Illumina that explains how their sequencing technology works (it addresses adapters/indexes): https://www.youtube.com/watch?v=HMyCqWhwB8E Index read sequence will not be part of the actual read. It will be included in the Fastq read header (http://en.wikipedia.org/wiki/FASTQ_f...ce_identifiers skip to CASAVA 1.8 format headers).

By the way, looking at the Per Base Sequence Quality, for all of my samples, the lower end of the yellow box goes below the 20 Quality Score after base-150 (all sequences are 200 bases). Does this mean I need to trim the adapters and also everything after base-150?

Thank you.

Post the FastQC plots for your sample(s) if you need specific comments but in general if you had inserts that were shorter than your read length then you are going to have adapters in your sequences. If your data was processed on the MiSeq by MiSeq reporter then the adapters may have already been removed (ask the facility if you are not sure).

BBDuk is specially good at documenting statistics about how many reads had adapters/were trimmed. Make sure you use the correct adapter reference files (nextera, truseq etc they are included in BBMap download in reference directory).

**michaellim** · 12-20-2014, 05:26 PM

Originally posted by GenoMax View Post

Watch this short video from Illumina that explains how their sequencing technology works (it addresses adapters/indexes): https://www.youtube.com/watch?v=HMyCqWhwB8E Index read sequence will not be part of the actual read. It will be included in the Fastq read header (http://en.wikipedia.org/wiki/FASTQ_f...ce_identifiers skip to CASAVA 1.8 format headers).

Post the FastQC plots for your sample(s) if you need specific comments but in general if you had inserts that were shorter than your read length then you are going to have adapters in your sequences. If your data was processed on the MiSeq by MiSeq reporter then the adapters may have already been removed (ask the facility if you are not sure).

BBDuk is specially good at documenting statistics about how many reads had adapters/were trimmed. Make sure you use the correct adapter reference files (nextera, truseq etc they are included in BBMap download in reference directory).

Hi Genomax,

Thanks for the explanation. I've attached four plots for you to comment. Honestly, I have not much idea about it. All I was told was that as long as the read is above 20 on the Y-axis then it's good to use. Those below 20 may probably be a wrongly called base and may need to be trimmed before mapping.

By the way, I did a trial mapping of one of the RNA seq Groom'ed file with a reference chromosome sequence, but when I try to view the BAM file on Integrative Genomic Viewer, the reference chromosome is not in the drop down list. When I tried to upload my own fasta file downloaded from NCBI, there is no gene annotation in it. Do you know how should I upload the annotation? I tried reading the IGV website, it says to download the GFF file from NCBI, but I don't see any "http://www.ncbi.nlm.nih.gov/nuccore/407479587" place for me to download the GFF file from the "Display Settings" (Top Left of the screen).

Could you kindly advise?

Thank you.

Attached Files

FASTQC plots.png (127.9 KB, 29 views)

**piet** · 12-21-2014, 05:44 AM

Originally posted by michaellim View Post

download the GFF file from NCBI, but I don't see any "http://www.ncbi.nlm.nih.gov/nuccore/407479587" place for me to download the GFF file from the "Display Settings" (Top Left of the screen).

The primary format NCBI has used for ages is Genbank flat file format. Downlad the entry in 'Genbank' format an then use a tool like 'seqret' from the Emboss package to convert Genbank flat file into GFF.

Or you may use the TogoWS web service to download the entry in GFF format directly:
wget http://togows.org/entry/nucleotide/407479587.gff

Please note, that GFF is not a strict format but rather a framework to invent your own format. Column 9 of the GFF file comprises several tags. The names of these tags are more or less arbitrary. The tag names assigned by TogoWS may or may not meet the requirements of your sequence viewer. Furthermore, column 1 of the GFF file holds the name of the sequence. The name used in column 1 of the GFF file must be EXACTLY the same as the name used in the corresponding FASTA file. In a FASTA file the name of the sequence is the first word of the description line (all the characters before the first space).

Please also note that GI=407479587 is an isolate from the German HUSEC outbreak in 2011 which is sequence type 678 and differs from the uropathogenic ST131 you have ask for before.

With regard to your questions about read trimming and about inclusion of plasmids, I would recomment that you initially start with just a single chromosomal sequence and without any read trimming. You should be able to map 60 to 80 percent of your reads that way. Your goal for the next weeks should be to make yourself familiar with all these tools and to establish a basic work flow. If you have found such a work flow you can try to improve the number of reads mapped by either adding plasmidic sequences to your set of reference sequences or by doing some read trimming.
--
piet

**GenoMax** · 12-21-2014, 08:26 AM

@michaellim: What is your ultimate aim with this RNAseq study? Are you looking to do differential expression or just checking to see what is expressed under some specific condition(s)?

For the immediate issue of not being able to see annotations you can use the gff file from Piet's example and see if that works with IGV. You should compare the pre- and post-trimming FastQC plots to see if there is an improvement in stats. Plots you have posted don't look bad but it is difficult to say if you have adapter contamination unless you try the trimming. No fastq grooming in galaxy should be necessary with MiSeq data. It is already in sanger fastq format.

Once you go away from "model" organisms tools such as galaxy start becoming limiting (as you have already discovered). Depending on your overall goals it may be beneficial to start learning how to do these analyses on command line. If this is a small part of whatever you are trying to do then enlisting the help of a friend/local bioinformatics support folks may be the easiest thing to do so you can get a set of hypotheses to test at the bench and move on.

**michaellim** · 12-22-2014, 07:13 AM

Originally posted by piet View Post

The primary format NCBI has used for ages is Genbank flat file format. Downlad the entry in 'Genbank' format an then use a tool like 'seqret' from the Emboss package to convert Genbank flat file into GFF.

Or you may use the TogoWS web service to download the entry in GFF format directly:
wget http://togows.org/entry/nucleotide/407479587.gff

Please note, that GFF is not a strict format but rather a framework to invent your own format. Column 9 of the GFF file comprises several tags. The names of these tags are more or less arbitrary. The tag names assigned by TogoWS may or may not meet the requirements of your sequence viewer. Furthermore, column 1 of the GFF file holds the name of the sequence. The name used in column 1 of the GFF file must be EXACTLY the same as the name used in the corresponding FASTA file. In a FASTA file the name of the sequence is the first word of the description line (all the characters before the first space).

Please also note that GI=407479587 is an isolate from the German HUSEC outbreak in 2011 which is sequence type 678 and differs from the uropathogenic ST131 you have ask for before.

With regard to your questions about read trimming and about inclusion of plasmids, I would recomment that you initially start with just a single chromosomal sequence and without any read trimming. You should be able to map 60 to 80 percent of your reads that way. Your goal for the next weeks should be to make yourself familiar with all these tools and to establish a basic work flow. If you have found such a work flow you can try to improve the number of reads mapped by either adding plasmidic sequences to your set of reference sequences or by doing some read trimming.
--
piet

Many thanks Piet, appreciate the info for the GFF file.

Thanks for letting me know, I was unaware of the difference in sequence type. I will give it a go with just the chromosome first. But how do I add the plasmid sequences too?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News