SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
In Silico read normalization prior to de novo assembly AmyEllison RNA Sequencing 16 07-19-2014 05:57 PM
Assembly using Illumina Paired-end reads from SRA with MIRA chayan Bioinformatics 3 02-24-2014 02:36 AM
De novo assembly for Illumina HighSeq paired end reads hicham Bioinformatics 17 02-12-2014 09:58 AM
Assembly of 150bp paired-end reads anli Illumina/Solexa 1 05-18-2011 04:01 AM
Whole genome assembly of cDNA with Illumina paired-end sequencing mjouret Bioinformatics 4 04-15-2010 06:35 AM

Reply
 
Thread Tools
Old 06-25-2014, 08:13 AM   #1
uloeber
Member
 
Location: Germany

Join Date: Mar 2013
Posts: 31
Question Illumina paired end assembly; in silico normalization

Dear all,
I want to assemble a fungal genome of approx. 35Mbp. My data are paired end reads of 101 bp in length with two different insertion length (300bp and 600 bp).
I decided to use velvet for the assembly and searched for the best k-mer using VelvetOptimizer. I searched for the "best" k-mer with a small subset of my NGS data regarding N50 and longest contig as criterion for best assembly.
I of course already read about the stagnation of assembly statistics, when a specific level of coverage is reached, one could not assemble more then the "real" genome in the end, BUT: what I observed is that the statistics drop down increasing the coverage (using the full data set).
Inspecting the results of the full set assembly and sub set assembly shows that long contigs from the sub set assembly (med cov 25x) are divided in smaller contigs in the other assembly (med cov 250x).
Initially the genome was "over sequenced" so the calculated expected coverage is about 1000x.
Do you have any suggestions why that happens? And do you think in silico normalization might get rid of that? Using DigiNorm for example I would loose the PE information and even the quality information if I'm right because the output is allays fasta?
I would be very happy if we could start a discussion about that. Did anyone of you observed that phenomenon before?
uloeber is offline   Reply With Quote
Old 06-25-2014, 09:54 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

oloeber,

You can either subsample or normalize; sometimes one gives a better assembly than the other. BBNorm will not lose pairing or quality information.

That package also contains Reformat, a tool that can do subsampling:

reformat.sh in=reads.fq.gz out=sampled.fq.gz samplerate=0.04

...will subsample to 4%.

With BBNorm, to normalize to 35x coverage:

bbnorm.sh in=reads.fq.gz out=normalized.fq.gz target=35 -Xmx29g

You may or may not need to set the -Xmx flag, depending on your environment. If you do, then set it to about 85% of the machine's physical memory.

In my experience, normalization normally yields a better assembly than subsampling, particularly if you have variable coverage. The disadvantages of normalization compared to subsampling are that it is slower and uses more memory, and destroys information about repeat content, but I consider these disadvantages to be unimportant if the resultant assembly is better.

Also, I find that Velvet, with default parameters, yields the best assembly at around 35-40x coverage.
Brian Bushnell is offline   Reply With Quote
Old 06-26-2014, 12:39 AM   #3
uloeber
Member
 
Location: Germany

Join Date: Mar 2013
Posts: 31
Default

Thank you for the reply! I try again with BBnorm.
uloeber is offline   Reply With Quote
Old 07-05-2014, 11:35 AM   #4
sdmoore
Member
 
Location: Florida

Join Date: Jun 2014
Posts: 10
Default BBnorm target depth setting

Hi Thread/Brian,

I ran BBnorm on a pre-BBduk'd pair of reads. I plan to use the normalized reads for assemblers (Velvet or A5, well, A4 in this case).

./bbnorm.sh in1=R1_bbduk20.fq in2=R2_bbduk20.fq out1=R1_bbduk20_norm35.fq out2=R2_bbduck20_norm35.fq target=35

the opening dialog states that the target depth is 140 on pass 1, then 35 on pass 2.

Why the two passes?

What are the "error reads/pairs/types"?

Do these normalizations typically help a BBmap (or other) run as well?

Thanks.
sdmoore is offline   Reply With Quote
Old 07-06-2014, 09:20 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

sdmoore,

There are two passes because normalization can selectively enrich the dataset for reads with errors, if no precautions are taken, since reads with errors appear to have rare kmers. To combat this, I first normalize down to a higher-than-requested level; or specifically, reads that appear error-free are normalized to the high level (140 in this case), and reads that appear to contain errors are normalized to the low level (35 in this case) on the first pass.

So, after the first pass all reads will still have minimum depth 35 (so nothing was lost), but the new dataset will be selectively enriched for error-free reads. The second pass normalizes the remaining reads to the final target regardless of whether errors are suspected.

It's not really possible to do this in a single pass because if (for example) half your reads contain errors, and error-free reads are randomly discarded at the target rate but error-containing reads are discarded at a higher rate, you will ultimately achieve only half of the desired final coverage.

You can set "passes=1" when running normalization, and look at the kmer frequency histogram afterward with "histout=x.txt". The histogram from a run with 2-pass normalization will have far fewer error kmers, which is obvious from a quick visual examination of the graph.

Normalization is not useful for mapping in most circumstances. The only advantage it would convey is a decrease in runtime by reducing the size of the input, which is useful if you are using a huge reference (like nt) or slow algorithm (like BLAST), but I don't use it for that. Error-correction (done by the same program as normalization, with ecc.sh instead of bbnorm.sh) may be useful before mapping, though - it will increase mapping rates, though always with a possibility that the ultimate output may be slightly altered. So, I would not error-correct data before mapping, either, except when using a mapping algorithm that is very intolerant of errors.

I designed BBNorm as a preprocessing step for assembly. Normalizing and/or error-correcting data with highly uneven coverage - single cell amplified data and metagenomes (and possibly transcriptomes, though I have not tried it) - can yield a much better assembly, particularly with assemblers that are designed for isolates with a fairly flat distribution, but even on assemblers designed for single-cell or metagenomic assembly. Also, normalization can allow assembly of datasets that would otherwise run out of memory or take too long. Depending on the circumstance, it often yields better assemblies with isolates too, but not always.

I'm sure there are other places where BBNorm can be useful, but I don't recommend it as a general-purpose preprocessing step prior to any analysis - just for cases where you can achieve better results be reducing data volume, or flattening the coverage, or reducing the error rate.

Oh... and as for the error reads/pairs/types, that shows a summary of the input data quality, and fractions of reads or pairs that appear to contain errors. The "types" indicate which heuristic classified the read as appearing to contain an error; that's really there for testing and I may remove it.
Brian Bushnell is offline   Reply With Quote
Old 07-07-2014, 06:39 AM   #6
sdmoore
Member
 
Location: Florida

Join Date: Jun 2014
Posts: 10
Default BBduk goose.

Thanks Brian,

That's much clearer and makes sense.

I noticed that fastqc complained more about kmers after the normalization, and I can see why that could happen.
sdmoore is offline   Reply With Quote
Old 07-10-2014, 03:37 AM   #7
uloeber
Member
 
Location: Germany

Join Date: Mar 2013
Posts: 31
Default

Hi Brian,
I used the following command for normalization of my oversequenced genome:
Code:
 ./bbnorm.sh in=species_L004_paired.fastq out=normalized_L004_80x.fastq target=80
Maybe that is the same question as mentioned before, but I expect that errors represented by low frequent kmers have more weight after the normalization. Is it like that?
By the way, even after the first run of you tool my assembly statistics increase. You did a great job! Thank you!
By the way, did I missed to give the pairing argument?
uloeber is offline   Reply With Quote
Old 07-10-2014, 09:20 AM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

uloeber,

1) BBNorm was designed to prioritize discarding of reads that appear to contain errors, so typically, there are fewer low-frequency kmers after normalization. When you normalize, you can set "hist=khist_input.txt" and "histout=khist_output.txt". The first file will get the frequency histogram of kmers before normalization, and the second one after normalization, so you can see how the process changed the distribution.

2) I'm glad it was helpful!

3) If the input is interleaved, the program will autodetect that, as long as the reads follow the standard Illumina naming patterns, so it should be fine. I will update it to print a message indicating whether it is processing the data as paired or not. You can force a file to be interpreted as interleaved with the "interleaved=t" flag.
Brian Bushnell is offline   Reply With Quote
Old 07-12-2014, 07:55 AM   #9
uloeber
Member
 
Location: Germany

Join Date: Mar 2013
Posts: 31
Default

Hi Brian,
I just do not know if it was successful with respect to the pairing... I got the following exception using DigiNorm:
Code:
** ERROR: Error: Improperly interleaved pairs
Thats why I was wondering if maybe there are some problems in the interleaved mode. Would your software give a hint if it is not possible to run the interleaved mode?
Can you again tell me what is best practice runing two or more iterative steps using BBnorm?
uloeber is offline   Reply With Quote
Old 07-12-2014, 04:07 PM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

uloeber,

You can run Reformat to verify that interleaving is valid, like this:

reformat.sh in=reads.fq out=null vint

It will print a message indicating whether or not the interleaving is correct, based on the read names. So you can run it on the file from before normalization, and the file after, to check. Another program can fix files with broken interleaving:

bbsplitpairs.sh in=reads.fq out=fixed.fq fint

This is very fast and requires very little memory, but will only work on files that were interleaved, then some of the reads were thrown away. If the reads are arbitrarily disordered, you can run this:

repair.sh in=reads.fq out=fixed.fq repair

This requires a lot of memory, though, as it may potentially store up to half of the reads in memory.

If BBNorm broke pairing, it is much better to redo normalization with the "int=t" flag to force reads to be treated as pairs than to try to fix the corrupted output. If you run it with "int=t" it will completely ignore the read names, and not give any sort of error if they don't match, since the names could have been changed or they could be in some format that is not recognized. If you run with "int=f" then similarly the input will be forced to be used single-ended. And if you run without the flag, it will autodetect but unfortunately it doesn't currently tell you whether the data was processed single-ended or not; you can run this program:

testformat.sh in=reads.fq

...which will print whether or not a file is detected as interleaved (and various other things). If testformat.sh says a file contains single-ended ASCII-33 fastq reads, then all of my programs will treat that file as such (with the exception of repair.sh and bbsplit.sh with the 'fint' flag, because those are designed for corrupted files).

I'm not really sure what you mean by running iterative steps with BBNorm - it runs two passes by default, transparently, as that yields the best assemblies in my testing. You can run multiple iterations yourself if you want (using the output as input), but there's no reason to do that unless the coverage comes out above your target. I have not found that multiple iterations of error-correction are particularly useful, either; one seems to be fine. So if you use this command line:

bbnorm.sh in=reads.fq out=normalized.fq int=t ecc=t hist=khist.txt histout=khist2.txt ow

...then it will do three passes:

1a) Kmer counting of input kmers for frequency histogram
1b) Error correction
1c) Normalization biased against reads with errors

2) Unbiased normalization

3) Kmer counting for output frequency histogram

...which should give you optimal results. The "ow" flag is optional and tells it to overwrite output files if they already exist; the "ecc" flag is optional and usually makes things better, but not always; and of course the "int" flag is optional but I always include it if I know whether or not the reads are paired, since autodetection is reliant on reads having Illumina standard names.

I hope that helps!
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:52 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO