Seqanswers Leaderboard Ad

**Torst** · 09-25-2012, 06:39 PM

KHMER does not handle paired reads currently, I believe they are working on that.

You can still assemble your reads with one of the following strategies:

1. Diginorm the pairs, but assemble as SE reads
2. Diginorm only Read1 from each pair, and assemble as SE reads
3. Don't diginorm, just assemble as PE

**kmkocot** · 12-04-2012, 05:43 PM

Hi dnusol,

If I understand your question correctly, it is possible to digitally normalize paired-end transcriptome data and assemble it using an assembler that takes advantage of the paired end info (e.g., Trinity) if your data are formatted correctly. If you're working with data from Casava 1.8+ (I think all HiSeq and MiSeq data) then you will have to modify your headers to have a /1 or /2 at the end to replicate the older way read pair member ID was encoded (see http://en.wikipedia.org/wiki/FASTQ_format).

These are the sed commands I use on my MiSeq data to fix this:
sed -i '/^@M00/ s/\ .\+/\/1/g' *_R1.fastq
sed -i '/^@M00/ s/\ .\+/\/2/g' *_R2.fastq

Then you will need to run commands along the lines of the following:
shuffleSequences_fastq.pl example_R1.fastq example_R2.fastq example_shuffled.fastq
normalize-by-median.py -C 30 -k 20 -N 4 -x 2.5e9 example_shuffled.fastq
strip-and-split-for-assembly.py example_shuffled.fastq.keep
split_pe.py example_shuffled.fastq.keep.pe
mv example_shuffled.fastq.keep.pe.1 example_norm_R1.fasta
mv example_shuffled.fastq.keep.pe.2 example_norm_R2.fasta
Trinity.pl --seqType fa --left example_norm_R1.fasta --right example_norm_R2.fasta --JM 20G --inchworm_cpu 8 --CPU 8 --output assembly

Please let me know if this does/n't work for you.

See this page for more info:

https://wiki.hpcc.msu.edu/display/Bioinfo/Basic+Digital+Normalization

Thanks,
Kevin

**aharkess** · 12-04-2012, 06:13 PM

Trinity's version of in silico read normalization handles paired end normalization - give it a look, for sure.

**Torst** · 12-04-2012, 08:42 PM

khmer vs trinity

Originally posted by aharkess View Post

Trinity's version of in silico read normalization handles paired end normalization - give it a look, for sure.

Titus Brown has written a blog post comparing KHMER to Trinity's digital normalization:

What does Trinity's In Silico normalization do?

http://ivory.idyll.org/blog/trinity-in-silico-normalize.html

This post can be referenced and cited at the following DOI: http://dx.doi.org/10.6084/m9.figshare.98198. For a few months, the Trinity list was...

**danwiththeplan** · 01-15-2014, 07:25 PM

FASTQ files can have @ in the quality scores

FYI, correct me if I'm wrong, but it's possible that FASTQ files have a @ symbol in the quality scores as well as at the start of the header line.

FASTQ format - Wikipedia

http://en.wikipedia.org/wiki/FASTQ_format

Which is a dumb thing about the FASTQ format, but whatever, we're stuck with it.

So, any approach that uses sed to search for the @ at the start of the header line is potentially unsafe.

Originally posted by kmkocot View Post

These are the sed commands I use on my MiSeq data to fix this:
sed -i '/^@M00/ s/\ .\+/\/1/g' *_R1.fastq
sed -i '/^@M00/ s/\ .\+/\/2/g' *_R2.fastq

So, this might cause problems in the event that you have the text @M00 in the quality scores, which is possible given the squillions of reads we deal with these days.

My totally non-guaranteed approach:

sed '1~4 s/$/ \/1/g' your_fastq_file.fastq > your_new_fastq_file.fastq (for left reads) , or
sed '1~4 s/$/ \/2/g' your_fastq_file.fastq > your_new_fastq_file.fastq (for right reads).

This simply adds ' /1' ( i.e. a space, a slash and a 1) to the end of every 4th line starting with the first line. If your file is FASTQ format this should work (works for me anyway). You can use the sed -i option to replace rather than redirecting to a new file if you want.

**arthurmelo** · 04-09-2014, 03:59 AM

Hi everbody ...
change the discussion focus, how is the real mean of -C (Cutoff) option in normalize-by-median.py script. The Digital Normalization authors suggested -C 20. But I don't sure if only reads of lower coverage 20 are trimmed.

Thanks,
Arthur

**rkizen** · 04-09-2014, 04:23 AM

Originally posted by arthurmelo View Post

Hi everbody ...
change the discussion focus, how is the real mean of -C (Cutoff) option in normalize-by-median.py script. The Digital Normalization authors suggested -C 20. But I don't sure if only reads of lower coverage 20 are trimmed.

Thanks,
Arthur

I am not sure if I understand your question correctly but -C (Cutoff) is to indicate the number which is used as a cutoff for the median kmer coverage of a read.

So basically diginorm is going through the file one read at a time and counting kmers from each read then storing the number of occurences for each kmer as it goes through the reads. The number of times a kmer has been found is considered the kmer coverage. If a read has a median kmer coverage over the cutoff, that entire read is discarded (not the same thing as trimming).

In case there is some confusion on what median kmer coverage is, it works like this:
If your reads are 100bp and your kmer size is 20bp (a 20-mer), then you will have 80 20-mers in each read.
Count how many times each 20-mer has occured and then rank the counts from most to least.
Take the median 20-mer from this rank list.
If this number is greater than the cutoff number, the read will be discarded. If it is less than the cutoff number, the read is kept.

This is a way to reduce the number of redundant reads while accomodating for some sequence error.

If you need more details, this article explains what it is doing exactly:

http://arxiv.org/pdf/1203.4802v2.pdf

**arthurmelo** · 04-09-2014, 04:52 AM

Thank you so much rkizen.

**kmcarr** · 04-09-2014, 06:27 AM

Originally posted by danwiththeplan View Post

FYI, correct me if I'm wrong, but it's possible that FASTQ files have a @ symbol in the quality scores as well as at the start of the header line.

FASTQ format - Wikipedia

http://en.wikipedia.org/wiki/FASTQ_format

Which is a dumb thing about the FASTQ format, but whatever, we're stuck with it.

So, any approach that uses sed to search for the @ at the start of the header line is potentially unsafe.

Originally Posted by kmkocot
These are the sed commands I use on my MiSeq data to fix this:
sed -i '/^@M00/ s/\ .\+/\/1/g' *_R1.fastq
sed -i '/^@M00/ s/\ .\+/\/2/g' *_R2.fastq

So, this might cause problems in the event that you have the text @M00 in the quality scores, which is possible given the squillions of reads we deal with these days.

I know this post is a little old but I just wanted to point out that using /^@M00/ is a safe way to identify header lines (at least for this particular run). The M00 (that is M<zero><zero>) represents the beginning of the machine name that produced these reads. None of the standard FASTQ quality formats will have 'M' and '0' (zero) together. Phred+33 may contain '0' but not 'M'; Phred+64 may contain 'M' but not '0'.

Our Illumina sequencers are configured to write instrument names in the FastQ files as "HWI-xxxxxxx". I always use /^@HWI/ to grep header lines in these FastQ files. This is safe for all newer (Phred+33 output) Illumina data as 'W' will never appear in a quality line.

**roliwilhelm** · 04-30-2014, 02:27 PM

Just to update this thread:

NOW, Khmer DOES preserve paired ends. However, the user must recover the paired reads after digital normalization since some of the pairs will be orphaned during the process. This page gives a tutorial and, while it is based on an older version of khmer, I used it this week with success:

2. Running digital normalization — khmer-protocols 0.8.3 documentation

http://khmer-protocols.readthedocs.org/en/v0.8.3/metagenomics/2-diginorm.html

**Brian Bushnell** · 04-30-2014, 02:35 PM

FYI, BBNorm never orphans pairs during normalization, and is substantially faster and easier to use than Khmer, which I found rather confusing.

The command to normalize to 50x coverage would be like this:

bbnorm.sh -Xmx29g in=read#.fq out=norm#.fq target=50

...and you can add the flag "ecc=t" if you want error-correction as well. The -Xmx flag should be set to about 85% of the system's memory, or you can leave it off and see how well the shellscript autodetects memory. The command assumes that both files have the same name except for where the "#" symbol is, where one has a 1 and the other has a 2.

**yueluo** · 05-01-2014, 12:44 AM

Trinity’s In silico Read Normalization can handle paired reads.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

DigiNorm on Paired-end samples

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News