View Single Post
Old 08-27-2014, 11:37 AM   #1
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default Introducing Reformat, a fast read format converter

Reformat is a member of the BBMap/BBTools package. It is a multipurpose tool designed for converting reads or other nucleotide data between different formats. It supports, and can inter-convert:

scarf (an old Illumina format)
bam (if samtools is installed)
ascii-33 (sanger)
ascii-64 (old Illumina)
paired files
interleaved files

It is multithreaded and can process data at over 500 megabytes per second, and can accept streams from standard in and write to standard out, allowing it to be easily dropped into the middle of a pipeline for format conversion. Reformat autodetects formats based on file extensions and content, making it very easy to use; and the autodetection can be overridden, allowing flexibility for people who don't like to follow naming conventions, or out-of-spec fastq files with qualities values like -17 or 120.

The program has been gradually expanded, and can now perform various other functions. None of these will break pairing, if the input is paired.

Quality trimming (either or both ends)
Quality filtering
Fixed-length trimming
Generation of histograms (base composition, quality, etc)
Subsampling (to a fraction of input reads, or an exact number of reads or bases)
Changing fasta line-wrapping length
Reverse-complementing (all reads or only read 2)
Adding /1 and /2 suffix to read names
GC-content filtering
Testing for corrupted interleaved files

Reformat is compatible with any platform that supports Java 1.7 or higher. It also has a bash shellscript for simpler invocation. Typical usage examples:

Reformat fastq into fasta: in=x.fq out=y.fa

Interleave paired reads: in1=x1.fq in2=x2.fq out=y.fq

Note - you can actually use a shortcut if paired read files have the same name with a 1 and a 2. This is equivalent to the above command: in=x#.fq out=y.fq

De-interleave reads: in=x.fq out1=y1.fq out2=y2.fq

Verify that interleaving appears correct, assuming Illumina namimg conventions: in=x.fq vint

Convert ASCII-33 to ASCII-64: in=x.fq out=y.fq qin=33 qout=64

Quality-trim paired reads to Q10 on the left and right ends and discard reads shorter than 50bp after trimming: in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlength=50

Subsample 10% of the first 20000 pairs in an interleaved file: in=x.fq out=y.fq reads=20000 samplerate=0.1 int=t
(in this case "int=t" overrides interleaving autodetection, to ensure reads are treated as pairs)

Pipe in a gzipped sam file and pipe out fasta: in=stdin.sam.gz out=stdout.fa

Reverse-complement reads: in=x.fq out=y.fq rcomp

For reformatting a file with very long sequences, Reformat will need more memory; just add the additional flag "-Xmx2g". For example, to change the line-wrapping length on the human genome (which has individual sequences over 200Mbp long) to 70 characters: -Xmx2g in=HG19.fa.gz out=HG19_wrapped.fa.gz fastawrap=70

For additional functions, please run the shellscript with no arguments, or just read it with a text editor. If you have any questions, please post them in this thread.

For people using a non-bash terminal, you may need to type "bash" instead of just "".
For users of Windows or other platforms that do not support bash shellscripts, replace "" with "java -ea -Xmx200m /path/to/bbmap/current/ jgi.ReformatReads"
for example,
java -ea -Xmx200m C:\bbmap\current\ jgi.ReformatReads in=x.fq out=y.fa

Reformat can be downloaded with BBTools here:
Brian Bushnell is offline   Reply With Quote