SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBMerge: A paired-end read merger Brian Bushnell Bioinformatics 112 10-14-2017 01:54 PM
Converter for vcf to bed format ketan_bnf Bioinformatics 4 09-03-2013 05:43 AM
Need Sequence Format Converter byou678 Bioinformatics 5 10-23-2012 01:17 PM
BOAT aligner output format converter? rahul.m.dhodapkar Bioinformatics 0 06-30-2010 07:28 AM
MAQ .map alignment format converter fadista Bioinformatics 0 10-24-2008 06:27 AM

Reply
 
Thread Tools
Old 08-27-2014, 11:37 AM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default Introducing Reformat, a fast read format converter

Reformat is a member of the BBMap/BBTools package. It is a multipurpose tool designed for converting reads or other nucleotide data between different formats. It supports, and can inter-convert:

fastq
fasta
fasta+qual
sam
scarf (an old Illumina format)
bam (if samtools is installed)
gzip
zip
ascii-33 (sanger)
ascii-64 (old Illumina)
paired files
interleaved files

It is multithreaded and can process data at over 500 megabytes per second, and can accept streams from standard in and write to standard out, allowing it to be easily dropped into the middle of a pipeline for format conversion. Reformat autodetects formats based on file extensions and content, making it very easy to use; and the autodetection can be overridden, allowing flexibility for people who don't like to follow naming conventions, or out-of-spec fastq files with qualities values like -17 or 120.

The program has been gradually expanded, and can now perform various other functions. None of these will break pairing, if the input is paired.

Quality trimming (either or both ends)
Quality filtering
Fixed-length trimming
Generation of histograms (base composition, quality, etc)
Subsampling (to a fraction of input reads, or an exact number of reads or bases)
Changing fasta line-wrapping length
Reverse-complementing (all reads or only read 2)
Adding /1 and /2 suffix to read names
GC-content filtering
Length-filtering
Testing for corrupted interleaved files

Reformat is compatible with any platform that supports Java 1.7 or higher. It also has a bash shellscript for simpler invocation. Typical usage examples:

Reformat fastq into fasta:
reformat.sh in=x.fq out=y.fa

Interleave paired reads:
reformat.sh in1=x1.fq in2=x2.fq out=y.fq

Note - you can actually use a shortcut if paired read files have the same name with a 1 and a 2. This is equivalent to the above command:
reformat.sh in=x#.fq out=y.fq

De-interleave reads:
reformat.sh in=x.fq out1=y1.fq out2=y2.fq

Verify that interleaving appears correct, assuming Illumina namimg conventions:
reformat.sh in=x.fq vint

Convert ASCII-33 to ASCII-64:
reformat.sh in=x.fq out=y.fq qin=33 qout=64

Quality-trim paired reads to Q10 on the left and right ends and discard reads shorter than 50bp after trimming:
reformat.sh in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlength=50

Subsample 10% of the first 20000 pairs in an interleaved file:
reformat.sh in=x.fq out=y.fq reads=20000 samplerate=0.1 int=t
(in this case "int=t" overrides interleaving autodetection, to ensure reads are treated as pairs)

Pipe in a gzipped sam file and pipe out fasta:
reformat.sh in=stdin.sam.gz out=stdout.fa

Reverse-complement reads:
reformat.sh in=x.fq out=y.fq rcomp

For reformatting a file with very long sequences, Reformat will need more memory; just add the additional flag "-Xmx2g". For example, to change the line-wrapping length on the human genome (which has individual sequences over 200Mbp long) to 70 characters:
reformat.sh -Xmx2g in=HG19.fa.gz out=HG19_wrapped.fa.gz fastawrap=70

For additional functions, please run the shellscript with no arguments, or just read it with a text editor. If you have any questions, please post them in this thread.

For people using a non-bash terminal, you may need to type "bash reformat.sh" instead of just "reformat.sh".
For users of Windows or other platforms that do not support bash shellscripts, replace "reformat.sh" with "java -ea -Xmx200m /path/to/bbmap/current/ jgi.ReformatReads"
for example,
java -ea -Xmx200m C:\bbmap\current\ jgi.ReformatReads in=x.fq out=y.fa

Reformat can be downloaded with BBTools here:
https://sourceforge.net/projects/bbmap/
Brian Bushnell is offline   Reply With Quote
Old 08-28-2014, 12:55 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

Looks nice. Out of curiousity, can it handle chimeric (i.e., non-linear) alignments when converting from BAM to fastq? That's been a real weak point of a lot of other tools.
dpryan is offline   Reply With Quote
Old 08-28-2014, 09:35 AM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

No, the conversion from sam/bam is stateless and 'dumb' - each line in the sam file will generate a FASTQ read, so secondary and chimeric alignments would cause problems.
Brian Bushnell is offline   Reply With Quote
Old 08-28-2014, 12:10 PM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

Consider this the first feature request then
dpryan is offline   Reply With Quote
Old 10-12-2014, 09:33 PM   #5
Nanu
Member
 
Location: New Delhi

Join Date: Sep 2014
Posts: 30
Default

When I use the command reformat.sh in bbtools package I am getting the following error::
java -ea -Xmx200m -cp /home/himanshu/Downloads/me2/bbmap/current/ jgi.ReformatReads -in=reads.fna qfin=reads.qual out=reads.fasta
Executing jgi.ReformatReads [-in=reads.fna, qfin=reads.qual, out=reads.fasta]

Input is being processed as unpaired
Exception in thread "Thread-1" java.lang.AssertionError
at stream.FastaQualReadInputStream3.makeRead(FastaQualReadInputStream3.java:257)
at stream.FastaQualReadInputStream3.toReadList(FastaQualReadInputStream3.java:147)
at stream.FastaQualReadInputStream3.toReads(FastaQualReadInputStream3.java:113)
at stream.FastaQualReadInputStream3.fillBuffer(FastaQualReadInputStream3.java:97)
at stream.FastaQualReadInputStream3.hasMore(FastaQualReadInputStream3.java:56)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:745)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:737)

Please help me
Nanu is offline   Reply With Quote
Old 10-13-2014, 08:38 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

This was caused by the bases and qualities having different lengths. Either the fna and qual file do not go together, or their order is different, or one of the files is misformatted.

"reads.fna" should already be in fasta format, though, which is what you specified as the output format. If you want fastq, the output filename needs to end with "fastq".

I suggest you post the first sequence in the reads.fna and reads.qual so we can see what's going on.
Brian Bushnell is offline   Reply With Quote
Old 01-27-2015, 09:43 AM   #7
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Hi Brian,

Does the subsampling tool randomly subsample reads? The program I am using now (Geneious) only takes the first specified % of reads and does not randomize.

Thank you,
Marisa
Marisa_Miller is offline   Reply With Quote
Old 01-27-2015, 10:14 AM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by Marisa_Miller View Post
Hi Brian,

Does the subsampling tool randomly subsample reads? The program I am using now (Geneious) only takes the first specified % of reads and does not randomize.

Thank you,
Marisa
Yes, it randomly subsamples. You can set the RNG seed with "sampleseed=NUMBER" if you want deterministic random sampling; by default, the seed is random.

Reformat can also give the first X reads with the "reads=X" flag, or combine the reads=X and samplerate=Y to subsample a fraction of the first X reads, etc.
Brian Bushnell is offline   Reply With Quote
Old 02-24-2015, 11:35 AM   #9
jpummil
Member
 
Location: Fayetteville, AR

Join Date: Apr 2014
Posts: 82
Default

Hey Brian!

So, to subsample a set of PE reads to reduce overall file size (creating quick running data set for a workshop), this would suffice?

reformat.sh in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq reads=-1 samplerate=0.1 int=f

It would: Keep parings intact, give me 1/10 the data overall, ensure no interleaving (though I expect assigning the pairings at the beginning would do this as well)

I already did a quick quality trimming with:
reformat.sh in1=x1.fastq in2=x2.fastq out1=y1.fastq out2=y2.fastq outsingle=singletons.fq qtrim
=rl trimq=10 minlength=50

Edit: Seems to have worked!

-rw-rw-r-- 1 jpummil jpummil 2.0G Feb 24 13:06 L001_R1_001_Qt.fastq
-rw-rw-r-- 1 jpummil jpummil 2.0G Feb 24 13:06 L001_R2_001_Qt.fastq

-rw-rw-r-- 1 jpummil jpummil 199M Feb 24 13:43 L001_R1_001_Sub.fastq
-rw-rw-r-- 1 jpummil jpummil 199M Feb 24 13:43 L001_R2_001_Sub.fastq

And it ran just fine in SPAdes.

Last edited by jpummil; 02-24-2015 at 11:59 AM. Reason: New info...
jpummil is offline   Reply With Quote
Old 02-24-2015, 12:31 PM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Yep, that's the correct approach. The "reads=-1" flag is not necessary (that's the default), and "int=f" also gets forced automatically when you have dual input files.
Brian Bushnell is offline   Reply With Quote
Old 03-17-2015, 04:24 PM   #11
rodf
Junior Member
 
Location: MN

Join Date: Mar 2015
Posts: 2
Default addslash=t problem

Hi Brian,

I tried to use reformat to add /1 and /2 to paired read names and a space was added between the name and the slash and this does not work for the assembler.

i.e.
want name/1 but get name /1

how to fix this?

Thanks,
Rod
rodf is offline   Reply With Quote
Old 03-17-2015, 04:36 PM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Rod,

There is currently no way to fix that. I specifically made it that way to replicate the name structure of real Illumina reads. For paired reads, many tools or formats (such as sam) require both to have exactly matching names, excluding anything after the first whitespace (such as a 1 or 2). Most aligners, therefore, trim everything after the first whitespace, though with BBMap you can disable this with the "keepnames" flag.

What are you doing that requires no space? I could add an option for that, but I'm interested in why.
Brian Bushnell is offline   Reply With Quote
Old 03-17-2015, 04:59 PM   #13
rodf
Junior Member
 
Location: MN

Join Date: Mar 2015
Posts: 2
Default

Hi Brian,

I'm using the Mira assembler to assemble Illumina MiSeq reads I got from NCBI/SRA. I use "fastq-dump --split-files -F xxxx.sra" to extract the reads and get two files. Each set of paired end reads have exactly the same name, and mira needs the /1 and /2 added onto the reads and gives an error if it detects reads with the same name.

Thanks for your quick reply,
Rod
rodf is offline   Reply With Quote
Old 03-23-2015, 01:03 PM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by rodf View Post
Hi Brian,

I'm using the Mira assembler to assemble Illumina MiSeq reads I got from NCBI/SRA. I use "fastq-dump --split-files -F xxxx.sra" to extract the reads and get two files. Each set of paired end reads have exactly the same name, and mira needs the /1 and /2 added onto the reads and gives an error if it detects reads with the same name.

Thanks for your quick reply,
Rod
Rod,

You can now use the flags "addslash=t slashspace=f" together to accomplish that.

As an unrelated note, reformat now supports the "stoptag" flag, so it can process a sam file and add the stop coordinate of the read to the optional tags, prefixed by "YS:i:".
Brian Bushnell is offline   Reply With Quote
Old 06-11-2016, 01:48 PM   #15
darthsequencer
Member
 
Location: california

Join Date: Feb 2012
Posts: 29
Default Percent identity filter?

Hi is there a way to filter sam/bam filters by percent identity?
darthsequencer is offline   Reply With Quote
Old 06-11-2016, 02:04 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Not currently... I can add that, though. I'll make a note to do that. BBMap has an "idfilter" flag, though.
Brian Bushnell is offline   Reply With Quote
Old 06-12-2016, 02:42 PM   #17
darthsequencer
Member
 
Location: california

Join Date: Feb 2012
Posts: 29
Default

Great - I've been using idfilter but it's pretty consuming to have to rerun the mapping (I'm dealing with ~10 lanes of data now).
darthsequencer is offline   Reply With Quote
Old 06-14-2016, 05:33 PM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I just uploaded a new version of BBTools - 36.11 - that supports idfilter (and subfilter, editfilter, etc) in Reformat. Bear in mind that reads mapped using old-style cigar strings ('M' symbol instead of 'X' and '=') must also have MD tags. For newer cigar strings MD tags are not necessary. Unmapped reads will not be affected by this filter (they will pass the filter), so if you want to get rid of them you also need to set "mappedonly=t".
Brian Bushnell is offline   Reply With Quote
Old 06-15-2016, 01:51 PM   #19
darthsequencer
Member
 
Location: california

Join Date: Feb 2012
Posts: 29
Default

Talk about a quick turnaround! Thanks a bunch BB.
darthsequencer is offline   Reply With Quote
Reply

Tags
ascii33, ascii64, bbduk, bbmap, bbtools, fasta, fastq, interleavei33, quality trim, reformat, scarf, subsample

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:57 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO