SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cuffdiff error [bam_header_read] invalid BAM binary header (this is not a BAM file). adrian Bioinformatics 0 12-29-2016 01:23 PM
.gtf file issue - Error at parsing .tlst line (invalid strand): 4galaxy7 Bioinformatics 1 11-05-2015 03:42 AM
Assertion failed error in BFAST localalign seeker Bioinformatics 7 09-02-2011 09:33 PM
Samtools Pileup Assertion Error AnamikaDarwin Bioinformatics 2 06-29-2009 12:44 PM

Reply
 
Thread Tools
Old 02-11-2019, 06:43 AM   #1
PinkTips
Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 10
Default BBSplit assertion error: invalid fasta file

Good morning, BBMappers!

I have been trying to run BBSplit (on my university's computing cluster) to remove host sequences from metatranscriptome data of a gut community.

This is the command I am using:
Code:
/home/hd55218/BBSplit/bbmap/bbsplit.sh in=/home/hd55218/BBSplit/QualTrimmed_bran11.fasta ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fasta
The error message returned after running on the cluster is :
Code:
Exception in thread "main" java.lang.AssertionError: Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fasta'
        at align2.AbstractMapper.preparse0(AbstractMapper.java:821)
        at align2.AbstractMapper.<init>(AbstractMapper.java:53)
        at align2.BBMap.<init>(BBMap.java:43)
        at align2.BBMap.main(BBMap.java:31)
        at align2.BBSplitter.main(BBSplitter.java:47)
The first four sequences in my FASTA file appear as:
Code:
>NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
>NB502039:96:HGLYGBGX3:1:11101:23904:2797 1:N:0:AGTTCC
CCGCCTTCAACGCCAAGAGCGCGAATTATGCGTATAGATGCACTTCTAAGCATCATGAGTTCTCTATCAGAAAGTGTTTGCGCAGGAGCTGCAACTATACTGTCACCTGTATGAACACCAACAGGGTCAAGGTTTTCCATCGAACAAATCGTAATGCAGTTATCTGCG
>NB502039:96:HGLYGBGX3:1:11101:16907:2810 1:N:0:AGTTCC
GGCACCGAACGCCTTGGCAGCCAAAGCCATAGCCGGCACGAACTGACGGTCGCCGACCGTCTTGCCGCCGCCCGCTCCGGGACGCTGCACCGAGTGGGTACAGTCCATTATCACGCGTGGCGTTATCTGCTTCATATCGGGAATATTGCGGAAATCAACCACCAAGTTATTGTACCCGAAGCTGTTGCCTCGCTCTATCAACCACACGTTTTCGTTACCGCTCTCGCGCACTTTCTGCACGG
>NB502039:96:HGLYGBGX3:1:11101:20216:2823 1:N:0:AGTTCC
TAAAGGCAAATGGCTCTATCATGAAATCCTGGAGCCGGGCGTGTTGGTGCATGTTTCTGAGAGCGGTGCCAAAGTATGGACCGTTCGCTGTGGTTCCCCCCGTCTGGTAACGGTCAATTATGTTCGCG
This FASTA file was converted from a FASTQ file using:
Code:
paste - - - - < Qualtrimmed_bran11.fastq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > Qualtrimmed_bran11.fasta
I am stumped as to why my FASTA format is invalid, so any thoughts/help would be greatly appreciated! Thanks!

Last edited by GenoMax; 03-21-2019 at 11:36 AM.
PinkTips is offline   Reply With Quote
Old 02-11-2019, 11:26 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

Do you get an error right away or does the program run for some time?

Are these bbmerged reads? Wonder if you should try to do the binning with original fastq data. Is that a possibility?
GenoMax is online now   Reply With Quote
Old 02-11-2019, 11:37 AM   #3
PinkTips
Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 10
Default

From what I can tell, the error shows up right away. I only get an email when my job is finished on the cluster, but the log file shows the error appearing right away (right after the reference files are merged).

Yes, these reads were merged with bbmerge.

I was under the impression that bbsplit wanted the reads as FASTA files, but I will try with the FASTQ files!

Thank you, and I will let you know how it goes!
PinkTips is offline   Reply With Quote
Old 02-11-2019, 02:20 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

BBSplit will take fastq reads and bin them. Let me know how that works.

You can convert the merged fastq reads afterwards with
Code:
reformat.sh in=merged.fq.gz out=merged.fa
No paste/cut/sed needed :-)
GenoMax is online now   Reply With Quote
Old 02-12-2019, 06:58 AM   #5
PinkTips
Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 10
Default

Unfortunately, I get the same assertion error as before when I use my FASTQ file.

The first few sequences of the FASTQ file:
Code:
@NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEJJJJHJHJHJJJJJJJJJJJDJGFJJJJJJHJJJJJJJJHJ?JHJJJJJJJHJJ7JHJJHJJJJJHJEEEEEEEEEEE/EEAE/EEAA/E/E/EEEEEEEEE/AEEEE/EEEEEAEE/EEEEEAEEEE/EAEEAAEEE/AE/EEEAAAAAA
@NB502039:96:HGLYGBGX3:1:11101:23904:2797 1:N:0:AGTTCC
CCGCCTTCAACGCCAAGAGCGCGAATTATGCGTATAGATGCACTTCTAAGCATCATGAGTTCTCTATCAGAAAGTGTTTGCGCAGGAGCTGCAACTATACTGTCACCTGTATGAACACCAACAGGGTCAAGGTTTTCCATCGAACAAATCGTAATGCAGTTATCTGCG
+
AAAAAEEEEEEEEEEEEEEJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJEEEEEEEEEEEEAAAAA
@NB502039:96:HGLYGBGX3:1:11101:16907:2810 1:N:0:AGTTCC
GGCACCGAACGCCTTGGCAGCCAAAGCCATAGCCGGCACGAACTGACGGTCGCCGACCGTCTTGCCGCCGCCCGCTCCGGGACGCTGCACCGAGTGGGTACAGTCCATTATCACGCGTGGCGTTATCTGCTTCATATCGGGAATATTGCGGAAATCAACCACCAAGTTATTGTACCCGAAGCTGTTGCCTCGCTCTATCAACCACACGTTTTCGTTACCGCTCTCGCGCACTTTCTGCACGG
+
AAAAAEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEAEEEJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJHJJJJJJJJEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA
@NB502039:96:HGLYGBGX3:1:11101:20216:2823 1:N:0:AGTTCC
TAAAGGCAAATGGCTCTATCATGAAATCCTGGAGCCGGGCGTGTTGGTGCATGTTTCTGAGAGCGGTGCCAAAGTATGGACCGTTCGCTGTGGTTCCCCCCGTCTGGTAACGGTCAATTATGTTCGCG
+
DFDDJJH77HJHJHJHJHJHJJJHJHJHJJJJJJH77JHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJ

Last edited by GenoMax; 03-21-2019 at 11:51 AM.
PinkTips is offline   Reply With Quote
Old 02-12-2019, 09:09 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

Can you validate your fastq files to make sure there are no errors in the file?

Use validateFiles from Kent Utilities (UCSC). Linux version linked. After download add execute permissions (chmod a+x validateFiles) before running.

validateFiles -type=fastq file1.gz file2.gz etc
GenoMax is online now   Reply With Quote
Old 02-12-2019, 10:31 AM   #7
PinkTips
Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 10
Default

I used
Code:
validateFiles -type=fastq QualTrimmed_bran11.fastq
and the output was
Quote:
Error count 0
When I used
Code:
validateFiles -type=fastq QualTrimmed_bran11.fasta
the output was
Quote:
Error [file=QualTrimmed_bran11.fastq, line=1]: sequence name first char invalid (got '@', wanted '>') [@NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC]
Aborting .. found 1 error
So I figured it was working properly.

(I am using v38.22)

Last edited by PinkTips; 02-12-2019 at 10:31 AM. Reason: forgot to add
PinkTips is offline   Reply With Quote
Old 02-12-2019, 11:40 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

I am wondering if the error you are seeing is a red herring. How much memory are you allocating to this job on the cluster (bbsplit can need a lot of memory depending on size of the reference genomes).

You should explicitly add "-Xmx20g" (this is 20 gig, just an example) flag to your bbsplit command. Make sure you match the sample amount of memory on the cluster side.

On a side note:

Code:
validateFiles -type=fastq QualTrimmed_bran11.fasta
generated an error since you need to change the type to fasta to match. So try

Code:
validateFiles -type=fasta QualTrimmed_bran11.fasta
GenoMax is online now   Reply With Quote
Old 02-12-2019, 12:01 PM   #9
PinkTips
Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 10
Default

I was only allotting 2GB from the cluster side, likely not enough for BBSplit to do its thing!
I've tried again with "-Xmx200gb" and will let you know how it goes!

Thanks - I never would have gotten that from java's error message!
PinkTips is offline   Reply With Quote
Old 03-21-2019, 10:56 AM   #10
PinkTips
Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 10
Default

Hi, I'm back again -- with the same assertion error at the same step.

I allotted 100 GB (from both the BBSplit side and the cluster side) and still get the same assertion error as before. If it's helpful, this is the output from the cluster after my job in run:
Quote:
java -Djava.library.path=/home/hd55218/BBSplit/bbmap/jni/ -ea -Xmx48g -cp /home/hd55218/BBSplit/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fastq -Xmx48g
Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq, ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta, basename=out_%.fasta, outu=/home/hd55218/BBSplit/cleaned_bran11.fastq, -Xmx48g]

Converted arguments to [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq, basename=out_%.fasta, outu=/home/hd55218/BBSplit/cleaned_bran11.fastq, ref_p.americana_genome=/home/hd55218/BBSplit/p.americana_genome.fasta, ref_Blattabacterium_genome=/home/hd55218/BBSplit/Blattabacterium_genome.fasta, ref_Blattabacterium_plasmid=/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta]
Creating merged reference file ref/genome/1/merged_ref_3113916972846229527.fa.gz
Ref merge time: 140.410 seconds.
Exception in thread "main" java.lang.AssertionError: Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fastq'
at align2.AbstractMapper.preparse0(AbstractMapper.java:821)
at align2.AbstractMapper.<init>(AbstractMapper.java:53)
at align2.BBMap.<init>(BBMap.java:43)
at align2.BBMap.main(BBMap.java:31)
at align2.BBSplitter.main(BBSplitter.java:47)
Below is the bbsplit command I used:
Quote:
/home/hd55218/BBSplit/bbmap/bbsplit.sh in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fastq -Xmx100g
My reference genomes are 3.43GB, 646KB, and 4KB.

Thanks for helping me work through this!
PinkTips is offline   Reply With Quote
Old 03-21-2019, 11:07 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

Now we have a different error.

Quote:
Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fastq'
Is that file actually in fastq format and is it in that location?

Post the output from
Code:
head -4 /home/hd55218/BBSplit/QualTrimmed_bran11.fastq
GenoMax is online now   Reply With Quote
Old 03-21-2019, 11:15 AM   #12
PinkTips
Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 10
Default

The output from
Quote:
"head -4 /home/hd55218/BBSplit/Qualtrimmed_bran11.fastq"
is:
Code:
@NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEJJJJHJHJHJJJJJJJJJJJDJGFJJJJJJHJJJJJJJJHJ?JHJJJJJJJHJJ7JHJJHJJJJJHJEEEEEEEEEEE/EEAE/EEAA/E/E/EEEEEEEEE/AEEEE/EEEEEAEE/EEEEEAEEEE/EAEEAAEEE/AE/EEEAAAAAA

Last edited by GenoMax; 03-21-2019 at 11:23 AM.
PinkTips is offline   Reply With Quote
Old 03-21-2019, 11:27 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

I don't think you can split to a fasta format file directly. Can you try following?

Code:
/home/hd55218/BBSplit/bbmap/bbsplit.sh -Xmx100g threads=2 in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fastq outu=/home/hd55218/BBSplit/cleaned_bran11.fastq
Reads that do not match to the two genomes will end up in "cleaned_bran11.fastq" file. Just making sure that is what you want.
GenoMax is online now   Reply With Quote
Old 03-27-2019, 12:41 PM   #14
PinkTips
Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 10
Default

Yes, I want reads that do not match the references to go into "cleaned".

I tried changing the basename parameter's extension to fastq, but the invalid input file error remains.
PinkTips is offline   Reply With Quote
Old 03-28-2019, 03:42 AM   #15
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

Is the invalid assertion error about "fasta" files or "fastq" data? It is possible that something is wrong with the fasta files that you are using. You would want to check on those using the validateFiles tool you used before.

If the error is about fastq data then at this point I am going to say that go back to the very original data (not quality trimmed/otherwise) and see if that works with bbsplit. All BBtools will accept gzipped files so there is not need to uncompress them.

Last edited by GenoMax; 03-28-2019 at 03:51 AM.
GenoMax is online now   Reply With Quote
Old 03-29-2019, 01:00 PM   #16
PinkTips
Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 10
Default

When using non-quality-trimmed fastq for BBSplit, it worked just fine!

My quality-trimming step uses BBDuk -- this command:
Code:
./bbmap/bbduk.sh in=nonrRNA_bran11.fastq out=QualTrimmed_bran11.fastq qtrim=r trimq=10 overwrite=true
Thoughts on how I should revise this to prevent an invalid file format error when moving to BBSplit?

I really appreciate your helpful (and quick!) replies!
PinkTips is offline   Reply With Quote
Old 03-29-2019, 03:07 PM   #17
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

That command looks fine. Perhaps you had a one time corruption with the file you got.
GenoMax is online now   Reply With Quote
Old 04-04-2019, 05:51 AM   #18
PinkTips
Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 10
Default

Thank you for your help!

I have just one more question regarding the output from BBSplit.

Quote:
-------------- Results ---------------

Genome: 1
Key Length: 13
Max Indel: 20
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 1829806 (416340827 bases)

Mapping: 1630.950 seconds.
Reads/sec: 1121.93
kBases/sec: 255.28


Read 1 data: pct reads num reads pct bases num bases

mapped: 13.1904% 241358 13.0507% 54335231
unambiguous: 9.5835% 175359 9.5952% 39948829
ambiguous: 3.6069% 65999 3.4554% 14386402
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 1.9436% 35564 1.8453% 7682929
semiperfect site: 1.9468% 35623 1.8485% 7695929

Match Rate: NA NA 96.0824% 52522434
Error Rate: 75.3143% 204784 3.8796% 2120757
Sub Rate: 73.8546% 200815 2.3134% 1264595
Del Rate: 14.3653% 39060 0.6014% 328726
Ins Rate: 16.4704% 44784 0.9649% 527436
N Rate: 3.9234% 10668 0.0380% 20766

Total time: 1715.910 seconds.
Since I had three reference sequences, is the "Genome: 1" referring to the combined reference of all three that I listed under the "ref=" parameter?

Code:
ref=p.americana_genome.fasta,Blattabacterium_genome.fasta,Blattabacterium_plasmid.fasta
PinkTips is offline   Reply With Quote
Reply

Tags
bbsplit, invalid file format

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:52 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO