SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cuffdiff error [bam_header_read] invalid BAM binary header (this is not a BAM file). adrian Bioinformatics 0 12-29-2016 01:23 PM
.gtf file issue - Error at parsing .tlst line (invalid strand): 4galaxy7 Bioinformatics 1 11-05-2015 03:42 AM
Assertion failed error in BFAST localalign seeker Bioinformatics 7 09-02-2011 09:33 PM
Samtools Pileup Assertion Error AnamikaDarwin Bioinformatics 2 06-29-2009 12:44 PM

Reply
 
Thread Tools
Old 02-11-2019, 06:43 AM   #1
PinkTips
Junior Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 7
Default BBSplit assertion error: invalid fasta file

Good morning, BBMappers!

I have been trying to run BBSplit (on my university's computing cluster) to remove host sequences from metatranscriptome data of a gut community.

This is the command I am using:
Code:
/home/hd55218/BBSplit/bbmap/bbsplit.sh in=/home/hd55218/BBSplit/QualTrimmed_bran11.fasta ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fasta
The error message returned after running on the cluster is :
Code:
Exception in thread "main" java.lang.AssertionError: Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fasta'
        at align2.AbstractMapper.preparse0(AbstractMapper.java:821)
        at align2.AbstractMapper.<init>(AbstractMapper.java:53)
        at align2.BBMap.<init>(BBMap.java:43)
        at align2.BBMap.main(BBMap.java:31)
        at align2.BBSplitter.main(BBSplitter.java:47)
The first four sequences in my FASTA file appear as:
Code:
>NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
>NB502039:96:HGLYGBGX3:1:11101:23904:2797 1:N:0:AGTTCC
CCGCCTTCAACGCCAAGAGCGCGAATTATGCGTATAGATGCACTTCTAAGCATCATGAGTTCTCTATCAGAAAGTGTTTGCGCAGGAGCTGCAACTATACTGTCACCTGTATGAACACCAACAGGGTCAAGGTTTTCCATCGAACAAATCGTAATGCAGTTATCTGCG
>NB502039:96:HGLYGBGX3:1:11101:16907:2810 1:N:0:AGTTCC
GGCACCGAACGCCTTGGCAGCCAAAGCCATAGCCGGCACGAACTGACGGTCGCCGACCGTCTTGCCGCCGCCCGCTCCGGGACGCTGCACCGAGTGGGTACAGTCCATTATCACGCGTGGCGTTATCTGCTTCATATCGGGAATATTGCGGAAATCAACCACCAAGTTATTGTACCCGAAGCTGTTGCCTCGCTCTATCAACCACACGTTTTCGTTACCGCTCTCGCGCACTTTCTGCACGG
>NB502039:96:HGLYGBGX3:1:11101:20216:2823 1:N:0:AGTTCC
TAAAGGCAAATGGCTCTATCATGAAATCCTGGAGCCGGGCGTGTTGGTGCATGTTTCTGAGAGCGGTGCCAAAGTATGGACCGTTCGCTGTGGTTCCCCCCGTCTGGTAACGGTCAATTATGTTCGCG
This FASTA file was converted from a FASTQ file using:
Code:
paste - - - - < Qualtrimmed_bran11.fastq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > Qualtrimmed_bran11.fasta
I am stumped as to why my FASTA format is invalid, so any thoughts/help would be greatly appreciated! Thanks!

Last edited by GenoMax; Today at 11:36 AM.
PinkTips is offline   Reply With Quote
Old 02-11-2019, 11:26 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,884
Default

Do you get an error right away or does the program run for some time?

Are these bbmerged reads? Wonder if you should try to do the binning with original fastq data. Is that a possibility?
GenoMax is offline   Reply With Quote
Old 02-11-2019, 11:37 AM   #3
PinkTips
Junior Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 7
Default

From what I can tell, the error shows up right away. I only get an email when my job is finished on the cluster, but the log file shows the error appearing right away (right after the reference files are merged).

Yes, these reads were merged with bbmerge.

I was under the impression that bbsplit wanted the reads as FASTA files, but I will try with the FASTQ files!

Thank you, and I will let you know how it goes!
PinkTips is offline   Reply With Quote
Old 02-11-2019, 02:20 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,884
Default

BBSplit will take fastq reads and bin them. Let me know how that works.

You can convert the merged fastq reads afterwards with
Code:
reformat.sh in=merged.fq.gz out=merged.fa
No paste/cut/sed needed :-)
GenoMax is offline   Reply With Quote
Old 02-12-2019, 06:58 AM   #5
PinkTips
Junior Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 7
Default

Unfortunately, I get the same assertion error as before when I use my FASTQ file.

The first few sequences of the FASTQ file:
Code:
@NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEJJJJHJHJHJJJJJJJJJJJDJGFJJJJJJHJJJJJJJJHJ?JHJJJJJJJHJJ7JHJJHJJJJJHJEEEEEEEEEEE/EEAE/EEAA/E/E/EEEEEEEEE/AEEEE/EEEEEAEE/EEEEEAEEEE/EAEEAAEEE/AE/EEEAAAAAA
@NB502039:96:HGLYGBGX3:1:11101:23904:2797 1:N:0:AGTTCC
CCGCCTTCAACGCCAAGAGCGCGAATTATGCGTATAGATGCACTTCTAAGCATCATGAGTTCTCTATCAGAAAGTGTTTGCGCAGGAGCTGCAACTATACTGTCACCTGTATGAACACCAACAGGGTCAAGGTTTTCCATCGAACAAATCGTAATGCAGTTATCTGCG
+
AAAAAEEEEEEEEEEEEEEJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJEEEEEEEEEEEEAAAAA
@NB502039:96:HGLYGBGX3:1:11101:16907:2810 1:N:0:AGTTCC
GGCACCGAACGCCTTGGCAGCCAAAGCCATAGCCGGCACGAACTGACGGTCGCCGACCGTCTTGCCGCCGCCCGCTCCGGGACGCTGCACCGAGTGGGTACAGTCCATTATCACGCGTGGCGTTATCTGCTTCATATCGGGAATATTGCGGAAATCAACCACCAAGTTATTGTACCCGAAGCTGTTGCCTCGCTCTATCAACCACACGTTTTCGTTACCGCTCTCGCGCACTTTCTGCACGG
+
AAAAAEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEAEEEJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJHJJJJJJJJEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA
@NB502039:96:HGLYGBGX3:1:11101:20216:2823 1:N:0:AGTTCC
TAAAGGCAAATGGCTCTATCATGAAATCCTGGAGCCGGGCGTGTTGGTGCATGTTTCTGAGAGCGGTGCCAAAGTATGGACCGTTCGCTGTGGTTCCCCCCGTCTGGTAACGGTCAATTATGTTCGCG
+
DFDDJJH77HJHJHJHJHJHJJJHJHJHJJJJJJH77JHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJ

Last edited by GenoMax; Today at 11:51 AM.
PinkTips is offline   Reply With Quote
Old 02-12-2019, 09:09 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,884
Default

Can you validate your fastq files to make sure there are no errors in the file?

Use validateFiles from Kent Utilities (UCSC). Linux version linked. After download add execute permissions (chmod a+x validateFiles) before running.

validateFiles -type=fastq file1.gz file2.gz etc
GenoMax is offline   Reply With Quote
Old 02-12-2019, 10:31 AM   #7
PinkTips
Junior Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 7
Default

I used
Code:
validateFiles -type=fastq QualTrimmed_bran11.fastq
and the output was
Quote:
Error count 0
When I used
Code:
validateFiles -type=fastq QualTrimmed_bran11.fasta
the output was
Quote:
Error [file=QualTrimmed_bran11.fastq, line=1]: sequence name first char invalid (got '@', wanted '>') [@NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC]
Aborting .. found 1 error
So I figured it was working properly.

(I am using v38.22)

Last edited by PinkTips; 02-12-2019 at 10:31 AM. Reason: forgot to add
PinkTips is offline   Reply With Quote
Old 02-12-2019, 11:40 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,884
Default

I am wondering if the error you are seeing is a red herring. How much memory are you allocating to this job on the cluster (bbsplit can need a lot of memory depending on size of the reference genomes).

You should explicitly add "-Xmx20g" (this is 20 gig, just an example) flag to your bbsplit command. Make sure you match the sample amount of memory on the cluster side.

On a side note:

Code:
validateFiles -type=fastq QualTrimmed_bran11.fasta
generated an error since you need to change the type to fasta to match. So try

Code:
validateFiles -type=fasta QualTrimmed_bran11.fasta
GenoMax is offline   Reply With Quote
Old 02-12-2019, 12:01 PM   #9
PinkTips
Junior Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 7
Default

I was only allotting 2GB from the cluster side, likely not enough for BBSplit to do its thing!
I've tried again with "-Xmx200gb" and will let you know how it goes!

Thanks - I never would have gotten that from java's error message!
PinkTips is offline   Reply With Quote
Old Today, 10:56 AM   #10
PinkTips
Junior Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 7
Default

Hi, I'm back again -- with the same assertion error at the same step.

I allotted 100 GB (from both the BBSplit side and the cluster side) and still get the same assertion error as before. If it's helpful, this is the output from the cluster after my job in run:
Quote:
java -Djava.library.path=/home/hd55218/BBSplit/bbmap/jni/ -ea -Xmx48g -cp /home/hd55218/BBSplit/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fastq -Xmx48g
Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq, ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta, basename=out_%.fasta, outu=/home/hd55218/BBSplit/cleaned_bran11.fastq, -Xmx48g]

Converted arguments to [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq, basename=out_%.fasta, outu=/home/hd55218/BBSplit/cleaned_bran11.fastq, ref_p.americana_genome=/home/hd55218/BBSplit/p.americana_genome.fasta, ref_Blattabacterium_genome=/home/hd55218/BBSplit/Blattabacterium_genome.fasta, ref_Blattabacterium_plasmid=/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta]
Creating merged reference file ref/genome/1/merged_ref_3113916972846229527.fa.gz
Ref merge time: 140.410 seconds.
Exception in thread "main" java.lang.AssertionError: Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fastq'
at align2.AbstractMapper.preparse0(AbstractMapper.java:821)
at align2.AbstractMapper.<init>(AbstractMapper.java:53)
at align2.BBMap.<init>(BBMap.java:43)
at align2.BBMap.main(BBMap.java:31)
at align2.BBSplitter.main(BBSplitter.java:47)
Below is the bbsplit command I used:
Quote:
/home/hd55218/BBSplit/bbmap/bbsplit.sh in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fastq -Xmx100g
My reference genomes are 3.43GB, 646KB, and 4KB.

Thanks for helping me work through this!
PinkTips is offline   Reply With Quote
Old Today, 11:07 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,884
Default

Now we have a different error.

Quote:
Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fastq'
Is that file actually in fastq format and is it in that location?

Post the output from
Code:
head -4 /home/hd55218/BBSplit/QualTrimmed_bran11.fastq
GenoMax is offline   Reply With Quote
Old Today, 11:15 AM   #12
PinkTips
Junior Member
 
Location: Athens, GA

Join Date: Feb 2019
Posts: 7
Default

The output from
Quote:
"head -4 /home/hd55218/BBSplit/Qualtrimmed_bran11.fastq"
is:
Code:
@NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEJJJJHJHJHJJJJJJJJJJJDJGFJJJJJJHJJJJJJJJHJ?JHJJJJJJJHJJ7JHJJHJJJJJHJEEEEEEEEEEE/EEAE/EEAA/E/E/EEEEEEEEE/AEEEE/EEEEEAEE/EEEEEAEEEE/EAEEAAEEE/AE/EEEAAAAAA

Last edited by GenoMax; Today at 11:23 AM.
PinkTips is offline   Reply With Quote
Old Today, 11:27 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,884
Default

I don't think you can split to a fasta format file directly. Can you try following?

Code:
/home/hd55218/BBSplit/bbmap/bbsplit.sh -Xmx100g threads=2 in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fastq outu=/home/hd55218/BBSplit/cleaned_bran11.fastq
Reads that do not match to the two genomes will end up in "cleaned_bran11.fastq" file. Just making sure that is what you want.
GenoMax is offline   Reply With Quote
Reply

Tags
bbsplit, invalid file format

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:47 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO