SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   BBSplit assertion error: invalid fasta file (http://seqanswers.com/forums/showthread.php?t=87619)

PinkTips 02-11-2019 06:43 AM

BBSplit assertion error: invalid fasta file
 
Good morning, BBMappers!

I have been trying to run BBSplit (on my university's computing cluster) to remove host sequences from metatranscriptome data of a gut community.

This is the command I am using:
Code:

/home/hd55218/BBSplit/bbmap/bbsplit.sh in=/home/hd55218/BBSplit/QualTrimmed_bran11.fasta ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fasta
The error message returned after running on the cluster is :
Code:

Exception in thread "main" java.lang.AssertionError: Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fasta'
        at align2.AbstractMapper.preparse0(AbstractMapper.java:821)
        at align2.AbstractMapper.<init>(AbstractMapper.java:53)
        at align2.BBMap.<init>(BBMap.java:43)
        at align2.BBMap.main(BBMap.java:31)
        at align2.BBSplitter.main(BBSplitter.java:47)

The first four sequences in my FASTA file appear as:
Code:

>NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
>NB502039:96:HGLYGBGX3:1:11101:23904:2797 1:N:0:AGTTCC
CCGCCTTCAACGCCAAGAGCGCGAATTATGCGTATAGATGCACTTCTAAGCATCATGAGTTCTCTATCAGAAAGTGTTTGCGCAGGAGCTGCAACTATACTGTCACCTGTATGAACACCAACAGGGTCAAGGTTTTCCATCGAACAAATCGTAATGCAGTTATCTGCG
>NB502039:96:HGLYGBGX3:1:11101:16907:2810 1:N:0:AGTTCC
GGCACCGAACGCCTTGGCAGCCAAAGCCATAGCCGGCACGAACTGACGGTCGCCGACCGTCTTGCCGCCGCCCGCTCCGGGACGCTGCACCGAGTGGGTACAGTCCATTATCACGCGTGGCGTTATCTGCTTCATATCGGGAATATTGCGGAAATCAACCACCAAGTTATTGTACCCGAAGCTGTTGCCTCGCTCTATCAACCACACGTTTTCGTTACCGCTCTCGCGCACTTTCTGCACGG
>NB502039:96:HGLYGBGX3:1:11101:20216:2823 1:N:0:AGTTCC
TAAAGGCAAATGGCTCTATCATGAAATCCTGGAGCCGGGCGTGTTGGTGCATGTTTCTGAGAGCGGTGCCAAAGTATGGACCGTTCGCTGTGGTTCCCCCCGTCTGGTAACGGTCAATTATGTTCGCG

This FASTA file was converted from a FASTQ file using:
Code:

paste - - - - < Qualtrimmed_bran11.fastq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > Qualtrimmed_bran11.fasta
I am stumped as to why my FASTA format is invalid, so any thoughts/help would be greatly appreciated! Thanks!

GenoMax 02-11-2019 11:26 AM

Do you get an error right away or does the program run for some time?

Are these bbmerged reads? Wonder if you should try to do the binning with original fastq data. Is that a possibility?

PinkTips 02-11-2019 11:37 AM

From what I can tell, the error shows up right away. I only get an email when my job is finished on the cluster, but the log file shows the error appearing right away (right after the reference files are merged).

Yes, these reads were merged with bbmerge.

I was under the impression that bbsplit wanted the reads as FASTA files, but I will try with the FASTQ files!

Thank you, and I will let you know how it goes!

GenoMax 02-11-2019 02:20 PM

BBSplit will take fastq reads and bin them. Let me know how that works.

You can convert the merged fastq reads afterwards with
Code:

reformat.sh in=merged.fq.gz out=merged.fa
No paste/cut/sed needed :-)

PinkTips 02-12-2019 06:58 AM

Unfortunately, I get the same assertion error as before when I use my FASTQ file.

The first few sequences of the FASTQ file:
Code:

@NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEJJJJHJHJHJJJJJJJJJJJDJGFJJJJJJHJJJJJJJJHJ?JHJJJJJJJHJJ7JHJJHJJJJJHJEEEEEEEEEEE/EEAE/EEAA/E/E/EEEEEEEEE/AEEEE/EEEEEAEE/EEEEEAEEEE/EAEEAAEEE/AE/EEEAAAAAA
@NB502039:96:HGLYGBGX3:1:11101:23904:2797 1:N:0:AGTTCC
CCGCCTTCAACGCCAAGAGCGCGAATTATGCGTATAGATGCACTTCTAAGCATCATGAGTTCTCTATCAGAAAGTGTTTGCGCAGGAGCTGCAACTATACTGTCACCTGTATGAACACCAACAGGGTCAAGGTTTTCCATCGAACAAATCGTAATGCAGTTATCTGCG
+
AAAAAEEEEEEEEEEEEEEJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJEEEEEEEEEEEEAAAAA
@NB502039:96:HGLYGBGX3:1:11101:16907:2810 1:N:0:AGTTCC
GGCACCGAACGCCTTGGCAGCCAAAGCCATAGCCGGCACGAACTGACGGTCGCCGACCGTCTTGCCGCCGCCCGCTCCGGGACGCTGCACCGAGTGGGTACAGTCCATTATCACGCGTGGCGTTATCTGCTTCATATCGGGAATATTGCGGAAATCAACCACCAAGTTATTGTACCCGAAGCTGTTGCCTCGCTCTATCAACCACACGTTTTCGTTACCGCTCTCGCGCACTTTCTGCACGG
+
AAAAAEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEAEEEJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJHJJJJJJJJEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA
@NB502039:96:HGLYGBGX3:1:11101:20216:2823 1:N:0:AGTTCC
TAAAGGCAAATGGCTCTATCATGAAATCCTGGAGCCGGGCGTGTTGGTGCATGTTTCTGAGAGCGGTGCCAAAGTATGGACCGTTCGCTGTGGTTCCCCCCGTCTGGTAACGGTCAATTATGTTCGCG
+
DFDDJJH77HJHJHJHJHJHJJJHJHJHJJJJJJH77JHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJ


GenoMax 02-12-2019 09:09 AM

Can you validate your fastq files to make sure there are no errors in the file?

Use validateFiles from Kent Utilities (UCSC). Linux version linked. After download add execute permissions (chmod a+x validateFiles) before running.

validateFiles -type=fastq file1.gz file2.gz etc

PinkTips 02-12-2019 10:31 AM

I used
Code:

validateFiles -type=fastq QualTrimmed_bran11.fastq
and the output was
Quote:

Error count 0
When I used
Code:

validateFiles -type=fastq QualTrimmed_bran11.fasta
the output was
Quote:

Error [file=QualTrimmed_bran11.fastq, line=1]: sequence name first char invalid (got '@', wanted '>') [@NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC]
Aborting .. found 1 error
So I figured it was working properly.

(I am using v38.22)

GenoMax 02-12-2019 11:40 AM

I am wondering if the error you are seeing is a red herring. How much memory are you allocating to this job on the cluster (bbsplit can need a lot of memory depending on size of the reference genomes).

You should explicitly add "-Xmx20g" (this is 20 gig, just an example) flag to your bbsplit command. Make sure you match the sample amount of memory on the cluster side.

On a side note:

Code:

validateFiles -type=fastq QualTrimmed_bran11.fasta
generated an error since you need to change the type to fasta to match. So try

Code:

validateFiles -type=fasta QualTrimmed_bran11.fasta

PinkTips 02-12-2019 12:01 PM

I was only allotting 2GB from the cluster side, likely not enough for BBSplit to do its thing!
I've tried again with "-Xmx200gb" and will let you know how it goes!

Thanks - I never would have gotten that from java's error message!

PinkTips 03-21-2019 10:56 AM

Hi, I'm back again -- with the same assertion error at the same step.

I allotted 100 GB (from both the BBSplit side and the cluster side) and still get the same assertion error as before. If it's helpful, this is the output from the cluster after my job in run:
Quote:

java -Djava.library.path=/home/hd55218/BBSplit/bbmap/jni/ -ea -Xmx48g -cp /home/hd55218/BBSplit/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fastq -Xmx48g
Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq, ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta, basename=out_%.fasta, outu=/home/hd55218/BBSplit/cleaned_bran11.fastq, -Xmx48g]

Converted arguments to [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq, basename=out_%.fasta, outu=/home/hd55218/BBSplit/cleaned_bran11.fastq, ref_p.americana_genome=/home/hd55218/BBSplit/p.americana_genome.fasta, ref_Blattabacterium_genome=/home/hd55218/BBSplit/Blattabacterium_genome.fasta, ref_Blattabacterium_plasmid=/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta]
Creating merged reference file ref/genome/1/merged_ref_3113916972846229527.fa.gz
Ref merge time: 140.410 seconds.
Exception in thread "main" java.lang.AssertionError: Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fastq'
at align2.AbstractMapper.preparse0(AbstractMapper.java:821)
at align2.AbstractMapper.<init>(AbstractMapper.java:53)
at align2.BBMap.<init>(BBMap.java:43)
at align2.BBMap.main(BBMap.java:31)
at align2.BBSplitter.main(BBSplitter.java:47)
Below is the bbsplit command I used:
Quote:

/home/hd55218/BBSplit/bbmap/bbsplit.sh in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fastq -Xmx100g
My reference genomes are 3.43GB, 646KB, and 4KB.

Thanks for helping me work through this!

GenoMax 03-21-2019 11:07 AM

Now we have a different error.

Quote:

Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fastq'
Is that file actually in fastq format and is it in that location?

Post the output from
Code:

head -4 /home/hd55218/BBSplit/QualTrimmed_bran11.fastq

PinkTips 03-21-2019 11:15 AM

The output from
Quote:

"head -4 /home/hd55218/BBSplit/Qualtrimmed_bran11.fastq"
is:
Code:

@NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEJJJJHJHJHJJJJJJJJJJJDJGFJJJJJJHJJJJJJJJHJ?JHJJJJJJJHJJ7JHJJHJJJJJHJEEEEEEEEEEE/EEAE/EEAA/E/E/EEEEEEEEE/AEEEE/EEEEEAEE/EEEEEAEEEE/EAEEAAEEE/AE/EEEAAAAAA


GenoMax 03-21-2019 11:27 AM

I don't think you can split to a fasta format file directly. Can you try following?

Code:

/home/hd55218/BBSplit/bbmap/bbsplit.sh -Xmx100g threads=2 in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fastq outu=/home/hd55218/BBSplit/cleaned_bran11.fastq
Reads that do not match to the two genomes will end up in "cleaned_bran11.fastq" file. Just making sure that is what you want.


All times are GMT -8. The time now is 07:07 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.