SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BBmap generates Bam files with duplications in sequence sequence names Lechu Bioinformatics 6 06-18-2019 08:18 AM
how to count the mapped reads with bbmap? GUZUMO Metagenomics 5 04-25-2017 12:54 PM
CIGAR error when running htseq-count after BBMap sunkid RNA Sequencing 18 12-15-2016 03:39 PM
BBMap, htseq-count, and reference sequence names sunkid RNA Sequencing 10 03-18-2016 09:20 AM
views Bowtie2 : aligned result read count not matching with sam Read count vinumanikandan Bioinformatics 0 05-13-2015 04:00 AM

Reply
 
Thread Tools
Old 09-22-2021, 01:52 PM   #1
mforthman
Junior Member
 
Location: California

Join Date: Sep 2021
Posts: 1
Default bbmap inflating read count and not finding one sequence after header

I'm using bbmap to map transcriptome reads to a set of target loci. I'm working with 12 samples with pair-end reads and 1 sample with single-end reads, all from NCBI's SRA. I'm having no problems with bbmap reading paired-end data and completing analyses correctly. It's the one sample with single-end reads that's causing two issues:

1. The first issue is that the input fasta file only has 288915 reads. I have confirmed this with grep ">" file.fasta | wc -l. However, bbmap reports "Reads used: 308655". I have no idea why the read count is inflated; again, this is not an issue with the paired-end data.

2. bbmap fails to recognize a sequence immediately after the fasta header: "Warning: A fasta header with no sequence was encountered: SRR768524.9631" The sequence in question is formatted exactly like all others in the file and I have checked the EOL, which is fine. Below is what a snippet of the fasta file looks like, with the 3rd sequence being the problematic one for bbmap.

Quote:
>SRR768524.9629
CTATCAAAGGGAAATCCCGCTGGCCTGCTATCATACAGTCTTGAACCTCCACATGCAATATCAGCTGAATCTCCAACGTGGCCGCTGACAAATGGAGTAACTACGACTGCCAAAACGAAAGCGCGACCTCCTTTCCATCCCATGGGTAATTGGAGTCTTTGAGGAAATCCACATCGGCCAGCCTCTGAATAATGGAATGGTTCCTTCTGGTTGAGTGCACTGTTTATTGAAGTGTAAAGAGACCTGAATCCTTCTTGGTCATGGATAAAGAGGGGAGAATCTGTTGAACTTCTTTCCCACACATTAACACCCTCAGTCAATTTTACTGGGAAACGGTCAATTTCAAATAAGCTAGTTTTGCTTGGTCATAAGTGAGGAAATGTTCATCAGAATCATATTTCGGTCCAATGAATACTCTCACAATGGCATCATCAGCTTTT
>SRR768524.9630
ATAATGCAATTATAGATTGTTGGAGTGCAGGTAAAGCTACCACTGTTATGATTAAAGATAATCCAAAGGTTGAAATTCTTGATGTAGAAGATGTTAAGGTTGGAAAGATAAGACAATTTTGTGAGTTGGACTTGGCATTGAACATGGCCTTACGAAAGTATTTTGGTAGTGTGTTTGATAAAATGGCAGTTACATCTAATGAAACGCCGTGGAAAGTTGCTTGGAATCCATATTTTATGCCTCATCACATCGTGGCGATAGAGAACGACAAGTACGATGTCTTTTGTATAGATGTGAAAAGAATGGATAAGAATTTACCAGTCCAATTCACTGAGATATTGTGT
>SRR768524.9631
TGTTACTGGGTAGGGCTGTGGCACTGGGACCTTGACTGGATAGGGTACATGCTTCTCGACTGGTACTGGGTATGGCTTGGGAATGTGGACAGGGTATGGCCTATCGACTGGGACCTTCACGGGATAGGGAACTTTCTTCTCGATATGGACTGGATAAGGTACTGGTACCTCCTTGTGGATGGTGATAGCTTTAATGTGGCCGTGTTCCTCATGTCCTCCGAGTTCATAGCCACCTCCATATCCGTATCCTCCTAAGTCATGTCCACCTCCATATCCTCCATGTCCACCGCCGTATCCTCCAAGCTCATAACCACCTCCATATCCTAAGCCAAGAAGTCCTCGTTTCTCCTGCTTCTTGTCATCTGTTGGTGCCGCTGCTTTGTCGGTCTTGGATTCGGCCTTCTTCTCTTCAGCTGATGCTGTGGCAAGCAGTGCCAACAGCCCTACCCACAGTACCTTGGATTGCATTGTTGAGTCGTGGTGTGGTCGGCGTCTCCCAA
>SRR768524.9632
AATTCCCAACGACCAAGTATCTGAACATGAGTGGATCAATGCTGAGATCCTGCCTGCTACTGCTATTCCTAGCTTACTGTGTGTCCTGCTATAGAAGTCATGTTCCTAGAGGCGGGAGTTACTCTCTACCGCCTGGAGTTAATCCAACATTCCCAGGAAGGAACCAAGGACTGCCTCCGGCTTATCATGGAAAATTCAAGAGATCACTGGAAGGAGGTTTAGAACCTGAAGATGGTGGTGTCCTTGCAGTTGATGAACCTGCTGATTATCTGAAAGTCAAAAGGTCAGTGGAAGATGTTGAAGGTGAATTCCTTGTGAACGAAGAACCTCAAGAATTTGAGACACTGAGAGCGCGCCGTGACGTCAGAATAATTCATCCAACT
>SRR768524.9633
TTCCAAACTGTCGATTCATGATGTACACAATACCAAAAAAGGCAAATAAGAAATAAAAGT
I'm at a loss for figuring out how to resolve these issues. I appreciate any help in getting this to run properly on this last fasta file.
mforthman is offline   Reply With Quote
Old 09-23-2021, 03:09 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,130
Default

Is there a specific reason you converted these reads to fasta format? If this is data from SRA then you should be able to map the fastq reads directly.

You may be getting secondary alignments and that may be the reason why your read count seems inflated.
GenoMax is offline   Reply With Quote
Reply

Tags
bbmap, bioinformatics, fasta

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:13 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO