Seqanswers Leaderboard Ad

**Brian Bushnell** · 09-12-2014, 10:10 AM

How long are the sequences, and do you expect 100% identity, or what? Do you already know the sequences you are expecting?

**lyw1** · 09-12-2014, 10:33 AM

30 to 90 bp.

Yes, I expect 100% identity and know the sequences.

**Brian Bushnell** · 09-12-2014, 10:38 AM

In that case, just map the reads to a reference fasta file you make of the sequences. For example, with BBMap, you can output a file containing the counts of reads that mapped to each reference sequence:

bbmap.sh -Xmx2g ref=sequences.fasta in=reads.fastq scafstats=stats.txt

What kind of sequencing are you doing? You may need to adapter-trim the input reads first, for example, if you have a fixed read length with NGS reads.

**lyw1** · 09-12-2014, 10:44 AM

Thanks a lot. I use MiSeq.

**Brian Bushnell** · 09-12-2014, 10:52 AM

In that case, you will need to either trim the adapters first, or use a different approach such as matching substrings, if the reference sequences are sufficiently different.

You can trim adapters like this:

bbduk.sh -Xmx1g in=reads.fq out=clean.fq ktrim=r ref=nextera.fa.gz,truseq.fa.gz k=25 mink=12 hdist=1

...where those nextera and truseq adapter files are included with the BBTools package (which contains BBMap and BBDuk) in the /resources/ directory.

Alternately, you can generate a report directly from BBDuk like this:

bbduk.sh -Xmx1g in=reads.fastq ref=sequences.fasta stats=stats.txt k=29 fbm

...if the reference sequences are sufficiently different that they do not share any 29-mers.

**lyw1** · 09-12-2014, 10:55 AM

If I add a few samples that include inserts with 30 bp unknown sequences, is it possible for me to count reads for 50 sequences that appear most frequently?

**Brian Bushnell** · 09-12-2014, 11:02 AM

The above tools require known sequences. If some are unknown, you could assemble the adapter-trimmed reads (with, say, Velvet) which should result in a fasta file containing one copy of each sequence present in your library, then map to that assembly to get the counts, as in post 4.

**lyw1** · 09-12-2014, 11:13 AM

Brian, it is very helpful. Thanks again.

**Brian Bushnell** · 09-12-2014, 11:44 AM

Oh, I should probably add, this may be difficult to assemble since the coverage is likely to be extremely high. You may need to normalize first. Also, what length are the reads, and are they paired or single?

**lyw1** · 09-12-2014, 12:35 PM

The reads are from both sides and 30 for each side.

**Brian Bushnell** · 09-12-2014, 03:09 PM

In that case, there's an easier solution, if all of the reads of the same product are expected to start at the same location.

kmercountexact.sh in=reads.fq outk=counts.txt mincount=4 k=30

That will produce a file "counts.txt" containing the count of every 30bp string in the file, excluding strings present fewer than 4 times.

**lyw1** · 09-16-2014, 02:58 PM

It works good

until

zheng@zheng-XPS-8500:~/Desktop/bbmap/20140916ngs$ kmercountexact.sh in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16
java -ea -Xmx11628m -cp /home/zheng/Desktop/bbmap/current/ jgi.CountKmersExact in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16
Executing jgi.CountKmersExact [in=ERa-05RNS160m_S6_L001_R2_001.fastq, outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt, mincount=3, k=16]

ways=37
Initial:
Memory: free=187m, used=64m

Final:
Memory: free=125m, used=193m

Input: 396838 reads 11111464 bases.
Result: 0 reads (0.00%) 0 bases (0.00%)
Unique Kmers: 4345841
Exception in thread "main" java.lang.AssertionError: ACCTTTTACCTAGTTT 3

at kmer.KmerNode.dumpKmersAsText(KmerNode.java:155)
at kmer.HashForest.dumpKmersAsText(HashForest.java:211)
at kmer.HashArray.dumpKmersAsText(HashArray.java:249)
at jgi.CountKmersExact.dumpKmersAsText(CountKmersExact.java:749)
at jgi.CountKmersExact.process(CountKmersExact.java:383)
at jgi.CountKmersExact.main(CountKmersExact.java:58)

Here I tried k=25, 20, 16. It works for k=25 or 20,

Some sequences at both ends may show "N". Do this command search short sequence from different start positions? Or always start from a fixed position? thanks,

Zheng

**Brian Bushnell** · 09-16-2014, 03:28 PM

Oh, sorry about that; I left a debug assertion in. I will remove that from the next version. Until then, add the "-da" flag to your command and it will work correctly.

Any subsequence containing an 'N' will be ignored. Otherwise, it looks at all valid subsequences. So, for example:

NNTTANGNNCT

for K=2 would produce:
TT 1
TA 1
CT 1

but nothing with the G that is surrounded by Ns. Thus, the reason I suggest K=30 for length-30 reads is that you will produce exactly one substring per read, while K=20 would produce 11 substrings per read. But if a 30bp read contains even a single N, it will yield no substrings.

**lyw1** · 09-16-2014, 04:20 PM

Thanks.

Seems still have a problem after with flag -da.

zheng@zheng-XPS-8500:~/Desktop/bbmap/20140916ngs$ kmercountexact.sh in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16 -da
java -da -Xmx11773m -cp /home/zheng/Desktop/bbmap/current/ jgi.CountKmersExact in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16 -da
Executing jgi.CountKmersExact [in=ERa-05RNS160m_S6_L001_R2_001.fastq, outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt, mincount=3, k=16, -da]

ways=37
Exception in thread "main" java.lang.RuntimeException: Output file ERa-05RNS160m_S6_L001_R2_001_counts16.txt already exists, and overwrite=false
at jgi.CountKmersExact.<init>(CountKmersExact.java:335)
at jgi.CountKmersExact.main(CountKmersExact.java:55)

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

tool to find how many counts for each sequence

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News