Seqanswers Leaderboard Ad

**dho** · 10-01-2017, 10:36 AM

Hi Brian,

I figured out the problem and a workaround.The problem is that even with your recommended settings reads that extended beyond the ends of both short and long reference sequences would only be mapped to the longer reference sequences.

The workaround is to avoid having longer and shorter reference sequences as mapping targets at the same time. I subdivided by reference sequences into multiple reference sequences that each contain sequences of the same size and then map against each of these individually. I can then merge the output from all of these mappings.

Thanks for your help this week and I hope my solution helps others who encounter the same issue!

dave

**Brian Bushnell** · 10-03-2017, 02:41 PM

Originally posted by dho View Post

Hi Brian,

I figured out the problem and a workaround.The problem is that even with your recommended settings reads that extended beyond the ends of both short and long reference sequences would only be mapped to the longer reference sequences.

The workaround is to avoid having longer and shorter reference sequences as mapping targets at the same time. I subdivided by reference sequences into multiple reference sequences that each contain sequences of the same size and then map against each of these individually. I can then merge the output from all of these mappings.

Thanks for your help this week and I hope my solution helps others who encounter the same issue!

dave

Hi Dave,

Thanks for the followup. I apologize for not getting back to you in a timely fashion, I'm pretty swamped currently!

**dho** · 10-04-2017, 04:33 PM

Hi Brian,

No problem! Thanks for all that you do.

For posterity, the solution I proposed above worked for the most part, but I encountered a few cases where it didn't. Since then, I've decided to be fully exhaustive and map against each reference sequence individually. This is computationally intensive but gives the most robust results. Even with these settings, though, reads mapping to very small reference sequences (<50bp) doesn't seem to work as consistently as I'd like, though I haven't figured out the source of this inconsistency.

--dave

**gokhulkrishnakilaru** · 10-10-2017, 08:21 AM

Adding Unique Identifier To Paired End Reads. Editing FASTQ Read ID based on randomer

Hi Brian-

We have a paired end rnaseq data.

For every sequence in read2, we want to extract the first 6nucleotides and append them to the read id in read2 and also to the respective read id in read1.

Please see example below.

Code:

[I]cat read1[/I]
@NB501293:231:HV3CTBGX2:1:11101:2280:1047 1:N:0:CAGATC
CCGCANGTTGCAGAGCGGGTGGGAGCCNCTNCGGGCGCGGCACTGNAGCCCTGANACTGAACCCCGAACCCGAGCC
+
AAAAA#EEEEEEE/EEEEEEEEE6E//#<E#/EEE/E/EAEAEEE#/EA/AEEE#/EAAAEE/EEEAEEE/EE//6
@NB501293:231:HV3CTBGX2:1:11101:9866:1047 1:N:0:CAGATC
CTCAANGGGAGAGACCTTAGATGATACNCANGATGACAGTAGGTANAGGGAACTTATAGAGCCACCTCCATCAGGA
+
AAAAA#EEEEEEEEEAEEEEEEEEEEE#EE#EEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
@NB501293:231:HV3CTBGX2:1:11101:24301:1048 1:N:0:CAGATC
ACGGANCTCTGGCTGTTGTATGGAAAGNTANGCTGTAACACGCACNGACAGAAGAGAGCCATTTTCTCCCTGAACT
+
AA/AA#EEEEEAEAEAEEEEE/EEEEE#E<#6EEAEEAAAE///E#EEAEE//A/</EE</EE//E6E6EEEEEE6
@NB501293:231:HV3CTBGX2:1:11101:16754:1048 1:N:0:CAGATC
CCTGGNAGCCGCCGCAAGCGCCGGACCNCANGCACTCCCAGGCGCGCGCGCTTCTTCTGCAAAAAGTTGAGGGCTC
+
AAAAA#EEAEEEEEE6EEEEEEEEAE/#EE#EE/EE//E/EEAEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEE/E


[I]cat read2[/I]
@NB501293:231:HV3CTBGX2:1:11101:2280:1047 2:N:0:CAGATC
GCTGGGCGAGTAGCTTCTGGATCCTGGCCTCCTGAGCCTGTGGCCCGGGCTAGGCTCGGGGCTCGGGTTCGGGGTT
+
AAAAAEEEEEEE/AEEAEEEEEEEE/AEEEAAAE/EEEEEAEAEEEE6EEEAEEEEEEEEEEE/EAEAAE6EEE<A
@NB501293:231:HV3CTBGX2:1:11101:9866:1047 2:N:0:CAGATC
TGACTGTGGGGTGGCAACCCCATTCCTCACTTGATGTCCTGTCTTCCTGATGGAGGTGGCTCTATAAGTTCCCTCT
+
AAAAAEEEEEEEEEE/AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
@NB501293:231:HV3CTBGX2:1:11101:24301:1048 2:N:0:CAGATC
TCGCATCATCCTACACAGCACGGACGCTAGATGACAGGACGTGCCATGACAGTCTAAGTTCAGGGAGAAAATGGCT
+
/AAAAEE//EEEEEEEEEEAEAE/EEEEEE/EA/EEA/AEEEEEEE/E/EEA/EEEEE/EEEEEEEEAEEEEA/EE
@NB501293:231:HV3CTBGX2:1:11101:16754:1048 2:N:0:CAGATC
TGACCAGCCATTGGCTGGTGGGAGTAGTGATGTCACCCATATGACACCCTGATAACGAGTTGAGAGAGAGCCCTCA
+
AA/AAEEEEEAEEEEEEEEEAEEEEEEEEAEEEEEA/EEEEE</</EAEEEEE<EEEAEEEEEEEEAA/EE<6/EE

Desired Output

Code:

[I]cat read1[/I]
@[COLOR="Red"]GCTGGG[/COLOR]_NB501293:231:HV3CTBGX2:1:11101:2280:1047 1:N:0:CAGATC
CCGCANGTTGCAGAGCGGGTGGGAGCCNCTNCGGGCGCGGCACTGNAGCCCTGANACTGAACCCCGAACCCGAGCC
+
AAAAA#EEEEEEE/EEEEEEEEE6E//#<E#/EEE/E/EAEAEEE#/EA/AEEE#/EAAAEE/EEEAEEE/EE//6
@TGACTG_NB501293:231:HV3CTBGX2:1:11101:9866:1047 1:N:0:CAGATC
CTCAANGGGAGAGACCTTAGATGATACNCANGATGACAGTAGGTANAGGGAACTTATAGAGCCACCTCCATCAGGA
+
AAAAA#EEEEEEEEEAEEEEEEEEEEE#EE#EEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
@TCGCAT_NB501293:231:HV3CTBGX2:1:11101:24301:1048 1:N:0:CAGATC
ACGGANCTCTGGCTGTTGTATGGAAAGNTANGCTGTAACACGCACNGACAGAAGAGAGCCATTTTCTCCCTGAACT
+
AA/AA#EEEEEAEAEAEEEEE/EEEEE#E<#6EEAEEAAAE///E#EEAEE//A/</EE</EE//E6E6EEEEEE6
@TGACCA_NB501293:231:HV3CTBGX2:1:11101:16754:1048 1:N:0:CAGATC
CCTGGNAGCCGCCGCAAGCGCCGGACCNCANGCACTCCCAGGCGCGCGCGCTTCTTCTGCAAAAAGTTGAGGGCTC
+
AAAAA#EEAEEEEEE6EEEEEEEEAE/#EE#EE/EE//E/EEAEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEE/E


[I]cat read2[/I]
@[COLOR="Red"]GCTGGG[/COLOR]_NB501293:231:HV3CTBGX2:1:11101:2280:1047 2:N:0:CAGATC
GCTGGGCGAGTAGCTTCTGGATCCTGGCCTCCTGAGCCTGTGGCCCGGGCTAGGCTCGGGGCTCGGGTTCGGGGTT
+
AAAAAEEEEEEE/AEEAEEEEEEEE/AEEEAAAE/EEEEEAEAEEEE6EEEAEEEEEEEEEEE/EAEAAE6EEE<A
@TGACTG_NB501293:231:HV3CTBGX2:1:11101:9866:1047 2:N:0:CAGATC
TGACTGTGGGGTGGCAACCCCATTCCTCACTTGATGTCCTGTCTTCCTGATGGAGGTGGCTCTATAAGTTCCCTCT
+
AAAAAEEEEEEEEEE/AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
@TCGCAT_NB501293:231:HV3CTBGX2:1:11101:24301:1048 2:N:0:CAGATC
TCGCATCATCCTACACAGCACGGACGCTAGATGACAGGACGTGCCATGACAGTCTAAGTTCAGGGAGAAAATGGCT
+
/AAAAEE//EEEEEEEEEEAEAE/EEEEEE/EA/EEA/AEEEEEEE/E/EEA/EEEEE/EEEEEEEEAEEEEA/EE
@TGACCA_NB501293:231:HV3CTBGX2:1:11101:16754:1048 2:N:0:CAGATC
TGACCAGCCATTGGCTGGTGGGAGTAGTGATGTCACCCATATGACACCCTGATAACGAGTTGAGAGAGAGCCCTCA
+
AA/AAEEEEEAEEEEEEEEEAEEEEEEEEAEEEEEA/EEEEE</</EAEEEEE<EEEAEEEEEEEEAA/EE<6/EE

Do you know if the BB suite can achieve this? I tried fastx toolkit but it prints the full sequence instead of a partial sequence.

**santiagorevale** · 10-23-2017, 05:43 AM

Hi Brian,

I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

Here is the error:

java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.String.toLowerCase(String.java:2647)
at java.lang.String.toLowerCase(String.java:2670)
at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
at driver.FilterReadsByName.main(FilterReadsByName.java:40)

Thank you very much in advance.

Best regards,
Santiago

**bio_d** · 10-23-2017, 11:36 AM

problem using bbmap

Hi Brian,

I am trying a Denovo assembly of a reptilian genome (size comparable to human genome) and have the following Illumina sequence libraries (paired-end, mate-pair and long mate-pair with estimated insert sizes 200bp, 5.2kb and 10kb respectively.).

After quality and adapter trimming (using FASTQc toolkit), I have used bbmap.sh to align the reads to a reference (assembly of contigs from CLC workbench). However, I am unable to generate bam file (using samtools) from the sam file generated using bbmap. I have used "rcomp=t" and "rcs=f" flag for both the mate pair and long mate pair libraries. I had earlier tried with "rcomp=t" flag only but that didn't help either.

Please help.

Best,
D

SNIPPET submit script:

srun ./bbmap/bbmap.sh rcomp=t rcs=f in1=mate_pair_1.fq.gz in2=mate_pair_2.fq.gz outu=mate_pair_u.sam outm=mate_pair_m.sam
srun ./samtools-1.6/samtools view -b -o mate_pair_m.bam mate_pair_m.sam
srun ./samtools-1.6/samtools view -b -o mate_pair_u.bam mate_pair_u.sam

snippet Error:

[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] Parse error at line 2051364
[main_samview] truncated file.
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] Parse error at line 2051364
[main_samview] truncated file.
srun: error: cluster01: task 0: Exited with exit code 1
srun: error: cluster01: task 1: Exited with exit code 1

**GenoMax** · 10-23-2017, 01:03 PM

Can you check what the lines in question look like? Use "zcat mate_pair_1.fq.gz | sed -n '2051364,2051366p' filename" to extract those lines.

**bio_d** · 10-23-2017, 01:17 PM

Hi,

I got this with the command you suggested.

BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F
@HWI-D00294:282:CAB4VANXX:7:1101:13889:60146 1:N:0:GCCAAT
CCAATGGGGAATGGCGAAAGCACTGCTCAGCATTTCTGGCTCTGCCTGAGGCTGGAATGCAGAAAACCCTGCAGTAGAGGGGGATCTTCTCTTTGGGGTGCTCCTCGTGCCTCCCCCTTACTGC

Best,
D

**GenoMax** · 10-23-2017, 03:19 PM

Compare the sequence and quality lengths of record that the first line belongs to, so let us try "zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p" instead.

If the lengths of sequence and quality values lines are different then you have a malformed fastq record. You could delete that record from both R1/R2 files as a work around. Was this data trimmed? Do you have the original files?

**bio_d** · 10-23-2017, 04:08 PM

Hi,

zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p'

TACACATCTAGATGTGTATAAGAGACAGGTAATGGGATTGCCAGGTTCCCCCTCACTTGTAGTTTTGGATTTGGATTTATTATTCTTAATGTATGTATGTAGCACCATAGCTATGTGTGCTCAGG
+
BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F

I checked the lengths of the sequence and quality values it is the same 125. Yes, the fastq files used were trimmed using trim galore toolkit and quality checked using fastqc toolkit. I do have the raw data (illumina) as well.

Is it because trimming was done by the above mentioned tools and not using bbduk.sh ? Can't really understand what is going wrong.

Best,
D

**GenoMax** · 10-23-2017, 06:15 PM

Can you check the R2 file by the same method?

**bio_d** · 10-23-2017, 08:29 PM

Hi,

There seems to be nothing wrong with the second read too

zcat mate_pair_2.fq.gz | sed -n '2051361,2051364p'

@HWI-D00294:282:CAB4VANXX:7:1101:13812:601352:N:0:GCCAAT
TTGAAGCAGCAGTTCAAAAACATTGTCTCAGTCTGTCTTAATTTGGTATAATCCCCTGAATCTATTAAACCAAGACCAGCTGTCTGACATTTTTCACTATTTTCTTTTCTCCGCTTGTTCTTTTC
+
@BBB@FG1EBGG@D1FFGFGGGGGEEGGGFCDGGE=@>DF@F@1@@>FG>FG>DEG1E>1@FDGGGCGECBGEGBE>1@>GGFC=D>FGE@FFGGGG00E>F>DCGGGGGGGGGDF=@>C0B@FG

**santiagorevale** · 11-02-2017, 08:32 AM

Originally posted by santiagorevale View Post

Hi Brian,

I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

Here is the error:

java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.String.toLowerCase(String.java:2647)
at java.lang.String.toLowerCase(String.java:2670)
at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
at driver.FilterReadsByName.main(FilterReadsByName.java:40)

Thank you very much in advance.

Best regards,
Santiago

Hi there,

Any hint on what I've previously asked?

Thanks!

**TomHarrop** · 11-06-2017, 08:08 PM

Hi Brian & others,

I tried to run bbnorm with a kmer size of 99, but it crashed with the following error:

Code:

Exception in thread "Thread-371" Exception in thread "Thread-357" Exception in thread "Thread-368" Exception in thread "Thread-377" Exception in thread "Thread-367" Exception in thread "Thread-362" Exception in thread "Thread-380" Exception in thread "Thread-363" Exception in thread "Thread-365" Exception in thread "Thread-364" Exception in thread "Thread-366" Exception in thread "Thread-358" Exception in thread "Thread-361" Exception in thread "Thread-360" Exception in thread "Thread-381" Exception in thread "Thread-387" Exception in thread "Thread-372" Exception in thread "Thread-399" java.lang.AssertionError: this function not tested with k>31
    at jgi.KmerNormalize.correctErrors(KmerNormalize.java:2124)
    at jgi.KmerNormalize.access$19(KmerNormalize.java:2121)
    at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3043)
    at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2806)

I'm wondering if I used a bad combination of parameters. Here's the call:

Code:

java -ea -Xmx132160m -Xms132160m -cp PATH/TO/bbmap/current/ jgi.KmerNormalize bits=32 in=output/trim_decon/reads.fastq.gz threads=50 out=output/k_99/norm/normalised.fastq.gz zl=9 hist=output/k_99/norm/hist_before.txt histout=output/k_99/norm/hist_after.txt target=50 min=5 prefilter ecc k=99 peaks=output/k_99/norm/peaks.txt

Otherwise, is it supported to use bbnorm with larger kmer sizes, or would you recommend estimating the target coverage for k = 99 based on the coverage at k = 31?

I've posted the log on pastebin: https://pastebin.com/jPkKagFs

Thanks again for the bbmap suite!

Tom

**boulund** · 11-07-2017, 02:54 AM

Hi, just want to make sure I'm not missing anything here, but randomreads.sh cannot produce metagenomes according to a specific profile, right? I only find information about it drawing random numbers from an exponential distribution for each reference sequence and thus produces a simulated metagenome from a set of reference sequences, which of course is awesome, but right now I would like to produce a simulated metagenome with a very specific composition.

Topics	Statistics	Last Post
TIGR Systems Offer a Compact Alternative to CRISPR for Gene Editing by seqadmin Started by seqadmin, 03-03-2025, 01:15 PM	0 responses 166 views 0 likes	Last Post by seqadmin 03-03-2025, 01:15 PM
Highlights from AGBT 2025 – Part II by seqadmin Started by seqadmin, 02-28-2025, 12:58 PM	0 responses 252 views 0 likes	Last Post by seqadmin 02-28-2025, 12:58 PM
Highlights from AGBT 2025 – Part I by seqadmin Started by seqadmin, 02-24-2025, 02:48 PM	0 responses 629 views 0 likes	Last Post by seqadmin 02-24-2025, 02:48 PM
Selecting the Right AI Model for Bioinformatics Research by seqadmin Started by seqadmin, 02-21-2025, 02:46 PM	0 responses 265 views 0 likes	Last Post by seqadmin 02-21-2025, 02:46 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News