SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Two related transcriptomes: merging but avoiding fake fusion transcripts danwiththeplan Bioinformatics 1 03-13-2014 07:45 PM
Do I need to create an index for my reference genome? wmseq Bioinformatics 5 11-02-2013 01:42 AM
looking to create IUPAC reference genome amythecat Bioinformatics 1 06-12-2013 11:42 PM
Create new Reference by merging SNV list with reference genome rdoan Bioinformatics 0 10-12-2012 07:17 AM
Assisted de novo genome assembly? Create new consensus mapping reads to reference? zmartine Bioinformatics 8 02-10-2012 12:31 AM

Reply
 
Thread Tools
Old 02-02-2017, 10:45 AM   #1
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 24
Default Cleaning up, merging de novo transcriptomes to create a quality reference

Hello,

I have about 950 million reads from an RNA-Seq data set that covers many developmental time-points. Assembling all the reads doesn't really work because I reach a point where errors are being included at a higher rate than new k-mers (or so I have been advised...including all of the reads and digital-normalizing to 20x results in a very fragmented, low quality assembly).

If I assemble multiple time-points individually and then merge the transcriptomes, how would I select the best representative isoform from each assembly and jettison the rest to create a nice, clean final reference? What is the a good method to filter the garbage out and what is a good method merge them, favoring more complete sequences?

To clarify merging - I'm thinking of selecting individual transcripts from multiple assemblies, not merging actual sequences together to increase length, although that would be a source of improvement.
dacotahm is offline   Reply With Quote
Old 02-02-2017, 11:15 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Generally, I would recommend assembling them all together, and using error-correction if necessary to deal with the introduction of errors. 20x is also a very low target for normalization prior to assembly; if you want to normalize, I typically recommend a target of 100x.

For optimal assembly, I recommend a bit of preprocessing first. Using BBMap, and starting with the raw reads (assuming these are 2x150bp Illumina data):

Code:
bbduk.sh in=r1.fq in2=r2.fq out=trimmed.fq minlen=90 ktrim=r k=23 mink=11 hdist=1 tbo tpe ref=adapters.fa maxns=0 qtrim=r trimq=10

bbduk.sh in=trimmed.fq out=filtered.fq ref=phix174_ill.ref.fa.gz,sequencing_artifacts.fa.gz k=31

bbmerge.sh in=filtered.fq out=ecco.fq ecco mix strict adapters=default

tadpole.sh in=ecco.fq out=ecct.fq ecc

#Normalization may or may not be helpful; it depends on the dataset and assembler.
#So, I suggest assembling both with and without to see which is better.
bbnorm.sh in=ecct.fq out=normalized.fq target=100 min=2
Then try assembling. If you assemble the libraries separately rather than together, you can use Dedupe to remove duplicate contigs:

Code:
dedupe.sh in=a.fa,b.fa,c.fa out=deduped.fa s=5
This will remove duplicate and contained sequences, allowing up to 5 substitutions.
Brian Bushnell is offline   Reply With Quote
Old 02-07-2017, 08:54 AM   #3
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 24
Default

Thanks, I'll give that a shot and report my results when finished
dacotahm is offline   Reply With Quote
Old 02-07-2017, 10:51 AM   #4
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 24
Default

I'm getting a weird error in bbduk.sh where it says I have unpaired reads.

Code:
BBDuk version 36.92
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=205801m, free=201506m, used=4295m

Added 216529 kmers; time:       0.181 seconds.
Memory: max=205801m, free=191842m, used=13959m

Input is being processed as paired
Started output streams: 0.025 seconds.
Exception in thread "Thread-11" java.lang.AssertionError:
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.
        at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:479)
        at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:344)
        at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:188)
        at java.lang.Thread.run(Thread.java:745)
I ran repair.sh as recommended and only 50% of the reads are paired. This isn't true, if I use the Khmer Toolkit to extract pairs, they're all paired. They also appear to be in the correct order. I don't have the output from repair.sh because I didn't capture it from the screen session. Advice? Should I interleave them first with Khmer Toolkit?

I concatenate all of my gzip reads from replicates and virtual NextSeq lanes like so:

Code:
cat left1.fastq.gz left2.fastq.gz left3.fastq.gz > left_all.fastq.gz
cat right1.fastq.gz right1.fastq.gz right1.fastq.gz > right_all.fastq.gz
If I run repair.sh on the single, un-concatenated files 100% of the reads have pairs. Again, they appear to be in order....

Thanks

Last edited by dacotahm; 02-07-2017 at 11:04 AM.
dacotahm is offline   Reply With Quote
Old 02-07-2017, 12:12 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Sounds like there's something strange about the concatenated gzipped files; it might work better if you decompress them, concatenated them, and then compress them again:

Code:
zcat left1.fastq.gz left2.fastq.gz left3.fastq.gz | gzip -c > left_all.fastq.gz
There are some versions of Java that have trouble with certain concatenated gzipped files (they think the file ends at one of the concatenation boundaries), which appears to be a bug. So you can do as above, or interleaved them with a different tool.
Brian Bushnell is offline   Reply With Quote
Old 02-21-2017, 06:41 AM   #6
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 24
Default

Here's my results if you're interested. Dedupe.sh didn't remove as many as I expected from one set, so I'm not sure how to interpret that.

I ran all of the norm and error correction steps on the reads for each sample separately. Then I assembled them separately with Trinity, concatenated the assemblies, and deduped them.

I did another run where I concatenated all the error-corrected, normalized reads and assembled them together with Trinity before deduping that single assembly. Results as follows:

Dedupe removed ~42% from the concatenated assemblies, but still had 1e6 contigs, which is very high. It removed 2.7% from the single assembly that contained all of the reads. This is another bee species so still high, but contains isoforms.

The third group of assemblies (below) has >1e6 contigs but length stats other than contig number and total BP are similar to other assemblies. I assume it contains a ton of duplication still, curious about how to tune dedupe.sh.

Code:
#Individual assemblies
filename 			sum n trim_n min med mean max n50 n50_len n90 n90_len
DLmRNA.fasta 		72452700 54350 54350 201 528 1333 34738 7323 2938 29606 455
OLA125mRNA.fasta 	247008720 141467 141467 201 625 1746 25949 17551 4258 69947 639
OLA14mRNA.fasta 	303516447 141371 141371 201 854 2146 27338 19963 4855 67083 991
OLA215mRNA.fasta 	266481896 147914 147914 201 665 1801 31511 18978 4226 72938 686
OLA28mRNA.fasta 	219382126 139428 139428 201 522 1573 31117 15741 4164 69889 519
OLA65mRNA.fasta 	364224298 154717 154717 201 1056 2354 29609 22538 5139 74439 1179
OLNAmRNA.fasta 		137613039 98089 98089 201 488 1402 29749 11716 3511 51283 455
PP12mRNA.fasta 		470714411 191853 191853 201 1026 2453 38729 27127 5531 90517 1207
PP15mRNA.fasta 		403006775 177736 177736 201 934 2267 31633 25075 5056 84637 1071
PP6mRNA.fasta 		189346636 134266 134266 201 494 1410 31708 15023 3658 70874 456
Pplus20mRNA.fasta 	210036076 80828 80828 201 1502 2598 56106 12951 5062 41100 1442
PPmRNA.fasta 		213918690 142302 142302 201 516 1503 35229 16062 3894 72927 496
PUmRNA.fasta 		209833062 147120 147120 201 485 1426 36445 16145 3756 76586 458

#Individual assemblies concatenated and deduped
filename 							sum n trim_n min med mean max n50 n50_len n90 n90_len
01_concatenatedAssemblies.fasta 	3307534876 1751441 1751441 201 662 1888 56106 221789 4561 837145 733
02_dedupedConcatAssemblies.fasta 	2667545297 1016986 1016986 201 1399 2622 56106 161158 5318 509036 1394

#Reads concatenated from all samples, assembled together.
filename 										sum n trim_n min med mean max n50 n50_len n90 n90_len
ReadsConcatenatedBeforeAssembly.fasta          730984902 288974 288974 201 1210 2529 43812 42367 5432 140412 1302
DeDuped_ReadsConcatenatedBeforeAssembly.fasta  712505764 281091 281091 201 1207 2534 43812 41174 5450 136631 1297
dacotahm is offline   Reply With Quote
Old 02-21-2017, 10:49 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Dedupe will only remove identical (aside from the specified number of mismatches/edits) or fully-contained sequences; it will not remove or combined sequences that overlap, but in which neither is fully contained by the other. For that purpose, you might try Minimus2.

It's hard to say why there are so many sequences in the deduped, concatenated assemblies compared to the combined assembly. You can certainly try increasing the allowed number of substitutions, or allow an edit distance to accommodate indels ("e=" flag rather than "s=" flag). Perhaps you can tell me...

1) What is the ploidy of these bees?
2) What is the average heterozygosity rate, or % identity between two individuals?

528 (for example) seems to be short for a median isoform length; it's not even much above your min cutoff of 201bp. How does that compare to the expected median isoform length in bees? I wonder if the RNA is too old and degraded, or if the coverage was too low for good assemblies. Do you have information about the coverage distribution of the contigs? Note that in the assembly where you concatenated all reads prior to assembly, you achieved a much higher median contig length than any of the individual assemblies, which indicates this is a coverage problem. For quantification purposes, you may want to just use that one and map to it rather than assembling the individual samples.
Brian Bushnell is offline   Reply With Quote
Old 10-12-2017, 12:00 PM   #8
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 24
Default

It's been a while but I circled back to cleaning this transcriptome up.

I don't know the amount of genetic variation, that's something we're planning to dig in to sometime next year.

Anway, I tried to clean it up a bit - The one with the duplication is by far the most complete (~96%). I made a reciprocal-best-hit subset and used best single hit against unique Honey Bees transcripts but it is quite incomplete.

BUSCO scores:

Code:
					C	D	F	M
ApisBlastUniques			2000	444	237	438
SingleSample				2214	521	178	283
AllSamplesWithDuplication	        248	2110	216	101
I tried running dedupe.sh in a loop with substitution increments of {0..100..5} and still didn't see much change:

Code:
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211470m -Xms211470m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia0subs.fasta maxsubs=0 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia0subs.fasta, maxsubs=0, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212503m, free=209177m, used=3326m

Found 0 duplicates.
Finished exact matches.    Time: 10.134 seconds.
Memory: max=212503m, free=160369m, used=52134m

Found 62027 contained sequences.
Finished containment.      Time: 15.413 seconds.
Memory: max=212503m, free=207364m, used=5139m

Removed 62027 invalid entries.
Finished invalid removal.  Time: 0.309 seconds.
Memory: max=212503m, free=207364m, used=5139m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	62027 reads (22.07%) 	238551461 bases (33.48%)    	70938383 collisions.
Result:                 	219064 reads (77.93%) 	473954303 bases (66.52%)

Printed output.            Time: 2.464 seconds.
Memory: max=212503m, free=206615m, used=5888m

Time:   			28.337 seconds.
Reads Processed:        281k 	9.92k reads/sec
Bases Processed:        712m 	25.14m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211451m -Xms211451m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia5subs.fasta maxsubs=5 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia5subs.fasta, maxsubs=5, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212485m, free=208050m, used=4435m

Found 0 duplicates.
Finished exact matches.    Time: 11.091 seconds.
Memory: max=212485m, free=160910m, used=51575m

Found 62027 contained sequences.
Finished containment.      Time: 9.976 seconds.
Memory: max=212485m, free=157030m, used=55455m

Removed 62027 invalid entries.
Finished invalid removal.  Time: 0.193 seconds.
Memory: max=212485m, free=157030m, used=55455m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	62027 reads (22.07%) 	238551461 bases (33.48%)    	71234164 collisions.
Result:                 	219064 reads (77.93%) 	473954303 bases (66.52%)

Printed output.            Time: 6.894 seconds.
Memory: max=212485m, free=209979m, used=2506m

Time:   			28.170 seconds.
Reads Processed:        281k 	9.98k reads/sec
Bases Processed:        712m 	25.29m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211500m -Xms211500m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia10subs.fasta maxsubs=10 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia10subs.fasta, maxsubs=10, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212533m, free=208097m, used=4436m

Found 0 duplicates.
Finished exact matches.    Time: 5.853 seconds.
Memory: max=212533m, free=160949m, used=51584m

Found 62027 contained sequences.
Finished containment.      Time: 18.619 seconds.
Memory: max=212533m, free=208178m, used=4355m

Removed 62027 invalid entries.
Finished invalid removal.  Time: 0.293 seconds.
Memory: max=212533m, free=208178m, used=4355m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	62027 reads (22.07%) 	238551461 bases (33.48%)    	70560203 collisions.
Result:                 	219064 reads (77.93%) 	473954303 bases (66.52%)

Printed output.            Time: 1.997 seconds.
Memory: max=212533m, free=207070m, used=5463m

Time:   			26.780 seconds.
Reads Processed:        281k 	10.50k reads/sec
Bases Processed:        712m 	26.61m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211435m -Xms211435m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia15subs.fasta maxsubs=15 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia15subs.fasta, maxsubs=15, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212469m, free=209143m, used=3326m

Found 0 duplicates.
Finished exact matches.    Time: 6.209 seconds.
Memory: max=212469m, free=160345m, used=52124m

Found 63110 contained sequences.
Finished containment.      Time: 20.348 seconds.
Memory: max=212469m, free=209159m, used=3310m

Removed 63110 invalid entries.
Finished invalid removal.  Time: 0.207 seconds.
Memory: max=212469m, free=209159m, used=3310m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	63110 reads (22.45%) 	238796524 bases (33.52%)    	70648475 collisions.
Result:                 	217981 reads (77.55%) 	473709240 bases (66.48%)

Printed output.            Time: 1.873 seconds.
Memory: max=212469m, free=208410m, used=4059m

Time:   			28.656 seconds.
Reads Processed:        281k 	9.81k reads/sec
Bases Processed:        712m 	24.86m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211443m -Xms211443m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia20subs.fasta maxsubs=20 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia20subs.fasta, maxsubs=20, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212477m, free=208042m, used=4435m

Found 0 duplicates.
Finished exact matches.    Time: 6.880 seconds.
Memory: max=212477m, free=158134m, used=54343m

Found 64647 contained sequences.
Finished containment.      Time: 29.648 seconds.
Memory: max=212477m, free=206120m, used=6357m

Removed 64647 invalid entries.
Finished invalid removal.  Time: 0.299 seconds.
Memory: max=212477m, free=206120m, used=6357m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	64647 reads (23.00%) 	239192035 bases (33.57%)    	69566767 collisions.
Result:                 	216444 reads (77.00%) 	473313729 bases (66.43%)

Printed output.            Time: 2.137 seconds.
Memory: max=212477m, free=204999m, used=7478m

Time:   			38.978 seconds.
Reads Processed:        281k 	7.21k reads/sec
Bases Processed:        712m 	18.28m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211421m -Xms211421m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia25subs.fasta maxsubs=25 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia25subs.fasta, maxsubs=25, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212455m, free=208021m, used=4434m

Found 0 duplicates.
Finished exact matches.    Time: 6.859 seconds.
Memory: max=212455m, free=159779m, used=52676m

Found 65967 contained sequences.
Finished containment.      Time: 19.185 seconds.
Memory: max=212455m, free=207484m, used=4971m

Removed 65967 invalid entries.
Finished invalid removal.  Time: 0.243 seconds.
Memory: max=212455m, free=207484m, used=4971m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	65967 reads (23.47%) 	239568880 bases (33.62%)    	69863260 collisions.
Result:                 	215124 reads (76.53%) 	472936884 bases (66.38%)

Printed output.            Time: 2.177 seconds.
Memory: max=212455m, free=206730m, used=5725m

Time:   			28.481 seconds.
Reads Processed:        281k 	9.87k reads/sec
Bases Processed:        712m 	25.02m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211414m -Xms211414m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia30subs.fasta maxsubs=30 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia30subs.fasta, maxsubs=30, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212447m, free=208013m, used=4434m

Found 0 duplicates.
Finished exact matches.    Time: 7.152 seconds.
Memory: max=212447m, free=160883m, used=51564m

Found 67307 contained sequences.
Finished containment.      Time: 17.242 seconds.
Memory: max=212447m, free=207899m, used=4548m

Removed 67307 invalid entries.
Finished invalid removal.  Time: 0.282 seconds.
Memory: max=212447m, free=207899m, used=4548m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	67307 reads (23.94%) 	239991769 bases (33.68%)    	69772975 collisions.
Result:                 	213784 reads (76.06%) 	472513995 bases (66.32%)

Printed output.            Time: 1.984 seconds.
Memory: max=212447m, free=206597m, used=5850m

Time:   			26.677 seconds.
Reads Processed:        281k 	10.54k reads/sec
Bases Processed:        712m 	26.71m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211410m -Xms211410m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia35subs.fasta maxsubs=35 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia35subs.fasta, maxsubs=35, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212443m, free=209117m, used=3326m

Found 0 duplicates.
Finished exact matches.    Time: 8.400 seconds.
Memory: max=212443m, free=160324m, used=52119m

Found 68858 contained sequences.
Finished containment.      Time: 9.879 seconds.
Memory: max=212443m, free=156999m, used=55444m

Removed 68858 invalid entries.
Finished invalid removal.  Time: 0.436 seconds.
Memory: max=212443m, free=156999m, used=55444m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	68858 reads (24.50%) 	240523318 bases (33.76%)    	68935338 collisions.
Result:                 	212233 reads (75.50%) 	471982446 bases (66.24%)

Printed output.            Time: 6.270 seconds.
Memory: max=212443m, free=209195m, used=3248m

Time:   			25.000 seconds.
Reads Processed:        281k 	11.24k reads/sec
Bases Processed:        712m 	28.50m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211436m -Xms211436m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia40subs.fasta maxsubs=40 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia40subs.fasta, maxsubs=40, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212469m, free=208035m, used=4434m

Found 0 duplicates.
Finished exact matches.    Time: 6.676 seconds.
Memory: max=212469m, free=159237m, used=53232m

Found 70682 contained sequences.
Finished containment.      Time: 29.239 seconds.
Memory: max=212469m, free=207078m, used=5391m

Removed 70682 invalid entries.
Finished invalid removal.  Time: 0.219 seconds.
Memory: max=212469m, free=207078m, used=5391m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	70682 reads (25.15%) 	241200047 bases (33.85%)    	69348258 collisions.
Result:                 	210409 reads (74.85%) 	471305717 bases (66.15%)

Printed output.            Time: 4.554 seconds.
Memory: max=212469m, free=206327m, used=6142m

Time:   			40.702 seconds.
Reads Processed:        281k 	6.91k reads/sec
Bases Processed:        712m 	17.51m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211397m -Xms211397m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia45subs.fasta maxsubs=45 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia45subs.fasta, maxsubs=45, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212431m, free=209105m, used=3326m

Found 0 duplicates.
Finished exact matches.    Time: 6.755 seconds.
Memory: max=212431m, free=159207m, used=53224m

Found 72888 contained sequences.
Finished containment.      Time: 18.857 seconds.
Memory: max=212431m, free=206186m, used=6245m

Removed 72888 invalid entries.
Finished invalid removal.  Time: 0.251 seconds.
Memory: max=212431m, free=206186m, used=6245m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	72888 reads (25.93%) 	242085988 bases (33.98%)    	68508447 collisions.
Result:                 	208203 reads (74.07%) 	470419776 bases (66.02%)

Printed output.            Time: 2.567 seconds.
Memory: max=212431m, free=205436m, used=6995m

Time:   			28.447 seconds.
Reads Processed:        281k 	9.88k reads/sec
Bases Processed:        712m 	25.05m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211368m -Xms211368m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia50subs.fasta maxsubs=50 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia50subs.fasta, maxsubs=50, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212400m, free=207967m, used=4433m

Found 0 duplicates.
Finished exact matches.    Time: 7.608 seconds.
Memory: max=212400m, free=160293m, used=52107m

Found 75105 contained sequences.
Finished containment.      Time: 22.144 seconds.
Memory: max=212400m, free=207667m, used=4733m

Removed 75105 invalid entries.
Finished invalid removal.  Time: 0.354 seconds.
Memory: max=212400m, free=207667m, used=4733m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	75105 reads (26.72%) 	243040998 bases (34.11%)    	68422636 collisions.
Result:                 	205986 reads (73.28%) 	469464766 bases (65.89%)

Printed output.            Time: 2.299 seconds.
Memory: max=212400m, free=206555m, used=5845m

Time:   			32.416 seconds.
Reads Processed:        281k 	8.67k reads/sec
Bases Processed:        712m 	21.98m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211365m -Xms211365m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia55subs.fasta maxsubs=55 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia55subs.fasta, maxsubs=55, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212399m, free=209074m, used=3325m

Found 0 duplicates.
Finished exact matches.    Time: 7.279 seconds.
Memory: max=212399m, free=159738m, used=52661m

Found 77406 contained sequences.
Finished containment.      Time: 23.601 seconds.
Memory: max=212399m, free=207725m, used=4674m

Removed 77406 invalid entries.
Finished invalid removal.  Time: 0.256 seconds.
Memory: max=212399m, free=207725m, used=4674m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	77406 reads (27.54%) 	244079330 bases (34.26%)    	68953360 collisions.
Result:                 	203685 reads (72.46%) 	468426434 bases (65.74%)

Printed output.            Time: 2.930 seconds.
Memory: max=212399m, free=206615m, used=5784m

Time:   			34.083 seconds.
Reads Processed:        281k 	8.25k reads/sec
Bases Processed:        712m 	20.90m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211348m -Xms211348m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia60subs.fasta maxsubs=60 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia60subs.fasta, maxsubs=60, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212380m, free=209056m, used=3324m

Found 0 duplicates.
Finished exact matches.    Time: 7.808 seconds.
Memory: max=212380m, free=159171m, used=53209m

Found 80107 contained sequences.
Finished containment.      Time: 20.234 seconds.
Memory: max=212380m, free=207257m, used=5123m

Removed 80107 invalid entries.
Finished invalid removal.  Time: 0.292 seconds.
Memory: max=212380m, free=207257m, used=5123m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	80107 reads (28.50%) 	245395205 bases (34.44%)    	68096667 collisions.
Result:                 	200984 reads (71.50%) 	467110559 bases (65.56%)

Printed output.            Time: 2.464 seconds.
Memory: max=212380m, free=206142m, used=6238m

Time:   			30.816 seconds.
Reads Processed:        281k 	9.12k reads/sec
Bases Processed:        712m 	23.12m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211339m -Xms211339m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia65subs.fasta maxsubs=65 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia65subs.fasta, maxsubs=65, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212372m, free=207940m, used=4432m

Found 0 duplicates.
Finished exact matches.    Time: 6.644 seconds.
Memory: max=212372m, free=160271m, used=52101m

Found 82733 contained sequences.
Finished containment.      Time: 23.494 seconds.
Memory: max=212372m, free=207506m, used=4866m

Removed 82733 invalid entries.
Finished invalid removal.  Time: 0.348 seconds.
Memory: max=212372m, free=207506m, used=4866m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	82733 reads (29.43%) 	246749815 bases (34.63%)    	66829379 collisions.
Result:                 	198358 reads (70.57%) 	465755949 bases (65.37%)

Printed output.            Time: 2.585 seconds.
Memory: max=212372m, free=206201m, used=6171m

Time:   			33.086 seconds.
Reads Processed:        281k 	8.50k reads/sec
Bases Processed:        712m 	21.53m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211341m -Xms211341m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia70subs.fasta maxsubs=70 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia70subs.fasta, maxsubs=70, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212374m, free=209050m, used=3324m

Found 0 duplicates.
Finished exact matches.    Time: 8.722 seconds.
Memory: max=212374m, free=159718m, used=52656m

Found 84870 contained sequences.
Finished containment.      Time: 18.281 seconds.
Memory: max=212374m, free=206725m, used=5649m

Removed 84870 invalid entries.
Finished invalid removal.  Time: 0.299 seconds.
Memory: max=212374m, free=206725m, used=5649m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	84870 reads (30.19%) 	247863057 bases (34.79%)    	67629054 collisions.
Result:                 	196221 reads (69.81%) 	464642707 bases (65.21%)

Printed output.            Time: 2.148 seconds.
Memory: max=212374m, free=205611m, used=6763m

Time:   			29.466 seconds.
Reads Processed:        281k 	9.54k reads/sec
Bases Processed:        712m 	24.18m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211314m -Xms211314m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia75subs.fasta maxsubs=75 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia75subs.fasta, maxsubs=75, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212346m, free=207914m, used=4432m

Found 0 duplicates.
Finished exact matches.    Time: 6.865 seconds.
Memory: max=212346m, free=161360m, used=50986m

Found 86735 contained sequences.
Finished containment.      Time: 11.489 seconds.
Memory: max=212346m, free=158139m, used=54207m

Removed 86735 invalid entries.
Finished invalid removal.  Time: 0.330 seconds.
Memory: max=212346m, free=158139m, used=54207m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	86735 reads (30.86%) 	248866866 bases (34.93%)    	67694034 collisions.
Result:                 	194356 reads (69.14%) 	463638898 bases (65.07%)

Printed output.            Time: 3.616 seconds.
Memory: max=212346m, free=157544m, used=54802m

Time:   			22.449 seconds.
Reads Processed:        281k 	12.52k reads/sec
Bases Processed:        712m 	31.74m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211325m -Xms211325m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia80subs.fasta maxsubs=80 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia80subs.fasta, maxsubs=80, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212358m, free=207926m, used=4432m

Found 0 duplicates.
Finished exact matches.    Time: 6.819 seconds.
Memory: max=212358m, free=160817m, used=51541m

Found 88418 contained sequences.
Finished containment.      Time: 17.209 seconds.
Memory: max=212358m, free=207540m, used=4818m

Removed 88418 invalid entries.
Finished invalid removal.  Time: 0.239 seconds.
Memory: max=212358m, free=207540m, used=4818m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	88418 reads (31.46%) 	249777410 bases (35.06%)    	67227638 collisions.
Result:                 	192673 reads (68.54%) 	462728354 bases (64.94%)

Printed output.            Time: 1.770 seconds.
Memory: max=212358m, free=206791m, used=5567m

Time:   			26.053 seconds.
Reads Processed:        281k 	10.79k reads/sec
Bases Processed:        712m 	27.35m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211274m -Xms211274m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia85subs.fasta maxsubs=85 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia85subs.fasta, maxsubs=85, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212306m, free=207875m, used=4431m

Found 0 duplicates.
Finished exact matches.    Time: 7.388 seconds.
Memory: max=212306m, free=160223m, used=52083m

Found 89992 contained sequences.
Finished containment.      Time: 20.680 seconds.
Memory: max=212306m, free=208425m, used=3881m

Removed 89992 invalid entries.
Finished invalid removal.  Time: 0.295 seconds.
Memory: max=212306m, free=208425m, used=3881m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	89992 reads (32.02%) 	250676105 bases (35.18%)    	67539779 collisions.
Result:                 	191099 reads (67.98%) 	461829659 bases (64.82%)

Printed output.            Time: 2.440 seconds.
Memory: max=212306m, free=206955m, used=5351m

Time:   			30.817 seconds.
Reads Processed:        281k 	9.12k reads/sec
Bases Processed:        712m 	23.12m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211297m -Xms211297m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia90subs.fasta maxsubs=90 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia90subs.fasta, maxsubs=90, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212330m, free=209006m, used=3324m

Found 0 duplicates.
Finished exact matches.    Time: 8.632 seconds.
Memory: max=212330m, free=159133m, used=53197m

Found 91593 contained sequences.
Finished containment.      Time: 24.517 seconds.
Memory: max=212330m, free=207263m, used=5067m

Removed 91593 invalid entries.
Finished invalid removal.  Time: 0.249 seconds.
Memory: max=212330m, free=207263m, used=5067m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	91593 reads (32.58%) 	251634118 bases (35.32%)    	66353054 collisions.
Result:                 	189498 reads (67.42%) 	460871646 bases (64.68%)

Printed output.            Time: 1.902 seconds.
Memory: max=212330m, free=206513m, used=5817m

Time:   			35.315 seconds.
Reads Processed:        281k 	7.96k reads/sec
Bases Processed:        712m 	20.18m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211283m -Xms211283m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia95subs.fasta maxsubs=95 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia95subs.fasta, maxsubs=95, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212316m, free=208992m, used=3324m

Found 0 duplicates.
Finished exact matches.    Time: 6.716 seconds.
Memory: max=212316m, free=159123m, used=53193m

Found 93092 contained sequences.
Finished containment.      Time: 21.864 seconds.
Memory: max=212316m, free=208434m, used=3882m

Removed 93092 invalid entries.
Finished invalid removal.  Time: 0.264 seconds.
Memory: max=212316m, free=208434m, used=3882m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	93092 reads (33.12%) 	252564221 bases (35.45%)    	66316379 collisions.
Result:                 	187999 reads (66.88%) 	459941543 bases (64.55%)

Printed output.            Time: 2.543 seconds.
Memory: max=212316m, free=207130m, used=5186m

Time:   			31.403 seconds.
Reads Processed:        281k 	8.95k reads/sec
Bases Processed:        712m 	22.69m bases/sec
java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211275m -Xms211275m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia100subs.fasta maxsubs=100 maxedits=2 minidentity=95
Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia100subs.fasta, maxsubs=100, maxedits=2, minidentity=95]

Dedupe version 36.92
Initial:
Memory: max=212308m, free=208985m, used=3323m

Found 0 duplicates.
Finished exact matches.    Time: 6.941 seconds.
Memory: max=212308m, free=160224m, used=52084m

Found 94648 contained sequences.
Finished containment.      Time: 32.346 seconds.
Memory: max=212308m, free=206767m, used=5541m

Removed 94648 invalid entries.
Finished invalid removal.  Time: 0.252 seconds.
Memory: max=212308m, free=206767m, used=5541m

Input:                  	281091 reads 		712505764 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	94648 reads (33.67%) 	253576350 bases (35.59%)    	66583705 collisions.
Result:                 	186443 reads (66.33%) 	458929414 bases (64.41%)

Printed output.            Time: 2.083 seconds.
Memory: max=212308m, free=205653m, used=6655m

Time:   			41.636 seconds.
Reads Processed:        281k 	6.75k reads/sec
Bases Processed:        712m 	17.11m bases/sec
You don't see Duplicates because those I assume were removed during the first pass. Even at 100 substitutions there is a lot of transcripts left. At this point including the massive amount of sequence data that I have is sort of an academic challenge rather than practical. The reference that I eventually use will most likely be assembled from a small subset.
dacotahm is offline   Reply With Quote
Reply

Tags
assembly, de novo, transcriptome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:41 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO