Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cleaning up, merging de novo transcriptomes to create a quality reference

    Hello,

    I have about 950 million reads from an RNA-Seq data set that covers many developmental time-points. Assembling all the reads doesn't really work because I reach a point where errors are being included at a higher rate than new k-mers (or so I have been advised...including all of the reads and digital-normalizing to 20x results in a very fragmented, low quality assembly).

    If I assemble multiple time-points individually and then merge the transcriptomes, how would I select the best representative isoform from each assembly and jettison the rest to create a nice, clean final reference? What is the a good method to filter the garbage out and what is a good method merge them, favoring more complete sequences?

    To clarify merging - I'm thinking of selecting individual transcripts from multiple assemblies, not merging actual sequences together to increase length, although that would be a source of improvement.

  • #2
    Generally, I would recommend assembling them all together, and using error-correction if necessary to deal with the introduction of errors. 20x is also a very low target for normalization prior to assembly; if you want to normalize, I typically recommend a target of 100x.

    For optimal assembly, I recommend a bit of preprocessing first. Using BBMap, and starting with the raw reads (assuming these are 2x150bp Illumina data):

    Code:
    bbduk.sh in=r1.fq in2=r2.fq out=trimmed.fq minlen=90 ktrim=r k=23 mink=11 hdist=1 tbo tpe ref=adapters.fa maxns=0 qtrim=r trimq=10
    
    bbduk.sh in=trimmed.fq out=filtered.fq ref=phix174_ill.ref.fa.gz,sequencing_artifacts.fa.gz k=31
    
    bbmerge.sh in=filtered.fq out=ecco.fq ecco mix strict adapters=default
    
    tadpole.sh in=ecco.fq out=ecct.fq ecc
    
    #Normalization may or may not be helpful; it depends on the dataset and assembler.
    #So, I suggest assembling both with and without to see which is better.
    bbnorm.sh in=ecct.fq out=normalized.fq target=100 min=2
    Then try assembling. If you assemble the libraries separately rather than together, you can use Dedupe to remove duplicate contigs:

    Code:
    dedupe.sh in=a.fa,b.fa,c.fa out=deduped.fa s=5
    This will remove duplicate and contained sequences, allowing up to 5 substitutions.

    Comment


    • #3
      Thanks, I'll give that a shot and report my results when finished

      Comment


      • #4
        I'm getting a weird error in bbduk.sh where it says I have unpaired reads.

        Code:
        BBDuk version 36.92
        maskMiddle was disabled because useShortKmers=true
        Initial:
        Memory: max=205801m, free=201506m, used=4295m
        
        Added 216529 kmers; time:       0.181 seconds.
        Memory: max=205801m, free=191842m, used=13959m
        
        Input is being processed as paired
        Started output streams: 0.025 seconds.
        Exception in thread "Thread-11" java.lang.AssertionError:
        There appear to be different numbers of reads in the paired input files.
        The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.
                at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:479)
                at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:344)
                at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:188)
                at java.lang.Thread.run(Thread.java:745)
        I ran repair.sh as recommended and only 50% of the reads are paired. This isn't true, if I use the Khmer Toolkit to extract pairs, they're all paired. They also appear to be in the correct order. I don't have the output from repair.sh because I didn't capture it from the screen session. Advice? Should I interleave them first with Khmer Toolkit?

        I concatenate all of my gzip reads from replicates and virtual NextSeq lanes like so:

        Code:
        cat left1.fastq.gz left2.fastq.gz left3.fastq.gz > left_all.fastq.gz
        cat right1.fastq.gz right1.fastq.gz right1.fastq.gz > right_all.fastq.gz
        If I run repair.sh on the single, un-concatenated files 100% of the reads have pairs. Again, they appear to be in order....

        Thanks
        Last edited by dacotahm; 02-07-2017, 12:04 PM.

        Comment


        • #5
          Sounds like there's something strange about the concatenated gzipped files; it might work better if you decompress them, concatenated them, and then compress them again:

          Code:
          zcat left1.fastq.gz left2.fastq.gz left3.fastq.gz | gzip -c > left_all.fastq.gz
          There are some versions of Java that have trouble with certain concatenated gzipped files (they think the file ends at one of the concatenation boundaries), which appears to be a bug. So you can do as above, or interleaved them with a different tool.

          Comment


          • #6
            Here's my results if you're interested. Dedupe.sh didn't remove as many as I expected from one set, so I'm not sure how to interpret that.

            I ran all of the norm and error correction steps on the reads for each sample separately. Then I assembled them separately with Trinity, concatenated the assemblies, and deduped them.

            I did another run where I concatenated all the error-corrected, normalized reads and assembled them together with Trinity before deduping that single assembly. Results as follows:

            Dedupe removed ~42% from the concatenated assemblies, but still had 1e6 contigs, which is very high. It removed 2.7% from the single assembly that contained all of the reads. This is another bee species so still high, but contains isoforms.

            The third group of assemblies (below) has >1e6 contigs but length stats other than contig number and total BP are similar to other assemblies. I assume it contains a ton of duplication still, curious about how to tune dedupe.sh.

            Code:
            #Individual assemblies
            filename 			sum n trim_n min med mean max n50 n50_len n90 n90_len
            DLmRNA.fasta 		72452700 54350 54350 201 528 1333 34738 7323 2938 29606 455
            OLA125mRNA.fasta 	247008720 141467 141467 201 625 1746 25949 17551 4258 69947 639
            OLA14mRNA.fasta 	303516447 141371 141371 201 854 2146 27338 19963 4855 67083 991
            OLA215mRNA.fasta 	266481896 147914 147914 201 665 1801 31511 18978 4226 72938 686
            OLA28mRNA.fasta 	219382126 139428 139428 201 522 1573 31117 15741 4164 69889 519
            OLA65mRNA.fasta 	364224298 154717 154717 201 1056 2354 29609 22538 5139 74439 1179
            OLNAmRNA.fasta 		137613039 98089 98089 201 488 1402 29749 11716 3511 51283 455
            PP12mRNA.fasta 		470714411 191853 191853 201 1026 2453 38729 27127 5531 90517 1207
            PP15mRNA.fasta 		403006775 177736 177736 201 934 2267 31633 25075 5056 84637 1071
            PP6mRNA.fasta 		189346636 134266 134266 201 494 1410 31708 15023 3658 70874 456
            Pplus20mRNA.fasta 	210036076 80828 80828 201 1502 2598 56106 12951 5062 41100 1442
            PPmRNA.fasta 		213918690 142302 142302 201 516 1503 35229 16062 3894 72927 496
            PUmRNA.fasta 		209833062 147120 147120 201 485 1426 36445 16145 3756 76586 458
            
            #Individual assemblies concatenated and deduped
            filename 							sum n trim_n min med mean max n50 n50_len n90 n90_len
            01_concatenatedAssemblies.fasta 	3307534876 1751441 1751441 201 662 1888 56106 221789 4561 837145 733
            02_dedupedConcatAssemblies.fasta 	2667545297 1016986 1016986 201 1399 2622 56106 161158 5318 509036 1394
            
            #Reads concatenated from all samples, assembled together.
            filename 										sum n trim_n min med mean max n50 n50_len n90 n90_len
            ReadsConcatenatedBeforeAssembly.fasta          730984902 288974 288974 201 1210 2529 43812 42367 5432 140412 1302
            DeDuped_ReadsConcatenatedBeforeAssembly.fasta  712505764 281091 281091 201 1207 2534 43812 41174 5450 136631 1297

            Comment


            • #7
              Dedupe will only remove identical (aside from the specified number of mismatches/edits) or fully-contained sequences; it will not remove or combined sequences that overlap, but in which neither is fully contained by the other. For that purpose, you might try Minimus2.

              It's hard to say why there are so many sequences in the deduped, concatenated assemblies compared to the combined assembly. You can certainly try increasing the allowed number of substitutions, or allow an edit distance to accommodate indels ("e=" flag rather than "s=" flag). Perhaps you can tell me...

              1) What is the ploidy of these bees?
              2) What is the average heterozygosity rate, or % identity between two individuals?

              528 (for example) seems to be short for a median isoform length; it's not even much above your min cutoff of 201bp. How does that compare to the expected median isoform length in bees? I wonder if the RNA is too old and degraded, or if the coverage was too low for good assemblies. Do you have information about the coverage distribution of the contigs? Note that in the assembly where you concatenated all reads prior to assembly, you achieved a much higher median contig length than any of the individual assemblies, which indicates this is a coverage problem. For quantification purposes, you may want to just use that one and map to it rather than assembling the individual samples.

              Comment


              • #8
                It's been a while but I circled back to cleaning this transcriptome up.

                I don't know the amount of genetic variation, that's something we're planning to dig in to sometime next year.

                Anway, I tried to clean it up a bit - The one with the duplication is by far the most complete (~96%). I made a reciprocal-best-hit subset and used best single hit against unique Honey Bees transcripts but it is quite incomplete.

                BUSCO scores:

                Code:
                					C	D	F	M
                ApisBlastUniques			2000	444	237	438
                SingleSample				2214	521	178	283
                AllSamplesWithDuplication	        248	2110	216	101
                I tried running dedupe.sh in a loop with substitution increments of {0..100..5} and still didn't see much change:

                Code:
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211470m -Xms211470m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia0subs.fasta maxsubs=0 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia0subs.fasta, maxsubs=0, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212503m, free=209177m, used=3326m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 10.134 seconds.
                Memory: max=212503m, free=160369m, used=52134m
                
                Found 62027 contained sequences.
                Finished containment.      Time: 15.413 seconds.
                Memory: max=212503m, free=207364m, used=5139m
                
                Removed 62027 invalid entries.
                Finished invalid removal.  Time: 0.309 seconds.
                Memory: max=212503m, free=207364m, used=5139m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	62027 reads (22.07%) 	238551461 bases (33.48%)    	70938383 collisions.
                Result:                 	219064 reads (77.93%) 	473954303 bases (66.52%)
                
                Printed output.            Time: 2.464 seconds.
                Memory: max=212503m, free=206615m, used=5888m
                
                Time:   			28.337 seconds.
                Reads Processed:        281k 	9.92k reads/sec
                Bases Processed:        712m 	25.14m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211451m -Xms211451m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia5subs.fasta maxsubs=5 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia5subs.fasta, maxsubs=5, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212485m, free=208050m, used=4435m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 11.091 seconds.
                Memory: max=212485m, free=160910m, used=51575m
                
                Found 62027 contained sequences.
                Finished containment.      Time: 9.976 seconds.
                Memory: max=212485m, free=157030m, used=55455m
                
                Removed 62027 invalid entries.
                Finished invalid removal.  Time: 0.193 seconds.
                Memory: max=212485m, free=157030m, used=55455m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	62027 reads (22.07%) 	238551461 bases (33.48%)    	71234164 collisions.
                Result:                 	219064 reads (77.93%) 	473954303 bases (66.52%)
                
                Printed output.            Time: 6.894 seconds.
                Memory: max=212485m, free=209979m, used=2506m
                
                Time:   			28.170 seconds.
                Reads Processed:        281k 	9.98k reads/sec
                Bases Processed:        712m 	25.29m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211500m -Xms211500m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia10subs.fasta maxsubs=10 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia10subs.fasta, maxsubs=10, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212533m, free=208097m, used=4436m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 5.853 seconds.
                Memory: max=212533m, free=160949m, used=51584m
                
                Found 62027 contained sequences.
                Finished containment.      Time: 18.619 seconds.
                Memory: max=212533m, free=208178m, used=4355m
                
                Removed 62027 invalid entries.
                Finished invalid removal.  Time: 0.293 seconds.
                Memory: max=212533m, free=208178m, used=4355m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	62027 reads (22.07%) 	238551461 bases (33.48%)    	70560203 collisions.
                Result:                 	219064 reads (77.93%) 	473954303 bases (66.52%)
                
                Printed output.            Time: 1.997 seconds.
                Memory: max=212533m, free=207070m, used=5463m
                
                Time:   			26.780 seconds.
                Reads Processed:        281k 	10.50k reads/sec
                Bases Processed:        712m 	26.61m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211435m -Xms211435m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia15subs.fasta maxsubs=15 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia15subs.fasta, maxsubs=15, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212469m, free=209143m, used=3326m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 6.209 seconds.
                Memory: max=212469m, free=160345m, used=52124m
                
                Found 63110 contained sequences.
                Finished containment.      Time: 20.348 seconds.
                Memory: max=212469m, free=209159m, used=3310m
                
                Removed 63110 invalid entries.
                Finished invalid removal.  Time: 0.207 seconds.
                Memory: max=212469m, free=209159m, used=3310m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	63110 reads (22.45%) 	238796524 bases (33.52%)    	70648475 collisions.
                Result:                 	217981 reads (77.55%) 	473709240 bases (66.48%)
                
                Printed output.            Time: 1.873 seconds.
                Memory: max=212469m, free=208410m, used=4059m
                
                Time:   			28.656 seconds.
                Reads Processed:        281k 	9.81k reads/sec
                Bases Processed:        712m 	24.86m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211443m -Xms211443m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia20subs.fasta maxsubs=20 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia20subs.fasta, maxsubs=20, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212477m, free=208042m, used=4435m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 6.880 seconds.
                Memory: max=212477m, free=158134m, used=54343m
                
                Found 64647 contained sequences.
                Finished containment.      Time: 29.648 seconds.
                Memory: max=212477m, free=206120m, used=6357m
                
                Removed 64647 invalid entries.
                Finished invalid removal.  Time: 0.299 seconds.
                Memory: max=212477m, free=206120m, used=6357m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	64647 reads (23.00%) 	239192035 bases (33.57%)    	69566767 collisions.
                Result:                 	216444 reads (77.00%) 	473313729 bases (66.43%)
                
                Printed output.            Time: 2.137 seconds.
                Memory: max=212477m, free=204999m, used=7478m
                
                Time:   			38.978 seconds.
                Reads Processed:        281k 	7.21k reads/sec
                Bases Processed:        712m 	18.28m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211421m -Xms211421m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia25subs.fasta maxsubs=25 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia25subs.fasta, maxsubs=25, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212455m, free=208021m, used=4434m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 6.859 seconds.
                Memory: max=212455m, free=159779m, used=52676m
                
                Found 65967 contained sequences.
                Finished containment.      Time: 19.185 seconds.
                Memory: max=212455m, free=207484m, used=4971m
                
                Removed 65967 invalid entries.
                Finished invalid removal.  Time: 0.243 seconds.
                Memory: max=212455m, free=207484m, used=4971m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	65967 reads (23.47%) 	239568880 bases (33.62%)    	69863260 collisions.
                Result:                 	215124 reads (76.53%) 	472936884 bases (66.38%)
                
                Printed output.            Time: 2.177 seconds.
                Memory: max=212455m, free=206730m, used=5725m
                
                Time:   			28.481 seconds.
                Reads Processed:        281k 	9.87k reads/sec
                Bases Processed:        712m 	25.02m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211414m -Xms211414m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia30subs.fasta maxsubs=30 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia30subs.fasta, maxsubs=30, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212447m, free=208013m, used=4434m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 7.152 seconds.
                Memory: max=212447m, free=160883m, used=51564m
                
                Found 67307 contained sequences.
                Finished containment.      Time: 17.242 seconds.
                Memory: max=212447m, free=207899m, used=4548m
                
                Removed 67307 invalid entries.
                Finished invalid removal.  Time: 0.282 seconds.
                Memory: max=212447m, free=207899m, used=4548m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	67307 reads (23.94%) 	239991769 bases (33.68%)    	69772975 collisions.
                Result:                 	213784 reads (76.06%) 	472513995 bases (66.32%)
                
                Printed output.            Time: 1.984 seconds.
                Memory: max=212447m, free=206597m, used=5850m
                
                Time:   			26.677 seconds.
                Reads Processed:        281k 	10.54k reads/sec
                Bases Processed:        712m 	26.71m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211410m -Xms211410m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia35subs.fasta maxsubs=35 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia35subs.fasta, maxsubs=35, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212443m, free=209117m, used=3326m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 8.400 seconds.
                Memory: max=212443m, free=160324m, used=52119m
                
                Found 68858 contained sequences.
                Finished containment.      Time: 9.879 seconds.
                Memory: max=212443m, free=156999m, used=55444m
                
                Removed 68858 invalid entries.
                Finished invalid removal.  Time: 0.436 seconds.
                Memory: max=212443m, free=156999m, used=55444m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	68858 reads (24.50%) 	240523318 bases (33.76%)    	68935338 collisions.
                Result:                 	212233 reads (75.50%) 	471982446 bases (66.24%)
                
                Printed output.            Time: 6.270 seconds.
                Memory: max=212443m, free=209195m, used=3248m
                
                Time:   			25.000 seconds.
                Reads Processed:        281k 	11.24k reads/sec
                Bases Processed:        712m 	28.50m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211436m -Xms211436m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia40subs.fasta maxsubs=40 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia40subs.fasta, maxsubs=40, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212469m, free=208035m, used=4434m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 6.676 seconds.
                Memory: max=212469m, free=159237m, used=53232m
                
                Found 70682 contained sequences.
                Finished containment.      Time: 29.239 seconds.
                Memory: max=212469m, free=207078m, used=5391m
                
                Removed 70682 invalid entries.
                Finished invalid removal.  Time: 0.219 seconds.
                Memory: max=212469m, free=207078m, used=5391m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	70682 reads (25.15%) 	241200047 bases (33.85%)    	69348258 collisions.
                Result:                 	210409 reads (74.85%) 	471305717 bases (66.15%)
                
                Printed output.            Time: 4.554 seconds.
                Memory: max=212469m, free=206327m, used=6142m
                
                Time:   			40.702 seconds.
                Reads Processed:        281k 	6.91k reads/sec
                Bases Processed:        712m 	17.51m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211397m -Xms211397m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia45subs.fasta maxsubs=45 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia45subs.fasta, maxsubs=45, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212431m, free=209105m, used=3326m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 6.755 seconds.
                Memory: max=212431m, free=159207m, used=53224m
                
                Found 72888 contained sequences.
                Finished containment.      Time: 18.857 seconds.
                Memory: max=212431m, free=206186m, used=6245m
                
                Removed 72888 invalid entries.
                Finished invalid removal.  Time: 0.251 seconds.
                Memory: max=212431m, free=206186m, used=6245m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	72888 reads (25.93%) 	242085988 bases (33.98%)    	68508447 collisions.
                Result:                 	208203 reads (74.07%) 	470419776 bases (66.02%)
                
                Printed output.            Time: 2.567 seconds.
                Memory: max=212431m, free=205436m, used=6995m
                
                Time:   			28.447 seconds.
                Reads Processed:        281k 	9.88k reads/sec
                Bases Processed:        712m 	25.05m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211368m -Xms211368m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia50subs.fasta maxsubs=50 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia50subs.fasta, maxsubs=50, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212400m, free=207967m, used=4433m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 7.608 seconds.
                Memory: max=212400m, free=160293m, used=52107m
                
                Found 75105 contained sequences.
                Finished containment.      Time: 22.144 seconds.
                Memory: max=212400m, free=207667m, used=4733m
                
                Removed 75105 invalid entries.
                Finished invalid removal.  Time: 0.354 seconds.
                Memory: max=212400m, free=207667m, used=4733m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	75105 reads (26.72%) 	243040998 bases (34.11%)    	68422636 collisions.
                Result:                 	205986 reads (73.28%) 	469464766 bases (65.89%)
                
                Printed output.            Time: 2.299 seconds.
                Memory: max=212400m, free=206555m, used=5845m
                
                Time:   			32.416 seconds.
                Reads Processed:        281k 	8.67k reads/sec
                Bases Processed:        712m 	21.98m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211365m -Xms211365m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia55subs.fasta maxsubs=55 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia55subs.fasta, maxsubs=55, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212399m, free=209074m, used=3325m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 7.279 seconds.
                Memory: max=212399m, free=159738m, used=52661m
                
                Found 77406 contained sequences.
                Finished containment.      Time: 23.601 seconds.
                Memory: max=212399m, free=207725m, used=4674m
                
                Removed 77406 invalid entries.
                Finished invalid removal.  Time: 0.256 seconds.
                Memory: max=212399m, free=207725m, used=4674m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	77406 reads (27.54%) 	244079330 bases (34.26%)    	68953360 collisions.
                Result:                 	203685 reads (72.46%) 	468426434 bases (65.74%)
                
                Printed output.            Time: 2.930 seconds.
                Memory: max=212399m, free=206615m, used=5784m
                
                Time:   			34.083 seconds.
                Reads Processed:        281k 	8.25k reads/sec
                Bases Processed:        712m 	20.90m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211348m -Xms211348m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia60subs.fasta maxsubs=60 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia60subs.fasta, maxsubs=60, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212380m, free=209056m, used=3324m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 7.808 seconds.
                Memory: max=212380m, free=159171m, used=53209m
                
                Found 80107 contained sequences.
                Finished containment.      Time: 20.234 seconds.
                Memory: max=212380m, free=207257m, used=5123m
                
                Removed 80107 invalid entries.
                Finished invalid removal.  Time: 0.292 seconds.
                Memory: max=212380m, free=207257m, used=5123m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	80107 reads (28.50%) 	245395205 bases (34.44%)    	68096667 collisions.
                Result:                 	200984 reads (71.50%) 	467110559 bases (65.56%)
                
                Printed output.            Time: 2.464 seconds.
                Memory: max=212380m, free=206142m, used=6238m
                
                Time:   			30.816 seconds.
                Reads Processed:        281k 	9.12k reads/sec
                Bases Processed:        712m 	23.12m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211339m -Xms211339m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia65subs.fasta maxsubs=65 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia65subs.fasta, maxsubs=65, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212372m, free=207940m, used=4432m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 6.644 seconds.
                Memory: max=212372m, free=160271m, used=52101m
                
                Found 82733 contained sequences.
                Finished containment.      Time: 23.494 seconds.
                Memory: max=212372m, free=207506m, used=4866m
                
                Removed 82733 invalid entries.
                Finished invalid removal.  Time: 0.348 seconds.
                Memory: max=212372m, free=207506m, used=4866m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	82733 reads (29.43%) 	246749815 bases (34.63%)    	66829379 collisions.
                Result:                 	198358 reads (70.57%) 	465755949 bases (65.37%)
                
                Printed output.            Time: 2.585 seconds.
                Memory: max=212372m, free=206201m, used=6171m
                
                Time:   			33.086 seconds.
                Reads Processed:        281k 	8.50k reads/sec
                Bases Processed:        712m 	21.53m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211341m -Xms211341m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia70subs.fasta maxsubs=70 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia70subs.fasta, maxsubs=70, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212374m, free=209050m, used=3324m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 8.722 seconds.
                Memory: max=212374m, free=159718m, used=52656m
                
                Found 84870 contained sequences.
                Finished containment.      Time: 18.281 seconds.
                Memory: max=212374m, free=206725m, used=5649m
                
                Removed 84870 invalid entries.
                Finished invalid removal.  Time: 0.299 seconds.
                Memory: max=212374m, free=206725m, used=5649m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	84870 reads (30.19%) 	247863057 bases (34.79%)    	67629054 collisions.
                Result:                 	196221 reads (69.81%) 	464642707 bases (65.21%)
                
                Printed output.            Time: 2.148 seconds.
                Memory: max=212374m, free=205611m, used=6763m
                
                Time:   			29.466 seconds.
                Reads Processed:        281k 	9.54k reads/sec
                Bases Processed:        712m 	24.18m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211314m -Xms211314m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia75subs.fasta maxsubs=75 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia75subs.fasta, maxsubs=75, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212346m, free=207914m, used=4432m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 6.865 seconds.
                Memory: max=212346m, free=161360m, used=50986m
                
                Found 86735 contained sequences.
                Finished containment.      Time: 11.489 seconds.
                Memory: max=212346m, free=158139m, used=54207m
                
                Removed 86735 invalid entries.
                Finished invalid removal.  Time: 0.330 seconds.
                Memory: max=212346m, free=158139m, used=54207m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	86735 reads (30.86%) 	248866866 bases (34.93%)    	67694034 collisions.
                Result:                 	194356 reads (69.14%) 	463638898 bases (65.07%)
                
                Printed output.            Time: 3.616 seconds.
                Memory: max=212346m, free=157544m, used=54802m
                
                Time:   			22.449 seconds.
                Reads Processed:        281k 	12.52k reads/sec
                Bases Processed:        712m 	31.74m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211325m -Xms211325m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia80subs.fasta maxsubs=80 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia80subs.fasta, maxsubs=80, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212358m, free=207926m, used=4432m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 6.819 seconds.
                Memory: max=212358m, free=160817m, used=51541m
                
                Found 88418 contained sequences.
                Finished containment.      Time: 17.209 seconds.
                Memory: max=212358m, free=207540m, used=4818m
                
                Removed 88418 invalid entries.
                Finished invalid removal.  Time: 0.239 seconds.
                Memory: max=212358m, free=207540m, used=4818m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	88418 reads (31.46%) 	249777410 bases (35.06%)    	67227638 collisions.
                Result:                 	192673 reads (68.54%) 	462728354 bases (64.94%)
                
                Printed output.            Time: 1.770 seconds.
                Memory: max=212358m, free=206791m, used=5567m
                
                Time:   			26.053 seconds.
                Reads Processed:        281k 	10.79k reads/sec
                Bases Processed:        712m 	27.35m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211274m -Xms211274m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia85subs.fasta maxsubs=85 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia85subs.fasta, maxsubs=85, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212306m, free=207875m, used=4431m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 7.388 seconds.
                Memory: max=212306m, free=160223m, used=52083m
                
                Found 89992 contained sequences.
                Finished containment.      Time: 20.680 seconds.
                Memory: max=212306m, free=208425m, used=3881m
                
                Removed 89992 invalid entries.
                Finished invalid removal.  Time: 0.295 seconds.
                Memory: max=212306m, free=208425m, used=3881m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	89992 reads (32.02%) 	250676105 bases (35.18%)    	67539779 collisions.
                Result:                 	191099 reads (67.98%) 	461829659 bases (64.82%)
                
                Printed output.            Time: 2.440 seconds.
                Memory: max=212306m, free=206955m, used=5351m
                
                Time:   			30.817 seconds.
                Reads Processed:        281k 	9.12k reads/sec
                Bases Processed:        712m 	23.12m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211297m -Xms211297m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia90subs.fasta maxsubs=90 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia90subs.fasta, maxsubs=90, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212330m, free=209006m, used=3324m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 8.632 seconds.
                Memory: max=212330m, free=159133m, used=53197m
                
                Found 91593 contained sequences.
                Finished containment.      Time: 24.517 seconds.
                Memory: max=212330m, free=207263m, used=5067m
                
                Removed 91593 invalid entries.
                Finished invalid removal.  Time: 0.249 seconds.
                Memory: max=212330m, free=207263m, used=5067m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	91593 reads (32.58%) 	251634118 bases (35.32%)    	66353054 collisions.
                Result:                 	189498 reads (67.42%) 	460871646 bases (64.68%)
                
                Printed output.            Time: 1.902 seconds.
                Memory: max=212330m, free=206513m, used=5817m
                
                Time:   			35.315 seconds.
                Reads Processed:        281k 	7.96k reads/sec
                Bases Processed:        712m 	20.18m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211283m -Xms211283m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia95subs.fasta maxsubs=95 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia95subs.fasta, maxsubs=95, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212316m, free=208992m, used=3324m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 6.716 seconds.
                Memory: max=212316m, free=159123m, used=53193m
                
                Found 93092 contained sequences.
                Finished containment.      Time: 21.864 seconds.
                Memory: max=212316m, free=208434m, used=3882m
                
                Removed 93092 invalid entries.
                Finished invalid removal.  Time: 0.264 seconds.
                Memory: max=212316m, free=208434m, used=3882m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	93092 reads (33.12%) 	252564221 bases (35.45%)    	66316379 collisions.
                Result:                 	187999 reads (66.88%) 	459941543 bases (64.55%)
                
                Printed output.            Time: 2.543 seconds.
                Memory: max=212316m, free=207130m, used=5186m
                
                Time:   			31.403 seconds.
                Reads Processed:        281k 	8.95k reads/sec
                Bases Processed:        712m 	22.69m bases/sec
                java -Djava.library.path=/home/dacotah/bbmap/jni/ -ea -Xmx211275m -Xms211275m -cp /home/dacotah/bbmap/current/ jgi.Dedupe in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta out=DedupedOsmia100subs.fasta maxsubs=100 maxedits=2 minidentity=95
                Executing jgi.Dedupe [in=/home/dacotah/Reference/OsmiaLignaria/Trin_deduped.fasta.dammit.fasta, out=DedupedOsmia100subs.fasta, maxsubs=100, maxedits=2, minidentity=95]
                
                Dedupe version 36.92
                Initial:
                Memory: max=212308m, free=208985m, used=3323m
                
                Found 0 duplicates.
                Finished exact matches.    Time: 6.941 seconds.
                Memory: max=212308m, free=160224m, used=52084m
                
                Found 94648 contained sequences.
                Finished containment.      Time: 32.346 seconds.
                Memory: max=212308m, free=206767m, used=5541m
                
                Removed 94648 invalid entries.
                Finished invalid removal.  Time: 0.252 seconds.
                Memory: max=212308m, free=206767m, used=5541m
                
                Input:                  	281091 reads 		712505764 bases.
                Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                Containments:           	94648 reads (33.67%) 	253576350 bases (35.59%)    	66583705 collisions.
                Result:                 	186443 reads (66.33%) 	458929414 bases (64.41%)
                
                Printed output.            Time: 2.083 seconds.
                Memory: max=212308m, free=205653m, used=6655m
                
                Time:   			41.636 seconds.
                Reads Processed:        281k 	6.75k reads/sec
                Bases Processed:        712m 	17.11m bases/sec
                You don't see Duplicates because those I assume were removed during the first pass. Even at 100 substitutions there is a lot of transcripts left. At this point including the massive amount of sequence data that I have is sort of an academic challenge rather than practical. The reference that I eventually use will most likely be assembled from a small subset.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM
                • seqadmin
                  The Impact of AI in Genomic Medicine
                  by seqadmin



                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                  02-26-2024, 02:07 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-14-2024, 06:13 AM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-08-2024, 08:03 AM
                0 responses
                71 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-07-2024, 08:13 AM
                0 responses
                80 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-06-2024, 09:51 AM
                0 responses
                68 views
                0 likes
                Last Post seqadmin  
                Working...
                X