View Single Post
Old 02-21-2017, 06:41 AM   #6
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 24
Default

Here's my results if you're interested. Dedupe.sh didn't remove as many as I expected from one set, so I'm not sure how to interpret that.

I ran all of the norm and error correction steps on the reads for each sample separately. Then I assembled them separately with Trinity, concatenated the assemblies, and deduped them.

I did another run where I concatenated all the error-corrected, normalized reads and assembled them together with Trinity before deduping that single assembly. Results as follows:

Dedupe removed ~42% from the concatenated assemblies, but still had 1e6 contigs, which is very high. It removed 2.7% from the single assembly that contained all of the reads. This is another bee species so still high, but contains isoforms.

The third group of assemblies (below) has >1e6 contigs but length stats other than contig number and total BP are similar to other assemblies. I assume it contains a ton of duplication still, curious about how to tune dedupe.sh.

Code:
#Individual assemblies
filename 			sum n trim_n min med mean max n50 n50_len n90 n90_len
DLmRNA.fasta 		72452700 54350 54350 201 528 1333 34738 7323 2938 29606 455
OLA125mRNA.fasta 	247008720 141467 141467 201 625 1746 25949 17551 4258 69947 639
OLA14mRNA.fasta 	303516447 141371 141371 201 854 2146 27338 19963 4855 67083 991
OLA215mRNA.fasta 	266481896 147914 147914 201 665 1801 31511 18978 4226 72938 686
OLA28mRNA.fasta 	219382126 139428 139428 201 522 1573 31117 15741 4164 69889 519
OLA65mRNA.fasta 	364224298 154717 154717 201 1056 2354 29609 22538 5139 74439 1179
OLNAmRNA.fasta 		137613039 98089 98089 201 488 1402 29749 11716 3511 51283 455
PP12mRNA.fasta 		470714411 191853 191853 201 1026 2453 38729 27127 5531 90517 1207
PP15mRNA.fasta 		403006775 177736 177736 201 934 2267 31633 25075 5056 84637 1071
PP6mRNA.fasta 		189346636 134266 134266 201 494 1410 31708 15023 3658 70874 456
Pplus20mRNA.fasta 	210036076 80828 80828 201 1502 2598 56106 12951 5062 41100 1442
PPmRNA.fasta 		213918690 142302 142302 201 516 1503 35229 16062 3894 72927 496
PUmRNA.fasta 		209833062 147120 147120 201 485 1426 36445 16145 3756 76586 458

#Individual assemblies concatenated and deduped
filename 							sum n trim_n min med mean max n50 n50_len n90 n90_len
01_concatenatedAssemblies.fasta 	3307534876 1751441 1751441 201 662 1888 56106 221789 4561 837145 733
02_dedupedConcatAssemblies.fasta 	2667545297 1016986 1016986 201 1399 2622 56106 161158 5318 509036 1394

#Reads concatenated from all samples, assembled together.
filename 										sum n trim_n min med mean max n50 n50_len n90 n90_len
ReadsConcatenatedBeforeAssembly.fasta          730984902 288974 288974 201 1210 2529 43812 42367 5432 140412 1302
DeDuped_ReadsConcatenatedBeforeAssembly.fasta  712505764 281091 281091 201 1207 2534 43812 41174 5450 136631 1297
dacotahm is offline   Reply With Quote