Seqanswers Leaderboard Ad

**HESmith** · 09-19-2015, 08:20 AM

I would be useful to post several lines of the non-standard format that you've encountered.

**mcarson** · 09-19-2015, 08:52 AM

@MISEQ06:359:000000000-AEFPR:1:1101:16407:1545 1:N:0:ACATCTTGACG
TTCTAGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTACAGCGCACGTAGGCGGTTTTTTAAGTCATTTTTGACATCCCCCGGCTCCACCTTGGCACTTCCTTTGCTACTTTCCTTCTTGAGTCTCTCAGAGTTCCTTTGCTTTCCCGTTTTAGCGGTGACCTTCTTCGATATCTGTATGCACCCCCTTGTCGCCTGCTGCTTTCTGTTTTCTCCCTGACTCTTCGGTCCGCACGCCTGGTGAGCCACCATGATTAGATACCCTTTTAGTCCTTCT
+
@BCCCGGGGFG@FGFFF@F:FGDGGGGGGDGGGGGGGEGGCGGGGGC,B,6CF,CC9C,CCGG+B,C6>+@>C####################################################################################################################################################################################################################################
@MISEQ06:359:000000000-AEFPR:1:1101:16407:1545 2:N:0:ACATCTTGACG
CCGCCCTCCTCGTTTTTCTTCTCCTTTTTCCTCCCCCCTCTTTCGCTCCTCCCCCTCTCTTTTTACCCCCTCCCCTCCCTTCCCCCTCTTTTTTCTTCCCTCTCTCTACCCTTTTCCCCTCTCCCTCCTCCCTTCCCCTCTCCTCTCTTCCACTCCCCTCCCTCTGTTCCCCTTTCCTTTCCCCTTTTTCCCCCTCCTCTTTCCCCCCTCCCTTCCCCCCCCCCCTCCCCCCCCTTTTCCCCCCTTCTTTCCTTTTCCCCCCCCCCCCCCCCTTTTTTCCCCCCCCTCTCCCCCTTCTCCC
+
-8,-,86,8;C---9696;,,6;;;6CC<,;6<EEE#########################################################################################################################################################################################################################################################################
@MISEQ06:359:000000000-AEFPR:1:1101:20149:1557 1:N:0:ACATCTTGACG
TGCAAGTGCCAGCCGCCGCGGTCATACGGAGGGTGCGAGCGTTGTTCTGCATTCTTGGGCGTACCGCGCGTGTAGGCGGTTTTTTCCGTCTTTTTTGCCCTCCCTCTTCTCCCCCCTGTCCCTGCCTTCGCTCCTTTCTTTCTTTCTTTCTTTCTCTTTTTTTTTCCTTCCTTTTTTCTCTTTTCACTCCTTCTCTCTCCTTCTTCACCCCTTTTTCTCCTTCTTCTTTCTTTCTCTCTCCTTACTCTCCTTCTCTCCCTCTTATTGCTCCCCCTTTATTATCTACCCTTTTAGTCCTTTT
+
@8ACCGGGFFG8FF:CF@F:FF+CFGGGGGGGGGGGGFGGDFGGGGF,C,<<C,CC,,:CFFC+,,8+>>>+B,,,:++86?AA@,,+:B?##################################################################################################################################################################################################################
@MISEQ06:359:000000000-AEFPR:1:1101:20149:1557 2:N:0:ACATCTTGACG
CCGCCCTACCTTTTTTTCTCCTCCTTTTTCCTCCCCTATCCTTCTTCCATCCCCCTCTCTCTTCCCCCCCTCCCCCCCCTTCCCCCCCTTTTTTCCTCCTTTTCTCTCCCTCTTTCCCCCCTCCCCCCCCCTTTCCCCTCCCCTCTCCCTTCTCCCCCTCTCCCCTTTTCCTCTCCCCTTCCCCTTTTCCCCCCCTCCCTTTCCCCTCCTCCTTTCCCCCCCCCCTCCCCCCCCTTTCCCCCCTTTCCTTCCTCCCCTCCCCCCCCCCCCCCTTTTTCCCCCTCTCGCTTCCCCTTCCCTC

This is the raw data from JGI. 1:N:0 is the forward 2:N:0 is the reverse and I've used extract_reads_from_interleaved_file.py to split them with no problems. From what I can tell the barcode is in the label not the sequence and you can extract them if need be using extract_barcodes.py -c barcode_in_label. Beyond that I'm a little lost. I've checked the fastq's and am fairly certain the forward read has a 5bp linker and the reverse has a 2bp one, followed by the V4 region primers (F5'GTGCCAGCMGCCGCGGTAA:R5'GGACTACHVGGGTWTCTAAT). I've also heard there were phiX internal controls added.

I hope to end up with linkers/adapters removed, as well as any other filtering/label renaming that needs to be done prior to aligning forward a reverse reads in PANDAseq.

**HESmith** · 09-19-2015, 11:21 AM

The data are standard MiSeq output, PE-300, with reads one and two interleaved. As you note, the data have already been demultiplexed (indicated by the index 'ACATCTTGACG' in the read identifier). The phiX controls lack indices and will not be present in demultiplexed data.

The forward primer largely matches your expectation (5bp linker + GTGCCAGCMGCCGCGGTAA at the start of read one, although there's a penultimate C instead of A in the second example). However, the reverse primer sequence is not detectable in either example of read two. The caveat is that read quality is very low and the nucleotide bias is nearly 100% C/T, so it's hard to say for sure. It would be useful to check with high-quality examples of read two.

I would recommend evaluating the data with FASTQ for quality metrics. You can use BBMap's BBDuk.sh command to trim by quality, length, or sequence string. By renaming, I assume you mean adding read groups, which can be done using Picard.

**mcarson** · 09-19-2015, 11:58 AM

Thanks for the suggestions, and I'm pretty confident in the reverse read primer as I get tens of thousands of hits when I search the different variants in a text file. I'll look into BBMap, are there any "standards" you'd suggest for places to start with options for BBDuk.sh in terms of trim length or quality? After those steps will I more or less be at the normal Sample_Lane_R1_001.fastq outputs most places send?

The labels are more for the filenames as QIIME seems to have automatic outputs that only allow you to alter the directory name not the actual output filename. So all of mine are now just Reverse/Forward.fastq when I want them to all be unique in some way so I can loop them through PANDAseq using a InputFilename txt file reference. I can just refernce individual directories that hold the files but it's more my OCD.

**Brian Bushnell** · 09-20-2015, 11:03 AM

JGI's reads are always interleaved like that. You can deinterleave them if you want like this (using the BBMap package):

reformat.sh in=reads.fq out1=read1.fq out2=read2.fq

...but that just makes all the commands longer, so I don't recommend it until you get to a tool that will not accept interleaved reads.

If you plan to merge the reads (which I am assuming you do, since they're 16S data) you can adapter-trim them first, but you don't really need to quality-trim first, and in fact it can reduce your merge rate. You can adapter-trim right adapters caused by short inserts using BBDuk like this:

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa tbo tpe ktrim=r hdist=1 k=23 mink=11

...but short inserts should not be very common in amplicon libraries unless you are targeting a region shorter than read length. There is a thread here discussing BBDuk options in more depth.

To merge the reads, I recommend BBMerge, like this:

bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq qtrim2=r trimq=10

"qtrim2" will first try to merge the reads, then only do quality-trimming if merging is unsuccessful at first. This will result in a higher merge rate than quality-trimming first.

**mcarson** · 09-20-2015, 12:13 PM

Brian thanks for the info. Do you happen to know of any online courses or lectures that cover these topics in more detail? I've been thrown into the bioinformatics realm and a big part of the struggle is just understanding all of the terms and steps associated with the data processing.

I'll give the BBMap suite a look over and see if I can't get this all sorted out this week. Would the adapters.fa be a file supplied by JGI, is there a defalt in the command, or is there some other standard I should be using? I don't think there was one available in the downloads file I have access to*.

*Just found the adapters.fa in the resources folder, I'd assume I just have to feed it the file pathway?

**Brian Bushnell** · 09-20-2015, 06:39 PM

Sorry, I can't really point you toward any online courses for learning bioinformatics... I kind of learned it on the job. I know they exist, though. But, there are some threads here, and also some at Biostars (in the "tutorial" category) that reference various online resources.

As for adapters.fa, you can either copy it to the same directory, or set the absolute or relative path as the argument for "ref=".

**mcarson** · 09-21-2015, 12:32 PM

Brian, a couple final questions. I know the "normal" JGI pipeline uses PANDAseq to merge reads so I assume BBMap is used to adapter trim and quality filter. However it seems like your bbmerge is equivilant or better especially using the qtrim2 option. Any benefits to doing one over the other (you're not biased right...)?

HESmith also mentioned that the phiX should be gone since these were already demultiplexed. I'm assuming that's true or should I also do a contaminant filter?

Finally we normally get paired end reads without linkers and typically without primers. I'm assuming I should trim off the linkers as well as primers prior to bbmerge since it doesn't appear that there is an option for that (or does this not really affect alignment at all?). I think there is a sequence trim option in bbduck that I should be able to get that done in with, I'd assume it takes the standard nucleotide lettereing system?

Just an FYI,here is a test sample results:
bbduk
Input: 314652 reads 94710252 bases.
KTrimmed: 1344 reads (0.43%) 61190 bases (0.06%)
Trimmed by overlap: 4252 reads (1.35%) 8290 bases (0.01%)
Result: 314648 reads (100.00%) 94640772 bases (99.93%)

bbduk quality trim
Input: 314648 reads 94640772 bases.
QTrimmed: 278601 reads (88.54%) 16334356 bases (17.26%)
Result: 312960 reads (99.46%) 78306416 bases (82.74%)

bbmerge
Pairs: 157324
Joined: 136389 86.693%
Ambiguous: 11862 7.540%
No Solution: 9073 5.767%
Too Short: 0 0.000%

Avg Insert: 299.1
Standard Deviation: 6.9
Mode: 299

Insert range: 51 - 478
90th percentile: 299
75th percentile: 299
50th percentile: 299
25th percentile: 299
10th percentile: 299

Thanks again for all the help on this, I really appreciate it.

**Brian Bushnell** · 09-21-2015, 02:23 PM

Originally posted by mcarson View Post

Brian, a couple final questions. I know the "normal" JGI pipeline uses PANDAseq to merge reads so I assume BBMap is used to adapter trim and quality filter. However it seems like your bbmerge is equivilant or better especially using the qtrim2 option. Any benefits to doing one over the other (you're not biased right...)?

JGI's Itagger pipeline currently uses Flash by default for 16S, and PANDAseq for ITS. I'm not sure why it is set up that way. But, these were determined when it was written, which was before BBMerge existed. There is a plan to switch to BBMerge when it is updated, but I don't know when that might happen.

BBMerge exceeds the accuracy of all other mergers that I've tested, which includes all that I am aware of with the exception of PANDAseq. Unfortunately, PANDAseq has very specific requirements for read header names and it will refuse to process data with headers that do not look like they are from Illumina. So, I was unable to benchmark it on synthetic data with known answers stored in the read headers, which allows the results to be verified. I contacted the author, but he does not intend to change that requirement.

My test results from a few months ago:

The different points for each program indicate the true and false positive merge rates at different confidence thresholds; upper-left corner of the graph is optimal. The dotted line indicates the highest possible correct merge rate due to overlap information alone, as only 84% of the reads actually overlapped. The points with black centers indicate default settings for the program.

As for whether I'm biased, well... it's hard to judge in yourself

But I don't recommend my tools in cases where I know of a better alternative.

HESmith also mentioned that the phiX should be gone since these were already demultiplexed. I'm assuming that's true or should I also do a contaminant filter?

That's correct; all of the phiX should be gone. In practice, though, sometimes the reads break and chimerically rejoin; sometimes adapters ligate to the wrong thing; and sometimes clusters are too close together, causing barcode misassignment. These can yield phiX in the demultiplexed output (as well as other cross-contamination and assorted junk). The level should be very low, but I always find some, if there enough reads.

Finally we normally get paired end reads without linkers and typically without primers. I'm assuming I should trim off the linkers as well as primers prior to bbmerge since it doesn't appear that there is an option for that (or does this not really affect alignment at all?).

Trimming adapters prior to merging will very slightly improve accuracy, as there are slightly fewer incorrect alternative answers from which to select the optimal overlap.

I think there is a sequence trim option in bbduk that I should be able to get that done in with, I'd assume it takes the standard nucleotide lettering system?

Yes, you can specify your own adapter sequence as a fasta file, or on the command line, e.g. "literal=ACGTTGCA...".

bbmerge
Pairs: 157324
Joined: 136389 86.693%
Ambiguous: 11862 7.540%
No Solution: 9073 5.767%
Too Short: 0 0.000%

Avg Insert: 299.1
Standard Deviation: 6.9
Mode: 299

Insert range: 51 - 478
90th percentile: 299
75th percentile: 299
50th percentile: 299
25th percentile: 299
10th percentile: 299

Looks like your insert sizes are generally 299bp

With that knowledge, you could increase the merge rate by, for example, adding the flags "mininsert=50 minoverlap=50" to reduce the search space, and thus reduce the number of ambiguous pairs. Or even "mininsert=100 minoverlap=100".

Thanks again for all the help on this, I really appreciate it.

You're welcome!

Attached Files

BBMerge_May_2015.png (47.0 KB, 166 views)

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Hello and JGI problems

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News