SEQanswers

Go Back   SEQanswers > Introductions



Similar Threads
Thread Thread Starter Forum Replies Last Post
Getting genbank annotation file from JGI EricSchon Bioinformatics 3 05-27-2015 11:25 AM
some problems about Bambus2 Yue Xu Bioinformatics 0 10-15-2013 01:37 AM
Hello from Jim Tripp at JGI, Walnut Creek CA hjtripp Introductions 4 11-29-2012 09:35 AM
Anyone have enrichment problems? Ryanloop 454 Pyrosequencing 1 07-27-2012 03:59 PM
Tools for converting JGI annotation data to NCBI .asn AppleInformatics Bioinformatics 0 04-06-2011 10:00 AM

Reply
 
Thread Tools
Old 09-19-2015, 07:51 AM   #1
mcarson
Junior Member
 
Location: Canada

Join Date: Sep 2015
Posts: 5
Default Hello and JGI problems

Hello, totally new to the forum and somewhat new to the sequencing world. I've dealt with miseq data before when it arrives in the "standard" forward/reverse fastq reads with or without primers. I then go straight into PANDAseq, USEARCH8, and QIIME then process the outputs in R and I'm fairly comfortable with all of that.

My recent probelm comes from a large project that is using JGI for sequencing (miseq), which gives you data in a non-standard format. Does anyone have experience dealing with this to get me to the point I'm more familiar with. I've already figured out how to split the forward and reverse reads and extract barcodes but am having troubble with the quality filtering and trimming as well as the naming of the files. I'll be putting a large number of samples through this pipeline that will eventually have reused barcodes so any advice on renaming files is also appreciated.

Thanks in advance for the help and I look forward to hearing peoples suggestions!
mcarson is offline   Reply With Quote
Old 09-19-2015, 08:20 AM   #2
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 509
Default

I would be useful to post several lines of the non-standard format that you've encountered.
HESmith is offline   Reply With Quote
Old 09-19-2015, 08:52 AM   #3
mcarson
Junior Member
 
Location: Canada

Join Date: Sep 2015
Posts: 5
Default

@MISEQ06:359:000000000-AEFPR:1:1101:16407:1545 1:N:0:ACATCTTGACG
TTCTAGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTACAGCGCACGTAGGCGGTTTTTTAAGTCATTTTTGACATCCCCCGGCTCCACCTTGGCACTTCCTTTGCTACTTTCCTTCTTGAGTCTCTCAGAGTTCCTTTGCTTTCCCGTTTTAGCGGTGACCTTCTTCGATATCTGTATGCACCCCCTTGTCGCCTGCTGCTTTCTGTTTTCTCCCTGACTCTTCGGTCCGCACGCCTGGTGAGCCACCATGATTAGATACCCTTTTAGTCCTTCT
+
@BCCCGGGGFG@FGFFF@F:FGDGGGGGGDGGGGGGGEGGCGGGGGC,B,6CF,CC9C,CCGG+B,C6>+@>C####################################################################################################################################################################################################################################
@MISEQ06:359:000000000-AEFPR:1:1101:16407:1545 2:N:0:ACATCTTGACG
CCGCCCTCCTCGTTTTTCTTCTCCTTTTTCCTCCCCCCTCTTTCGCTCCTCCCCCTCTCTTTTTACCCCCTCCCCTCCCTTCCCCCTCTTTTTTCTTCCCTCTCTCTACCCTTTTCCCCTCTCCCTCCTCCCTTCCCCTCTCCTCTCTTCCACTCCCCTCCCTCTGTTCCCCTTTCCTTTCCCCTTTTTCCCCCTCCTCTTTCCCCCCTCCCTTCCCCCCCCCCCTCCCCCCCCTTTTCCCCCCTTCTTTCCTTTTCCCCCCCCCCCCCCCCTTTTTTCCCCCCCCTCTCCCCCTTCTCCC
+
-8,-,86,8;C---9696;,,6;;;6CC<,;6<EEE#########################################################################################################################################################################################################################################################################
@MISEQ06:359:000000000-AEFPR:1:1101:20149:1557 1:N:0:ACATCTTGACG
TGCAAGTGCCAGCCGCCGCGGTCATACGGAGGGTGCGAGCGTTGTTCTGCATTCTTGGGCGTACCGCGCGTGTAGGCGGTTTTTTCCGTCTTTTTTGCCCTCCCTCTTCTCCCCCCTGTCCCTGCCTTCGCTCCTTTCTTTCTTTCTTTCTTTCTCTTTTTTTTTCCTTCCTTTTTTCTCTTTTCACTCCTTCTCTCTCCTTCTTCACCCCTTTTTCTCCTTCTTCTTTCTTTCTCTCTCCTTACTCTCCTTCTCTCCCTCTTATTGCTCCCCCTTTATTATCTACCCTTTTAGTCCTTTT
+
@8ACCGGGFFG8FF:CF@F:FF+CFGGGGGGGGGGGGFGGDFGGGGF,C,<<C,CC,,:CFFC+,,8+>>>+B,,,:++86?AA@,,+:B?##################################################################################################################################################################################################################
@MISEQ06:359:000000000-AEFPR:1:1101:20149:1557 2:N:0:ACATCTTGACG
CCGCCCTACCTTTTTTTCTCCTCCTTTTTCCTCCCCTATCCTTCTTCCATCCCCCTCTCTCTTCCCCCCCTCCCCCCCCTTCCCCCCCTTTTTTCCTCCTTTTCTCTCCCTCTTTCCCCCCTCCCCCCCCCTTTCCCCTCCCCTCTCCCTTCTCCCCCTCTCCCCTTTTCCTCTCCCCTTCCCCTTTTCCCCCCCTCCCTTTCCCCTCCTCCTTTCCCCCCCCCCTCCCCCCCCTTTCCCCCCTTTCCTTCCTCCCCTCCCCCCCCCCCCCCTTTTTCCCCCTCTCGCTTCCCCTTCCCTC


This is the raw data from JGI. 1:N:0 is the forward 2:N:0 is the reverse and I've used extract_reads_from_interleaved_file.py to split them with no problems. From what I can tell the barcode is in the label not the sequence and you can extract them if need be using extract_barcodes.py -c barcode_in_label. Beyond that I'm a little lost. I've checked the fastq's and am fairly certain the forward read has a 5bp linker and the reverse has a 2bp one, followed by the V4 region primers (F5'GTGCCAGCMGCCGCGGTAA:R5'GGACTACHVGGGTWTCTAAT). I've also heard there were phiX internal controls added.

I hope to end up with linkers/adapters removed, as well as any other filtering/label renaming that needs to be done prior to aligning forward a reverse reads in PANDAseq.
mcarson is offline   Reply With Quote
Old 09-19-2015, 11:21 AM   #4
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 509
Default

The data are standard MiSeq output, PE-300, with reads one and two interleaved. As you note, the data have already been demultiplexed (indicated by the index 'ACATCTTGACG' in the read identifier). The phiX controls lack indices and will not be present in demultiplexed data.

The forward primer largely matches your expectation (5bp linker + GTGCCAGCMGCCGCGGTAA at the start of read one, although there's a penultimate C instead of A in the second example). However, the reverse primer sequence is not detectable in either example of read two. The caveat is that read quality is very low and the nucleotide bias is nearly 100% C/T, so it's hard to say for sure. It would be useful to check with high-quality examples of read two.

I would recommend evaluating the data with FASTQ for quality metrics. You can use BBMap's BBDuk.sh command to trim by quality, length, or sequence string. By renaming, I assume you mean adding read groups, which can be done using Picard.
HESmith is offline   Reply With Quote
Old 09-19-2015, 11:58 AM   #5
mcarson
Junior Member
 
Location: Canada

Join Date: Sep 2015
Posts: 5
Default

Thanks for the suggestions, and I'm pretty confident in the reverse read primer as I get tens of thousands of hits when I search the different variants in a text file. I'll look into BBMap, are there any "standards" you'd suggest for places to start with options for BBDuk.sh in terms of trim length or quality? After those steps will I more or less be at the normal Sample_Lane_R1_001.fastq outputs most places send?

The labels are more for the filenames as QIIME seems to have automatic outputs that only allow you to alter the directory name not the actual output filename. So all of mine are now just Reverse/Forward.fastq when I want them to all be unique in some way so I can loop them through PANDAseq using a InputFilename txt file reference. I can just refernce individual directories that hold the files but it's more my OCD.
mcarson is offline   Reply With Quote
Old 09-20-2015, 11:03 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

JGI's reads are always interleaved like that. You can deinterleave them if you want like this (using the BBMap package):

reformat.sh in=reads.fq out1=read1.fq out2=read2.fq

...but that just makes all the commands longer, so I don't recommend it until you get to a tool that will not accept interleaved reads.

If you plan to merge the reads (which I am assuming you do, since they're 16S data) you can adapter-trim them first, but you don't really need to quality-trim first, and in fact it can reduce your merge rate. You can adapter-trim right adapters caused by short inserts using BBDuk like this:

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa tbo tpe ktrim=r hdist=1 k=23 mink=11

...but short inserts should not be very common in amplicon libraries unless you are targeting a region shorter than read length. There is a thread here discussing BBDuk options in more depth.

To merge the reads, I recommend BBMerge, like this:

bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq qtrim2=r trimq=10

"qtrim2" will first try to merge the reads, then only do quality-trimming if merging is unsuccessful at first. This will result in a higher merge rate than quality-trimming first.
Brian Bushnell is offline   Reply With Quote
Old 09-20-2015, 12:13 PM   #7
mcarson
Junior Member
 
Location: Canada

Join Date: Sep 2015
Posts: 5
Default

Brian thanks for the info. Do you happen to know of any online courses or lectures that cover these topics in more detail? I've been thrown into the bioinformatics realm and a big part of the struggle is just understanding all of the terms and steps associated with the data processing.

I'll give the BBMap suite a look over and see if I can't get this all sorted out this week. Would the adapters.fa be a file supplied by JGI, is there a defalt in the command, or is there some other standard I should be using? I don't think there was one available in the downloads file I have access to*.

*Just found the adapters.fa in the resources folder, I'd assume I just have to feed it the file pathway?

Last edited by mcarson; 09-20-2015 at 12:34 PM. Reason: added content
mcarson is offline   Reply With Quote
Old 09-20-2015, 06:39 PM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Sorry, I can't really point you toward any online courses for learning bioinformatics... I kind of learned it on the job. I know they exist, though. But, there are some threads here, and also some at Biostars (in the "tutorial" category) that reference various online resources.

As for adapters.fa, you can either copy it to the same directory, or set the absolute or relative path as the argument for "ref=".
Brian Bushnell is offline   Reply With Quote
Old 09-21-2015, 12:32 PM   #9
mcarson
Junior Member
 
Location: Canada

Join Date: Sep 2015
Posts: 5
Default

Brian, a couple final questions. I know the "normal" JGI pipeline uses PANDAseq to merge reads so I assume BBMap is used to adapter trim and quality filter. However it seems like your bbmerge is equivilant or better especially using the qtrim2 option. Any benefits to doing one over the other (you're not biased right...)?

HESmith also mentioned that the phiX should be gone since these were already demultiplexed. I'm assuming that's true or should I also do a contaminant filter?

Finally we normally get paired end reads without linkers and typically without primers. I'm assuming I should trim off the linkers as well as primers prior to bbmerge since it doesn't appear that there is an option for that (or does this not really affect alignment at all?). I think there is a sequence trim option in bbduck that I should be able to get that done in with, I'd assume it takes the standard nucleotide lettereing system?

Just an FYI,here is a test sample results:
bbduk
Input: 314652 reads 94710252 bases.
KTrimmed: 1344 reads (0.43%) 61190 bases (0.06%)
Trimmed by overlap: 4252 reads (1.35%) 8290 bases (0.01%)
Result: 314648 reads (100.00%) 94640772 bases (99.93%)

bbduk quality trim
Input: 314648 reads 94640772 bases.
QTrimmed: 278601 reads (88.54%) 16334356 bases (17.26%)
Result: 312960 reads (99.46%) 78306416 bases (82.74%)

bbmerge
Pairs: 157324
Joined: 136389 86.693%
Ambiguous: 11862 7.540%
No Solution: 9073 5.767%
Too Short: 0 0.000%

Avg Insert: 299.1
Standard Deviation: 6.9
Mode: 299

Insert range: 51 - 478
90th percentile: 299
75th percentile: 299
50th percentile: 299
25th percentile: 299
10th percentile: 299


Thanks again for all the help on this, I really appreciate it.
mcarson is offline   Reply With Quote
Old 09-21-2015, 02:23 PM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by mcarson View Post
Brian, a couple final questions. I know the "normal" JGI pipeline uses PANDAseq to merge reads so I assume BBMap is used to adapter trim and quality filter. However it seems like your bbmerge is equivilant or better especially using the qtrim2 option. Any benefits to doing one over the other (you're not biased right...)?
JGI's Itagger pipeline currently uses Flash by default for 16S, and PANDAseq for ITS. I'm not sure why it is set up that way. But, these were determined when it was written, which was before BBMerge existed. There is a plan to switch to BBMerge when it is updated, but I don't know when that might happen.

BBMerge exceeds the accuracy of all other mergers that I've tested, which includes all that I am aware of with the exception of PANDAseq. Unfortunately, PANDAseq has very specific requirements for read header names and it will refuse to process data with headers that do not look like they are from Illumina. So, I was unable to benchmark it on synthetic data with known answers stored in the read headers, which allows the results to be verified. I contacted the author, but he does not intend to change that requirement.

My test results from a few months ago:



The different points for each program indicate the true and false positive merge rates at different confidence thresholds; upper-left corner of the graph is optimal. The dotted line indicates the highest possible correct merge rate due to overlap information alone, as only 84% of the reads actually overlapped. The points with black centers indicate default settings for the program.

As for whether I'm biased, well... it's hard to judge in yourself But I don't recommend my tools in cases where I know of a better alternative.

Quote:
HESmith also mentioned that the phiX should be gone since these were already demultiplexed. I'm assuming that's true or should I also do a contaminant filter?
That's correct; all of the phiX should be gone. In practice, though, sometimes the reads break and chimerically rejoin; sometimes adapters ligate to the wrong thing; and sometimes clusters are too close together, causing barcode misassignment. These can yield phiX in the demultiplexed output (as well as other cross-contamination and assorted junk). The level should be very low, but I always find some, if there enough reads.

Quote:
Finally we normally get paired end reads without linkers and typically without primers. I'm assuming I should trim off the linkers as well as primers prior to bbmerge since it doesn't appear that there is an option for that (or does this not really affect alignment at all?).
Trimming adapters prior to merging will very slightly improve accuracy, as there are slightly fewer incorrect alternative answers from which to select the optimal overlap.

Quote:
I think there is a sequence trim option in bbduk that I should be able to get that done in with, I'd assume it takes the standard nucleotide lettering system?
Yes, you can specify your own adapter sequence as a fasta file, or on the command line, e.g. "literal=ACGTTGCA...".

Quote:
bbmerge
Pairs: 157324
Joined: 136389 86.693%
Ambiguous: 11862 7.540%
No Solution: 9073 5.767%
Too Short: 0 0.000%

Avg Insert: 299.1
Standard Deviation: 6.9
Mode: 299

Insert range: 51 - 478
90th percentile: 299
75th percentile: 299
50th percentile: 299
25th percentile: 299
10th percentile: 299
Looks like your insert sizes are generally 299bp With that knowledge, you could increase the merge rate by, for example, adding the flags "mininsert=50 minoverlap=50" to reduce the search space, and thus reduce the number of ambiguous pairs. Or even "mininsert=100 minoverlap=100".

Quote:
Thanks again for all the help on this, I really appreciate it.

You're welcome!
Attached Images
File Type: png BBMerge_May_2015.png (47.0 KB, 29 views)
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bioinfomatics, jgi, miseq, quality filtering

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:51 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO