Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hello and JGI problems

    Hello, totally new to the forum and somewhat new to the sequencing world. I've dealt with miseq data before when it arrives in the "standard" forward/reverse fastq reads with or without primers. I then go straight into PANDAseq, USEARCH8, and QIIME then process the outputs in R and I'm fairly comfortable with all of that.

    My recent probelm comes from a large project that is using JGI for sequencing (miseq), which gives you data in a non-standard format. Does anyone have experience dealing with this to get me to the point I'm more familiar with. I've already figured out how to split the forward and reverse reads and extract barcodes but am having troubble with the quality filtering and trimming as well as the naming of the files. I'll be putting a large number of samples through this pipeline that will eventually have reused barcodes so any advice on renaming files is also appreciated.

    Thanks in advance for the help and I look forward to hearing peoples suggestions!

  • #2
    I would be useful to post several lines of the non-standard format that you've encountered.

    Comment


    • #3
      @MISEQ06:359:000000000-AEFPR:1:1101:16407:1545 1:N:0:ACATCTTGACG
      TTCTAGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTACAGCGCACGTAGGCGGTTTTTTAAGTCATTTTTGACATCCCCCGGCTCCACCTTGGCACTTCCTTTGCTACTTTCCTTCTTGAGTCTCTCAGAGTTCCTTTGCTTTCCCGTTTTAGCGGTGACCTTCTTCGATATCTGTATGCACCCCCTTGTCGCCTGCTGCTTTCTGTTTTCTCCCTGACTCTTCGGTCCGCACGCCTGGTGAGCCACCATGATTAGATACCCTTTTAGTCCTTCT
      +
      @BCCCGGGGFG@FGFFF@F:FGDGGGGGGDGGGGGGGEGGCGGGGGC,B,6CF,CC9C,CCGG+B,C6>+@>C####################################################################################################################################################################################################################################
      @MISEQ06:359:000000000-AEFPR:1:1101:16407:1545 2:N:0:ACATCTTGACG
      CCGCCCTCCTCGTTTTTCTTCTCCTTTTTCCTCCCCCCTCTTTCGCTCCTCCCCCTCTCTTTTTACCCCCTCCCCTCCCTTCCCCCTCTTTTTTCTTCCCTCTCTCTACCCTTTTCCCCTCTCCCTCCTCCCTTCCCCTCTCCTCTCTTCCACTCCCCTCCCTCTGTTCCCCTTTCCTTTCCCCTTTTTCCCCCTCCTCTTTCCCCCCTCCCTTCCCCCCCCCCCTCCCCCCCCTTTTCCCCCCTTCTTTCCTTTTCCCCCCCCCCCCCCCCTTTTTTCCCCCCCCTCTCCCCCTTCTCCC
      +
      -8,-,86,8;C---9696;,,6;;;6CC<,;6<EEE#########################################################################################################################################################################################################################################################################
      @MISEQ06:359:000000000-AEFPR:1:1101:20149:1557 1:N:0:ACATCTTGACG
      TGCAAGTGCCAGCCGCCGCGGTCATACGGAGGGTGCGAGCGTTGTTCTGCATTCTTGGGCGTACCGCGCGTGTAGGCGGTTTTTTCCGTCTTTTTTGCCCTCCCTCTTCTCCCCCCTGTCCCTGCCTTCGCTCCTTTCTTTCTTTCTTTCTTTCTCTTTTTTTTTCCTTCCTTTTTTCTCTTTTCACTCCTTCTCTCTCCTTCTTCACCCCTTTTTCTCCTTCTTCTTTCTTTCTCTCTCCTTACTCTCCTTCTCTCCCTCTTATTGCTCCCCCTTTATTATCTACCCTTTTAGTCCTTTT
      +
      @8ACCGGGFFG8FF:CF@F:FF+CFGGGGGGGGGGGGFGGDFGGGGF,C,<<C,CC,,:CFFC+,,8+>>>+B,,,:++86?AA@,,+:B?##################################################################################################################################################################################################################
      @MISEQ06:359:000000000-AEFPR:1:1101:20149:1557 2:N:0:ACATCTTGACG
      CCGCCCTACCTTTTTTTCTCCTCCTTTTTCCTCCCCTATCCTTCTTCCATCCCCCTCTCTCTTCCCCCCCTCCCCCCCCTTCCCCCCCTTTTTTCCTCCTTTTCTCTCCCTCTTTCCCCCCTCCCCCCCCCTTTCCCCTCCCCTCTCCCTTCTCCCCCTCTCCCCTTTTCCTCTCCCCTTCCCCTTTTCCCCCCCTCCCTTTCCCCTCCTCCTTTCCCCCCCCCCTCCCCCCCCTTTCCCCCCTTTCCTTCCTCCCCTCCCCCCCCCCCCCCTTTTTCCCCCTCTCGCTTCCCCTTCCCTC


      This is the raw data from JGI. 1:N:0 is the forward 2:N:0 is the reverse and I've used extract_reads_from_interleaved_file.py to split them with no problems. From what I can tell the barcode is in the label not the sequence and you can extract them if need be using extract_barcodes.py -c barcode_in_label. Beyond that I'm a little lost. I've checked the fastq's and am fairly certain the forward read has a 5bp linker and the reverse has a 2bp one, followed by the V4 region primers (F5'GTGCCAGCMGCCGCGGTAA:R5'GGACTACHVGGGTWTCTAAT). I've also heard there were phiX internal controls added.

      I hope to end up with linkers/adapters removed, as well as any other filtering/label renaming that needs to be done prior to aligning forward a reverse reads in PANDAseq.

      Comment


      • #4
        The data are standard MiSeq output, PE-300, with reads one and two interleaved. As you note, the data have already been demultiplexed (indicated by the index 'ACATCTTGACG' in the read identifier). The phiX controls lack indices and will not be present in demultiplexed data.

        The forward primer largely matches your expectation (5bp linker + GTGCCAGCMGCCGCGGTAA at the start of read one, although there's a penultimate C instead of A in the second example). However, the reverse primer sequence is not detectable in either example of read two. The caveat is that read quality is very low and the nucleotide bias is nearly 100% C/T, so it's hard to say for sure. It would be useful to check with high-quality examples of read two.

        I would recommend evaluating the data with FASTQ for quality metrics. You can use BBMap's BBDuk.sh command to trim by quality, length, or sequence string. By renaming, I assume you mean adding read groups, which can be done using Picard.

        Comment


        • #5
          Thanks for the suggestions, and I'm pretty confident in the reverse read primer as I get tens of thousands of hits when I search the different variants in a text file. I'll look into BBMap, are there any "standards" you'd suggest for places to start with options for BBDuk.sh in terms of trim length or quality? After those steps will I more or less be at the normal Sample_Lane_R1_001.fastq outputs most places send?

          The labels are more for the filenames as QIIME seems to have automatic outputs that only allow you to alter the directory name not the actual output filename. So all of mine are now just Reverse/Forward.fastq when I want them to all be unique in some way so I can loop them through PANDAseq using a InputFilename txt file reference. I can just refernce individual directories that hold the files but it's more my OCD.

          Comment


          • #6
            JGI's reads are always interleaved like that. You can deinterleave them if you want like this (using the BBMap package):

            reformat.sh in=reads.fq out1=read1.fq out2=read2.fq

            ...but that just makes all the commands longer, so I don't recommend it until you get to a tool that will not accept interleaved reads.

            If you plan to merge the reads (which I am assuming you do, since they're 16S data) you can adapter-trim them first, but you don't really need to quality-trim first, and in fact it can reduce your merge rate. You can adapter-trim right adapters caused by short inserts using BBDuk like this:

            bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa tbo tpe ktrim=r hdist=1 k=23 mink=11

            ...but short inserts should not be very common in amplicon libraries unless you are targeting a region shorter than read length. There is a thread here discussing BBDuk options in more depth.

            To merge the reads, I recommend BBMerge, like this:

            bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq qtrim2=r trimq=10

            "qtrim2" will first try to merge the reads, then only do quality-trimming if merging is unsuccessful at first. This will result in a higher merge rate than quality-trimming first.

            Comment


            • #7
              Brian thanks for the info. Do you happen to know of any online courses or lectures that cover these topics in more detail? I've been thrown into the bioinformatics realm and a big part of the struggle is just understanding all of the terms and steps associated with the data processing.

              I'll give the BBMap suite a look over and see if I can't get this all sorted out this week. Would the adapters.fa be a file supplied by JGI, is there a defalt in the command, or is there some other standard I should be using? I don't think there was one available in the downloads file I have access to*.

              *Just found the adapters.fa in the resources folder, I'd assume I just have to feed it the file pathway?
              Last edited by mcarson; 09-20-2015, 12:34 PM. Reason: added content

              Comment


              • #8
                Sorry, I can't really point you toward any online courses for learning bioinformatics... I kind of learned it on the job. I know they exist, though. But, there are some threads here, and also some at Biostars (in the "tutorial" category) that reference various online resources.

                As for adapters.fa, you can either copy it to the same directory, or set the absolute or relative path as the argument for "ref=".

                Comment


                • #9
                  Brian, a couple final questions. I know the "normal" JGI pipeline uses PANDAseq to merge reads so I assume BBMap is used to adapter trim and quality filter. However it seems like your bbmerge is equivilant or better especially using the qtrim2 option. Any benefits to doing one over the other (you're not biased right...)?

                  HESmith also mentioned that the phiX should be gone since these were already demultiplexed. I'm assuming that's true or should I also do a contaminant filter?

                  Finally we normally get paired end reads without linkers and typically without primers. I'm assuming I should trim off the linkers as well as primers prior to bbmerge since it doesn't appear that there is an option for that (or does this not really affect alignment at all?). I think there is a sequence trim option in bbduck that I should be able to get that done in with, I'd assume it takes the standard nucleotide lettereing system?

                  Just an FYI,here is a test sample results:
                  bbduk
                  Input: 314652 reads 94710252 bases.
                  KTrimmed: 1344 reads (0.43%) 61190 bases (0.06%)
                  Trimmed by overlap: 4252 reads (1.35%) 8290 bases (0.01%)
                  Result: 314648 reads (100.00%) 94640772 bases (99.93%)

                  bbduk quality trim
                  Input: 314648 reads 94640772 bases.
                  QTrimmed: 278601 reads (88.54%) 16334356 bases (17.26%)
                  Result: 312960 reads (99.46%) 78306416 bases (82.74%)

                  bbmerge
                  Pairs: 157324
                  Joined: 136389 86.693%
                  Ambiguous: 11862 7.540%
                  No Solution: 9073 5.767%
                  Too Short: 0 0.000%

                  Avg Insert: 299.1
                  Standard Deviation: 6.9
                  Mode: 299

                  Insert range: 51 - 478
                  90th percentile: 299
                  75th percentile: 299
                  50th percentile: 299
                  25th percentile: 299
                  10th percentile: 299


                  Thanks again for all the help on this, I really appreciate it.

                  Comment


                  • #10
                    Originally posted by mcarson View Post
                    Brian, a couple final questions. I know the "normal" JGI pipeline uses PANDAseq to merge reads so I assume BBMap is used to adapter trim and quality filter. However it seems like your bbmerge is equivilant or better especially using the qtrim2 option. Any benefits to doing one over the other (you're not biased right...)?
                    JGI's Itagger pipeline currently uses Flash by default for 16S, and PANDAseq for ITS. I'm not sure why it is set up that way. But, these were determined when it was written, which was before BBMerge existed. There is a plan to switch to BBMerge when it is updated, but I don't know when that might happen.

                    BBMerge exceeds the accuracy of all other mergers that I've tested, which includes all that I am aware of with the exception of PANDAseq. Unfortunately, PANDAseq has very specific requirements for read header names and it will refuse to process data with headers that do not look like they are from Illumina. So, I was unable to benchmark it on synthetic data with known answers stored in the read headers, which allows the results to be verified. I contacted the author, but he does not intend to change that requirement.

                    My test results from a few months ago:



                    The different points for each program indicate the true and false positive merge rates at different confidence thresholds; upper-left corner of the graph is optimal. The dotted line indicates the highest possible correct merge rate due to overlap information alone, as only 84% of the reads actually overlapped. The points with black centers indicate default settings for the program.

                    As for whether I'm biased, well... it's hard to judge in yourself But I don't recommend my tools in cases where I know of a better alternative.

                    HESmith also mentioned that the phiX should be gone since these were already demultiplexed. I'm assuming that's true or should I also do a contaminant filter?
                    That's correct; all of the phiX should be gone. In practice, though, sometimes the reads break and chimerically rejoin; sometimes adapters ligate to the wrong thing; and sometimes clusters are too close together, causing barcode misassignment. These can yield phiX in the demultiplexed output (as well as other cross-contamination and assorted junk). The level should be very low, but I always find some, if there enough reads.

                    Finally we normally get paired end reads without linkers and typically without primers. I'm assuming I should trim off the linkers as well as primers prior to bbmerge since it doesn't appear that there is an option for that (or does this not really affect alignment at all?).
                    Trimming adapters prior to merging will very slightly improve accuracy, as there are slightly fewer incorrect alternative answers from which to select the optimal overlap.

                    I think there is a sequence trim option in bbduk that I should be able to get that done in with, I'd assume it takes the standard nucleotide lettering system?
                    Yes, you can specify your own adapter sequence as a fasta file, or on the command line, e.g. "literal=ACGTTGCA...".

                    bbmerge
                    Pairs: 157324
                    Joined: 136389 86.693%
                    Ambiguous: 11862 7.540%
                    No Solution: 9073 5.767%
                    Too Short: 0 0.000%

                    Avg Insert: 299.1
                    Standard Deviation: 6.9
                    Mode: 299

                    Insert range: 51 - 478
                    90th percentile: 299
                    75th percentile: 299
                    50th percentile: 299
                    25th percentile: 299
                    10th percentile: 299
                    Looks like your insert sizes are generally 299bp With that knowledge, you could increase the merge rate by, for example, adding the flags "mininsert=50 minoverlap=50" to reduce the search space, and thus reduce the number of ambiguous pairs. Or even "mininsert=100 minoverlap=100".

                    Thanks again for all the help on this, I really appreciate it.

                    You're welcome!
                    Attached Files

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X