Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jweger1988
    Member
    • Apr 2017
    • 37

    Brian,

    Wow, that's very surprising to me.

    things we used to see on our NextSeqs until it eventually got better after various chemistry and software revisions
    Are you still currently using a nextseq? Do you recommend a MiSeq or a Hiseq instead? I assume by chemistry and software revisions you mean those put out by Illumina? I wonder if we don't have the latest software?

    Any suggestion to improve this in the future?

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      A MiSeq/HiSeq generally generates better data so if you do have the option of using one of those then do switch. Especially for lowfreq SNP calls.

      Comment

      • jweger1988
        Member
        • Apr 2017
        • 37

        Originally posted by GenoMax View Post
        A MiSeq/HiSeq generally generates better data so if you do have the option of using one of those then do switch. Especially for lowfreq SNP calls.
        Thanks Genomax.

        I'm just curious why Brian said the data was really low quality. We are using a Nextseq 500 and V2 chemistry. The Q30>30 was 80% and it was actually underclustered.

        We only have access to Nextseq at the moment, is there anything we can do to improve this or is it just how it is with Nextseq?

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          Originally posted by jweger1988 View Post
          I'm just curious why Brian said the data was really low quality. We are using a Nextseq 500 and V2 chemistry. The Q30>30 was 80% and it was actually underclustered.
          I would not use the "Q30" number for anything; I consider it marketing fluff. The only way to determine data quality is to actually map it and count the number of sequencing errors. You can see from the graph that when you do that, the average starts at Q11 (which is incredibly low and very unusual), goes up to a peak of just under 30, and drops to Q12 at the end, which is also very low. At no point does the average quality even reach Q30.

          You can see this visually in IGV - the screen is always littered with random blips indicating indels or mismatches to the reference, which is very unusual for Illumina data (at least, HiSeq 2500 Illumina data). The screen should look very clean except where mismatches to the reference form an obvious column, indicating a real mutation.

          We only have access to Nextseq at the moment, is there anything we can do to improve this or is it just how it is with Nextseq?
          You might want to make sure you are using the latest chemistry and software, and talk to your Illumina rep about your concerns with the quality. We are still using NextSeq routinely, and the quality is usually much better than that, but I don't know the reason.

          I don't think the platform is appropriate for low-frequency variant-calling anyway, but on any platform, I'd recommend paired-end reads for that (preferably overlapping so they can be merged), and definitely not strand-specific protocols, which incur bias (particularly, certain genomic features, read in a certain direction, are prone to miscalls). How did you end up with only minus-strand reads, anyway?

          Comment

          • boulund
            Member
            • Jan 2017
            • 18

            Hi,

            I have a question about statswrapper.sh. It says in the help text description that it supposedly runs stats.sh on multiple assemblies to produce one output line per file, but I only get a concatenation of regular stats.sh outputs, making it far less useful to me than I was hoping for. Is this the intended behavior or am I misunderstanding the help text formulation somehow?

            I was hoping to get some kind of overviewable summary table of all files in a single file, with one output line per file.

            Comment

            • GenoMax
              Senior Member
              • Feb 2008
              • 7142

              @boulund: Can you include the exact starswrapper.sh command you are running?

              Comment

              • Brian Bushnell
                Super Moderator
                • Jan 2014
                • 2709

                Hi Boulund,

                Please add the flag "format=3" to get one line per file.

                Comment

                • boulund
                  Member
                  • Jan 2017
                  • 18

                  Oh, that was easy. Thanks a lot for your quick response, both of you. Sorry for not spotting that option!

                  Originally posted by GenoMax View Post
                  @boulund: Can you include the exact starswrapper.sh command you are running?

                  The command I used was simple:
                  Code:
                  statswrapper.sh in=assembly_1.fasta in=assembly_2.fasta
                  With the "format=3" option it produces exactly the output I was expecting.

                  Comment

                  • luc
                    Senior Member
                    • Dec 2010
                    • 469

                    I am sorry I can't find the info - I am sure it is somewhere.

                    How does BBmap define insert size ( e.g of the insert size histograms)?
                    Distance between start point of the reads (forward and reverse) or between the ends of the reads?

                    #############
                    I just checked with a test file; as expected the insert size seems to be the distance between the start point of the reads.
                    Last edited by luc; 09-07-2017, 08:52 AM.

                    Comment

                    • Brian Bushnell
                      Super Moderator
                      • Jan 2014
                      • 2709

                      Hi Luc,

                      To make this very clear, it's the estimate of the size of sequenced genomic molecule. So, for example, if you have 2 reads like this, where R1 maps to the plus strand and R2 maps to the minus strand:

                      Code:
                         111111111111111
                      -------------------------------------------------------------------
                                                                22222222222222222
                      ...then it would generally be the distance from the leftmost 1 to the rightmost 2. But if either read contains insertions, those are added to the insert size, and deletions are subtracted from the insert size. Otherwise, the insert sizes would be dramatically overestimated in spliced RNA-seq data.

                      Comment

                      • darthsequencer
                        Member
                        • Feb 2012
                        • 35

                        Originally posted by Brian Bushnell View Post
                        "ambiguous=best" is a bit misleading, but it means the genomically first location with a maxmimum score will be used. "ambiguous=all" will report all locations within the ambiguity threshold of the first. This does not mean they need exactly the same score; it means that they are very close, so much so that none can be confidently determined to be the correct mapping location. Normally they're identical, but if for example one mapping had a single 1bp deletion and another mapping had two 1bp substitutions, the scores would be different, but would be close enough to be both reported. But if there was a third potential mapping with, say, 5 substitutions, that would be excluded. This can be controlled with the "secondarysitescoreratio" flag; if you set it to 1.0, only mappings with identical scores to the best score will be reported.
                        Hi - So if I set secondarysitescoreratio to 1.0 do I need to set secondary = T if ambiguous=all and increase the maxsites setting?

                        I guess what I'm really asking is what do I need to set to ensure that I'm getting all mapping locations of read with all the same top score.

                        Comment

                        • Brian Bushnell
                          Super Moderator
                          • Jan 2014
                          • 2709

                          That's correct for single-ended reads, but I do not recommend setting "secondarysitescoreratio = 1.0" for paired reads because one pairing is currently slightly favored over others due to the difficulty of tracking all possible pairings. So, just set "secondary=t ambig=all maxsites=100". In other words, it's not currently possible to ensure you get exactly and only all reads that have the exact same (best) alignment score unless they are processed unpaired.

                          Comment

                          • luc
                            Senior Member
                            • Dec 2010
                            • 469

                            Hi Brian,

                            thanks for the details -- there is always a another twist to it.

                            Originally posted by Brian Bushnell View Post
                            Hi Luc,

                            To make this very clear, it's the estimate of the size of sequenced genomic molecule. So, for example, if you have 2 reads like this, where R1 maps to the plus strand and R2 maps to the minus strand:

                            Code:
                               111111111111111
                            -------------------------------------------------------------------
                                                                      22222222222222222
                            ...then it would generally be the distance from the leftmost 1 to the rightmost 2. But if either read contains insertions, those are added to the insert size, and deletions are subtracted from the insert size. Otherwise, the insert sizes would be dramatically overestimated in spliced RNA-seq data.

                            Comment

                            • boulund
                              Member
                              • Jan 2017
                              • 18

                              Hi,

                              I have a question about the PercentOfPairs output found in the insert size histogram output (ihist). What does it mean, percent of pairs? Is it the proportion of read pairs where an insert size could be computed?

                              Comment

                              • Brian Bushnell
                                Super Moderator
                                • Jan 2014
                                • 2709

                                Originally posted by boulund View Post
                                Hi,

                                I have a question about the PercentOfPairs output found in the insert size histogram output (ihist). What does it mean, percent of pairs? Is it the proportion of read pairs where an insert size could be computed?
                                Yes, that's correct. This is especially important for BBMerge, where insert size can only be computed from reads that overlap; so if the average calculated insert size is say 270bp for 2x150bp reads, but "PercentOfPairs" is only 20%, then presumably the actual insert size is substantially higher but those pairs did not overlap so it could not be computed. It's less important for BBMap but still useful in some cases - if PercentOfPairs is very low, something probably went wrong and the insert size is not trustworthy. For example, if you have an extremely fragmented reference with contig length shorter than insert size, such that most paired reads map to different contigs, then the pairing rate will be low and the calculated insert size will be untrustworthy (shorter than the actual insert size).

                                Comment

                                Latest Articles

                                Collapse

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 06-05-2026, 10:09 AM
                                0 responses
                                12 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-04-2026, 08:59 AM
                                0 responses
                                23 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                28 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                22 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...