Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    Originally posted by M4TTN View Post
    But ideally, it would be good to rid our gDNA preps of the plasmid as much as possible, so I'll look into some protocols to see if it is possible to selectively partition the gDNA and 2 micron.
    This should be very simple to do with informatically with BBSplit.

    BTW: Does UGENE have latest versions of all software (bwa/bowtie2)? Are you able to tweak parameters (I have not used UGENE)?

    It is normal to have a fraction of your read not map at all. You do have a high percentage of duplicates. Are they PCR-type duplicates (identical start and end when mapped)?
    Last edited by GenoMax; 03-27-2015, 08:54 AM.

    Comment


    • #47
      The tlen field is determined by the aligner, so different aligners will give you different numbers there. BBMap does produce correct insert size histograms for overlapping reads, without having to process the sam file, so I'd be interested to see what its histogram looks like. You can generate it like this:

      bbmap.sh ref=reference.fa in=reads.fq ihist=ihist.txt

      Comment


      • #48
        Originally posted by GenoMax View Post
        This should be very simple to do with informatically with BBSplit.

        BTW: Does UGENE have latest versions of all software (bwa/bowtie2)? Are you able to tweak parameters (I have not used UGENE)?

        It is normal to have a fraction of your read not map at all. You do have a high percentage of duplicates. Are they PCR-type duplicates (identical start and end when mapped)?
        We are making progress: At least in some of the samples, the 10% unmapped reads are coming from the 2 micron plasmid (approx 200x copies per cell it seems), and also from mitochondrial genome, which although 10x larger than the 2 micron plasmid, is present at only about 1-2 copies per cell so is not a big deal.)

        Interestingly we have some other samples (with 25% unmapped initially), that still have 15-20% unmapped after including 2 micron and mtDNA in the reference file.

        These were slow growing clones, so I am assuming we have some seriously contaminated cultures - perhaps bacteria or some other yeast species was competing more efficiently that normal. About to do some BLAST with the unmapped reads.

        Regarding the duplicate rate: I really think this is due to so many pairs of reads being perfect reads due to the insert size being 30 bp or shorter: thus read 1 and read 2 perfectly overlap.

        In more recent samples, we used less Ampure beads to purify, which should size select large inserts. Indeed, in these samples, the duplicate rate is only 5-10%. Disregarding the potential for PCR duplicates, this suggests that only 5-10% of molecules were 300 bp or shorter.

        However, the best way I guess would be to do a single end alignment. The prediction being that our duplication rate would pretty much disappear (other than PCR duplicates. If that makes sense I can try that.

        Comment


        • #49
          Originally posted by Brian Bushnell View Post
          The tlen field is determined by the aligner, so different aligners will give you different numbers there. BBMap does produce correct insert size histograms for overlapping reads, without having to process the sam file, so I'd be interested to see what its histogram looks like. You can generate it like this:

          bbmap.sh ref=reference.fa in=reads.fq ihist=ihist.txt
          bbmap is taking quite a long time... (8 core mac mini).

          In the absence of a growing output file, is there any way to tell how far it may have got? It's been running for at least 15 mins now at 800% CPU usage. Each fasq has about 460,000 reads.

          Edit: After all that, it appears to have failed:
          Code:
          BBMap version 34.72
          Set insert size histogram output to /Desktop/bbmap_analysis/bbmap_aligner/ihist.txt
          Retaining first best site only for ambiguous mappings.
          No output file.
          NOTE:	Ignoring reference file because it already appears to have been processed.
          NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
          Set genome to 1
          
          Loaded Reference:	0.179 seconds.
          Loading index for chunk 1-1, build 1
          Generated Index:	1.947 seconds.
          Analyzed Index:   	3.244 seconds.
          Cleared Memory:    	0.218 seconds.
          Processing reads in paired-ended mode.
          Started read stream.
          Started 8 mapping threads.
          Exception in thread "Thread-14" java.lang.AssertionError: 
          There appear to be different numbers of reads in the paired input files.
          The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.
          	at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:492)
          	at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:358)
          	at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:195)
          	at java.lang.Thread.run(Thread.java:695)
          Indeed, by opening in FastQC: read1 461,558 reads has and read2 has 461,540 reads.

          Or did something go wrong do you think?

          Edit: actually, I don't think the Terminal has finished yet. I don't have a -bash prompt. (But CPU has dropped to minimal usage.)
          Last edited by M4TTN; 03-31-2015, 04:50 AM.

          Comment


          • #50
            Looks like your input files are corrupt - the reads are no longer correctly paired. Go ahead and kill BBMap; it won't terminate (I'll try to add a way for it to exit gracefully in that situation).

            You need to either start over with the original input data and trim making sure you always use both files as input at the same time when trimming (I apologize if I did not make that clear - with paired input files, running bbduk, always use in1= and in2= rather than doing them independently). Or you can fix them with repair.sh, but re-trimming from the original files would be better because that allows you to use the "tbo" flag which won't work otherwise.

            Comment


            • #51
              Originally posted by Brian Bushnell View Post
              Looks like your input files are corrupt - the reads are no longer correctly paired. Go ahead and kill BBMap; it won't terminate (I'll try to add a way for it to exit gracefully in that situation).

              You need to either start over with the original input data and trim making sure you always use both files as input at the same time when trimming (I apologize if I did not make that clear - with paired input files, running bbduk, always use in1= and in2= rather than doing them independently). Or you can fix them with repair.sh, but re-trimming from the original files would be better because that allows you to use the "tbo" flag which won't work otherwise.
              Thanks Brian. I need to check when back in work tomorrow, but I don't think I used fasta files that had been processed with bbduk first for this bbmap attempt. I think they were the raw files supplied by Illumina from basespace.

              Is it normal that Bowtie2 accepts fast files of that are not "paired" - it didn't throw any errors when using the same fast files as input.

              We are also experiencing some oddities with our files that we ran as 350+250bp sequencing run. On manual inspection it appears that a significant number of unmapped reads contain adapter sequences embedded in the middle of them - even though the "read" is either 350 or 250 bp long. I don't quite understand how the MiSeq can have sequenced so far past the adapter (over 150 bp beyond in some molecules!). Perhaps it starts calling bases from the adjacent clusters?

              For these aberrant reads, only the sequence to the "left" of the adapter maps to teh reference genome.

              Since we haven't noticed this problem before and we don't normally process the fasta files in any way prior to Bowtie2 alignment, I can only come to the conclusion that the automated Illumina trimming algorithm fails (for unknown reasons) when the reads are not identical in length.

              Thoughts?
              Last edited by M4TTN; 03-31-2015, 01:04 PM.

              Comment


              • #52
                Unlike, say, the PacBio platform, Illumina platforms do not know or care how long the molecule is that they are sequencing. They identify the locations of clusters in the first X cycles (~20, I think?) and then call the base at that cluster location according to the dominant light color at each cycle. If sequencing goes of the end of a molecule it will keep calling each from stray light noise, which as you say is probably from nearby clusters. The quality is likely to be low, though, so quality-trimming may sometimes get rid of these bases. And yes, with short inserts from normal fragment libraries, the sequence to the left of the adapter is genomic, and to the right is just noise.

                For determining whether your files are correctly paired according to Illumina naming conventions, you can run "reformat.sh in=r1.fq in2=r2.fq vpair" which will examine the names. Not sure about Bowtie2 - some programs allow mismatched read files, and some don't; I try to do a lot of checking and make my programs crash if the input is corrupt so that people will be forced to stop and fix the input, rather than end up with incorrect results later.

                Unfortunately, I don't know anything about the built-in Illumina trimming routines; we always keep them disabled, as that's just another variable that we don't have control over which could cause problems. But if you are sure the left and right reads came from the same lane and run (which is another thing to check), then you should fix them with repair.sh, then adapter trim them (as pairs), then try mapping again:

                repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outs=singletons.fq

                Also, it might help if you could post the first few headers from each file.

                Comment


                • #53
                  The files are correctly paired since the resulting sam files we generate clearly have pairs of reads in sequential lines that map to the same genomic location (and most of the time overlap significantly with each other - sometimes completely. i.e. where the insert is shorter than the read length).

                  I'm not sure why the the fastq files are apparently of different length though. I mean: why they contain a different number of reads.

                  Good point about the way illumina works. I had assumed that once the end of the molecule was reached, the lack of light emission for any base was interpreted as the end of the molecule, but I guess it is simpler to just keep all the data (no matter how noisy it has become) and then trim it back down to the adapter location.

                  Regarding trimming: I wasn't aware there was a method to disable auto-trimming of the FASTQ files. We'll have to look into this in the future now that we know there may be some problems.

                  We've only just started doing NGS (this is only our 4th MiSeq run), so there is a lot to learn, and a lot ways to improve the efficiency of our analysis pipeline. Our main stumbling block is that we are a wet lab with no dedicated bioinformatics support, so we are needing to learn a lot.

                  Thanks for your help!

                  Edit: Question: if there is are one or more sequencing errors in the adapter sequence, will bbduk still recognise and trim them?
                  Last edited by M4TTN; 03-31-2015, 01:21 PM.

                  Comment


                  • #54
                    Originally posted by M4TTN View Post
                    Regarding trimming: I wasn't aware there was a method to disable auto-trimming of the FASTQ files.
                    Look into doing a Fastq Only/GenerateFASTQ run.

                    Edit: Question: if there is are one or more sequencing errors in the adapter sequence, will bbduk still recognise and trim them?
                    Yes. That is where the hdist option comes in. Add hdist=1 (hamming distance) to one mismatch.

                    Comment


                    • #55
                      As far as I know we do just run a FASTQ only on the MiSeq. But my understanding was that the output sequences are already pretrimmed.

                      Thanks for the bbduk options explanation - I am slowly getting there. I see that Brian also gave me this info earlier.

                      Comment


                      • #56
                        It turns out that on the last run (350+250), adapter trimming had not been enabled in MISeq. We are requeuing the analysis with Illumina so at least the output fastas will have been processed the same as all our others.

                        This is presumably why our mapping dropped to 75%.

                        Comment


                        • #57
                          Is there a reason you are doing odd cycle runs (350 + 250)? I am guessing that the 50 extra cycles on read 1 are resulting in crappy qualities towards the end of that read with increasing chances that are you reading into adapter (and past) at that point.

                          Comment


                          • #58
                            Originally posted by GenoMax View Post
                            Is there a reason you are doing odd cycle runs (350 + 250)? I am guessing that the 50 extra cycles on read 1 are resulting in crappy qualities towards the end of that read with increasing chances that are you reading into adapter (and past) at that point.
                            Actually I am not sure that quality drops off much at all. It appears to in the MiSeq reporter software, but this is just because so many inserts are shorter than the read length - and as Brian pointed out: once the adpater has been sequenced, quality tanks rapidly.

                            The motivation was actually based on the fact that the average qaulity of read 1 was greater than read2, and thus that t might be better to read more bases in read 1 than read 2.

                            It was also based on the (probable, mis)assumption that by having differing read lengths it might prevent misinterpretation of the insert length (TLEN), since we'll have fewer paired reads than fully overlap. In actual fact, on thinking about this more, this probably won't make much difference since there are still lots of inserts that are shorter than 250 bp, then these will fully overlap causing the TLEN column to report "odd" insert lengths (negative etc)

                            Comment


                            • #59
                              This thread has become so long that I don't remember if you ever showed us what the Q-scores look like (FastQC report) for this data. Of course this would not be useful if you are pre-trimming your data on MiSeq since the result file would have lost the bad Q-score data.

                              Comment


                              • #60
                                Hi Genomax. Actually we are really quite perplexed, and are starting to wonder if our last run was particularly poor quality (for as yet unknown reasons).

                                On the first run (3 months ago; 2x300), the Median Q-score as reported by MiSeq reporter stayed high until about cycle 250. In read 2, the overall quality tailed off sooner. The quality on the bottom surface of the flowcell was marginally worse than the top surface, but it is pretty similar.

                                Now, on our most recent run (350+250), the Median Q-score as reported by MiSeq reporter is much worse, dropping off at around cycle 200 in read1, and there is a rather dramatic difference between top and bottom surfaces.

                                Specifically: On the top surface, the quality starts to drop off after about cycle 220 on read1 and cycle 500 (i.e. 150bp)on read 2.

                                On the bottom surface, the quality starts to drop off after cycle 150 in read 1 and cycle 450 (i.e. 100bp) in read2.

                                One thing I don't understand is how the Q-score is calculated. Without really thinking about it, I assumed the main reason the quality dropped off was due to sequencing into adpaters and off the end of molecules. Indeed I am pretty sure this is why the % ATGC content becomes skewed with increasing length of sequence (and which is not seen int he most recent run in which we size-selected for larger inserts).

                                However, if I open up trimmed FASTQ files in FastQC, it is clear that the Q-scores are almost the same as those reported for the untrimmed MiSeq reporter (the MiSeq reporter gives this info live and thus is not doing any trimming yet at this point as far as I am aware). Yet the % ATGC content is dramatically improved (now flat across the entire read length rather than becoming skewed).

                                Also if I open up trimmed (bbduk) versus untrimmed Fastq files in FastQC for the most recent run, there is only a very subtle change in the Qscore distribution. I do not have the untrimmed FastQ files for the first run since these were automatically trimmed by Illumina.

                                SO what is going on? Did we just have a bad run? Is the machine damaged?

                                Cluster density in the first (good) run was 33 million with 30 M passing filter (above spec). Then new run was 18.4 M with 14.5 million passing. Is that too low?

                                Was it because we size-selected for larger inserts by using less Ampure beads? And clustering was both inefficient and resulted in low quality clusters?

                                The overal % Q>=30 as reported by Illumina was 80.3% for run 1, and 67.3 for recent run.
                                I attach some images here from FastQC and MiSeq reporter for the 1st and recent run.

                                Even the two index reads were a lot worse: 94.6% and 97.8% in run 1, but 83.5% and 83.4% in the recent run.

                                I guess these may all be things I need to discuss with Illumina tech support.



                                Last edited by M4TTN; 04-01-2015, 07:44 AM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                50 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X