Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi Felix - thanks for the prompt reply, and sorry I forgot to include my library type/bowtie version. I was just making sure this was the intended behaviour before I updated my pipeline to ignore the /1 /2 tags in remove PCR duplicates - thanks

    Comment


    • Hi all

      I am trying to re-align my reads again using a new assembly of the reference genome and I am finding many problems in the alignment step.

      After running genome_preparation I am trying to run bismark for paired end reads trimmed with trim_galore (--phred64 -t --paired filename_2 filename_2). Right after the script creates the two version os the reads (c-t, a-g) it gives back a un ending number of errors. Many of them are related to the impossibility to determine the genomic context (due to Cs located at the end of the contigs) but others are quite new for me.

      The resulting SAM file is fairly small and the number of Cs analyzed (from the report file) are ridiculously low; around 50 000.

      I am using the last version of Bismark with the --bowtie2 option (also --botwtie2 option for the genome preparation step)

      I am running in parallel an alignment with the same .fq files against a masked version of the genome and no errors so far (4 millions reads...)

      Any thoughts?

      Cheers


      PS. some of the lines with multiple errors

      Reading in the sequence files BPM24_FIS17_trim_galore_1.fq and BPM24_FIS17_trim_galore_2.fq
      Processed 100000 sequences so far
      Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 203898.
      Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 203898.
      Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:2724:10559#ACCAGACT/1 scaffold_1495
      Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 223364.
      Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 223364.
      Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:7055:11458#ACCAGACT/1 scaffold_402
      Processed 200000 sequences so far
      Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 592170.
      Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 592170.
      Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:8410:27041#ACCAGACT/1 scaffold_1917
      Processed 300000 sequences so far
      Processed 400000 sequences so far
      Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 957754.
      Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 957754.
      Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:14266:27423#ACCAGACT/1 scaffold_1862
      Processed 500000 sequences so far
      Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1095048.
      Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1095048.
      Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:14067:33341#ACCAGACT/1 scaffold_297
      Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1171846.
      Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1171846.
      Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:20587:36545#ACCAGACT/1 scaffold_391
      Processed 600000 sequences so far
      Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1323878.
      Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1323878.
      Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:10776:58374#ACCAGACT/1 scaffold_1028
      Processed 700000 sequences so far
      Processed 800000 sequences so far
      Processed 900000 sequences so far
      Processed 1000000 sequences so far
      Processed 1100000 sequences so far
      Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 2259346.
      Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 2259346.
      Last edited by oria34; 07-05-2013, 10:53 AM.

      Comment


      • Hi oria,

        Interestingly I encountered the same error messages yesterday myself; they are caused by trying to print out the starting position of the read which aligns so close to the edge of a chromosome or contig/scaffold that additional 2bp sequence can't be extracted. These warnings are harmless though and can be ignored safely; I have still fixed this bug yesterday (if you want the new version just send me an email).

        From the output below it seems that you are getting roughly 1 of these warning per 100000 analysed sequences, so this can hardly be the reason for poor overall mapping results. Are you by any chance running the other alignment to the repeat masked genome from the same files in the same folder? This would certainly have the potential to interfere with the results... In any case, if you continue to have problems can you email me further details of your problems? If necessary I could also create an FTP server for you to upload your files in question so I could take a look myself.

        Comment


        • Thanks for your quick reply!

          I am not really worried for the errors, I remember reading in this thread that they are due to the Cs located at the very end and, as you said, they are not relevant...Is the outcome what is killing me!

          I have created two different folders for the two different alignments (unmasked and masked), each with copies of bismark´s scripts. Each folder has two subfolders, one per individual (two .fq files each)...I have even duplicated the bowtie2 folder just in case....

          It is true that the two genomes are in different folders but each genome is used at the same time for the alignment of the different individuals (2x)...but this doesn´t seem to affect to the masked version of the genome.

          I´ll post tomorrow how it is going.

          Many thanks!

          Comment


          • Alright, thanks for keeping me posted.

            Comment


            • Hi again,

              OK, I have some more news. I made a mistake checking online my runs ....the one that is running error-free is the alignment against the standard genome.

              All the errors I´ve reported have been produced in the alignment run against the masked genome. These runs are now over (see report below) and they didn´t work at all.

              To my understanding, Bowtie doesn´t have any type of restrictions in the use of masked genomes, does it? Do you think I could miss something during the genome preparation step?

              Cheers


              Bismark report for: M24_FIS17_trim_galore_1.fq and M24_FIS17_trim_galore_2.fq (version: v0.7.12)
              Bowtie was run against the bisulfite genome of /Bisulfite/BISMARK_REALIGNMENT/MASKED_Version_Genome/ with the specified options: -q --phred64 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500

              Option '--directional' specified: alignments to complementary strands will be ignored (i.e. not performed)
              Final Alignment report
              ======================
              Sequence pairs analysed in total: 62787165
              Number of paired-end alignments with a unique best hit: 1587
              Mapping efficiency: 0.0%
              Sequence pairs with no alignments under any condition: 62785501
              Sequence pairs did not map uniquely: 77
              Sequence pairs which were discarded because genomic sequence could not be extracted: 0

              Number of sequence pairs with unique best (first) alignment came from the bowtie output:
              CT/GA/CT: 557 ((converted) top strand)
              GA/CT/CT: 0 (complementary to (converted) top strand)
              GA/CT/GA: 0 (complementary to (converted) bottom strand)
              CT/GA/GA: 1030 ((converted) bottom strand)

              Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

              Final Cytosine Methylation Report
              =================================
              Total number of C's analysed: 50806

              Total methylated C's in CpG context: 1118
              Total methylated C's in CHG context: 99
              Total methylated C's in CHH context: 172

              Total C to T conversions in CpG context: 2166
              Total C to T conversions in CHG context: 16576
              Total C to T conversions in CHH context: 30675

              C methylated in CpG context: 34.0%
              C methylated in CHG context: 0.6%
              C methylated in CHH context: 0.6%
              Last edited by oria34; 07-06-2013, 12:39 AM.

              Comment


              • Good to hear that the alignments against the standard genome went well.

                The errors from the masked genome are quite likely to be a result of the masking process itself, do you have an idea about how much sequence has been masked into Ns (presumably)?

                As far as I am aware Bowtie2 supports indexing of Ns in there reference sequence (whereas Bowtie 1 does not), so in theory it should work in the same way. N's in the reads or reference do receive a mapping penalty as well though (albeit not as much as true mismatches), so if you simply have masked too much of your genome this might just exceed the fairly stringent penalty allowance of Bismark's default setting which is governed by the "--score_min" parameter. The longer your reads and the paired-end nature of your library could make it possible that effectively all reads would fall into a 'masked' region.

                To see if it is the degree of masking you could just run a few sample reads (e.g. with the parameter --qupto 500000 for the firt 500K reads), run only Read 1 in single end mode and use --score_min L,0,-0.6. This should finish in a couple of minutes and tell you quickly if the mapping efficiency increases notably.

                Just out of interest, why would you want to align reads to both a masked and unmasked genome in parallel (especially since the normal genome alignments were fine)?

                Comment


                • Well, apparently errors started to appear also in the alignment run against the standard genome, but a bit later I left the office (see below). Not really concerning if I get a good enough mapping efficiency.... Still running

                  Well this is embarrassing…… No idea how "they" masked the genome for the interspersed repeats. I can't even ask till next Monday...Sorry for this also, but I have no idea how to extract the first 500K reads. I am quite newbie in all this

                  The reason for aligning to both versions of the genome is just to compare results...well, I am just curious about how the interspersed repeats can determine certain patterns of methylation in certain parts of a couple of particular chromosomes. I just thought it was worth the shot.



                  Processed 47700000 sequences so far
                  Processed 47800000 sequences so far
                  Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 95642710.
                  Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 95642710.
                  Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:2821:99210#TCTTATAT/1 scaffold_1878
                  Processed 47900000 sequences so far
                  Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 95989234.
                  Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 95989234.
                  Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:12271:44482#TCTTATAT/1 scaffold_232
                  Processed 48000000 sequences so far
                  Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96116284.
                  Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96116284.
                  Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2315:3124:23585#TCTTATAT/1 scaffold_297
                  Processed 48100000 sequences so far
                  Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96230734.
                  Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96230734.
                  Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:15481:55756#TCTTATAT/1 scaffold_1105
                  Processed 48200000 sequences so far
                  Processed 48300000 sequences so far
                  Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96765254.
                  Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96765254.
                  Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2316:2905:27999#TCTTATAT/1 scaffold_1711
                  Processed 48400000 sequences so far
                  Processed 48500000 sequences so far
                  Processed 48600000 sequences so far
                  Processed 48700000 sequences so far
                  Processed 48800000 sequences so far
                  Processed 48900000 sequences so far


                  FYI - Updated info: The alignment to standard genome is now finished. It took around 17h and the mapping efficiency was fairly high....thanks again for this worderful tool! I copy/paste the report:

                  Final Alignment report
                  ======================
                  Sequence pairs analysed in total: 48900263
                  Number of paired-end alignments with a unique best hit: 31991336
                  Mapping efficiency: 65.4%

                  Sequence pairs with no alignments under any condition: 13461118
                  Sequence pairs did not map uniquely: 3447809
                  Sequence pairs which were discarded because genomic sequence could not be extracted: 215

                  Number of sequence pairs with unique best (first) alignment came from the bowtie output:
                  CT/GA/CT: 15981506 ((converted) top strand)
                  GA/CT/CT: 0 (complementary to (converted) top strand)
                  GA/CT/GA: 0 (complementary to (converted) bottom strand)
                  CT/GA/GA: 16009615 ((converted) bottom strand)

                  Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

                  Final Cytosine Methylation Report
                  =================================
                  Total number of C's analysed: 1298097901

                  Total methylated C's in CpG context: 122944840
                  Total methylated C's in CHG context: 870088
                  Total methylated C's in CHH context: 2311989

                  Total C to T conversions in CpG context: 71226620
                  Total C to T conversions in CHG context: 301248948
                  Total C to T conversions in CHH context: 799495416

                  C methylated in CpG context: 63.3%
                  C methylated in CHG context: 0.3%
                  C methylated in CHH context: 0.3%
                  Last edited by oria34; 07-06-2013, 05:19 AM.

                  Comment


                  • Good news. So apparently the warning messages you saw amounted to a total of 215 (out of 49 million read pairs), indeed something you should be able to live with.

                    Bismark has an option '--qupto' to just use the first so many reads; just type bismark --help to see further explanations on it

                    So you could run single end alignments against your masked genome for the first 500000 reads with not-so-stringent criteria using the following command:

                    bismark --bowtie2 --qupto 500000 --score_min L,0,-0.6 /path/to/genome/ file_val_1.fq

                    Comment


                    • M-Bias plot and alignment stats pie chart

                      We have just released a new version of Bismark (v0.8.0). The main purpose of this release is the introduction of an M-bias plot for the methylation extractor (thanks to K.D. Hansen for the suggestion, see reference below). The M-bias plot allows researchers to see whether there are fundamental technical biases in the methylation calling of their reads.

                      Attached are two examples:

                      a) the profile of a PBAT-Seq library (read1) which should in theory start with a 'random' 4-mer oligo that was used to bind to bisulfite converted reads. The M-bias (and also base composition profiles of these libraries as judged by FastQC (not shown here)) do show some serious biases primarily within, but not limited to, the first 4bp of the library.

                      b) Read 2 of any paired-end BS-Seq library. Due to the end-repair reaction and A-tailing procedure for Illumina libaries, the 3' ends of sonicated fragments are filled in by the Klenow enzyme with unmethylated cytosines. These will then be converted to Ts in the bisulfite treatment, resulting in a very high amount of apparently unmethylated cytosines at the first couple of positions of Read 2. (The problem is somewhat reminiscent of the fill-in problem in RRBS libraries).

                      In addition to just the methylation profile, the M-bias plot also shows the actual number of events for an entire read file. This should enable investigators to make an informed decision of whether to remove biased positions in the reads, e.g. by using the option --ignore <int> of the methylation extractor or by hard-trimming read 2 by 2bp prior to running the alignments, or whether it is OK to live with what they observe.

                      Other changes in the latest version include:
                      Bismark: Added new option '--prefix' to add a prefix to the output filenames. For example, '--prefix test' with 'file.fq' would result in the output file 'test.file.fq_bismark.sam' etc.
                      Bismark: Fixed a warning message that occurred when chromosomal sequences could not be extracted in paired-end Bowtie2 mode
                      Bismark: will now generate a pie chart with the alignment statistics once a run has finished; this allows to get a quick overview of how many sequences aligned uniquely or sequences that did not align, either due to producing no alignment at all, multiple mapping or because it was impossible to extract the chromosomal sequence

                      Methylation Extractor: upon completion, the methylation extractor will now produce an M-bias (methylation bias) plot, which shows the methylation proportion across each possible position in the reads (described in: Hansen et al., Genome Biology, 2012, 13:R83). The data for the M-bias plot will be written into a text file (to generate graphs by alternative means) and drawn into a .png file. The plot also contains the absolute number of methylation calls per position (methylated + unmethylated)

                      Bismark is avaialable for download here: https://www.bioinformatics.babraham....jects/bismark/
                      Attached Files

                      Comment


                      • Hi Felix,

                        Sorry for the delay in replying. For some reason I couldn't use the --qupto option, Bismark gives back an error message:

                        Unknown option: qupto
                        Please respecify command line options

                        Any thoughts?

                        Thanks for the new release of Bismark! This new M-bias tool is really helpful!

                        Also, I am looking for some advice regarding the post processing of all the methylation calls. In order to have a statistically based tool to decide whether a C is actually a methylated C, I've decide to perform the well-known Binomial probability test.

                        When I started in my present position, all the sequencing experiments were already performed and the data ready to be analyzed so, to my knowledge, no lambda control sequence exists. The question is, Is there any straightforward way to use the methylation extractor output as a input file for any other soft performing this Binomial probability test?

                        I would be very grateful in receiving some advice on this

                        Thanks a lot

                        Comment


                        • Hi oria,

                          Sorry the upto command is actually:

                          Code:
                          -u/--upto <int>          Only aligns the first <int> reads or read pairs from the input. Default: no limit.
                          (just do bismark --help if in doubt).

                          We personally use SeqMonk for most downstream methylation analyses, so I am probably not the best person to advise you on other downstream applications. There are couple of packages though that devoted to such analyses, e.g. methylKit by A. Akalin or the Bioconductor package bsseq by K.D. Hansen.

                          Comment


                          • We just released a new version of Bismark (v0.8.1) which addresses a few minor issues such as plot colours and legends, but notably also extends the '--ignore' option of the methylation extractor to now work independently for Read 1 and Read 2 of paired-end files. This allows more flexible control of removing biases that can now be visualised via the M-bias plot. Here are the current changes in more detail:

                            • Methylation Extractor: Changed the function of the option '--ignore <int>' to ignore the first <int> bp from the 5' end of single-end reads or Read 1 of paired-end files. In addition, added a new option '--ignore_r2 <int>' to ignore the first <int> bp from the 5' end of Read 2 of paired-end files. Since the first couple bases in Read 2 of BS-Seq experiments show a severe bias towards non-methylation as a result of the end-repair of sonicated fragments with unmethylated cytosines (see M-bias plot), it is recommended that the the first couple of bp of Read 2 are removed before starting downstream analysis. Please see the section on M-bias plots in the Bismark User Guide for more details
                            • Methylation Extractor: Changed colours, legends and background colour of the M-bias plot
                            • Bismark: Changed the way in which the alignment overview file is being named to actually work

                            Together with the options '--clip_r1' and '--clip_r2' of Trim Galore one can now choose to remove known biases, e.g. for PBAT-Seq, prior to running the alignments using Trim Galore, or ignoring biased methylation calls after the alignments have been carried out using the methylation extractor.

                            Bismark can be downloaded here: https://www.bioinformatics.babraham....jects/bismark/

                            Comment


                            • Hi Felix,

                              Any chance read groups could be incorporated into the output bams in a future version of bismark?

                              Cheers!

                              Comment


                              • Originally posted by frozenlyse View Post
                                Hi Felix,

                                Any chance read groups could be incorporated into the output bams in a future version of bismark?

                                Cheers!
                                Hi frozenlyse,

                                Sorry but I am not too familiar with the concept of read groups in the output BAMs, would that be more or less just adding an @RG tag and ID into the header section? Or would you want to specify more details yourself?

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                22 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X