Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Out of Memory Issues

    I have 8 gb of ram allocated to seqmonk and although it takes awhile I am able to import all of my methylation extractor data into seqmonk and generate probes using the contig probe generator with a depth of one. However, when I try to quantitate the probes I get an out of memory error. I have three samples that I would like to analyze together however I have also tried to quantitate using just one sample and I still ran out of memory. Any suggestions?

    Comment


    • Originally posted by rmorey View Post
      I have 8 gb of ram allocated to seqmonk and although it takes awhile I am able to import all of my methylation extractor data into seqmonk and generate probes using the contig probe generator with a depth of one. However, when I try to quantitate the probes I get an out of memory error. I have three samples that I would like to analyze together however I have also tried to quantitate using just one sample and I still ran out of memory. Any suggestions?
      Designing probes over every single C is a very memory intensive way of going about the analysis. I don't know the details but all CpGs (top and bottom strand) for a human genome may well be around 50 million probes to handle (and this number would be a lot higher for non-CG context), so one could easily imagine that 8GB is not enough for this many calculations.

      What we normally use to compare different samples is the Bisulfite Quantitation pipeline over features. This lets you select certain genomic features you are interested in (basically any type of annotation or even the current probe set), e.g. CGIs, promoters, gene bodies etc., and calculates a percentage methylation for each feature. While doing so you may select a certain coverage threshold for a position to be considered, or a certain number of observations per feature. Tto get an overall idea about your different datasets you should be able to desing 1kb tiles over the entire genome, then quantify using appropriate coverage and observations per 1kb tile, e.g. 5 methylation counts per C and 10 Cs per tile, and then you can start looking at differences using appropriate filtering options.

      Comment


      • Hi Simon,
        Thanks for making SeqMonk available. I would like to use it for bismark output. I am using the test data that came with bismark.
        There is an item in the import data menu for bismark files, but no data tracks show, just errors.
        I can import the bam-file that is there, but that does not show the methylation if C's that I was hoping to be able to view with SeqMonk.
        I have seen the instruction videos on youtube (great and helpful), but nothing about methylation data.
        Looking forward to youw answer,
        Jacques

        Comment


        • Originally posted by JacquesWvdH View Post
          Hi Simon,
          Thanks for making SeqMonk available. I would like to use it for bismark output. I am using the test data that came with bismark.
          There is an item in the import data menu for bismark files, but no data tracks show, just errors.
          I can import the bam-file that is there, but that does not show the methylation if C's that I was hoping to be able to view with SeqMonk.
          I have seen the instruction videos on youtube (great and helpful), but nothing about methylation data.
          Looking forward to youw answer,
          Jacques
          Dear Jacques,

          Seqmonk is currently only able to import methylation information directly if Bismark was run in '--vanilla' mode. SAM files need to be run through the Bismark methylation extractor and can then be imported into SeqMonk using the Generic Text Import option.

          Enabling methylation import directly from SAM/BAM files has been on Simon's things to do list for a while, but unfortunately there are quite a few other things, too

          Comment


          • Custom genome in seqmonk

            I have a bacterial genome and is well documented at NCBI meaning has Fasta file as well annotation. However the particular organism is not there in seqmonk. Can I use custom annotation for my analysis?

            Thanks

            Comment


            • transcript expression in Seq monk

              Is it possible to calculate transcript differential expression with Seqmonk? I see exon utr and other selection features for probe but not transcript? Thanks for your help.

              Comment


              • Originally posted by mathew View Post
                Is it possible to calculate transcript differential expression with Seqmonk? I see exon utr and other selection features for probe but not transcript? Thanks for your help.
                Yes - I've been meaning to put up a new instruction video for this on our YouTube site, but there are now reasonably good tools to do differential expression analysis in SeqMonk. Most of the details for this can be found in our Advanced Course Manual, which actually focuses on the analysis of RNA-Seq data, but the general pipeline is:
                1. Select a set of transcripts you want to analyse and make these into an annotation track (or use the whole of the mRNA feature track)
                2. Use the RPKM quantitation pipeline to quantitate your data using default parameters. I need to change the name of this since it doesn't actually do RPKM calculations by default (more like LRPM)
                3. Check the normalisation of your samples and use the available tools (percentile distribution normalisation, match distributions normalisation) to correct any significant differences if required
                4. Make replicate sets from any groups of biological replicates you have (optional if you don't have them)
                5. Use the intensity difference filter to find a set of differentially expressed transcripts
                6. If you have biological replicates use the replicate set stats filter to remove variable transcripts from the initial candidate list
                7. Use the deduplication filter to select the most significantly changing transcript for each gene (optional)
                8. If you have more than two conditions use the hierarchical clustering plot to separate your hits into related clusters
                9. Report your list of hits and proceed to other downstream analyses.


                We've been using this methodology for a while now and it seems to be pretty robust for us.

                Comment


                • Originally posted by simonandrews View Post
                  If you're interested in looking at alternative splicing then if you haven't seen this already then a really neat option is to import just the spliced introns into your project. If you have a spliced mapped SAM/BAM file (eg from TopHat), then if you import this and select "Split Spliced Reads" and "Import Introns rather than exons" then you'll see just the splices which you've observed. You can quantitatively analyse these by using the Read Position Probe Generator followed by the Exact Overlap Count Quantitation. We've found this way of looking at the data to be really helpful in deciding if there really is a change in the splicing pattern between samples.
                  This might be a silly question, but does this means that one could look at and identify retained introns using SeqMonk?

                  Comment


                  • Originally posted by krespim View Post
                    This might be a silly question, but does this means that one could look at and identify retained introns using SeqMonk?
                    Yes, you could specifically quantitate over introns (even excluding other overlapping exons) and quantitate reads in these to see if you can see significant levels of coverage. There are no pipelines set up for this type of analysis but you should be able to build something suitable from the generic sets of tools which are available.

                    Comment


                    • Hi Simon,
                      I'm currently using SeqMonk to analyze a set of RNA-seq libraries with read depths ranging from 12-100 million. I've used the RPKM pipeline and the cumulative distribution plot lines are close, but still parallel to each other. How different should these lines be before doing further normalization, and which normalization would work the best for datasets with this difference in total reads? Thanks!

                      Comment


                      • After the RPKM pipeline (which will have it's name changed in the next release given that by default it doesn't do RPKM calculations!), you should look at the cumulative distribution plot and see how your samples look. If they're well overlaid then you can stop there. If they're different then you'd normally run the percentile normalisation quantitation method to get them to match up.

                        Within the percentile normalisation you have the choice to match your datasets by either adding or multiplying by a factor to get the data to match. If your samples have a small number of highly differential regions then you can end up with profiles which fall parallel to each other. In this case you'd use the 'add' option to make them line up. If your samples have differing degrees of duplication in them then you can get profiles where the degree of separation is proportional to the read count (the lines get further apart through the plot). In this case you'd use the 'multiply' option.

                        If you're not sure you can always try both and see which works best.

                        Hope this helps

                        Simon.

                        Comment


                        • Hello,
                          so I'm working with bisulfite data, so after methylation extraction after mapping of Bismark, I have 5 files, CHG,CG,CHH, bedgraph and genomewidereport.
                          I tried to upload data to Seqmonk by using the option text(generic).

                          I create for each chromosome a file because the genomewidereport file is really huge....

                          I want to know how to visulize the peak (the number of methylation per position) and after I tried the probe window generator to see by window of 20 bp but I did not see any difference between windows...

                          SO if someone has an idea how to analyse or visualize these kinds of data...


                          Thanks

                          Mohamed

                          Comment


                          • Firstly I'd have a think about exactly which contexts you care about in your data. CHH and CHG are normally huge datasets, and if you're only really interested in CpG then just working with that should make your life much easier. On a reasonable PC you should be able to load a whole CpG dataset fairly easily. You'll need a reasonable amount of RAM to do a whole CHH/CHG dataset.

                            Once you've loaded the data the way to calculate methylation levels is to use one of the quantitation pipelines (Data > Quantitation Pipelines), specifically the 'Bisulphite Methylation over feature' pipeline. This gives you several options to calculate an overall methlyation value for each one of a class of features (CpG islands, promoters, exons, genes, whatever..). Once you have the methlyation values you can start to compare these between datasets - exactly how you do this will depend on the datasets you have and what it is you want to look for.

                            Hopefully this is enough to get you started.

                            Comment


                            • Originally posted by shadow19c View Post
                              Hello,
                              so I'm working with bisulfite data, so after methylation extraction after mapping of Bismark, I have 5 files, CHG,CG,CHH, bedgraph and genomewidereport.
                              I tried to upload data to Seqmonk by using the option text(generic).

                              I create for each chromosome a file because the genomewidereport file is really huge....

                              I want to know how to visulize the peak (the number of methylation per position) and after I tried the probe window generator to see by window of 20 bp but I did not see any difference between windows...

                              SO if someone has an idea how to analyse or visualize these kinds of data...


                              Thanks

                              Mohamed
                              Further to what Simon wrote, you can't import the genome wide cytosine report as this basically has only one entry per cytosine. This means that the entire genome has a coverage of only a single read which is probably why you don't see any differences. SeqMonk regards every line of an input file as an extra read, which is why you have to import the output of the methylation extractor (CpG_*, CHG_* or CHH_*) and not the cytosine report.

                              Comment


                              • SE vs PE counting

                                Hey all,

                                how does Seqmonk count paired-end reads? Is each pair only counted once (or once per gene?) or is each read counted individually? In any case: is there any way how to switch between these two modes?

                                And if you allow a second question: Paired-end reads are shown as "complement" on mouse-over, but is there also a way to tell which reads belong together, maybe similar to how duplicate reads are visualized (they turn green on mouse-over)?

                                The reason I ask, is because I have mixed samples form SE and PE sequencing and want to compare raw count numbers (i.e. I use Seqmonk for preparing countTables for DESeq).

                                Thanks,

                                Roman

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                18 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                22 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                47 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X