Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correlation coefficient in cluster

    I have question about seqmonk- when we perform clustering on any analyzed data then we can have different subsets of clusters based upon correlation coefficient. How it calculate correlation coefficient? Is it that whatever the data matrix is it will calculate correlation. In that case there should be one correlation value per row. How it is that one single correlation coefficient is used to cut rows. I may be missing something. Any suggestion please

    Comment


    • Originally posted by mathew View Post
      I have question about seqmonk- when we perform clustering on any analyzed data then we can have different subsets of clusters based upon correlation coefficient. How it calculate correlation coefficient? Is it that whatever the data matrix is it will calculate correlation. In that case there should be one correlation value per row. How it is that one single correlation coefficient is used to cut rows. I may be missing something. Any suggestion please
      The correlation clustering is an iterative process where you start by making a set of clusters with only one probe in each. In each round the program finds the two most correlated clusters and joins them together. It keeps doing this until all of the clusters are joined together. Since the most strongly correlated clusters are always joined in each round, the level of correlation decreases as the clustering continues. It also means that every cluster join has a specific R value associated with it.

      When you adjust the clustering stringency with the slider in SeqMonk what you're actually doing is moving through the cluster tree to find the largest cluster set for which the R value which joined that cluster is at or above the R value that you set. High R values will most likely be found early on in the clustering but will generate only small clusters, smaller or negative R values will be late stage joins of large clusters, so adjusting this threshold allows you to define the stringency of clustering.

      Hope this clears things up.

      Comment


      • Originally posted by Neuromancer View Post
        Hi Simon,

        Just a short question about genome versions:
        As far as I know, SeqMonk genomes are derived from ENSEMBL genome releases, right?
        So is the current SeqMonk mouse genome (GRCm38) the same as the annotation and coordinates in ENSEMBLE release 73 (i.e. GRCm38p1 + new annotations by ENSEMBL)?

        [This current release has 38561 genes (ensemble gene IDs), SeqMonk's probe generator (v0.25.0) generates 32029 genes (feature probes over genes, nothing removed)...]

        What's the status of the SeqMonk (mouse) genome then?
        In general we only update the genomes for new assemblies and the gene builds we distribute are the initial builds for that assembly. GRCm38 hasn't changed its sequence since the initial Ensembl build so the gene models are still on Ensembl v68. If there is a significant improvement in the gene builds then we can update these and SeqMonk will pick up the updates, but we didn't build in a place to record the specific annotation version when we built the back end (would have been nice in retrospect) so we're generally reluctant to do this.

        If you want a newer gene build you can always download the GTF file for any specific build and import that as an additional annotation set. You can prefix all of the features with a specific string so you can tell them apart from the core features.

        Comment


        • Originally posted by simonandrews View Post
          If you want a newer gene build you can always download the GTF file for any specific build and import that as an additional annotation set. You can prefix all of the features with a specific string so you can tell them apart from the core features.
          Thank you, that was what I had in mind as a solution as well! Thanks for the quick answer.

          Comment


          • I've just released SeqMonk v0.26.0 onto our project web site. The immediate reason for this is to fix a problem which occurred with the program launcher in the new OSX Mavericks release. We have also, though, included another tool we've been working on which makes it much easier to create and work with custom genomes, so that if you just have a collection of fastq files or a GTF file then it's now much easier to use these with SeqMonk.

            Please try out the new release and send your experiences either back to us directly or post them in this forum.

            Comment


            • how to predict gene from transcriptome data by mapping of transcriptome to genome

              hi everyone
              i want to ask how to predict gene from transcrips by mapping them to genome. please reply if someone know about it. i am new in this field

              Comment


              • Originally posted by rajeshgazara View Post
                hi everyone
                i want to ask how to predict gene from transcrips by mapping them to genome. please reply if someone know about it. i am new in this field
                Probably the most commonly used tool for this would be cufflinks. Since you're asking in a SeqMonk thread I should point out that we've done this kind of analysis and then loaded the raw mapped data and the GTF file from cufflinks in to SeqMonk to check the results. We've found that it's been very variable whether the predictions it made matched with what we expected from looking at the data ourselves.

                Comment


                • Hi Simon.

                  I'm trying to use the tool for constructing genomes via gff and fasta files. I'm trying to get a draft version of the kiwifruit genome (from http://bioinfo.bti.cornell.edu/cgi-b...i/download.cgi) into seqmonk.

                  I've been loading the gff file - Kiwifruit_pseudomolecule.gff3 - it appears to accept it and then I get nothing when create a new project. Just 25 blank chromosomes. I've tried adding in the various fasta files, with little success. When I load the scaffold file into the genome creation tool along side the others, it'll show me the scaffold track, but that is of little value.

                  Is there some modification to the gff file I need to make that I'm missing? The seqmonk version I'm using was downloaded a week or so ago, so that should be current.

                  Any thoughts/pointers would be much appreciated.

                  Cheers
                  Ben.

                  Comment


                  • Originally posted by tirohia View Post
                    Hi Simon.

                    I'm trying to use the tool for constructing genomes via gff and fasta files. I'm trying to get a draft version of the kiwifruit genome (from http://bioinfo.bti.cornell.edu/cgi-b...i/download.cgi) into seqmonk.

                    I've been loading the gff file - Kiwifruit_pseudomolecule.gff3 - it appears to accept it and then I get nothing when create a new project. Just 25 blank chromosomes. I've tried adding in the various fasta files, with little success. When I load the scaffold file into the genome creation tool along side the others, it'll show me the scaffold track, but that is of little value.
                    Hi Ben,

                    I had a look at this. It's a bug in SeqMonk - it doesn't pick up files with .gff3 extensions when it creates the default annotation set for a new custom genome. If you change the extension to just .gff and rebuild the custom genome it should work.

                    I'll fix this in the next release. Thanks for spotting and reporting this.

                    Cheers

                    Simon.

                    Comment


                    • Ah. That works. Brilliant.

                      Ta muchly.

                      Ben.

                      Comment


                      • Hi Simon.

                        Possibly a rehash (of sorts of an old question if I may. I'm getting an error when I try and import a bam file into seqmonk - "Couldn't extract a valid name from <name>". It's the same one that was in this post a while back.

                        I've read the article that you linked to in your response.

                        So my gff file, where the reference data came from, has entries like this:

                        Chr6 glean gene 14845357 14856602 . + . ID=Achn215061; status=novel;
                        Chr6 glean mRNA 14845357 14856602 0.240873 + . ID=Achn215061-TA; Parent=Achn215061; status=novel;
                        Chr6 glean CDS 14845357 14845647 . + 0 Parent=Achn215061-TA;
                        Chr6 glean CDS 14848222 14849031 . + 0 Parent=Achn215061-TA;

                        The corresponding line in the header of the sam file that I have been attempting to import (actually been trying in bam, but this is the sam file it is derived from), has an entry like this:

                        @SQ SN:Achn215061 LN:4998

                        Both files have data from all the available chromosomes in them but the sam file only contains accession numbers, not chromosome details. It's all just Achn******. So how does one go about setting up aliases as indicated in the article that you linked to?

                        I'm assuming at this point, that I'll have to add something to the names of the various genes in the sam file, but the article indicates that there is no regexp attempts on the alias's provided in the files, which would mean that if I added a chromosome name as a prefix/suffix to all the entries in the sam file (Achn215061chr6 maybe), it wouldn't pick them up. I'm not sure where/how I would add the chromosome information in the SAM file.

                        Am I missing something obvious?

                        Cheers
                        Ben.

                        Comment


                        • Originally posted by tirohia View Post
                          I'm getting an error when I try and import a bam file into seqmonk - "Couldn't extract a valid name from <name>". It's the same one that was in this post a while back.

                          I've read the article that you linked to in your response.

                          So my gff file, where the reference data came from, has entries like this:

                          Chr6 glean gene 14845357 14856602 . + . ID=Achn215061; status=novel;
                          Chr6 glean mRNA 14845357 14856602 0.240873 + . ID=Achn215061-TA; Parent=Achn215061; status=novel;
                          Chr6 glean CDS 14845357 14845647 . + 0 Parent=Achn215061-TA;
                          Chr6 glean CDS 14848222 14849031 . + 0 Parent=Achn215061-TA;

                          The corresponding line in the header of the sam file that I have been attempting to import (actually been trying in bam, but this is the sam file it is derived from), has an entry like this:

                          @SQ SN:Achn215061 LN:4998

                          Both files have data from all the available chromosomes in them but the sam file only contains accession numbers, not chromosome details. It's all just Achn******. So how does one go about setting up aliases as indicated in the article that you linked to?
                          Setting up chromosome aliases is fairly simple. You simply need to create a file called aliases.txt in the folder containing your seqmonk genome and then add in alias[tab]chromosome name pairs to allow seqmonk to do the lookup when importing.

                          However, in this case I suspect you might have a different problem. It's difficult to tell from the information you've supplied but I think your data might have been mapped against a transcriptome rather than a genome, so although your genome has assembled chromosomes, the coordinates in your BAM file might be within transcripts rather than being genomic positions. If this is the case then it's not just a case of adding an alias since the positions will be offset in the genome. The aliases file does allow for supplying an offset position as well as an alias, but if you're working in a species which does splicing then even this isn't going to be enough since you will have a different offset for each exon.

                          It's theoretically possible to translate transcriptome coordinates to genomic coordinates (tophat does this internally for example), but I've never actually tried this and don't know of a simple approach to do this, but if your BAM file is mapped against a transcriptome and you wanted to view the data on a genome then this is what you'd need to do.

                          If you can give us a bit more information about where this data came from and how the mapping was done we can probably give a more concrete answer.

                          Comment


                          • So the reference genome that I've loaded into Seqmonk was the one that I was trying to load a week or two ago - from gff file at http://bioinfo.bti.cornell.edu/cgi-b...i/download.cgi. (Your fix worked well, thanks for that).

                            I've taken the corredesponding file of coding sequences from that site (ftp://bioinfo.bti.cornell.edu/pub/ki...ruit_cds.fa.gz) that correspond to the 39040 genes in the gff file and used that to create an index for bwa.
                            I then used bwa to map trasncriptome data against those coding sequences, ending up with a sam file. I imagine this is where the problem is. The genes get split up into chromosomes when the reference is loaded into seqmonk, but thre's no chromosome information in the cds file - thus no chromosome information when the mapped reads are put into the sam file.

                            Though, for reference, when I first started, I mapped my transcriptome data against the kiwifruit pseudomolecule sequence - which are the chromosomes and I was getting the same error though a lot less of them.

                            Any other info that would be helpful?

                            Comment


                            • Sorry. I may have answered my own question. When I was using the kiwifruit pseudomolecule sequences (i.e. the chromosome sequences) to map my transcriptome data against, I was getting the same error - but I didn't, at the time, find the link about the mapping of the chromsomes.

                              I'm repeating the mapping with the chromosome sequences, and I'll see if I get results wherein I can figure out how to set up the aliases.

                              Cheers
                              Ben.

                              Comment


                              • Hi Ben,

                                You should be fine if you map against the chromosome sequences. You'll need to use a splice aware mapper such as tophat to do the mapping, but you can also pass in your GTF file to the mapper so that it will effectively map against the transcriptome first, but will give you genomic coordinates.

                                Let me know if it works out OK, but hopefully this batch of mapped reads will be OK.

                                Simon.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                18 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                22 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                47 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X