Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cDNA analysis 454 assembler

    Hello,
    Could anybody explain from his experience the output files from 454 cDNA assembly? ( Isotigs, contigs, graph etc.) . For example, which file to use for further analysis- the 454AllIsotigs or the AllContigs and what exactly is the difference? how to visualize the graph? It is impossible to understand something from the graph.txt output file etc. THANKS ALOT!!!!!

  • #2
    Isotigs are transcripts, build out of the contigs. Different isogroups within the same isogroup represent alternative splice variants. This makes the isogroup the equivalent of a gene.

    Take this with a grain of salt, though, it is based on mining the contig graph for subgraphs (isogroups) and traversing all possible subgraphs (isotigs). We find, for example, small variations (SNPs, indels) generating almost identical isotigs. So, perhaps cluster the isotigs using CD-hit would help.

    Visualizing the graph is a wish we all have.

    Comment


    • #3
      more about cDNA

      Thanks alot, I have read your blog which explains in a very good way. Still, some questions are left:
      1. In the file 454AllContigs, there are some "contigs" with one or a few nucleotides.
      What are those "contigs"?
      2. some isogroups include only contigs and not isotigs (the first 2 groups in our case), the short "contigs" from the previous question are also assigned to this isogroup. So what is this isogroup? it is all the same gene? different genes? why there are no isotigs?
      3. In the file " 454 graph" there is the scaffold section, however, we had non-paired end sequencing, so what is the basis for this scaffold?
      4. Which of the files are recommended for further analysis, such as blast? The 454Isotigs.fna ? The 454AllContigs.fna (and then how all the very short sequences should be treated?)

      Comment


      • #4
        1. In the file 454AllContigs, there are some "contigs" with one or a few nucleotides.
        What are those "contigs"?
        These very small contigs seem to be produced when Newbler has difficulty resolving the edges of real contigs. We often see these in very highly abundant transcripts, presumably because the number of sequencing errors is high enough to make Newbler think these are real variations. So if the edge of an exon look like:


        ...CATGCATGAAA
        ...CATGCATGAAA
        ...CATGCATGAAA
        ...CATGCATGAAAA
        ...CATGCATGAAAA


        Newbler might consider that fourth 'A' in the last two reads to be a separate exon/contig.


        2. some isogroups include only contigs and not isotigs (the first 2 groups in our case), the short "contigs" from the previous question are also assigned to this isogroup. So what is this isogroup? it is all the same gene? different genes? why there are no isotigs?
        The isotigs are computed by traversing the contig graph, and Newbler has limits to how deep it will recurse when doing this. So if you have a bunch of these false contigs, it will eventually give up on trying to produce isotigs. You can try increasing the default limts, but in my experience even the max allowed values are not always sufficient.

        Which of the files are recommended for further analysis, such as blast? The 454Isotigs.fna ? The 454AllContigs.fna (and then how all the very short sequences should be treated?)
        Unfortunately, the only way to make sure your further analyses are using all your data is to take the 454Isotigs.fna plus the larger contigs from those isogroups where proper isotig formation failed.

        Comment


        • #5
          Originally posted by litali View Post
          3. In the file " 454 graph" there is the scaffold section, however, we had non-paired end sequencing, so what is the basis for this scaffold?
          Scaffolding is not really scaffolding here, just a description of the relation between the contigs and the isotigs. The same description is given in different ways in the 454IsotigsLayout.txt and 454Isotigs.txt files

          Comment


          • #6
            Originally posted by flxlex View Post
            Different isogroups within the same isogroup represent alternative splice variants.
            I guess you meant: Different "isotigs" within the same isogroup represent (...)

            Comment


            • #7
              Originally posted by CHRYSES View Post
              I guess you meant: Different "isotigs" within the same isogroup represent (...)
              Yep. Thanks...

              Comment


              • #8
                Hi all!
                I did a Newbler transcriptome assembly a year ago and it was very difficult to find some information about the process outcome (flxlex , thank you very much for your blog!). About this, I tried to know how many reads assembled, and I got different results depending the file I saw. For instance, according to 454AllContigs.fna 12310 reads were assembled in a sample identified by a MID tag (multiplexed) (I added all reads from the last column, numreads=), but I got such information in the 454NewblerMetrics.txt file:
                numberAssembled = 6603;
                numberPartial = 5359;
                numberSingleton = 8674;
                numberRepeat = 1101;
                numberOutlier = 723;
                Total reads = 22460
                Which could be the reason for this discrepancy?
                I did the assembly with the release 1.1.03.24 of Newbler.
                Regards,

                Comment


                • #9
                  Originally posted by jordi View Post
                  Hi all!
                  Which could be the reason for this discrepancy?
                  I suspect that some of the reads are being split among the contigs. Such reads would be counted twice.

                  Comment


                  • #10
                    mmmmm, Ponder

                    Hello.
                    I also want to make sure every possibly sequence is used in my further data analyses;

                    Originally posted by flxlex
                    "Isotigs are transcripts, build out of the contigs."
                    Originally posted by cram
                    "Unfortunately, the only way to make sure your further analyses are using all your data is to take the 454Isotigs.fna plus the larger contigs from those isogroups where proper isotig formation failed.
                    Originally posted by flxlex
                    CD-hit would help
                    Thanks flxlex, that program is a real help.

                    To clarify; would combining the Isotig.fna and the contigs.fna files into a single file and then running CD-hit give you a comprehensive, non-redundant set of transcripts from your 454 transcriptome for further analyses?

                    Are there are single reads anywhere else that are neither contigs nor isotigs but are still useful?

                    Thank you for any advice,

                    John.

                    Comment


                    • #11
                      Originally posted by poisson200 View Post
                      To clarify; would combining the Isotig.fna and the contigs.fna files into a single file and then running CD-hit give you a comprehensive, non-redundant set of transcripts from your 454 transcriptome for further analyses?
                      Hmm, that could actually work, hadn't thought of that. I always thought of running CD-HIT per isogroup with some looping script. Taking all contigs and isotigs into a CD-HIT run might collapse paralogues, though...

                      Are there are single reads anywhere else that are neither contigs nor isotigs but are still useful?
                      Yep, but so far, newbler does not output them in a separate file. You can get the IDs of the singleton reads from the 454ReadStatus file. Further, check this post:

                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                      Comment


                      • #12
                        Hi flxlex,
                        Thanks for the quick reply and the answers.

                        Originally posted by flxlex
                        Taking all contigs and isotigs into a CD-HIT run might collapse paralogues, though...
                        Looking at CD-hit, by default it looks for 98% identity or greater, which I think should be stringent enough not to collapse any paralogs (paralogs would have to be from a very recent gene duplication event or from a CNV for that to happen) but it is a good point to bear in mind.

                        To correct; cdhit-est, for me, should be set to 0.98, which is 0.9 by default.

                        Thanks again,

                        John.
                        Last edited by poisson200; 10-28-2010, 05:36 AM.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          Yesterday, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        57 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        45 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        55 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X