Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cufflinks question

    Hello,

    I work with a polyploid species. I would like to use cufflinks to get a better assembly out of my data than I have been able to SOAP or ABySS.

    I have used novoalign to do my alignments since it allows the most mismatches between reference and reads. To the best of my knowledge anyway, I am not a bioinformatics guy.

    Regardless, I aligned with novoalign using the sam output option. I converted to bam, added the custom XS tag with a perl script, sorted the bam file and then plugged it into cufflinks.

    However, that is where it all went wrong!!

    This was my command:

    cufflinks -v 10lap.XS.bam

    I didn't set very many of the advanced options because I just want to produce the assembly file, not worried yet about anything else.

    This was the output:

    Warning: Your version of Cufflinks is not up-to-date. It is recommended that you upgrade to Cufflinks v1.2.1 to benefit from the most recent features and bug fixes (http://cufflinks.cbcb.umd.edu).
    [15:18:52] Inspecting reads and determining fragment length distribution.
    Inspecting bundle NODE_3_length_1029_cov_324.759949:0-1083 with 2007 reads
    Bad intron table has 0 introns: (0 alloc'd, 0 used)
    Map has 2007 hits, 965 are non-redundant
    Processed 1 loci.
    > Map Properties:
    > Total Map Mass: 2007.00
    > Fragment Length Distribution: Truncated Gaussian (default)
    > Default Mean: 200
    > Default Std Dev: 80
    0 ReadHits still live
    Found 2 reference contigs
    Total map density: 2007.000000
    [15:18:52] Assembling transcripts and estimating abundances.
    Segmentation fault

    I would greatly appreciate any hints on how to make that segmentation fault go away!

    Thank you for your time,

    - James

  • #2
    Hi James,

    Your post was curious, could you please tell us some more about what you're working with?
    - What are your reads from, are they RNA-seq data or from genomic DNA?
    - What version of cufflinks are you using (since you're getting an update error)?
    - What sort of size genome do you expect, is it a very small? Eukaryotic?
    - Do you have a reference annotation?

    The reason I ask is because most of your output looks quite odd to me, even before the segmentation fault, so I wondered whether Cufflinks might not be appropriate for what you're trying to do. Compare your map mass and number of loci processed to mine... This is my output, so you can see what I mean (although it's not from the most up-to-date version of cufflinks either):

    Code:
    cufflinks: /usr/lib64/libz.so.1: no version information available (required by cufflinks)
    You are using Cufflinks v1.1.0, which is the most recent release.
    [15:21:00] Loading reference annotation.
    [15:21:05] Inspecting reads and determining fragment length distribution.
    > Processed 149777 loci.                       [*************************] 100%
    > Map Properties:
    >       Total Map Mass: 60566592.87
    >       Read Type: 0bp single-end
    >       Fragment Length Distribution: Truncated Gaussian (default)
    >                     Default Mean: 200
    >                  Default Std Dev: 80
    [15:30:57] Assembling transcripts and estimating abundances.
    > Processed 149777 loci.                       [*************************] 100%

    Comment


    • #3
      Hi,

      I have RNA seq data and version 1.2.0 of cufflinks.

      I work with sugarcane, it has a complex, large genome. S. officinarum has a monoploid genome of about 930 megabases, just slightly larger than Sorghum, at 760 mbp.

      That being said, I have two choices for references, Sorghum, or I have a decent assembly from 42 million paired end reads from S. Officinarum, specifically LA Purple. Its N50 is around 950.

      I have not been able to get decent assemblies from any of my other sequenced sugarcanes. I expect this is due to the high ploidy, resulting in pulling apart contigs because of haplotypes or something based on the differences between the many copies of homologous chromosomes.

      I am hoping to get around that by using cufflinks to create contigs based on the alignment. I can get 70-80% of my reads aligned to a reference, so by creating contigs of a sort based on the alignment, I may be able to get a decent alignment which would greatly aid my work.

      Thanks for your help!

      Comment


      • #4
        Hi again,

        First of all, a couple more questions When you say you could not get good assemblies, do you mean using RNA-seq data or genomic data? If RNA-seq, do you think the poor assembly is due to alternative splicing isoforms?

        I am still trying to get my head around what you're trying to do but it does not seem to be the standard use of Cufflinks. Cufflinks is really only good for 1 type of thing: using alignments of reads to define where the genes lie on your reference chromosome/contig. It will help you figure out which reads originate from the same gene where alternative splicing is occurring and the structure of your gene, but that is about all. It will output genomic coordinates but generally not contigs, and if you do get contigs, they will be based on the reference species genomic sequence, with mismatches between your reads and the genome assumed to be errors. What type of information are you hoping to obtain? Gene expression, gene structure, ?

        Just noticed your original post and wondered whether you should use Tophat to align your reads anyway? Tophat will chop your reads into smaller chunks/segments (i think ~25-40 bp is default? check the manual) and then you can specify how many mismatches to allow per segment. You should have no problem getting as many mismatches as you want with this method, and you can change the size of the segments if you wish also. In doing this, TopHat will produce alignments with gaps for introns, which will result in the best output from Cufflinks in the long run.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 08:47 AM
        0 responses
        11 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        54 views
        0 likes
        Last Post seqadmin  
        Working...
        X