Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • multi-read-correct not working in Cuffdiff 2.2.1

    Hi all,

    I've posted this to the Cufflinks mailing list as well:


    I'm running cuffdiff v2.2.1 from the binary on both Linux and osx. When I try to use the --multi-read-correct flag, the only difference I see in any output file is the "run.info" where it lists that I used the flag. I can see a second progress bar in the log when it runs with --multi-read-correct, so it looks like cuffdiff recognizes the flag. Otherwise, I can diff any other output with/without multi-read correction and it shows no change. I've tried all the binaries since 2.0.0 and this behavior seems to have started in 2.1.0.

    I've also noticed that multi-reads that map to duplicated genes in the genome lead to an FPKM and count of 0 for each duplicate. I'll post after this with concrete examples.

    Does anyone else notice no change in their output when enabling or disabling "--multi-read-correct" with cuffdiff 2.2.1?

    Michael Leonard

  • #2
    I've created a toy genome from Chlamydomonas reinhardtii to demonstrate what I'm seeing in the full genome. I've extracted seven genes of interest with 1kp flanking regions and made each their own contig. I've also extracted the reads mapping to these region and remapped them. The last "contig" was duplicated exactly to demonstrate an exact duplication event. I'm using STAR to map the reads, but I notice similar behavior with tophat. The entire example and all files can be accessed here:


    The following pairs of genes demonstrate my process of debugging. I report the counts and FPKM for the first sample. I also report the raw pileup of reads from bamtools. I see no difference with and without multi-read-correct:


    Biologically distinct genes, should both have FPKMs
    chromosome_test1 Cre01.g004300 count:6239 FPKM:122170 coverage:6443
    chromosome_test2 Cre01.g004500 count:5275 FPKM:118732 coverage:5452



    multi-read-correct seems to work correctly for a small duplicate region
    however, it appears reads mapping to duplicated region aren't counted at all
    97% of the first gene is contained in the second gene (30% global coverage)
    chromosome_test3 Cre17.g707450 count:0 FPKM:0 coverage:1657
    chromosome_test4 Cre07.g333746 count:4629 FPKM:57610 coverage:6373



    Very low count and FPKM for genes with large duplicate region
    blast reports 100% sequence identity with 70% coverage between genes
    chromosome_test5 Cre17.g738650 count:6 FPKM:123.825 coverage:1758
    chromosome_test6 Cre17.g698299 count:225 FPKM:2909 coverage:1965



    Duplicated full gene, expect 0 FPKM for each based on the above
    chromosome_test7 Cre01.g000900.1 count:0 FPKM:0 coverage:1959
    chromosome_test8 Cre01.g000900.2 count:0 FPKM:0 coverage:1959




    It appears that cuffdiff 2.1.0 - 2.2.1 is ignoring duplicated regions completely. I've tried this test on every binary I could run, with and without the multi-read-correct flag:


    The parameters I use for everything are in the "scripts" folder of the zip file above. These are my cuffdiff parameters:

    cuffdiff \
    --labels ${SAMPLE_LABELS} \
    --output-dir . \
    --num-threads 4 \
    --multi-read-correct \
    --max-bundle-frags 1000000000 \
    ${GTF} \
    ${SAMPLE_LIST}

    I'm assuming that multi-read-correct is supposed to place multi-reads in whichever transcript fits the expression model better. With multi-read-correct disabled, I'm also assuming that each duplicate should get half of the reads. Am I correct in these assumptions and is this the desired behavior? I can imagine instances where genes are expressed, but since they are duplicated somewhere else they report 0 expression.

    Michael Leonard

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:37 PM
    0 responses
    10 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 06:07 PM
    0 responses
    9 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    51 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    67 views
    0 likes
    Last Post seqadmin  
    Working...
    X