I'm working with some RNA-seq data (4 samples, 2 lanes on Illumina Hiseq 2000 = ~600mil 100bp PE reads) from a vertebrate without a sequenced genome. I initially wanted to both assemble an accurate de novo transcriptome from my combined reads and calculate differential expression for each transcript between 4 samples. I've done the de novo assembly with Trinity which looks great and is Blasting well. However now I am trying to map the reads back to this transcriptome using tuxedo so I can calculate accurate FPKM and fold changes.
Problem 1: Mapping to the transcriptome, which has ~450k transcritps generates BAM/SAM files with massive headers which are difficult to manage and will not work with cuff- links/compare/diff at all. I cannot modify the source to allow larger headers because I am running everything on servers containing precomplied binaries that I cannot touch or replace.
Problem 2: I've atrifically scaffolded my transcripts into "scaffolds" of 20k transcripts each with strings of 10 N's between each. This gives me 21 scaffolds that I can map to and generate BAMs of workable size that go through cufflinks without issue. What I can't do it prevent Tophat from integrating the N's into the mapped "exons". I've tried limiting the intron size which prevents transcripts from spanning the N's but will still map across them and integrate them into the exons.
Does anyone have a solution that can prevent Tophat from mapping across these patches or N's? Or another way to limit my header size and still allow Cufflinks/Compare/Diff to run?
Problem 1: Mapping to the transcriptome, which has ~450k transcritps generates BAM/SAM files with massive headers which are difficult to manage and will not work with cuff- links/compare/diff at all. I cannot modify the source to allow larger headers because I am running everything on servers containing precomplied binaries that I cannot touch or replace.
Problem 2: I've atrifically scaffolded my transcripts into "scaffolds" of 20k transcripts each with strings of 10 N's between each. This gives me 21 scaffolds that I can map to and generate BAMs of workable size that go through cufflinks without issue. What I can't do it prevent Tophat from integrating the N's into the mapped "exons". I've tried limiting the intron size which prevents transcripts from spanning the N's but will still map across them and integrate them into the exons.
Does anyone have a solution that can prevent Tophat from mapping across these patches or N's? Or another way to limit my header size and still allow Cufflinks/Compare/Diff to run?
Comment