Hi All,
CLIP/CRAC is an assay that allow researchers to identify direct interactions between proteins and RNA. Proteins are covalently cross-linked to RNA and the RNA is shortened using RNAses. We subsequently ligate adapters to each end and either do Sanger sequencing or if we expect the protein has multiple binding sites or binds lots of different RNAs we do Illumina Solexa sequencing
I've written a number of scripts in Python that I think are quite useful in processing the data. The scripts take a Novoalign or SAM output (single or paired end) and GTF files and can calculate UTR/intron overlap (for genes and transcripts), number of hits for each gene (sense or anti-sense), generate output files for viewing in genome browser (.sgr and GTF). These scripts can also remove repetitive reads, reads with multiple alignment locations and also count clusters (i.e. an assembly of reads that contain at least two overlapping reads). I've also written scripts to deal with barcoded 5' linkers, scripts that can generate nice multiple sequence alignments (more flexible I think than SAMtools) and generate simple pileups. Alignments can be generated for both genomic transcript sequences or coding sequences, if you expect that your protein only binds to mature mRNAs.
If anybody is interested in trying these scripts, please let me know.
So far they have only been tested on Linux and MacOSX.
G
I've also written python Novoalign, SAM and GTF classes that can be used to parse these file-types and with a few lines of code you can link to these parsers in your own scripts. I realise that there are tons of programs out there but I designed these programs to be user friendly and, as mentioned before, they can be quite easily incorporated into your own scripts.
I'm looking for volunteers interested in testing these scripts. They are still a work in progress but they could be quite useful to people doing a lot of RNAseq, CLIP or CRAC.
CLIP/CRAC is an assay that allow researchers to identify direct interactions between proteins and RNA. Proteins are covalently cross-linked to RNA and the RNA is shortened using RNAses. We subsequently ligate adapters to each end and either do Sanger sequencing or if we expect the protein has multiple binding sites or binds lots of different RNAs we do Illumina Solexa sequencing
I've written a number of scripts in Python that I think are quite useful in processing the data. The scripts take a Novoalign or SAM output (single or paired end) and GTF files and can calculate UTR/intron overlap (for genes and transcripts), number of hits for each gene (sense or anti-sense), generate output files for viewing in genome browser (.sgr and GTF). These scripts can also remove repetitive reads, reads with multiple alignment locations and also count clusters (i.e. an assembly of reads that contain at least two overlapping reads). I've also written scripts to deal with barcoded 5' linkers, scripts that can generate nice multiple sequence alignments (more flexible I think than SAMtools) and generate simple pileups. Alignments can be generated for both genomic transcript sequences or coding sequences, if you expect that your protein only binds to mature mRNAs.
If anybody is interested in trying these scripts, please let me know.
So far they have only been tested on Linux and MacOSX.
G
I've also written python Novoalign, SAM and GTF classes that can be used to parse these file-types and with a few lines of code you can link to these parsers in your own scripts. I realise that there are tons of programs out there but I designed these programs to be user friendly and, as mentioned before, they can be quite easily incorporated into your own scripts.
I'm looking for volunteers interested in testing these scripts. They are still a work in progress but they could be quite useful to people doing a lot of RNAseq, CLIP or CRAC.
Comment