Asking for help on "BUILDING EXPANDED GENOMES" by Caltech ERANGE

jiexu

Junior Member

Join Date: Oct 2010

Posts: 3
- Share
- Tweet
#1

Asking for help on "BUILDING EXPANDED GENOMES" by Caltech ERANGE

10-26-2010, 10:13 AM

referring to http://woldlab.caltech.edu/erange/README.build-rds,

0. is "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" under "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/" the right table to input to getsplicefa.py for hg19?

1. it is mentioned "Download the chromosomes from UCSC", what is the exact right file to download?
1.1 is it fine for me to use bowtie-inspect to "output a FASTA file containing the sequences of the original references (with all non-A/C/G/T characters converted to Ns)"( http://bowtie-bio.sourceforge.net/ma...ndex-inspector)
from the pre-built index @ ftp://ftp.cbcb.umd.edu/pub/data/bowt.../hg19.ebwt.zip under http://bowtie-bio.sourceforge.net/index.shtml, and then take the result as "the chromosomes from UCSC
" to be input to getsplicefa.py??

1.2 http://woldlab.caltech.edu/erange/RN...lysisSteps.txt has an sample command line: "python $ERANGEPATH/getsplicefa.py hsapiens /my/path/to/human/knownGene.txt hg18splice32.fa 28"

However, I tried it but failed: "
xu@linux18> python getsplicefa.py Human.txt knownGene.txt expandedgenome_spacer2maxBorder27.fa 27
psyco not running
getsplicefa.py: version 3.5
72402
655896
10002 genes
20003 genes
30004 genes
40005 genes
50006 genes
60007 genes
70008 genes
3624 splices too short to be seen
555 splices will be under-reported"

it seems Human.txt (the result of reverse engineering on ftp://ftp.cbcb.umd.edu/pub/data/bowt.../hg19.ebwt.zip mentioned above ) could not be identified by getsplicefa.py at all.

1.3 And I am totally lost the way on setting H_sapiens(with a set of chromo1.bin~chromo22.bin and chromoX/Y.bin file and hsapiens.genedb) to be aware by cistematic core and ERANGE's getsplicefa

2. Where should the downloaded "chromosomes from UCSC" be located compared with the scripts under ../ERANGE3.2.1/commoncode?

3. "http://woldlab.caltech.edu/erange/README.rna-seq" mentions "build expanded genomes with splices and spikes"
there is a "mm9splices_spikes.tgz (the files for building the expanded genomes and remapping splices) ", then my question is
.what is the relationship of mm9splices_spikes.tgz with knownGene.txt?
. Is such a file different from knownGene.txt necessary to build the expanded genome of hg19? if so, how can it be available?

4.for Cistematic 3.0, "You will need to download the following packages: * cistematic3.0.tgz * db2.0.tgz"
However, where should the files in db2.0.tgz be put? .../cistematic3.0/db, the folder in cistematic3.0.tgz ?

5. there is many puzzles mentioned @ http://seqanswers.com/ on how to set the CISTEMATIC_ROOT, ERANGEPATH, PYTHONPATH, CISTEMATIC_TEMP.
Would you please issue a working solution by a big while detailed picture of your varibale/path setting and the organization of the files of ERANGE, CISTEMATIC, "chromosomes from UCSC" and knownGene.txt?

6. My RNA-seq's reads' length is diverse(from 13nt to 31 nt), how should I set the spacer and maxBorder in the scenario of hg 19 and varying (13nt-31nt) reads length?

7. Once the expanded genome is ready from getsplicefa.py, may I immediately use ./bowtie-build to generate the index and then map with ./Bowtie ?

8. how should such a spliced mapping results be fed to a peak finder? I've no experience with the later workflow of spliced mapping before.

Sorry about so many questions, but it seems they are common questions shared by many green hand of ERANGE, please issue me some guidance if feasible

Best,
jie
Tags: None

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Asking for help on "BUILDING EXPANDED GENOMES" by Caltech ERANGE

Latest Articles

ad_right_rmr

News