Seqanswers Leaderboard Ad

**mike.t** · 05-25-2012, 06:01 AM

Originally posted by flyyuan View Post

I have tried to use repeatmasker to analysis the repeat elements of soybean genome, howerver the pecentage of masked sequence is only 30%, it is much less than the 60% which was reported in published paper. Also, in the readme file for RepeatMakser, it is record that the repeat percentage for Hg18 is 35%, which it should be 50% as described in other papers. could anyone tell me why such difference happened?

Are you using RepBase? They probably used their own library of repeats which may have more repeat families than are in RepBase.

Originally posted by flyyuan View Post

Another strange thing is that when I run RepeatMask on chr10 or chr15 of Hg19(I ony test these two chroms of Hg19), the program is always stop at processing output stage (at cylce 9), but it works well on soybean chrom sequences and whole genome sequence.

What to you mean by 'stop'? Does it crash and give you an error message?

**flyyuan** · 05-25-2012, 06:56 AM

Originally posted by mike.t View Post

Are you using RepBase? They probably used their own library of repeats which may have more repeat families than are in RepBase.

I used the lastest repeatmasker libaray downloaded from RepBase site. The repeat elemets for soybean were just added to latest version, it maybe not include all the repeat families. So I tried to test the program on Hg19 to ensure I used the program in a correct way.

Originally posted by mike.t View Post

What to you mean by 'stop'? Does it crash and give you an error message?

the program suspend at "cycle 9", the program did not exit and there is no any error message. the screen is showed as following:
-----------------------------------------------
cycle7.....................................................
cycle8......................................................
cycle9
------------------------------------------------
the program will be finishied if cycle10 is completed, but it does not go ahead. I checked the thread information and found that it comsumed about 10G memory, it looks strange.

**gprakhar** · 05-27-2012, 10:51 PM

Originally posted by flyyuan View Post

I have tried to use repeatmasker to analysis the repeat elements of soybean genome, howerver the pecentage of masked sequence is only 30%, it is much less than the 60% which was reported in published paper. Also, in the readme file for RepeatMakser, it is record that the repeat percentage for Hg18 is 35%, which it should be 50% as described in other papers. could anyone tell me why such difference happened?

The Soyabean people do have a custom Repeat Library for it. I got it from here, http://SoyTEdb.org
I use this command,
$RepeatMasker -s -nolow -norna -no_is -gff -lib Glycine_max_TE_lib.fa <sequence-file.fa>

Originally posted by flyyuan View Post

Another strange thing is that when I run RepeatMask on chr10 or chr15 of Hg19(I ony test these two chroms of Hg19), the program is always stop at processing output stage (at cylce 9), but it works well on soybean chrom sequences and whole genome sequence.
thanks in advance.

What exact command did you use to do this analysis??

--
pg

**rahularjun86** · 05-28-2012, 06:16 AM

Hi,
Have you tried RepeatScout (http://bix.ucsd.edu/repeatscout/). It uses repeatmasker in one of its 6 steps. I used it and got nice results.
Best wishes,
Rahul

**Artem** · 05-29-2012, 11:52 AM

Repeats in the genome are not rigorously defined. The common quote of repetitive elements making up 45% of the human genome is made using a relatively liberal definition of repeats (this is to account for the divergence in sequence) from the original human genome paper in 2001.

RepeatMasker uses a slightly modified RepBase library to do it's searching and from the annotation uses a cutoff I believe of 30% divergence, which is more on the conservative side. What you are getting then is an annotation of all elements that are within 30% divergence of the consensus sequence for that element.

The biological hypothesis on the other hand (and some more recent papers looking for deep homology), is that repetitive-derived elements make up somewhere on the order of 70% of bulk human DNA. The 70% - 45% difference is highly divergent and difficult to recognize or the common ancestor cannot be reconstructed. Further, as a repetitive sequence becomes older and older, the probability that it acquires endogenous function will increase so some very old elements (and even some young ones) become necessary for an organism to survive. Syncitin A in mice is the textbook example; an endogenous retroviral envelop gene was 'exapted' in early placenta formation to form the synctiotrophoblast and is necessary for survival.

So in answer to your question, if you want to increase the % of repeats masked you increase the divergence cutoff, but risk informative endogenous sites. The cut-offs are what 'works', and if you want to do a detailed study of something like gene-regulation sites then it may even be beneficial to include repeats as they can greatly effect gene expression.

I'm new to the repeats field so if I made any mistakes above please forgive me but that's the jist of it.

From the RepeatMasker Manual:

"The program can be run at three levels of speed or sensitivity. The only difference between these settings is the minimum match or word length in the initial (not quite) hashing step of the cross_match program (see the cross_match/phrap documentation). The "slow" setting will take about 3 times longer and will find and mask 0-5% more repetitive DNA sequences than the default setting. The "quick" settings miss 5-10% of the sequences masked by default, but will be 3 to 6 times faster. The alignments may extend more or be somewhat more accurate in the more sensitive settings as well.

At the sensitive settings RepeatMasker currently finds, on average, 47% of human genomic DNA to be derived from interspersed repeats. RepeatMasker is very sensitive in comparison with other programs, although comparison to some is skewed because of the use of much smaller databases."

Try using the slower settings.

**mege** · 06-06-2012, 08:33 PM

I am repeating the second question of flyyuan, because I have the same problem and there is no answer to this so far.
I have used RepeatMasker for Drosophila, Caenorhabditis and Arabidopsis genomes without any problem. However, when I want to run the RepeatMasker on the human genome with the’ nolow’ option, it does not get through. There is no problem running it with the ‘noint’ option. When using ‘nolow’ option RepeatMasker deals with most of the jobs, but when Processing Repeats, it is blocked at Cycle 9 (no error message, but nothing happens for 2-3 hours). I tried to cut up the genome into smaller bits (I suspected memory issues). It works fine for Chr19 (59 Mb) or ChrY, but not for Chr18 (78 Mb). So I cut up all the chromosomes into 50Mb pieces and tried to run Repeat Masker on only one fragment. This did not help, RepeatMasker is still blocked at cylcle9. Do you have any idea of where the problem comes from and how to fix it? The command used is as follows
‘RepeatMasker –pa 4 –nolow –spec homo –dir temp input_file.fas ‘

Another question is related to the ‘nolow’ option again. I would like to know, if using this option, RepeatMasker skips the detection of low-complexity regions and simple sequences all together, or it does detect first these regions, then TEs, but it does not print out Low-complexity and simple sequence hits? In the first case I could expect false TE hits in low-complexity regions that are reported, while in the second it is more unlikely.

**mike.t** · 06-06-2012, 08:52 PM

Have you tried contacting the developers directly?

it looks like the problem is in the Perl script ProcessRepeats. Setting $DEBUG = 1; in that file may provide you with some more helpful output.

This part of the script seems to be finding and removing hits that are nested within other hits. You might check your repeat library to ensure there are no duplicates or nested elements.

You might try censor instead of RepeatMasker:

Software - Censor - GIRI

http://www.girinst.org/downloads/software/censor/

**A.N.Other** · 07-08-2012, 02:03 PM

I too am having the same issue as above with RepeatMasker stopping at Cycle 9 after identifying the elements in the input sequence. In fact, googling 'repeatmasker cycle 9' brought me to this thread

I've had RepeatMasker quite happily do the entire mouse genome, but it won't manage any of the human chromosomes I've tried. 2-3 days (over a weekend), and the process is much as described above - ~10-12 Gb RAM, 1-2% proc usage - without progressing a single dot on the progress bar for cycle 9. This is using the latest downloads of RepeatMasker, TRF, RM-BLAST and the library from RepBase. The computer certainly isn't a limiting factor for my issues - 128Gb RAM, 64 CPU cores etc ...

Have either of you solved this issue, by any chance? I've set debug flags to 1 for the RepeatMasker perl script itself and the same for the ProcessRepeats script, but it doesn't give me anything to work with - nothing at all, in fact. I haven't tried to get a more useful output form the scripts yet, but will next week.

I've emailed the developers, but I understand they've probably got other things on their minds, so if we can sort it out ourselves, it's probably better

**A.N.Other** · 07-09-2012, 02:37 AM

Setting the debug flags for cycle 9 within the ProcessRepeats script doesn't shed any light on the issue for me. Execution stops after the following, with human ch22 for a fast run:

Code:

cycle 8 ..............................................................................
cycle 9 Cycle9: Considering: 1524 12.3 1.0 14.2 chr22                     0        0 30218631 + DELETE_ME#SINE/Alu        2      120      192   2 13072 b364s2i1       
Unhide Inserts:
 --vs-->1462 17.7 0.3 17.7 chr22                     0        0 20842511 + DELETE_ME#SINE/Alu        1      122      190   2 37457 b526s2i0       
 --vs-->1452 14.4 0.0 1.8 chr22                     0        0 11929977 + DELETE_ME#SINE/Alu      286      289       23   1 62701 b680s1i11

To avoid complication, settings were simply '-s -pa 16 -species "homo sapiens"'.

**rhubley** · 09-11-2012, 10:40 AM

A user just pointed out this forum to us. We are sorry to hear that people have been having problems running RepeatMasker. Luckily with this user's help we tracked down the problem and have a fix for everyone. We will be releasing a new version of RepeatMasker soon, until then please download a new ProcessRepeats script from: http://www.repeatmasker.org/ProcessRepeats.gz , place it in your RepeatMasker directory and uncompress it with: "gunzip ProcessRepeats.gz". Please let us know if you have any further problems using the feedback page on our website: http://www.repeatmasker.org.

**amitbik** · 01-20-2014, 01:20 AM

Repeatmasker

i am trying repeatscout to find plant repeats. There are 6 steps to do it but in 1 step it uses repeatmasker with command

./RepeatMasker -s -lib your_repeats_filtered1.fasta yourgenome.fasta

but in the commandline it is Checking for E. coli insertion elements not any plant repeats. or
Do I have to make any custom library for this?

Thanks

**mike.t** · 01-20-2014, 01:43 AM

The file your_repeats_filtered1.fasta IS your custom library. There is also a flag to enable/disable E coli insertion element checks but I usually disable it.

**amitbik** · 01-20-2014, 02:33 AM

Repeatmasker

Thanks mike.t

actually i read it here http://seqanswers.com/forums/archive...hp/t-5448.html

my your_repeats_filtered1.fasta file is my genome.fasta file which is processed by several steps.

so do i have to make a repeate file by merging all file into one then process with my genome.fasta file?

Thanks..

**lwhitmore** · 04-10-2014, 02:04 PM

Hey Everyone, I am new to RepeatMasker so this may be a basic question but does anyone know if this program can align sequences with IUPAC codes in them such as (R, K, and M). I cannot seem to find a definitive answer anywhere.

Thanks for your help

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

question about Repeatmasker

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News