Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Repeatmasker - How to make searches for low complexity regions less stringent?

    I am using RepeatMasker only to find regions of low-complexity regions of DNA. With the default settings "100 bp stretch of DNA is masked when it is >87% AT or >89% GC, a 30 bp stretch has to contain 29 A/T (or GC) nucleotides. ". What can I do to loosen this criteria and play around with the settings?

    Perhaps there is better program for what I want to accomplish? I have a list of 60000 rather short sequences (every sequence is about 600 bases).

    Thanks

  • #2
    I made a tool called BBMask, available in the BBTools package, here.

    Usage:
    bbmask.sh -Xmx6g in=file.fa out=masked.fa window=80 entropy=0.75 ke=5

    That will mask areas with entropy below 0.75 on a scale of 0-1, using a window size of 80, using a kmer length of 5 for entropy calculation. Those are the default settings, but you can customize it. A higher value of entropy will mask more sequences. It's extremely fast so you can play around with the settings (mainly entropy) until it masks the amount you want. It reports how much it masked.

    Comment


    • #3
      Your program looks like it might do the trick. Quick and easy to change the parameters. But I would like an output file that gives me %masked/input sequence.

      I have tried the covstats and scafstats outputs but I get "unknown parameter". There is an example below. What I have tired is to change the file format. In the example I have written .fa but I have also tried other formats or simply skipped writing a format. What am I doing wrong?

      Thank you for your kind help

      magnus@magnus-MacBookPro:~/Downloads/bbmap$ bash bbmask.sh -Xmx6g in=/home/magnus/Downloads/Testar_med_farre.fa out=/home/magnus/Documents/BBMask/masked8.fa covstats=/home/magnus/Documents/BBMask/covstats.fa window=20 entropy=0.95 ke=5 overwrite
      bbmask.sh: line 87: module: command not found
      bbmask.sh: line 88: module: command not found
      java -ea -Xmx6g -cp /home/magnus/Downloads/bbmap/current/ jgi.BBMask -Xmx6g in=/home/magnus/Downloads/Testar_med_farre.fa out=/home/magnus/Documents/BBMask/masked8.fa covstats=/home/magnus/Documents/BBMask/covstats.fa window=20 entropy=0.95 ke=5 overwrite
      Executing jgi.BBMask [-Xmx6g, in=/home/magnus/Downloads/Testar_med_farre.fa, out=/home/magnus/Documents/BBMask/masked8.fa, covstats=/home/magnus/Documents/BBMask/covstats.fa, window=20, entropy=0.95, ke=5, overwrite]

      Unknown parameter covstats=/home/magnus/Documents/BBMask/covstats.fa
      Exception in thread "main" java.lang.AssertionError: Unknown parameter covstats=/home/magnus/Documents/BBMask/covstats.fa
      at jgi.BBMask.<init>(BBMask.java:216)
      at jgi.BBMask.main(BBMask.java:45)

      Comment


      • #4
        Oh... let me clarify. The "readme.txt" file is for BBMap. BBMask's instructions are in its shellscript; you can print them by running the shellscript (bbmask.sh) with no arguments. So, covstats and scafstats are just for BBMap. The percent masked will be printed to the screen. So, the command should be this:

        bash bbmask.sh -Xmx6g in=/home/magnus/Downloads/Testar_med_farre.fa out=/home/magnus/Documents/BBMask/masked8.fa window=20 entropy=0.95 ke=5 overwrite

        A complete run looks like this:

        Code:
        bash bbmask.sh in=Panicum_hallii.fasta out=masked.fasta
        java -ea -Xmx46673m -cp /usr/common/jgi/utilities/bbtools/prod-v33.42/lib/BBTools.jar jgi.BBMask in=Panicum_hallii.fasta out=masked.fasta
        Executing jgi.BBMask [in=Panicum_hallii.fasta, out=masked.fasta]
        
        Loading input
        Loading Time:                   2.920 seconds.
        
        Masking low-entropy (to disable, set 'mle=f')
        Low Complexity Masking Time:    2.703 seconds.
        Ref Bases:                 556945529    206.08m bases/sec
        Low Complexity Bases:         899687
        
        Converting masked bases to N
        Done Masking
        Conversion Time:                1.784 seconds.
        
        Writing output
        Writing Time:                   1.171 seconds.
        
        Total Bases Masked:           899687/556945529  0.162%
        Total Time:                     8.611 seconds.
        That all gets printed to std err, so if you want to log it in a file, add >2 at the end, like this:

        bash bbmask.sh -Xmx6g in=/home/magnus/Downloads/Testar_med_farre.fa out=/home/magnus/Documents/BBMask/masked8.fa window=20 entropy=0.95 ke=5 overwrite 2>log.txt

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 11:49 AM
        0 responses
        15 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-24-2024, 08:47 AM
        0 responses
        16 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        62 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Working...
        X