Unconfigured Ad

**Risha** · 08-03-2015, 04:53 AM

Hi Asaf

I am also noticing this in our databasets. This is my first time analysing data from NextSeq and FastQC says that in Read 2, there is overrepresented poly G sequences.

Did you figure out what was going on?

**Asaf** · 08-03-2015, 11:38 PM

I emailed Illumina's representatives here in Israel but didn't get an answer. I think that the explanation I gave above is reasonable (maybe low efficiency of RT in the cluster?). With v.2 chemistry we had better results but we only ran 1 sample so I can't tell for sure.
What I do is remove reads that have more than 80% G's and/or use DUST filter to remove low complexity reads. Beware that besides poly-G you'll probably have poly-G with some other nucleotides randomly appearing in the sequence (which might even map to the genome) this is why I remove them before mapping.

**[email protected]** · 08-05-2015, 12:07 AM

Such tool is available on github

There is a tool available on Github for removing PolyA, PolyT, PolyC, PolyG

GitHub - OpenGene/AfterQC: Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data

https://github.com/OpenGene/after

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data - OpenGene/AfterQC

Automatic Filtering, Trimming, and Error Removing for fastq data
Currently it supports Illumina 1.8 or newer format
AFTER can simply go through all fastq files in a folder and then output a good folder and a bad folder, which contains good reads and bad reads of each fastq file

Besides remove PolyX, it also can do:
Trim reads at front and tail according to bad per base sequence content
Detect and eliminate bubble artifact caused by sequencer due to fluid dynamics issue
Filter low-quality reads

**[email protected]** · 08-05-2015, 12:15 AM

Use AFTER to do filtering

AFTER works well with nextseq500 data

**Holinder** · 10-29-2015, 12:21 AM

I have noticed the same thing with NextSeq data. Mostly poly-G, but some other homopolymers as well (even poly-N). I tried this tool After to remove these reads, but it doesn't seem to work. What other program can work with paired-end reads and remove poly-X reads?

**[email protected]** · 10-29-2015, 12:28 AM

Originally posted by Holinder View Post

I have noticed the same thing with NextSeq data. Mostly poly-G, but some other homopolymers as well (even poly-N). I tried this tool After to remove these reads, but it doesn't seem to work. What other program can work with paired-end reads and remove poly-X reads?

What's the error did you meet when using AFTER? Let me know that and I will help you to fix it.

**Holinder** · 10-29-2015, 12:41 AM

With default settings it marked almost all the reads as bad. And good reads had a minimum length of 24 bp, however the default should have been 35 bp.

**[email protected]** · 10-29-2015, 01:03 AM

Originally posted by Holinder View Post

With default settings it marked almost all the reads as bad. And good reads had a minimum length of 24 bp, however the default should have been 35 bp.

cd to the folder contains your fastq files, and try to run with:

Code:

python after.py -f0 -t0 -s24

-f0 means no trimming in the front
-t0 means no trimming in the tail
-s24 means set the min read length to 24 bp

**[email protected]** · 10-29-2015, 01:08 AM

And because your read length is extreme short, you shoud set following parameters:

-p POLY_SIZE_LIMIT, --poly_size_limit=POLY_SIZE_LIMIT
if exists one polyX(polyG means GGGGGGGGG...), and its length is >= POLY_SIZE_LIMIT, then this read/pair is bad. Default is 40
-a ALLOW_MISMATCH_IN_POLY, --allow_mismatch_in_poly=ALLOW_MISMATCH_IN_POLY
the count of allowed mismatches when evaluating poly_X. Default 5 means disallow any mismatches

following options may work:

python after.py -f0 -t0 -s24 -p15 -a2

that means any read has a 15bp polyX, in the poly it has no more than 2 other bases, will be discarded.

i.e.
******AAAAAAAAAATACAA****** will be treated as BAD
******AAACAAAAAATACAA****** will be treated as GOOD

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 13 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

poly-G in NextSeq

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News