Seqanswers Leaderboard Ad

**GenoMax** · 11-07-2014, 08:05 AM

Originally posted by Marisa_Miller View Post

Hi all,

My question is, will this matter for mapping and assembly?

Most likely no.

Worrisome part is "blasted some sequences but got no hits". What DB was that against? You are sure you have the right data.

**simonandrews** · 11-07-2014, 08:37 AM

Originally posted by Marisa_Miller View Post

Hi all,
After reading around on the forums and elsewhere on the internet, it seems like seeing weird results for Kmer overrepresentation and per base sequence content after running FastQC on Nextera XT libraries is common.

Hi Marisa,

I believe the issue with Nextera kits is similar to the one you get with RNA-Seq libraries in that there is a random priming step which creates some selection bias and results in the types of distribution you see in the FastQC reports.

For the RNA-Seq problem there is some literature about this and I suspect something very similar will come from the nextera fragmentation.

You wouldn't expect that the 5' bias is removed by trimming since the bases are not of poor quality and they will be correct sequence from your library. Trimming them would not help with the library, nor would it remove the bias (you'll end up with sequences slightly downstream of biased positions instead of over them).

The way to tell whether the bias is adversely affecting your results would be to look at the evenness of coverage of your genome in your mapped data. For RNA-Seq, although you see priming bias in the library you still get pretty even coverage of transcripts, so the practical effect doesn't seem to be too bad.

If would be great if it would be possible to remove this bias in the library prep stage, but for now this is an effect everyone seems to get, and which doesn't seem to have too detrimental an effect on your results.

Simon.

**Marisa_Miller** · 11-07-2014, 12:57 PM

Originally posted by GenoMax View Post

Most likely no.

Worrisome part is "blasted some sequences but got no hits". What DB was that against? You are sure you have the right data.

I tried blasting the short heptamers against the Nucleotide collection (nr/nt) database. Not sure why I don't get any hits back.

**Marisa_Miller** · 11-07-2014, 01:00 PM

Originally posted by simonandrews View Post

Hi Marisa,

I believe the issue with Nextera kits is similar to the one you get with RNA-Seq libraries in that there is a random priming step which creates some selection bias and results in the types of distribution you see in the FastQC reports.

For the RNA-Seq problem there is some literature about this and I suspect something very similar will come from the nextera fragmentation.

You wouldn't expect that the 5' bias is removed by trimming since the bases are not of poor quality and they will be correct sequence from your library. Trimming them would not help with the library, nor would it remove the bias (you'll end up with sequences slightly downstream of biased positions instead of over them).

The way to tell whether the bias is adversely affecting your results would be to look at the evenness of coverage of your genome in your mapped data. For RNA-Seq, although you see priming bias in the library you still get pretty even coverage of transcripts, so the practical effect doesn't seem to be too bad.

If would be great if it would be possible to remove this bias in the library prep stage, but for now this is an effect everyone seems to get, and which doesn't seem to have too detrimental an effect on your results.

Simon.

Hi Simon,
Thanks so much for the information. After mapping I will take a look at the coverage distribution to see if the bias is causing an issue. I'm just glad that the problem shouldn't interfere with mapping and assembly too much.

Thanks again,
Marisa

**GenoMax** · 11-07-2014, 03:40 PM

Originally posted by Marisa_Miller View Post

I tried blasting the short heptamers against the Nucleotide collection (nr/nt) database. Not sure why I don't get any hits back.

Heptamers are too short for blast searches. You should search with (a few full length) read(s) to get results.

NCBI/blast520 - WWW Error 404 Diagnostic

http://www.ncbi.nlm.nih.gov/blast/Why.shtml

Short sequences (less than 20 bases) will often not find any significant matches to the database entries under the standard nucleotide-nucleotide BLAST settings. The usual reasons for this are that the significance threshold governed by the expect value parameter is set too stringently and the default word size parameter is set too high.

**Marisa_Miller** · 11-08-2014, 08:19 AM

Originally posted by GenoMax View Post

Heptamers are too short for blast searches. You should search with (a few full length) read(s) to get results.

http://www.ncbi.nlm.nih.gov/blast/Why.shtml

I didn't realize that heptamers would be too short to blast. I have searched with a few full reads and they all come up as wheat or wheat wild relative mitochondrial or plastid sequence, which is exactly what they should be since the libraries were prepared from purified organellar DNA from wheat.

**GenoMax** · 11-08-2014, 01:11 PM

Perfect. You can move on to the analysis (do run your sequences through bbduk or trimmomatic to ensure that there are no remnants of adapters etc).

**gab0** · 11-09-2014, 06:41 AM

Originally posted by Marisa_Miller View Post

I didn't realize that heptamers would be too short to blast. I have searched with a few full reads and they all come up as wheat or wheat wild relative mitochondrial or plastid sequence, which is exactly what they should be since the libraries were prepared from purified organellar DNA from wheat.

Hi Marisa:

I faced a similar problem some time ago, and as you, I was puzzled about the weird Kmer content.

What I did was check if the Kmers would overlap (just by eye) and then I searched for the consensus sequence, which turned out to be sequences matching with Illumina sequences. I don't have the exact information right now but as I recall, these sequences aligned with the sequences from "Process Controls for TruSeq Sample Preparation Kits". There's a PDF of illumina in the web, illumina-customer-sequence-letter describing these sequences.

I found that there was over-representation of these sequences in the library. Hence that was why FASTQC was warning about Kmer content.

I finally removed those sequences from my fastq files and then performed assembly. They *should not* affect my assembly as they matched only with the adapters AND the identity % between them was very high. But just in case, I removed them

Maybe that can help you solve this issue!

Cheers,

Gabriel

**Marisa_Miller** · 11-11-2014, 04:09 AM

Originally posted by gab0 View Post

Hi Marisa:

I faced a similar problem some time ago, and as you, I was puzzled about the weird Kmer content.

What I did was check if the Kmers would overlap (just by eye) and then I searched for the consensus sequence, which turned out to be sequences matching with Illumina sequences. I don't have the exact information right now but as I recall, these sequences aligned with the sequences from "Process Controls for TruSeq Sample Preparation Kits". There's a PDF of illumina in the web, illumina-customer-sequence-letter describing these sequences.

I found that there was over-representation of these sequences in the library. Hence that was why FASTQC was warning about Kmer content.

I finally removed those sequences from my fastq files and then performed assembly. They *should not* affect my assembly as they matched only with the adapters AND the identity % between them was very high. But just in case, I removed them

Maybe that can help you solve this issue!

Cheers,

Gabriel

Hi Gabriel,
Thanks for the info! I will do as you suggest and see if I can overlap those sequences to find out what they are.

Thanks!

**Haiopai** · 11-18-2014, 06:41 AM

Originally posted by simonandrews View Post

...
Does anyone here know if this format is something which is actually generated by an Illumina sequencer, or is it something an individual or maybe the ENA have done to the file? I can add a quick fix to just abandon the module if too many tiles are predicted, but if this is a format which might be more generally about then I should try to cope with this properly.
...

I have to come back to the bug regarding the tile module. I just ran a simulation of RNA-Seq with the Flux Simulator. It puts out reads with identifiers that at first glance look like the ones from Illumina (numbers, seperated by ':' ) but have completely different meanings. http://sammeth.net/confluence/displa...ad+Identifiers

I simply renamed the reads afterwards using sed, but in general it would be good to have some kind of mechanism as you proposed, that skips the tile module when there are too many tiles. The --limits possibility seems also good (as soon as it functions) but I had a hard time figuring out, why the error occured. So an automatic "overflow" detection would be good, maybe with some warning in the log-file/error-out.

**sahodges** · 12-12-2014, 02:06 PM

Color Codes of Per Tile Sequence Quality

I had a run that clearly had a flow cell problem in that there the per tile sequence quality degraded starting around cycle 35 for a cluster of tiles. In the plot attached, there are 6 clusters but these tiles are all actually contiguous (and on both sides of the flow cell).

What I'm trying to understand is how to interpret the colors, i.e., how poor of quality is green/yellow/orange/red? When I look at the fastqc_data file that's generated, there's a "Mean" value for each tile and base but I'm confused as to what this is a mean of? The numbers don't make sense to me as quality scores. For instance for base 1 the mean scores range from -1.6 to 0.73852 (average is 0) for the 96 tiles so this seems to be some sort of variance measure?

thanks for any help in interpretation!

Attached Files

Tile Sequence Quality.pdf (29.3 KB, 21 views)

**simonandrews** · 12-13-2014, 07:55 AM

Originally posted by sahodges View Post

I had a run that clearly had a flow cell problem in that there the per tile sequence quality degraded starting around cycle 35 for a cluster of tiles. In the plot attached, there are 6 clusters but these tiles are all actually contiguous (and on both sides of the flow cell).

What I'm trying to understand is how to interpret the colors, i.e., how poor of quality is green/yellow/orange/red? When I look at the fastqc_data file that's generated, there's a "Mean" value for each tile and base but I'm confused as to what this is a mean of? The numbers don't make sense to me as quality scores. For instance for base 1 the mean scores range from -1.6 to 0.73852 (average is 0) for the 96 tiles so this seems to be some sort of variance measure?

thanks for any help in interpretation!

The tiles on the FastQC plot are sorted by number, so discontiguous numbers might actually be adjacent on the actual flowcell - you'd need to look at the illumina documentation for the flowcell version you used (there's more than one).

The numbers you get are Phred score differences from the mean Phred score for that cycle of sequencing. For example if you had a chemistry cycle where the mean Phred score across all tiles was 30 and in tile 1234 the average Phred score was 25 then the per-tile measure for that tile would be -5.

The colour scale goes from 0 to 0 - whatever error threshold you have set in your config file (limits.txt), which is 10 by default.

Hope this helps.

**sahodges** · 12-13-2014, 08:24 AM

Yes that does help. Thanks very much! Really appreciate your fast response too!

I had figured out the tile location coding - the bad tiles are, in fact, all adjacent.

**tv195** · 12-21-2014, 09:18 PM

FastQC Exception in thread "main" java.awt.HeadlessException

Hi,
A quick question before Christmas, I get the following error:

java -version

java version "1.6.0_33"
OpenJDK Runtime Environment (IcedTea6 1.13.5) (6b33-1.13.5-1ubuntu0.12.04)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)

./fastqc

Exception in thread "main" java.awt.HeadlessException
at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:173)
at java.awt.Window.<init>(Window.java:547)
at java.awt.Frame.<init>(Frame.java:419)
at java.awt.Frame.<init>(Frame.java:384)
at javax.swing.JFrame.<init>(JFrame.java:174)
at uk.ac.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:71)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:324)

Thanks,
Chris

**GenoMax** · 12-22-2014, 04:26 AM

@chris: Simon will no doubt swing by but it appears that you are trying to run fastqc in interactive mode on a server where the X11 is not set/not present.

You can get around this by running fastqc non-interactively. Just specify the name of the sequence file(s) on the command like this:

Code:

$ fastqc seq_file1 seq_file2

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 49 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 50 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 43 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News