Seqanswers Leaderboard Ad

**aforntacc** · 07-25-2013, 02:16 AM

ok fine i get it, in the report it says hi-seq 2000, base calling pipleline hiSeq control soft. v 1.4.5
but by default bowtie identifies the quality scale. however now i am piping the fastq file through the perl script from the link you suggested will see when the run is finished.
i am new may be that's why it hard like this.
thanks a lot

**mastal** · 07-25-2013, 02:24 AM

Software to filter errors in fastq files?

Bowtie doesn't identify the scale by default, phred33 is the default
scale bowtie will use unless you specify that your files use a different scale.

But still, I don't think there are any quality encodings that would give you a value of -93.

And yes, we've all been there, things always seem more complicated at the beginning.

**aforntacc** · 07-30-2013, 08:36 AM

hello guys
Please i saw this summary file in the tophat out folder
please what does it mean and why is it 64% its very low. i googled a bit but became more confused
what can i do to to improve the mapping. i used default setting of tophat.

Left reads:
Input: 63588486
Mapped: 41120473 (64.7% of input)
of these: 5143253 (12.5%) have multiple alignments (2 have >20)
Right reads:
Input: 63588486
Mapped: 38423206 (60.4% of input)
of these: 4773086 (12.4%) have multiple alignments (0 have >20)
62.5% overall read alignment rate.

Aligned pairs: 31409898
of these: 3418180 (10.9%) have multiple alignments
and: 24649 ( 0.1%) are discordant alignments
49.4% concordant pair alignment rate.

thanks

**dpryan** · 07-31-2013, 01:05 AM

There are a few questions you'll need to answer before anyone can help you:
1) How long are the reads?
2) Have you quality trimmed yet?
3) What organism is this?
4) What reference did you use?
5) What version of tophat/bowtie was this?
6) What was the exact command line argument used to start alignment?
7) What sort of experiment was this from?

**aforntacc** · 07-31-2013, 07:01 AM

Originally posted by dpryan View Post

There are a few questions you'll need to answer before anyone can help you:
1) How long are the reads?
2) Have you quality trimmed yet?
3) What organism is this?
4) What reference did you use?
5) What version of tophat/bowtie was this?
6) What was the exact command line argument used to start alignment?
7) What sort of experiment was this from?

ok i see
1 reads are 100bp. i did clearing of fastq file
2 no i did not and i dont know how honestly
3 organism plant
4 ncbi mRNA
5 BOWTIE 2
6 TOPHAT2 path to ref.fa path to fastq file A1 and A2
7 rnaseq.

**dpryan** · 07-31-2013, 07:18 AM

You might use trim_galore/trimmomatic/etc. to quality trim the reads and align again. Also, since you're aligning directly to the transcriptome, your alignment rate will be decreased if whichever plant your using doesn't have a particularly complete reference transcriptome.

**aforntacc** · 07-31-2013, 07:58 AM

Originally posted by dpryan View Post

You might use trim_galore/trimmomatic/etc. to quality trim the reads and align again. Also, since you're aligning directly to the transcriptome, your alignment rate will be decreased if whichever plant your using doesn't have a particularly complete reference transcriptome.

please what is this command line, never used it before and how can i get it.
thanks

**dpryan** · 07-31-2013, 08:02 AM

Originally posted by aforntacc View Post

please what is this command line, never used it before and how can i get it.
thanks

Have you googled for "trim_galore" or "trimmomatic"? They come with some documentation.

**mattbawn** · 07-23-2014, 09:37 AM

This was great thanks!!

**Ntobe** · 12-03-2015, 02:21 AM

filtering bad reads

Originally posted by simonandrews View Post

If it's useful to anyone this is a small script I knocked up when we had to process some fastq files which were corrupted during an FTP transfer. You can pipe data through it and it does some basic sanity checks to ensure that the file looks like valid fastq data. It will remove any entries which look broken and leave you just the good stuff.

Code:

#!/usr/bin/perl
use warnings;
use strict;

while (<>) {

  unless (/^\@/) {
    warn "$_ should have had an \@ at the start and it didn't\n";
    next;
  }
  my $id1 = $_;
  my $seq = <>;
  my $id2 = <>;
  my $qual = <>;

  if ($seq =~/^[@+]/) {
    warn "Sequence '$seq' looked like an id";
    next;
  }
  if ($qual =~/^[@+]/) {
    warn "Quality '$qual' looked like an id";
    next;
  }
  if ($id2 !~ /^\+/) {
    warn "Midline '$id2' didn't start with a +";
    next;
  }

  if ($qual =~ /[GATCN]{20,}/) {
    warn "Quality '$qual' looked like sequence";
    next;
  }

  if (length($seq) != length($qual)) {
    warn "Seq $seq and Qual $qual weren't the same length";
    next;
  }

  print $id1,$seq,$id2,$qual;


}

Thank you so much for the script, I used it and it worked with my reads and left only good quality reads (which I managed to map using tophat). I have one question though, will filtering these reads affect any downstream analysis (e.g. cuffdiff step) where differential gene expression is dependent on read quantity between my conditions? I'm a biologist by training and have recently started working with RNA-Seq data. Any response will be highly appreciated.

**Brian Bushnell** · 12-03-2015, 09:59 AM

Quality-filtering will always incur bias in a platform where quality is affected by sequence composition; I don't recommend it for quantitative analysis like differential expression. It's better to quality-trim or simply use an aligner that is capable of mapping the low-quality reads, like BBMap.

**Ntobe** · 12-03-2015, 11:27 PM

Thank you so much for your response Brian. I will explore BBMap in the mean time. Does anyone know what this error mean? It occurred while filtering the reads using the perl script above.

Can't locate object method "With" via package "Quote" (perhaps you forgot to load "Quote"?) at ./perlscript.pl line 44, <> line 847195456

Thanks.

**Brian Bushnell** · 12-03-2015, 11:40 PM

Sorry, I don't use Perl.

**Ntobe** · 12-03-2015, 11:48 PM

Thanks so much Brian Bushnell. The reason we trying to explore cleaning the reads is because our tophat jobs were running out of time in the server before completion. Our raw read files are too big (~13GB per .gz file) and we were thinking that some of the reads might not be of good quality. If that the case, why not remove them and map only good quality reads? Again, this might not be a good approach and that why I'm seeking help.

**GenoMax** · 12-04-2015, 05:45 AM

@Ntobe: That perl script referenced above is only for checking if a file has corrupt fastq records. Is that what you are using it for?

Have you scanned and trimmed (if needed) your raw data files to remove adapter contamination? You can speed up Tophat jobs by using multiple threads. Have you tried using that option?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News