SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Casava FASTQ from SRA (http://seqanswers.com/forums/showthread.php?t=50509)

JMFA 02-24-2015 12:27 AM

Casava FASTQ from SRA
 
Dear all,

I am currently trying to perform quality checks on a set of FASTQ files (downloaded from SRA) that (according to the authors) were generated using CASAVA 1.8.1 pipeline.

Since all files belong to a single sample, I am trying to use "--casava" option in FASTQC but I keep getting the "SRR*.fastq.gz didn't look like part of a CASAVA group" error.

To my understanding, FASTQC requires that all CASAVA generated files should be named <sample name>_<barcode sequence>_L<lane>_R<read number>.<0-padded 3-digitset number>.fastq.gz

However, the fastq files from SRA have a very different name: "SRR*_1/_2.fastq.gz"

Is there a way to change the name of these files so that FASTQC recognises them as a single sample or should I just analyse them independently?

Thank you very much in advance,
JMFA

GenoMax 02-24-2015 03:04 AM

Have you tried to run FastQC on the files without worrying about the casava option? What SRR # are you working with?

JMFA 02-24-2015 05:01 AM

Hi!
Thanks for the reply :)

Yes. Running w/out the "--casava" option was the first thing that I did. However, I got some weird Kmer profiles. FASTQC doesn't detect any adapter content / overrepresented sequences but it shows Kmer enrichments around the center and towards the end of the read...(I am attaching one example)

https://5f8360518f2168d337fe1e349c71...r_profiles.jpg

I have limited experience with NGS data so I have no idea whether this has anything to do with the way I am "using" the data. The pattern is actually similar for most fastq.gz files so perhaps this is a problem with the data itself.

sarvidsson 02-24-2015 05:09 AM

Not sure if FastQC includes all adapters - to me that looks like miRNA data before adapter clipping.

TCGTATGCCGTCTTC: http://www.biomedcentral.com/1471-2164/12/176

JMFA 02-25-2015 05:25 AM

This is actually 100bp PE DNAseq data.
I've clipped the adapter sequences and the Kmer profile looks much better (not perfect but better)! https://5f8360518f2168d337fe1e349c71...ewKmerPROF.png

However, I have an additional question.
While running cutadapt to remove the adapter sequences I am also setting the "-m" option (used to throw away processed reads shorter than N bases) to 100bp. However, I end up discarding approx. half of the reads this way.

Is there a problem (for instance, in the alignment step) to use reads with varying length?
Again, thank you very much for the input.


All times are GMT -8. The time now is 06:06 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.