Seqanswers Leaderboard Ad

**dpryan** · 02-13-2014, 06:48 AM

That's because the second part isn't part of the read name. There's an option in fastq-dump to put the original read name where it should be rather than just numbering things sequentially.

**splaisan** · 02-13-2014, 06:58 AM

Problem is I downloaded the fastq pre-made from the EBI repo and mapped them all :-( without figuring this out. I can fix this by patching the fatsQ but will still need to remap the whole shebang...

Thanks for the info anyway (for next time)

**splaisan** · 02-15-2014, 01:59 AM

picard markDuplicate compatible reads from SRA data

few days later, the issue is fixed by:

NOT downloading the fastq files from SRA but instead the .sra formatted data using Aspera (I used the browser link)
Use the sratoolkit command fastq-dump (thanks Devon) to convert .sra to .fastq and split reads in paired files. The trick was here to use the specific parameter -F|--origfmt to ensure 'Defline contains only original sequence name' and that the remaining text was discarder

The resulting command in my case was (after correcting typo!):

fastq-dump -F --split-3 --gzip *.sra -O fastq_read_folder

TIP: I used P|P|S|S to speed this dramatically for the 26 input files on my 24 thread machine.

My reads have now a header line as

@HWI-ST188:1:1101:1222:2140
NAGACGAAGGTTCTTCAGTTAAACAGTTTAGAGCCCCATAAGAGCAAACTGTAGTGTAAAGAGGAAAAGTAAGTACAATCTTTCCAGACACACAACTAATA
+HWI-ST188:1:1101:1222:2140
#1:BDDDDHHHHHIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIHGIFHCGIEHIIIHIIIIDEHHCHEHEEEEEECCECCCBCCBBBBCCCCA

which after tophat mapping results for that particular read in

HWI-ST188:1:1101:1222:2140 99 chr10 59953037 50 101M = 59953061 125 NAGACGAAGGTTCTTCAGTTAAACAGTTTAGAGCCCCATAAGAGCAAACTGTAGTGTAAAGAGGAAAAGTAAGTACAATCTTTCCAGACACACAACTAATA #1:BDDDDHHHHHIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIHGIFHCGIEHIIIHIIIIDEHHCHEHEEEEEECCECCCBCCBBBBCCCCA AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C100 YT:Z:UU NH:i:1

Running picard on such BAM data is now able to identify few 1000' optical repeats in the full sample.

CQFD

**GenoMax** · 02-15-2014, 05:41 AM

Don't see a "-F" in your fastq-dump command above. Typo?

**splaisan** · 02-15-2014, 06:15 AM

shame on me! corrected now (thanks)

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 48 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

keep read address using tophat

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News