Hello,
I've recently performed Illumina PE (2x75) sequencing of enzymatically-fragmented genomic DNA, where the fragments going into the library prep ranged from 20 to 400 bp. I would like to align these reads to a reference genome to obtain locations and insert sizes, while discarding all reads that are not paired.
Based on this workflow, we assume (to use bowtie2 manual terminology and diagrams) that:
1) Mates may 'overlap' each other
2) Mates may 'contain' each other
3) Mates may 'dovetail' each other
I would like to find alignments and accurate insert sizes, even for the 20-bp fragments which may be classified in one of the above situations.
After clipping 3' adapters and low-quality ends, and filtering reads of low quality, I've attempted to align these reads with bowtie2:
The resulting sam contains only paired alignments with 'insert sizes' of the 20+ bp fragments I am interested in, but also a very large population of 'inserts sizes' the same size as the length of the reads themselves (75 bp). Importantly, there was not a large population of ~75 bp molecules that went into the library prep.
It seems that this population of ~75-bp insert sizes is an alignment artifact. What can I do to test or resolve this? I cannot find another alignment program that explicitly states they can handle mates that dovetail or contain each other.
I've recently performed Illumina PE (2x75) sequencing of enzymatically-fragmented genomic DNA, where the fragments going into the library prep ranged from 20 to 400 bp. I would like to align these reads to a reference genome to obtain locations and insert sizes, while discarding all reads that are not paired.
Based on this workflow, we assume (to use bowtie2 manual terminology and diagrams) that:
1) Mates may 'overlap' each other
Code:
Mate 1: GCAGATTATATGAGTCAGCTACGATATTGTT Mate 2: TGTTTGGGGTGACACATTACGCGTCTTTGAC Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC
Code:
Mate 1: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGC Mate 2: TGTTTGGGGTGACACATTACGC Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC Mate 1: CAGCTACGATATTGTTTGGGGTGACACATTACGC Mate 2: CTACGATATTGTTTGGGGTGAC Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC
Code:
Mate 1: GTCAGCTACGATATTGTTTGGGGTGACACATTACGC Mate 2: TATGAGTCAGCTACGATATTGTTTGGGGTGACACAT Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC
After clipping 3' adapters and low-quality ends, and filtering reads of low quality, I've attempted to align these reads with bowtie2:
Code:
bowtie2 --dovetail --no-mixed --nodiscordant --no-unal -x reference -1 mates1.fastq -2 mates2.fastq -S aligned.sam
It seems that this population of ~75-bp insert sizes is an alignment artifact. What can I do to test or resolve this? I cannot find another alignment program that explicitly states they can handle mates that dovetail or contain each other.