Deep Learning for Genomic Feature Prediction Using Semantic k-mer Embeddings and a CNN-LSTM Architecture
Poster Number
19
Faculty Mentor Name
Julia Olivieri
Research or Creativity Area
Engineering & Computer Science
Abstract
This study presents a novel deep learning approach for the prediction of genomic features in genomic sequences. The identification of functional elements within genomes remains a fundamental challenge in computational biology. Traditional methods often rely on sequence conservation or specific motifs but may miss complex patterns and dependencies that characterize genomic features. We developed a hybrid convolutional neural network and long short-term memory (CNN-LSTM) architecture enhanced with k-mer based Word2Vec embeddings to capture both local motifs and long-range dependencies in DNA sequences. Our approach transforms DNA sequences into overlapping k-mers of varying lengths (3-5 bp), which are then converted to dense vector representations using Word2Vec, allowing the model to learn semantic relationships between sequence patterns. The multi-scale CNN component employs convolutional layers with different kernel sizes to detect sequence motifs at various scales, while the bidirectional LSTM captures long-range interactions and positional context. Using human chromosome 1 data (hg38), we trained our model to distinguish exon regions from non-exon regions. The model demonstrated the ability to distinguish exon from non-exon regions by learning complex sequence signatures, even in the absence of explicit splice site cues. This approach offers several advantages over traditional methods: (1) it requires no prior knowledge of sequence motifs or conservation, (2) it automatically learns relevant features at multiple scales, and (3) it can potentially be adapted to identify diverse genomic elements.
Location
University of the Pacific, DeRosa University Center
Start Date
26-4-2025 10:00 AM
End Date
26-4-2025 1:00 PM
Deep Learning for Genomic Feature Prediction Using Semantic k-mer Embeddings and a CNN-LSTM Architecture
University of the Pacific, DeRosa University Center
This study presents a novel deep learning approach for the prediction of genomic features in genomic sequences. The identification of functional elements within genomes remains a fundamental challenge in computational biology. Traditional methods often rely on sequence conservation or specific motifs but may miss complex patterns and dependencies that characterize genomic features. We developed a hybrid convolutional neural network and long short-term memory (CNN-LSTM) architecture enhanced with k-mer based Word2Vec embeddings to capture both local motifs and long-range dependencies in DNA sequences. Our approach transforms DNA sequences into overlapping k-mers of varying lengths (3-5 bp), which are then converted to dense vector representations using Word2Vec, allowing the model to learn semantic relationships between sequence patterns. The multi-scale CNN component employs convolutional layers with different kernel sizes to detect sequence motifs at various scales, while the bidirectional LSTM captures long-range interactions and positional context. Using human chromosome 1 data (hg38), we trained our model to distinguish exon regions from non-exon regions. The model demonstrated the ability to distinguish exon from non-exon regions by learning complex sequence signatures, even in the absence of explicit splice site cues. This approach offers several advantages over traditional methods: (1) it requires no prior knowledge of sequence motifs or conservation, (2) it automatically learns relevant features at multiple scales, and (3) it can potentially be adapted to identify diverse genomic elements.