Deep Learning for Genomic Feature Prediction Using Semantic k-mer Embeddings and a CNN-LSTM Architecture

Poster Number

19

Lead Author Affiliation

Computer Science

Lead Author Status

Undergraduate - Senior

Second Author Affiliation

Computer Science

Second Author Status

Faculty Mentor

Faculty Mentor Name

Julia Olivieri

Research or Creativity Area

Engineering & Computer Science

Abstract

This study presents a novel deep learning approach for the prediction of genomic features in genomic sequences. The identification of functional elements within genomes remains a fundamental challenge in computational biology. Traditional methods often rely on sequence conservation or specific motifs but may miss complex patterns and dependencies that characterize genomic features. We developed a hybrid convolutional neural network and long short-term memory (CNN-LSTM) architecture enhanced with k-mer based Word2Vec embeddings to capture both local motifs and long-range dependencies in DNA sequences. Our approach transforms DNA sequences into overlapping k-mers of varying lengths (3-5 bp), which are then converted to dense vector representations using Word2Vec, allowing the model to learn semantic relationships between sequence patterns. The multi-scale CNN component employs convolutional layers with different kernel sizes to detect sequence motifs at various scales, while the bidirectional LSTM captures long-range interactions and positional context. Using human chromosome 1 data (hg38), we trained our model to distinguish exon regions from non-exon regions. The model demonstrated the ability to distinguish exon from non-exon regions by learning complex sequence signatures, even in the absence of explicit splice site cues. This approach offers several advantages over traditional methods: (1) it requires no prior knowledge of sequence motifs or conservation, (2) it automatically learns relevant features at multiple scales, and (3) it can potentially be adapted to identify diverse genomic elements.

Location

University of the Pacific, DeRosa University Center

Start Date

26-4-2025 10:00 AM

End Date

26-4-2025 1:00 PM

This document is currently not available here.

Share

COinS
 
Apr 26th, 10:00 AM Apr 26th, 1:00 PM

Deep Learning for Genomic Feature Prediction Using Semantic k-mer Embeddings and a CNN-LSTM Architecture

University of the Pacific, DeRosa University Center

This study presents a novel deep learning approach for the prediction of genomic features in genomic sequences. The identification of functional elements within genomes remains a fundamental challenge in computational biology. Traditional methods often rely on sequence conservation or specific motifs but may miss complex patterns and dependencies that characterize genomic features. We developed a hybrid convolutional neural network and long short-term memory (CNN-LSTM) architecture enhanced with k-mer based Word2Vec embeddings to capture both local motifs and long-range dependencies in DNA sequences. Our approach transforms DNA sequences into overlapping k-mers of varying lengths (3-5 bp), which are then converted to dense vector representations using Word2Vec, allowing the model to learn semantic relationships between sequence patterns. The multi-scale CNN component employs convolutional layers with different kernel sizes to detect sequence motifs at various scales, while the bidirectional LSTM captures long-range interactions and positional context. Using human chromosome 1 data (hg38), we trained our model to distinguish exon regions from non-exon regions. The model demonstrated the ability to distinguish exon from non-exon regions by learning complex sequence signatures, even in the absence of explicit splice site cues. This approach offers several advantages over traditional methods: (1) it requires no prior knowledge of sequence motifs or conservation, (2) it automatically learns relevant features at multiple scales, and (3) it can potentially be adapted to identify diverse genomic elements.