Lexi - ASL-enabled video call web application
Course Instructor
Pramod Gupta
Abstract
The project is a modern web application specifically for video call that can support ASL translation during the calls using real time camera feed of users and translate their hand signs into text. The aim of this project is ASL translation across 2000 common English words (list of words gathered from WLASL, the largest video dataset for ASL), with the hope of supporting basic sentences as a start to building an application of this kind. All technology used to build this project is open source.
It consists of 3 different development phases - data collection & processing, machine learning model construction, and application development. For the first 2 phases, along with WLASL, web scraping is used to gather more videos using Selenium, a web automation tool. YT-DLP, a command-line tool to download videos from YouTube, is then used to download the videos gathered. Once downloaded, OpenCV, a library specifically for computer vision and image processing tasks, is then used to access and store individual frames of those videos based on their calculated motion level (blurriness) in order to retain useful information. MediaPipe is then used to locate keypoints of hand placement and position of each frame to compile a Numpy memory map, which is used to train a simple TensorFlow Sequential neural network model. For application development, React was used to ensure a dynamic web application with basic features such as device settings, adding contacts, etc., some of which that require real-time updates is built with Socket.io, a JavaScript library that enables event-based communication between web client and server. WebRTC is used for the actual video call feature, ensuring bidirectional, real-time video, audio, and data exchange between user devices by creating peer-to-peer (P2P) connection through signaling and NAT traversal techniques such as ICE, STUN/TURN, ensuring secure and low-latency communication. The backend side consists of a PostgreSQL database for user information friend list, and friend requests; the trained TensorFlow model that receives front-end video feed, process user hand signs, and sends the translation as text output on the front-end screen.
This project hopes to serve as a better alternate to ASL translation specifically in video-call application and remote communication in general to the existing solutions, which are often inefficient and costly (e.g. involving a third-party person in the video call to translate the conversion). By using all open-source technology, this project also hopes to bring more awareness to ASL through a user-friendly and easily accessible web application that anyone can use.
Lexi - ASL-enabled video call web application
The project is a modern web application specifically for video call that can support ASL translation during the calls using real time camera feed of users and translate their hand signs into text. The aim of this project is ASL translation across 2000 common English words (list of words gathered from WLASL, the largest video dataset for ASL), with the hope of supporting basic sentences as a start to building an application of this kind. All technology used to build this project is open source.
It consists of 3 different development phases - data collection & processing, machine learning model construction, and application development. For the first 2 phases, along with WLASL, web scraping is used to gather more videos using Selenium, a web automation tool. YT-DLP, a command-line tool to download videos from YouTube, is then used to download the videos gathered. Once downloaded, OpenCV, a library specifically for computer vision and image processing tasks, is then used to access and store individual frames of those videos based on their calculated motion level (blurriness) in order to retain useful information. MediaPipe is then used to locate keypoints of hand placement and position of each frame to compile a Numpy memory map, which is used to train a simple TensorFlow Sequential neural network model. For application development, React was used to ensure a dynamic web application with basic features such as device settings, adding contacts, etc., some of which that require real-time updates is built with Socket.io, a JavaScript library that enables event-based communication between web client and server. WebRTC is used for the actual video call feature, ensuring bidirectional, real-time video, audio, and data exchange between user devices by creating peer-to-peer (P2P) connection through signaling and NAT traversal techniques such as ICE, STUN/TURN, ensuring secure and low-latency communication. The backend side consists of a PostgreSQL database for user information friend list, and friend requests; the trained TensorFlow model that receives front-end video feed, process user hand signs, and sends the translation as text output on the front-end screen.
This project hopes to serve as a better alternate to ASL translation specifically in video-call application and remote communication in general to the existing solutions, which are often inefficient and costly (e.g. involving a third-party person in the video call to translate the conversion). By using all open-source technology, this project also hopes to bring more awareness to ASL through a user-friendly and easily accessible web application that anyone can use.