Introduction: Why?
Blind individuals often face challenges when it comes to accessing and understanding visual content, such as images and videos. While many websites and applications provide alternative text descriptions for visual content, these descriptions can be difficult for blind individuals to access and understand, particularly when the content is complex or dynamic. There is a need for more ways to access and visual content.
Key Skills
- User Research
- Voice Flow Design
- User Flow
- Alexa Skill Development
Team
Rujula Singh R, Harshita Shyale
CSE Majors, RVCE
Duration
2 Months
How do we solve?
To address this challenge, we have developed an Alexa skill that provides audio descriptions for visual content. The skill allows users to request an audio description of any visual content that they are viewing, such as an image or video, and then generates an audio description of the content based on the user's request. The skill uses natural language processing (NLP) techniques to understand and interpret user requests, and text-to-speech (TTS) technology to generate the audio descriptions of the visual content.
Design Process
Define Users and their Needs:
In this case, the user was a blind individual who needed a
To understand the user's needs and challenges, we reached out to National Association for the Blind Karnataka and Mithra Jyothi in Bengaluru, to gather user insights from blind and low-vision users, by conducting user interviews and surveys. The goal was to to come up with a convenient and effective way to access and understand visual content.
Based on the interview, we defined a User Persona for our Skill:
User Persona
Main Pain-Points:
- Many websites and applications do not provide alternative text descriptions for their visual content, making it difficult to access and understand this information.
- Even when alternative text descriptions are available, they can be difficult understand, particularly when the visual content is complex or dynamic.
- Screen reader software like Talk Back or other assistive technologies do not always provide adequate descriptions of visual content.
Users Insights:
- There is a need more ways to access and visual content.
- Users with low vision or blindness appreciate the convenience and accessibility of a voice-based interface, such as Alexa, for requesting audio descriptions of visual content.
Define the Problem and Solution:
Based on the user research, we defined the problem as the difficulty that blind individuals face when accessing and understanding visual content.
Ideation & Design
High-Level Concept
We started by creating a high-level concept for the skill, including the key features and functionality that it would offer. The high-level concept for the Alexa skill that provides audio descriptions for visual content includes the following key features and functionality:
- Provide clear and concise audio descriptions of visual content, using natural language processing (NLP) and Amazon Rekognition and Amazon Polly to analyze the content and generate an audio description using text-to-speech (TTS) technology.
- Allow users to request audio descriptions of images, and videos, using voice commands and slots to capture the user's intent and the URL of the content that they want to describe.
- Ensure that the skill is easy to use and navigate, using a simple and intuitive voice user interface (VUI) that allows users to quickly and easily request audio descriptions of visual content.
Voice Flow
Sample Dialogue Flows
“Wizard of Oz” Testing
Sample dialogue flows and audio files were prepared.
Role-playing participants interacted with Alexa, whom they believed to be ‘live’, but which was actually being operated by an unseen researcher in another room; “the Wizard”.
Develop
The code uses the Alexa Skills Kit (ASK) SDK for Node.js to create and implement the skill, and includes the invocation name, collection of intents, sample utterances, and slots that enable the skill to understand and respond to user requests for audio descriptions of visual content.
Few Code Snippets:
Test & Refine
Ran the skill on Alexa simulator, to test the skill with real people, and continually updated utterances to enable graceful error handling.
Output & Results
Images and videos are uploaded on a google drive, and available on URL.
Amazon Rekognition Outputs for Sample Images:
Working Alexa Skill on Alexa Simulator:
Results & Conclusion
Ultimately we were able to develop an Alexa Skill meant to aid Blind and low vision users, that ideally are already accustomed to using an AI Assistant on a daily basis. This skill gives audio descriptions of images or videos either in a URL or upload it to a drive.
- Users are able to get a decent idea of visual content, on their Alexa devices.
- Using an Alexa is a convenient way to get to get an audio descriptions of visual content.
Future Work
While the present input methods serve effectively as a proof of concept, the ways of providing inputs to Alexa can be improved. One way can be by integrating messaging apps with Alexa, or a Drive. The content when forwarded to a certain account, must upload the content, analyse it, and generate audio descriptions.