Meet Harshita.
Meet Harshita.
Voice to Vision: Audio Descriptions for Visual Content
Voice to Vision: Audio Descriptions for Visual Content

Voice to Vision: Audio Descriptions for Visual Content

Introduction: Why?

Blind individuals often face challenges when it comes to accessing and understanding visual content, such as images and videos. While many websites and applications provide alternative text descriptions for visual content, these descriptions can be difficult for blind individuals to access and understand, particularly when the content is complex or dynamic. There is a need for more ways to access and visual content.

Key Skills

  • User Research
  • Voice Flow Design
  • User Flow
  • Alexa Skill Development

Team

Rujula Singh R, Harshita Shyale

CSE Majors, RVCE

Duration

2 Months

How do we solve?

To address this challenge, we have developed an Alexa skill that provides audio descriptions for visual content. The skill allows users to request an audio description of any visual content that they are viewing, such as an image or video, and then generates an audio description of the content based on the user's request. The skill uses natural language processing (NLP) techniques to understand and interpret user requests, and text-to-speech (TTS) technology to generate the audio descriptions of the visual content.

Design Process

image

Define Users and their Needs:

In this case, the user was a blind individual who needed a

To understand the user's needs and challenges, we reached out to National Association for the Blind Karnataka and Mithra Jyothi in Bengaluru, to gather user insights from blind and low-vision users, by conducting user interviews and surveys. The goal was to to come up with a convenient and effective way to access and understand visual content.

Based on the interview, we defined a User Persona for our Skill:

User Persona

image

Main Pain-Points:

  • Many websites and applications do not provide alternative text descriptions for their visual content, making it difficult to access and understand this information.
  • Even when alternative text descriptions are available, they can be difficult understand, particularly when the visual content is complex or dynamic.
  • Screen reader software like Talk Back or other assistive technologies do not always provide adequate descriptions of visual content.

Users Insights:

  • There is a need more ways to access and visual content.
  • Users with low vision or blindness appreciate the convenience and accessibility of a voice-based interface, such as Alexa, for requesting audio descriptions of visual content.

Define the Problem and Solution:

Based on the user research, we defined the problem as the difficulty that blind individuals face when accessing and understanding visual content.

image

Ideation & Design

High-Level Concept

We started by creating a high-level concept for the skill, including the key features and functionality that it would offer. The high-level concept for the Alexa skill that provides audio descriptions for visual content includes the following key features and functionality:

  • Provide clear and concise audio descriptions of visual content, using natural language processing (NLP) and Amazon Rekognition and Amazon Polly to analyze the content and generate an audio description using text-to-speech (TTS) technology.
  • Allow users to request audio descriptions of images, and videos, using voice commands and slots to capture the user's intent and the URL of the content that they want to describe.
  • Ensure that the skill is easy to use and navigate, using a simple and intuitive voice user interface (VUI) that allows users to quickly and easily request audio descriptions of visual content.

Voice Flow

image

Sample Dialogue Flows

image

“Wizard of Oz” Testing

Sample dialogue flows and audio files were prepared.

Role-playing participants interacted with Alexa, whom they believed to be ‘live’, but which was actually being operated by an unseen researcher in another room; “the Wizard”.

image

Develop

The code uses the Alexa Skills Kit (ASK) SDK for Node.js to create and implement the skill, and includes the invocation name, collection of intents, sample utterances, and slots that enable the skill to understand and respond to user requests for audio descriptions of visual content.

Few Code Snippets:

                                       GetAudioDescriptionIntent with slots and sample utterances:
GetAudioDescriptionIntent with slots and sample utterances:
                                                     Interacting with Amazon Rekognition & Polly
Interacting with Amazon Rekognition & Polly
                LaunchRequestHandler is used when Skill is invoked by saying “Open voice-vision”
LaunchRequestHandler is used when Skill is invoked by saying “Open voice-vision”
                  Handler that fetches content from source, fetches audio description and plays it for user.
Handler that fetches content from source, fetches audio description and plays it for user.
                                                                            Some more Intents
Some more Intents

Test & Refine

Ran the skill on Alexa simulator, to test the skill with real people, and continually updated utterances to enable graceful error handling.

Output & Results

Images and videos are uploaded on a google drive, and available on URL.

Amazon Rekognition Outputs for Sample Images:

image
image
image

Working Alexa Skill on Alexa Simulator:

Here, personal google drive is pre-configured, where users can upload visual content with the aid of a screen-reader. Additionally, users may use a screen-reader app or text-to-speech software on their device to read out the URL of the content.

Results & Conclusion

Ultimately we were able to develop an Alexa Skill meant to aid Blind and low vision users, that ideally are already accustomed to using an AI Assistant on a daily basis. This skill gives audio descriptions of images or videos either in a URL or upload it to a drive.

  • Users are able to get a decent idea of visual content, on their Alexa devices.
  • Using an Alexa is a convenient way to get to get an audio descriptions of visual content.

Future Work

While the present input methods serve effectively as a proof of concept, the ways of providing inputs to Alexa can be improved. One way can be by integrating messaging apps with Alexa, or a Drive. The content when forwarded to a certain account, must upload the content, analyse it, and generate audio descriptions.

Logo

©harshitashyale

LinkedInXGitHub