School of Science and Technology 科技學院
Computing Programmes 電腦學系

Visually Impaired Assistant With Neural Network And Image Recognition

Arthur On Chun LIU, Ki Shun LI, LI Qi YAN

Programme Bachelor of Computing with Honours in Internet Technology
Supervisor Prof. Vanessa Ng
Areas Intelligent Applications
Year of Completion 2020
Award Champion, IEEE (HK) Computational Intelligence Chapter 17th FYP Competition

Objectives

The aim of this project is to implement a visually impaired assistant with neural network and image recognition to help visually impaired go outside efficiently and safely. The visually impaired assistant will be similar to a person to accompany the user, using the image recognition technology to describe the nearby environment to the visually impaired in real- time, letting them know what is going on in front of them. The obstacle detector will let them know the obstacles ahead so that they can be prepared early.

To allow visually impaired people to go to their destinations, navigation is also a must, and there are several apps on the Internet providing the navigation function, but the system will make the navigation system more useful to assist the visually impaired to go to the destination.

The main aim of the project is to implement an application to help visually impaired to go outside. The project has also defined several sub-objectives as follows:

  • To design and develop a scene description system for describing the scene, letting the users know what is happening in front of them.
  • To design and develop an obstacle prompt system for describing the obstacles ahead, users can be prepared early.
  • To design and develop a voice navigation system for navigating the user to the destination and the user does not need to watch the phone screen.
  • To design and develop a mobile application (Android only) for handling the voice output of all three systems above, to initiate system and language recognition.

Video Demonstration

Techniques and technologies used

Image Caption

To provide a scene description, we have to use image caption technology and we have chosen “The show, Attend and Tell” for our project as an image caption technology since our target user is Hong Kong people.

Figure 1: Principal of Show, Attend and Tell, retrieve from Xu, et al, 2015

Figure 2: Prediction Caption

Speech Recognition

For speech recognition, Alex Graves (2014) proposed a model of recurrent neural networks and CTC layers, as well as RNN-CTC pronunciation and acoustic models for speech recognition (A. Graves and N. Jaitly., 2014).

Object Detection

To provide obstacle prompt, we have to use object detection technology. MobileNetV2 is a lightweight single shot multi-box detector, it is suitable to run at most of the mobile devices and embedded devices. It will be used as a backend of our obstacle prompt system since real-time performance is more important for obstacle prompt systems when the user may not know the obstacle.

Figure 3: Example output of object detection, retrieve from Tensorflow

Depth Estimation

To provide more accurately obstacle prompt, we have to use depth estimation technology to sort out the most dangerous obstacle to the users. High Quality Monocular Depth Estimation via Transfer Learning (DenseDepth) is an encoder-decoder architecture model, which can present a high resolution depth map by inputting a single RGB image.

Figure 4: Original Color Image

Figure 5: Depth map

System architecture

The system has consisted of three components: Android Client, Raspberry Pi, and Machine Learning Server. The responsibilities of different components are shown as below.

Figure 6: Major Components

Figure 7: Components of the system

Android Client

  • As a voice interface to the user
  • Returns navigation information upon user request
  • Using MQTT clients to connect the Raspberry Pi and prompt the user when dangerous information or scene description is received
  • Open a WIFI hotspot for Raspberry Pi connection

Raspberry Pi

  • Connected to Raspberry Pi' WIFI hotspot
  • Will host an MQTT server for data transferring between Android Client
  • As a hands-free camera device which can easily be installed on the walking stick or worn on the neck since the user may have to pick up the walking stick.
  • Will host a video streaming server which will do motion detection and serve images to clients by MJPEG.
  • When the motion is detected, the video streaming server on the Raspberry Pi will broadcast an MQTT message. Once the object detection client and image caption client was notified they will send HTTP requests to Machine Learning Server respectively and forward the response data to the MQTT server.

Machine Learning Server

  • Serve the request through Flask, a web application framework based on Python
  • Since the inference time of deep neural networks(DNN) is relatively large, the request of object detection and image caption will be processed separately.
  • Object name and relative direction will be returned from the object detection server when the confidence of the object in the image is high and the object may be dangerous to users.
  • Sentences of description will be returned from the image caption server.

System Design and Implementation

Figure 8. Open “Visually Impaired Assistant”

Figure 9. Voice Recognition Function

Figure 10. Voice Recognition Function

Figure 11: Voice Navigation screen

Figure 12: Navigation screen

Figure 13: Sense Description screen

The voice recognition interface will be displayed when the user clicks the voice recognition button on the main screen or clicks the voice navigation on the bottom navigation bar. Users can say where they want to go, and the application will automatically open the navigation screen.

The navigation screen shows the route to the destination and the route information will be read out by Text-to-speech service.

The scene description interface shows the image of the external camera and the scene description of the picture, which will be read aloud by Text-to-speech service according to the priority.

Figure 14: Setting screen

Figure 15: A user hold the devices with a walking stick

Figure 16: A user hang around the devices to the neck

The setting screen can set whether to turn on 3D direction prompts and related prompt intervals.

The user can hang the raspberry pi and the external camera (devices) around the neck or hold it with a walking stick. After that, the user needs to put on the earpiece and microphone, and then turn on the app using their mobile phone.

The user can listen to the nearby environment through the earpiece, if there are obstacles in front such as lampposts, stairs, signs, etc., the application will use voice to prompt the user.

Evaluation

In our observations, testers used our equipment and application to collide with objects and people less often than when not in use. The number of object collisions has been reduced by about one-third, and the number of human collisions has dropped significantly by more than one-half. The data indicate that the use of our devices and application can reduce the chance of partial accidents and thus improve user safety. Besides, users can reach the destination more efficiently when using our solution with 3D direction prompts, as can be seen from the time to find the correct direction and the time travel to the destination.

Table 1: User Evaluation Average Result

Conclusion and Future Development

This project successfully implemented a navigation system that uses the MapBox API and converts its voice into Cantonese to allow the system to work with Hong Kong users. On the basis of the navigation system, a 3D direction prompt system is also implemented, allowing users to distinguish whether they are facing the correct direction through the left-right channels and the large-small beep sound. In addition, the voice recognition system can filter the user’s voice using DialogFlow and output the destination for the Mapbox API to search for the appropriate route.

In the environment description system, real-time environment description is realized, the user can know the affairs in front of the camera through the voice by the text-to-speech service. Environmental description can let the visually impaired know the nearby information that not only be told by other people.

The obstacle prompts system has also been successfully implemented. That system uses image recognition technology to recognize the objects in the image and prompts the user through voice when an obstacle is found.

Future work

It can be improved in hardware and software.

Hardware

  • It can combine raspberry pi and its batteries to reduce volume
  • To add a microphone and buttons to provide basic services without a phone
  • It can reduce dependence on the network connectivity by using Jetson Nano or other single-board computers with neural network acceleration chips
  • By adding an infrared lens or an ultrasonic sensor to accurately sense the distance of objects in front.

Software

  • It can improve the accuracy and quantity of obstacle recognition by changing the model of the image recognition
  • It can improve the language and the accuracy of scene description by changing the model of the image recognition and training by own dataset
  • Combining image caption model and object detection model by sharing the same encoder/backbone.