Andrew Xia

Exploring Life

Learning Digits via Audio-Visual Representations

We trained two CNNs to read MNIST and TIDIGITS data; final project for Machine Learning (6.867) class.

December 18, 2016

1 minute read

Abstract

Our goal is to explore models for language learning (in this case learning numerical digits in their spoken and visual representations) in the manner that humans learn languages as children. Namely, children do not have intermediary text transcriptions in corresponding visual and audio inputs from the world around them; rather, they directly make connections between what they see and what they hear. In this paper, we construct models for the direct bi-directional classification of speech and images, inspired by a few research papers. We experiment with architectures of two convolutional neural networks, one on the TIDIGITS data set (audio) and the other on the MNIST data set (visual), to obtain joint representations of single digits from spoken utterances and images. Finally, we experiment with an alignment model that ties together the convnets to learn these joint representations. We report an overall image annotation accuracy of 88.5% and an overall image retrieval accuracy of 87.6%.

Links & Notes

I worked with Sitara Persad and Karan Kashyap on this final project.

See our paper writeup here.

Our GitHub repository for the project is available here.

Say something

Comments

Nothing yet.

Summer 2016 Vlogs

Hack Punt Tool

Andrew Xia

Learning Digits via Audio-Visual Representations

Abstract

Links & Notes

Say something

Comments

Recent posts

Books that have Influenced me the Most

Close to the Fire

Getting Started With Cycling

Using Computer Vision to Evaluate Scooter Parking

El Eclipse Solar

Categories

About