Learning An Invariant Speech Representation

Evangelopoulos, Georgios; Voinea, Stephen; Zhang, Chiyuan; Rosasco, Lorenzo; Poggio, Tomaso

dc.contributor.author	Evangelopoulos, Georgios
dc.contributor.author	Voinea, Stephen
dc.contributor.author	Zhang, Chiyuan
dc.contributor.author	Rosasco, Lorenzo
dc.contributor.author	Poggio, Tomaso
dc.date.accessioned	2015-12-10T23:51:52Z
dc.date.available	2015-12-10T23:51:52Z
dc.date.issued	2014-06-15
dc.identifier.uri	http://hdl.handle.net/1721.1/100186
dc.description.abstract	Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates — such as specific phones or words — together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.	en_US
dc.description.sponsorship	This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216.	en_US
dc.language.iso	en_US	en_US
dc.publisher	Center for Brains, Minds and Machines (CBMM), arXiv	en_US
dc.relation.ispartofseries	CBMM Memo Series;022
dc.rights	Attribution-NonCommercial 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/us/	*
dc.subject	Speech Recognition	en_US
dc.subject	Invariance	en_US
dc.subject	Machine Learning	en_US
dc.subject	Language	en_US
dc.title	Learning An Invariant Speech Representation	en_US
dc.type	Technical Report	en_US
dc.type	Working Paper	en_US
dc.type	Other	en_US
dc.identifier.citation	arXiv:1406.3884v1	en_US

Files in this item

Name:: CBMM-Memo-022.pdf
Size:: 1.808Mb
Format:: PDF

View/Open

Name:: license_rdf
Size:: 1.346Kb
Format:: application/rdf+xml

View/Open

This item appears in the following Collection(s)

CBMM Memo Series

Show simple item record