Show simple item record

dc.contributor.authorEvangelopoulos, Georgios
dc.contributor.authorVoinea, Stephen
dc.contributor.authorZhang, Chiyuan
dc.contributor.authorRosasco, Lorenzo
dc.contributor.authorPoggio, Tomaso
dc.date.accessioned2015-12-10T23:51:52Z
dc.date.available2015-12-10T23:51:52Z
dc.date.issued2014-06-15
dc.identifier.urihttp://hdl.handle.net/1721.1/100186
dc.description.abstractRecognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates — such as specific phones or words — together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.en_US
dc.description.sponsorshipThis work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216.en_US
dc.language.isoen_USen_US
dc.publisherCenter for Brains, Minds and Machines (CBMM), arXiven_US
dc.relation.ispartofseriesCBMM Memo Series;022
dc.rightsAttribution-NonCommercial 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc/3.0/us/*
dc.subjectSpeech Recognitionen_US
dc.subjectInvarianceen_US
dc.subjectMachine Learningen_US
dc.subjectLanguageen_US
dc.titleLearning An Invariant Speech Representationen_US
dc.typeTechnical Reporten_US
dc.typeWorking Paperen_US
dc.typeOtheren_US
dc.identifier.citationarXiv:1406.3884v1en_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record