Show simple item record

dc.contributor.authorPoggio, Tomaso
dc.contributor.authorCooper, Yaim
dc.date.accessioned2020-07-01T19:43:03Z
dc.date.available2020-07-01T19:43:03Z
dc.date.issued2020-07-01
dc.identifier.urihttps://hdl.handle.net/1721.1/126041
dc.description.abstractConsider a loss function L = 􏰀ni=1 l2i with li = f(xi) − yi, where f(x) is a deep feedforward network with R layers, no bias terms and scalar output. Assume the network is overparametrized that is, d >> n, where d is the number of parameters and n is the number of data points. The networks are assumed to interpolate the training data (e.g. the minimum of L is zero). If GD converges, it will converge to a critical point of L, namely a solution of 􏰀ni=1 li∇li = 0. There are two kinds of critical points - those for which each term of the above sum vanishes individually, and those for which the expression only vanishes when all the terms are summed. The main claim in this note is that while GD can converge to both types of critical points, SGD can only converge to the first kind, which include all global minima.en_US
dc.description.sponsorshipThis material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216.en_US
dc.publisherCenter for Brains, Minds and Machines (CBMM)en_US
dc.relation.ispartofseriesCBMM Memo;107
dc.titleLoss landscape: SGD can have a better view than GDen_US
dc.typeTechnical Reporten_US
dc.typeWorking Paperen_US
dc.typeOtheren_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record