数据挖掘课件:DL-Tutorial.pptx
- 【下载声明】
1. 本站全部试题类文档,若标题没写含答案,则无答案;标题注明含答案的文档,主观题也可能无答案。请谨慎下单,一旦售出,不予退换。
2. 本站全部PPT文档均不含视频和音频,PPT中出现的音频或视频标识(或文字)仅表示流程,实际无音频或视频文件。请谨慎下单,一旦售出,不予退换。
3. 本页资料《数据挖掘课件:DL-Tutorial.pptx》由用户(罗嗣辉)主动上传,其收益全归该用户。163文库仅提供信息存储空间,仅对该用户上传内容的表现方式做保护处理,对上传内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(点击联系客服),我们立即给予删除!
4. 请根据预览情况,自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器,压缩文件请下载最新的WinRAR软件解压。
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据 挖掘 课件 DL_Tutorial
- 资源描述:
-
1、Deep Learning TutorialThanks Hung-yi LeeDeep learning attracts lots of attention. Google TrendsDeep learning obtains many exciting results.20072009201120132015The talks in this afternoonThis talk will focus on the technical part.OutlinePart IV: Neural Network with MemoryPart III: Tips for Training D
2、eep Neural NetworkPart II: Why Deep?Part I: Introduction of Deep LearningPart I: Introduction of Deep LearningWhat people already knew in 1980s Example Application Handwriting Digit RecognitionMachine“2”Handwriting Digit RecognitionInputOutput16 x 16 = 2561x2x256xInk 1No ink 0y1y2y10Each dimension r
3、epresents the confidence of a digit.is 1is 2is 00.10.70.2The image is “2”Example Application Handwriting Digit RecognitionMachine“2”1x2x256xy1y2y10bwawawazKK2211Element of Neural Network z1w2wKw1a2aKab zbiasaActivation functionweightsNeuronOutput LayerHidden LayersInput LayerNeural NetworkInputOutpu
4、t1x2xLayer 1NxLayer 2Layer Ly1y2yMDeep means many hidden layersneuronExample of Neural Network zz zez11Sigmoid Function1-11-21-1104-20.980.12Example of Neural Network 1-21-1104-20.980.122-1-1-23-14-10.860.110.620.8300-221-1Example of Neural Network 1-21-1100.730.52-1-1-23-14-10.720.120.510.8500-22Di
5、fferent parameters define different function 00Matrix Operation2y1y1-21-1104-20.980.121-11x2xNxy1y2yMNeural Network W1W2WLb2bLxa1a2yb1W1x+b2W2a1+bLWL+aL-1b11x2xNxy1y2yMNeural Network W1W2WLb2bLxa1a2yyxb1W1x+b2W2+bLWL+b1Using parallel computing techniques to speed up matrix operationSoftmax Softmax l
6、ayer as the output layerOrdinary Layer 11zy 22zy 33zy1z2z3zIn general, the output of network can be any value.May not be easy to interpret Softmax Softmax layer as the output layer1z2z3zSoftmax Layereee1ze2ze3ze3111jzzjeey31jzje3-312.72.720200.050.050.880.1203122jzzjeey3133jzzjeeyHow to set network
7、parameters16 x 16 = 2561x2x256xInk 1No ink 0y1y2y100.10.70.2y1 has the maximum valueInput:y2 has the maximum valueInput:is 1is 2is 0How to let the neural network achieve thisSoftmaxTraining Data Preparing training data: images and their labelsUsing the training data to find the network parameters.“5
8、”“0”“4”“1”“3”“1”“2”“9”Cost1x2x256xy1y2y10Cost 0.20.30.5“1”100Cost can be Euclidean distance or cross entropy of the network output and target targetTotal Costx1x2xRNNNNNNy1y2yRx3NNy3For all training data Total Cost:Gradient DescentAssume there are only two parameters w1 and w2 in a network.The color
9、s represent the value of C.Error SurfaceGradient DescentEventually, we would reach a minima .Local Minima Gradient descent never guarantee global minima Reach different minima, so different resultsWho is Afraid of Non-Convex Loss Functions?http:/ local minima costparameter spaceVery slow at the plat
10、eauStuck at local minimaStuck at saddle pointIn physical world MomentumHow about put this phenomenon in gradient descent?MomentumcostMovement = Negative of Gradient + Momentum Gradient = 0Still not guarantee reaching global minima, but give some hope Negative of GradientMomentumReal MovementMini-bat
11、chx1NNy1x31NNy31x2NNy2x16NNy16Pick the 1st batchPick the 2nd batchMini-batchMini-batchC is different each time when we update parameters!Mini-batchOriginal Gradient DescentWith Mini-batchunstableThe colors represent the total C on all training data.Mini-batchx1NNy1x31NNy31x2NNy2x16NNy16Pick the 1st
12、batchPick the 2nd batchUntil all mini-batches have been pickedone epochFasterBetter!Mini-batchMini-batchRepeat the above processBackpropagation A network can have millions of parameters. Backpropagation is the way to compute the gradients efficiently (not today) Ref: http:/speech.ee.ntu.edu.tw/tlkag
13、k/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.html Many toolkits can compute the gradients automaticallyRef: http:/speech.ee.ntu.edu.tw/tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.htmlPart II:Why Deep?Layer X SizeWord Error Rate (%)Layer X SizeWord Error Rate (%)1 X 2k
14、24.22 X 2k20.43 X 2k18.44 X 2k17.85 X 2k17.21 X 377222.57 X 2k17.11 X 463422.61 X 16k22.1Deeper is Better?Seide, Frank, Gang Li, and Dong Yu. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. Interspeech. 2011.Not surprised, more parameters, better performance Univers
15、ality TheoremReference for the reason: http:/ continuous function fM:RRfNCan be realized by a network with one hidden layer(given enough hidden neurons)Why “Deep” neural network not “Fat” neural network?Fat + Short v.s. Thin + Tall1x2xNxDeep1x2xNxShallowWhich one is better?The same number of paramet
16、ersFat + Short v.s. Thin + TallSeide, Frank, Gang Li, and Dong Yu. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. Interspeech. 2011.Layer X SizeWord Error Rate (%)Layer X SizeWord Error Rate (%)1 X 2k24.22 X 2k20.43 X 2k18.44 X 2k17.85 X 2k17.21 X 377222.57 X 2k17.
17、11 X 463422.61 X 16k22.1長髮男Why Deep? Deep ModularizationGirls with long hairBoys with short hair Boys with long hairImageClassifier 1Classifier 2Classifier 3長髮女長髮女長髮女長髮女Girls with short hair短髮女短髮男短髮男短髮男短髮男短髮女短髮女短髮女Classifier 4Little examplesweakWhy Deep? Deep ModularizationImageLong or short?Boy or
18、Girl?Classifiers for the attributes長髮男長髮女長髮女長髮女長髮女短髮女短髮男短髮男短髮男短髮男短髮女短髮女短髮女v.s.長髮男長髮女長髮女長髮女長髮女短髮女短髮男短髮男短髮男短髮男短髮女短髮女短髮女v.s.Each basic classifier can have sufficient training examples.Basic ClassifierWhy Deep? Deep ModularizationImageLong or short?Boy or Girl?Sharing by the following classifiers as mod
19、ulecan be trained by little dataGirls with long hairBoys with short hair Boys with long hairClassifier 1Classifier 2Classifier 3Girls with short hairClassifier 4Little datafineBasic ClassifierWhy Deep? Deep Modularization1x2xNxThe most basic classifiersUse 1st layer as module to build classifiers Us
20、e 2nd layer as module The modularization is automatically learned from data. Less training data?Deep Learning also works on small data set like TIMIT.Hand-crafted kernel functionSVMSource of image: http:/www.gipsa-lab.grenoble-inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdfApply simple classif
21、ierDeep Learning1x2xNxy1y2yMsimple classifierLearnable kernelHard to get the power of Deep Before 2006, deeper usually does not imply better.Part III:Tips for Training DNNRecipe for Learninghttp:/.au/2015/04/the-basic-recipe-for-machine-learning-explained-in-a-single-powerpoint-slide/Recipe for Lear
22、ninghttp:/.au/2015/04/the-basic-recipe-for-machine-learning-explained-in-a-single-powerpoint-slide/overfittingDont forget!PreventingOverfittingModify the NetworkBetter optimization StrategyRecipe for Learning New activation functions, for example, ReLU or MaxoutModify the Network Adaptive learning r
23、atesBetter optimization Strategy DropoutPrevent OverfittingOnly use this approach when you already obtained good results on the training data.Part III:Tips for Training DNNNew Activation FunctionReLU Rectified Linear Unit (ReLU)Reason:1. Fast to compute2. Biological reason3. Infinite sigmoid with di
24、fferent biases4. Vanishing gradient problemXavier Glorot, AISTATS11Andrew L. Maas, ICML13Kaiming He, arXiv15Vanishing Gradient ProblemLarger gradientsAlmost randomAlready convergebased on random!?Learn very slowLearn very fast1x2xNxy1y2yMSmaller gradientsIn 2006, people used RBM pre-training.In 2015
25、, people use ReLU.Vanishing Gradient Problem1x2xNxIntuitive way to compute the gradient Smaller gradientsLarge inputSmall outputReLU1x2x1y2y0000ReLU1x2x1y2yA Thinner linear networkDo not have smaller gradientsMaxout Learnable activation function Ian J. Goodfellow, ICML13Max1x2xInputMax+MaxMax+ReLU i
展开阅读全文