MIT深度学习与自动驾驶－第二讲-76页 (2).pdf

资源描述

1、Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving Cars6.S094:Deep Learning for Self-Driving CarsLearning to Move:Deep Reinforcement Learning for Motion Planningcars.mit.edu每日免费获取报告1、每日微信群内分享5+最新重磅报告；2、每日分享当日华尔街日报、金融时报；3、每周分享经济学人4、每月汇总500+份当月重磅报告（增值

2、服务）扫一扫二维码关注公号回复：研究报告加入“起点财经”微信群。Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsAdministrative Website:cars.mit.edu Contact Email:deepcarsmit.edu Required:Create an account on the website.Follow the tutorial for each of the 2 projects.Recomme

3、nded:Ask questions Win competition!Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsScheduleLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeepTraffic:Solving Traffic with Deep Reinfor

4、cement LearningLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsSupervised LearningUnsupervised LearningSemi-SupervisedLearningReinforcementLearningStandard supervised learning pipeline:Types of machine learning:References:81Lex Fridman:fridma

5、nmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPerceptron:Weighing the EvidenceReferences:78EvidenceDecisionsLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPerceptron:Implement a NAND Gate Universalit

6、y:NAND gates are functionally complete,meaning we can build any logical function out of them.References:79Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPerceptron:Implement a NAND Gate011101(-2)*0+(-2)*1+3=1(-2)*1+(-2)*0+3=1001110(-2)*0+(-2

7、)*0+3=3(-2)*1+(-2)*1+3=-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPerceptron NAND GateReferences:80Both circuits can represent arbitrary logical functions:But“perceptron circuits”can learnLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJ

8、anuary2017Course 6.S094:Deep Learning for Self-Driving CarsThe Process of Learning:Small Change in Weights Small Change in OutputReferences:80Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsThe Process of Learning:Small Change in Weights Smal

9、l Change in OutputReferences:80This requires a“smoothness”PerceptronNeuronSmoothness of activation function means:the output is a linear function of the weights and biasLearning is the process of gradually adjusting the weights to achieve any gradual change in the output.Lex Fridman:fridmanmit.eduWe

10、bsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsCombining Neurons into LayersFeed Forward Neural NetworkRecurrent Neural Network-Have state memory-Are hard to trainLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving Cars

11、Task:Classify and Image of a NumberReferences:80Input:(28x28)Network:Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsTask:Classify and Image of a NumberReferences:63,80Ground truth for“6”:“Loss”function:Lex Fridman:fridmanmit.eduWebsite:cars.

12、mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPhilosophical Motivation for Reinforcement LearningTakeaway from Supervised Learning:Neural networks are great at memorization and not(yet)great at reasoning.Hope for Reinforcement Learning:Brute-force propagation of outcomes to know

13、ledge about states and actions.This is a kind of brute-force“reasoning”.Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsAgent and Environment At each step the agent:Executes action Receives observation(new state)Receives reward The environmen

14、t:Receives action Emits observation(new state)Emits rewardReferences:80Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsReinforcement LearningReinforcement learning is a general-purpose framework for decision-making:An agent operates in an env

15、ironment:Atari Breakout An agent has the capacity to act Each action influences the agents future state Success is measured by a reward signal Goal is to select actions to maximize future rewardReferences:85Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self

16、-Driving CarsMarkov Decision Process0,0,1,1,1,2,1,1,stateactionrewardTerminalstateReferences:84Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsMajor Components of an RL AgentAn RL agent may include one or more of these components:Policy:agent

17、s behavior function Value function:how good is each state and/or action Model:agents representation of the environment0,0,1,1,1,2,1,1,stateactionrewardTerminalstateLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsRobot in a Room+1-1STARTaction

18、s:UP,DOWN,LEFT,RIGHTUP80%move UP10%move LEFT10%move RIGHT reward+1 at 4,3,-1 at 4,2 reward-0.04 for each step whats the strategy to achieve max reward?what if the actions were deterministic?Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsIs t

19、his a solution?+1-1 only if actions deterministic not in this case(actions are stochastic)solution/policy mapping from each state to an actionLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsOptimal policy+1-1Lex Fridman:fridmanmit.eduWebsite:

20、cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsReward for each step-2+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsReward for each step:-0.1+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S

21、094:Deep Learning for Self-Driving CarsReward for each step:-0.04+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsReward for each step:-0.01+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-D

22、riving CarsReward for each step:+0.01+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsValue Function Futurereward=1 +2 +3 +=+1+2+Discounted future reward(environment is stochastic)=+1 +2+2+=+(+1 +(+2 +)=+1 A good strategy for an agent woul

23、d be to always choose an action that maximizes the(discounted)futurerewardReferences:84Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsQ-Learning State value function:V(s)Expected return when starting in s and following State-action value fun

24、ction:Q(s,a)Expected return when starting in s,performing a,and following Useful for finding the optimal policy Can estimate from experience(Monte Carlo)Pick the best action using Q(s,a)Q-learning:off-policy Use any policy to estimate Q that maximizes future reward:Q directly approximates Q*(Bellman

25、 optimality equation)Independent of the policy being followed Only requirement:keep updating each(s,a)pairsasrLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsQ-LearningsasrNew StateOld StateRewardLearning RateDiscount FactorLex Fridman:fridma

26、nmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsExploration vs ExploitationKey ingredient of Reinforcement LearningDeterministic/greedy policy wont explore all actionsDont know anything about the environment at the beginningNeed to try all actions to find the

27、optimal oneMaintain explorationUse soft policies instead:(s,a)0(for all s,a)-greedy policyWith probability 1-perform the optimal/greedy actionWith probability perform a random actionWill keep exploring the environmentSlowly move it towards greedy policy:-0Lex Fridman:fridmanmit.eduWebsite:cars.mit.e

28、duJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsQ-Learning:Value IterationNew StateOld StateRewardLearning RateDiscount FactorReferences:84Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsQ-Learning:Representation Matters In prac

29、tice,Value Iteration is impractical Very limited states/actions Cannot generalize to unobservedstates Think about the Breakout game State:screenpixels Image size:(resized)Consecutive 4images Grayscale with 256 graylevelsrows in theQ-table!References:83,84Lex Fridman:fridmanmit.eduWebsite:cars.mit.ed

30、uJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPhilosophical Motivation for Deep Reinforcement LearningTakeaway from Supervised Learning:Neural networks are great at memorization and not(yet)great at reasoning.Hope for Reinforcement Learning:Brute-force propagation of outcomes to knowl

31、edge about states and actions.This is a kind of brute-force“reasoning”.Hope for Deep Learning+Reinforcement Learning:General purpose artificial intelligence through efficient generalizable learning of the optimal thing to do given a formalized set of actions and states(possibly huge).Lex Fridman:fri

32、dmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeep Q-LearningUse a function(with parameters)to approximate the Q-function Linear Non-linear:Q-NetworkReferences:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self

33、-Driving CarsDeep Q-Network:AtariMnih et al.Playing atari with deep reinforcement learning.2013.References:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeep Q-Network Training Bellman Equation:Loss function(squared error):References:83Le

34、x Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeep Q-Network TrainingGiven a transition,the Q-table update rule in the previous algorithm must be replaced with the following:Do a feedforward pass for the current state s to get predicted Q-val

35、ues for all actions Do a feedforward pass for the next state s and calculate maximum overall network outputs maxaQ(s,a)Set Q-value target for action to r+maxaQ(s,a)(use the max calculated in step 2).For all other actions,set the Q-value target to the same as originally returned from step 1,making th

36、e error 0 for those outputs.Update the weights using backpropagation.References:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsExploration vs ExploitationKey ingredient of Reinforcement LearningDeterministic/greedy policy wont explore all

37、actionsDont know anything about the environment at the beginningNeed to try all actions to find the optimal oneMaintain explorationUse soft policies instead:(s,a)0(for all s,a)-greedy policyWith probability 1-perform the optimal/greedy actionWith probability perform a random actionWill keep explorin

38、g the environmentSlowly move it towards greedy policy:-0Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsAtari Breakout A few tricks needed,most importantly:experience replayReferences:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary201

39、7Course 6.S094:Deep Learning for Self-Driving CarsDeep Q-Learning AlgorithmLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsAtari BreakoutReferences:85After120 Minutesof TrainingAfter10 Minutesof TrainingAfter240 Minutesof TrainingLex Fridman:

40、fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDQN Results in AtariReferences:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsGorila(General Reinforcement LearningArchitecture)10 x faster than

41、Nature DQN on 38 out of 49 Atari gamesApplied to recommender systems within GoogleNair et al.Massively parallel methods for deep reinforcement learning.(2015).Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsThe Game of TrafficOpen Question(Ag

42、ain):Is driving closer to chess or to everyday conversation?Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeepTraffic:Solving Traffic with Deep Reinforcement LearningGoal:Achieve the highest average speed over a long period of time.Requirem

43、ent for Students:Follow tutorial to achieve a speed of 65mphLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsThe Road,The Car,The SpeedState Representation:Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning fo

44、r Self-Driving CarsSimulation SpeedLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDisplay OptionsLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsSafety SystemLex Fridman:fridmanmit.ed

45、uWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDriving/LearningLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsLearning InputLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learnin

46、g for Self-Driving CarsLearning InputLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsLearning InputLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsEvaluation Scoring:Average Speed Meth

47、od:Collect average speed Ten runs,about 30(simulated)minutes of game each Result:median speed of the 10 runs Done server side after you submit (no cheating possible!(we also look at the code)You can try it locally to get an estimate Uses exactly the same evaluation procedure/code But:some influence

48、of randomness Our number is what counts in the end!Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsEvaluation(Locally).Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsCoding/Changing t

49、he Net LayoutWatch out:kills trained state!Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsTraining Done on separate thread(Web Workers,yay!)Separate simulation,resets,state,etc.A lot faster(1000 fps+)Net state gets shipped to the main simula

50、tion from time to time You get to see the improvements/learning liveLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsTrainingLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsLoading/Savi

展开阅读全文