Optimization overview in Machine Learning
Optimization
A branch of mathematics which encompasses many diverse areas of minimization and optimization. Optimization theory is the more modern term for operations research. Optimization theory includes the calculus of variations, control theory, convex optimization theory, decision theory, game theory, linear programming, Markov chains, network analysis, optimization theory, queuing systems, etc.
References
[1]. http://mathworld.wolfram.com/OptimizationTheory.html
[2]. http://www.convexoptimization.com/
[3]. http://wireless.egr.uh.edu/Optimization/index.htm
Optimization in Deep learning
Summary
TBD.
Common Methods
The objective function (loss function) and transformation (activation function) are usually non-linear in learning problems. Thus, the closed-form solutions are commonly not available in practical learning problems. In addtion, it is not easy to discuss the monotonicity of the loss functions.
Actually, searching methods are more feasible to solve the learning problems. Here, we conclude some widely adopted optimization-solving methods.
- Stochasitc Gradient descent
- Adagrad
- Adadelta
- RMSprop
- Momentum
- Adam
- Adamax
- Nesterov
- Nadam
References
[1]. https://zhuanlan.zhihu.com/p/22252270
[2]. http://blog.csdn.net/muyu709287760/article/details/62531509
[3]. http://blog.csdn.net/shenxiaoming77/article/details/41444269
More Discussion on SGD
- Shuffling and Curriculum Learning
- Batch Normalization
- Early stopping
- Gradient Noise
Optimization Method Selection in DL
- 对于稀疏数据,尽量使用学习率可自适应的算法,不用手动调节,而且最好采用默认参数
- SGD通常训练时间最长,但是在好的初始化和学习率调度方案下,结果往往更可靠。但SGD容易困在鞍点,这个缺点也不能忽略。
- 如果在意收敛的速度,并且需要训练比较深比较复杂的网络时,推荐使用学习率自适应的优化方法.
- Adagrad,Adadelta和RMSprop是比较相近的算法,表现都差不多
- 在能使用带动量的RMSprop或者Adam的地方,使用Nadam往往能取得更好的效果.