来自 | 知乎 作者| sticky链接 | https://zhuanlan.zhihu.com/p/137940586本文仅作学术交流,如有侵权,请联系后台删除。 这两天发现了一篇宝藏paper,2019年CVPR中的一篇 Bag of Tricks for Image Classification with Convolutional Neural Networks。这篇paper主要从3个方面讲述了提高现有baseline(ResNet-50)的有效trick:1. 在新的硬件上有效训练2. 在ResNet-50的基础上,对模型进行了一些微量的调整3. 训练的一些技巧Bag of Tricks for Image Classification with Convolutional Neural Networks
以前神经网络通常使用32-bit浮点数精度(FP32)来训练。但是现在的新的硬件增强了低精度数据类型的算术逻辑单元。例如Nvidia V100对FP32提供14 TFLOPS,而对FP16提供100 TFLOPS。因此,使用FP16时,总的训练速度加速了2~3倍:Comparison of the training time and validation accuracy for ResNet-50 between the baseline (BS=256 with FP32) and a more hardware efficient setting (BS=1024 with FP16).The breakdown effect for each effective training heuristic on ResNet-50.
2. 模型调整
The architecture of ResNet-50. The convolution kernel size, output channel size and stride size (default is 1) are illustrated, similar for pooling layers.主要对downsampling block和input steam(上图指出部分)做了一些改动:1. downsampling做改动主要是由于使用stride=2的1×1 conv会忽略3/4的feature-map。因此,为了使输出的shape保持不变,将path A的前两个conv分别改为stride=1的1×1 conv和stride=2的3×3 conv,即ResNet-C;将path B换成stride=2的2×2 AvgPool和stride=1的1×1 conv,即ResNet-D2. 而input steam做的改动主要是由于使用7×7 conv的计算cost是3×3的5.4倍。因此将7×7 conv换成3个连续的3×3conv,即ResNet-CThree ResNet tweaks. ResNet-B modifies the downsampling block of Resnet. ResNet-C further modifies the input stem. On top of that, ResNet-D again modifies the downsampling block.Compare ResNet-50 with three model tweaks onmodel size, FLOPs and ImageNet validation accuracy.
3. 训练技巧
3.1 Cosine Learning Rate Decay
以往学习率衰减的策略一般是"step decay",即每隔一定的epoch,学习率才进行一次指数衰减。而现在,学习率随着epoch的增大不断衰减:Visualization of learning rate schedules with warm-up. Top: cosine and step schedules for batch size 1024. Bottom: Top-1 validation accuracy curve with regard to the two schedules.
The validation accuracies on ImageNet for stacking training refinements one by one. We repeat each refinement on ResNet-50-D for 4 times with different initialization, and report the mean and standard deviation in the table.