Reproducing Imagenet in 18 minutes
The code to reproduce #ImageNet in 18 minutes is posted in the GitHub repo. It actually becomes Β«Imagenet in 12 minutesΒ» if using 74.9% top1, used in Chainer's "Imagenet in 15" paper, last few bits are the hardest.
Link: https://github.com/diux-dev/imagenet18
The code to reproduce #ImageNet in 18 minutes is posted in the GitHub repo. It actually becomes Β«Imagenet in 12 minutesΒ» if using 74.9% top1, used in Chainer's "Imagenet in 15" paper, last few bits are the hardest.
Link: https://github.com/diux-dev/imagenet18
GitHub
GitHub - cybertronai/imagenet18_old: Code to reproduce "imagenet in 18 minutes" DAWN-benchmark entry
Code to reproduce "imagenet in 18 minutes" DAWN-benchmark entry - cybertronai/imagenet18_old
ImageNet/ResNet-50 Training speed dramatically (6.6 min -> 224 sec) reduced
ResNet-50 on ImageNet now (allegedly) down to 224sec (3.7min) using 2176 V100s. Increasing batch size schedule, LARS, 5 epoch LR warmup, synch BN without mov avg. (mixed) fp16 training. "2D-Torus" all-reduce on NCCL2, with NVLink2 & 2 IB EDR interconnect.
1.28M images over 90 epochs with 68K batches, so the entire optimization is ~1700 updates to converge.
ArXiV: https://arxiv.org/abs/1811.05233
#ImageNet #ResNet
ResNet-50 on ImageNet now (allegedly) down to 224sec (3.7min) using 2176 V100s. Increasing batch size schedule, LARS, 5 epoch LR warmup, synch BN without mov avg. (mixed) fp16 training. "2D-Torus" all-reduce on NCCL2, with NVLink2 & 2 IB EDR interconnect.
1.28M images over 90 epochs with 68K batches, so the entire optimization is ~1700 updates to converge.
ArXiV: https://arxiv.org/abs/1811.05233
#ImageNet #ResNet
Do Better ImageNet Models Transfer Better?
Finding: better ImageNet architectures tend to work better on other datasets too. Surprise: pretraining on ImageNet dataset sometimes doesn't help very much.
ArXiV: https://arxiv.org/abs/1805.08974
#ImageNet #finetuning #transferlearning
Finding: better ImageNet architectures tend to work better on other datasets too. Surprise: pretraining on ImageNet dataset sometimes doesn't help very much.
ArXiV: https://arxiv.org/abs/1805.08974
#ImageNet #finetuning #transferlearning
"Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet"
A "bag of words" of nets on tiny 17x17 patches suffice to reach AlexNet-level performance on ImageNet. A lot of the information is very local.
Paper: https://openreview.net/forum?id=SkfMWhAqYQ
#fun #CNN #CV #ImageNet
A "bag of words" of nets on tiny 17x17 patches suffice to reach AlexNet-level performance on ImageNet. A lot of the information is very local.
Paper: https://openreview.net/forum?id=SkfMWhAqYQ
#fun #CNN #CV #ImageNet
OpenReview
Approximating CNNs with Bag-of-local-Features models works...
Aggregating class evidence from many small image patches suffices to solve ImageNet, yields more interpretable models and can explain aspects of the decision-making of popular DNNs.
Critics: AI competitions donβt produce useful models
Post, suggesting a viewpoint that AI competitions never seem to lead to products, how the one can overfit on a hold out test set, and why #Imagenet results since the mid-2010s are suspect.
Link: https://lukeoakdenrayner.wordpress.com/2019/09/19/ai-competitions-dont-produce-useful-models/
#critics #meta #AI #kaggle #imagenet #lenet
Post, suggesting a viewpoint that AI competitions never seem to lead to products, how the one can overfit on a hold out test set, and why #Imagenet results since the mid-2010s are suspect.
Link: https://lukeoakdenrayner.wordpress.com/2019/09/19/ai-competitions-dont-produce-useful-models/
#critics #meta #AI #kaggle #imagenet #lenet
Luke Oakden-Rayner
AI competitions donβt produce useful models
Ai competitions are fun, community building, talent scouting, brand promoting, and attention grabbing. But competitions are not intended to develop useful models.
πΉWhat's Hidden in a Randomly Weighted Neural Network?
Amazingly this paper finds a subnetwork with random weights in a Wide ResNet-50 that outperforms optimized weights in a ResNet-34 for ImageNet!
On the last ICLR article by Lottery Ticket Hypothesis β the authors showed that it is possible to take a trained big net, and throw out at 95% of the scales so that the rest can be learned on the same quality, starting with the same initialization.
In the follow-up Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask found out that, it is possible to leave the weight with the initialization and learn only the mask, throwing unnecessary connections from the network - so it was possible to get under 40% of the quality on the Cifar, teaching not the weight of the model, but only its structure. Similar observations were made for simple RL tasks, see Weight-Agnostic Neural Network.
However, it was not clear how much structure-only training works on normal datasets and large nets, or without the right weights.
In the article the authors for the first time start struture-only on Imagenet. For this purpose:
- It takes a bold grid aka DenseNet, weights are initialized from the "binaryized" kaiming normal (either +std, or -std instead of normal).
- For each weight, an additional scalar - score s, showing how important it is for a good prediction. On the inference we take the top-k% weights and zero out the rest.
- With fixed weights, we train the scores. The main trick is that although in the forward pass we, like in the inference, take only top-k weights, in the backward pass the gradient flows through all the scores. It is ambiguous LRD where all weights are used in the forward, and in the backward - only a small subset.
Thus we can to prune a random WideResnet50 and get 73.3% accuracy on imagenet and there will be less active weights than in Resnet34. Magic.
ArXiV: https://arxiv.org/pdf/1911.13299.pdf
YouTube explanation: https://www.youtube.com/watch?v=C6Tj8anJO-Q
via @JanRocketMan
#ImageNet #ResNet
Amazingly this paper finds a subnetwork with random weights in a Wide ResNet-50 that outperforms optimized weights in a ResNet-34 for ImageNet!
On the last ICLR article by Lottery Ticket Hypothesis β the authors showed that it is possible to take a trained big net, and throw out at 95% of the scales so that the rest can be learned on the same quality, starting with the same initialization.
In the follow-up Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask found out that, it is possible to leave the weight with the initialization and learn only the mask, throwing unnecessary connections from the network - so it was possible to get under 40% of the quality on the Cifar, teaching not the weight of the model, but only its structure. Similar observations were made for simple RL tasks, see Weight-Agnostic Neural Network.
However, it was not clear how much structure-only training works on normal datasets and large nets, or without the right weights.
In the article the authors for the first time start struture-only on Imagenet. For this purpose:
- It takes a bold grid aka DenseNet, weights are initialized from the "binaryized" kaiming normal (either +std, or -std instead of normal).
- For each weight, an additional scalar - score s, showing how important it is for a good prediction. On the inference we take the top-k% weights and zero out the rest.
- With fixed weights, we train the scores. The main trick is that although in the forward pass we, like in the inference, take only top-k weights, in the backward pass the gradient flows through all the scores. It is ambiguous LRD where all weights are used in the forward, and in the backward - only a small subset.
Thus we can to prune a random WideResnet50 and get 73.3% accuracy on imagenet and there will be less active weights than in Resnet34. Magic.
ArXiV: https://arxiv.org/pdf/1911.13299.pdf
YouTube explanation: https://www.youtube.com/watch?v=C6Tj8anJO-Q
via @JanRocketMan
#ImageNet #ResNet
arXiv.org
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without...