Data Science by ODS.ai 🦜
51.1K subscribers
359 photos
32 videos
7 files
1.51K links
First Telegram Data Science channel. Covering all technical and popular staff about anything related to Data Science: AI, Big Data, Machine Learning, Statistics, general Math and the applications of former. To reach editors contact: @haarrp
Download Telegram
Reproducing Imagenet in 18 minutes

The code to reproduce #ImageNet in 18 minutes is posted in the GitHub repo. It actually becomes Β«Imagenet in 12 minutesΒ» if using 74.9% top1, used in Chainer's "Imagenet in 15" paper, last few bits are the hardest.

Link: https://github.com/diux-dev/imagenet18
ImageNet/ResNet-50 Training speed dramatically (6.6 min -> 224 sec) reduced

ResNet-50 on ImageNet now (allegedly) down to 224sec (3.7min) using 2176 V100s. Increasing batch size schedule, LARS, 5 epoch LR warmup, synch BN without mov avg. (mixed) fp16 training. "2D-Torus" all-reduce on NCCL2, with NVLink2 & 2 IB EDR interconnect.

1.28M images over 90 epochs with 68K batches, so the entire optimization is ~1700 updates to converge.

ArXiV: https://arxiv.org/abs/1811.05233

#ImageNet #ResNet
Do Better ImageNet Models Transfer Better?

Finding: better ImageNet architectures tend to work better on other datasets too. Surprise: pretraining on ImageNet dataset sometimes doesn't help very much.

ArXiV: https://arxiv.org/abs/1805.08974

#ImageNet #finetuning #transferlearning
"Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet"

A "bag of words" of nets on tiny 17x17 patches suffice to reach AlexNet-level performance on ImageNet. A lot of the information is very local.

Paper: https://openreview.net/forum?id=SkfMWhAqYQ

#fun #CNN #CV #ImageNet
πŸ“ΉWhat's Hidden in a Randomly Weighted Neural Network?

Amazingly this paper finds a subnetwork with random weights in a Wide ResNet-50 that outperforms optimized weights in a ResNet-34 for ImageNet!

On the last ICLR article by Lottery Ticket Hypothesis β€” the authors showed that it is possible to take a trained big net, and throw out at 95% of the scales so that the rest can be learned on the same quality, starting with the same initialization.
In the follow-up Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask found out that, it is possible to leave the weight with the initialization and learn only the mask, throwing unnecessary connections from the network - so it was possible to get under 40% of the quality on the Cifar, teaching not the weight of the model, but only its structure. Similar observations were made for simple RL tasks, see Weight-Agnostic Neural Network.
However, it was not clear how much structure-only training works on normal datasets and large nets, or without the right weights.

In the article the authors for the first time start struture-only on Imagenet. For this purpose:
- It takes a bold grid aka DenseNet, weights are initialized from the "binaryized" kaiming normal (either +std, or -std instead of normal).
- For each weight, an additional scalar - score s, showing how important it is for a good prediction. On the inference we take the top-k% weights and zero out the rest.
- With fixed weights, we train the scores. The main trick is that although in the forward pass we, like in the inference, take only top-k weights, in the backward pass the gradient flows through all the scores. It is ambiguous LRD where all weights are used in the forward, and in the backward - only a small subset.

Thus we can to prune a random WideResnet50 and get 73.3% accuracy on imagenet and there will be less active weights than in Resnet34. Magic.

ArXiV: https://arxiv.org/pdf/1911.13299.pdf
YouTube explanation: https://www.youtube.com/watch?v=C6Tj8anJO-Q
via @JanRocketMan

#ImageNet #ResNet