​​Transformer Reasoning Network for Image-Text Matching and Retrieval
A new approach for image-text matching using Faster-RCNN Bottom-Up and BERT.
Usually, downstream applications use the ResNet or one of its variants as the backbone CNN. Its simple and modular design can be easily adapted to various tasks. However, since ResNet models are originally designed for image classification, they may not be suitable for various downstream applications because of the limited receptive-field size and lack of cross-channel interaction.
Authors suggest an architecture, where images and texts are processed at first, and then their representations are combined.
Main contributions of the paper:
- TERN Architecture
- NDCG metric in addition to Recall@K
- show SOTA result on the benchmark
Paper: https://arxiv.org/abs/2004.09144
Code: https://github.com/mesnico/TERN
#computervision #deeplearning #bert #imagetextmatching
A new approach for image-text matching using Faster-RCNN Bottom-Up and BERT.
Usually, downstream applications use the ResNet or one of its variants as the backbone CNN. Its simple and modular design can be easily adapted to various tasks. However, since ResNet models are originally designed for image classification, they may not be suitable for various downstream applications because of the limited receptive-field size and lack of cross-channel interaction.
Authors suggest an architecture, where images and texts are processed at first, and then their representations are combined.
Main contributions of the paper:
- TERN Architecture
- NDCG metric in addition to Recall@K
- show SOTA result on the benchmark
Paper: https://arxiv.org/abs/2004.09144
Code: https://github.com/mesnico/TERN
#computervision #deeplearning #bert #imagetextmatching