Fine-tuning code and pre-trained models
Explore the official paper »
Table of Contents
Statistical pattern recognition, nowadays often known under the term "machine learning",
is the key element of modern computer science. Its goal is to find, learn, and recognize patterns in complex data,
for example in images, speech, biological pathways, the internet.
- This repo is a gist of implementation of the Vision Transformation which was introduced in the paper: An Image is worth 16x16 words
- This repository is uses Py-Torch implementation availabale here
- The Py-Torch repository has pre-trained weights
The code just a rewrite & straight implementation of the VisionTransformer class, with minor modifications
and simplifications the class function is easier to run & modify for future work to patch and embed images for classification.
A list of commonly used resources that I find helpful are listed in the acknowledgements.
The raw implementation of code is built using python3.7.9
& pip20.0
The Vision Transformer is an image classifier which takes in an image and outputs the class & sub-class prediction, HOWEVER,
it does that without any convolutional layer, INSTEAD it uses the attention layers which is used already in NLTK, that is-an Attention Mechanism is also an attempt to implement the same action of selectively concentrating on a few relevant things,
while ignoring others in deep neural networks, However, in computer vision, convolutional neural networks (CNNs) are still the norm and self-attention just began to slowly creep into the main body of research.
The network is trained in three steps where image is turned in sequence of 1D tokens to use transform architecture:
- Fine-tuning of the global features pretrained by ImageNet & flatten the patches into 1D vectors.
- Mask inference to obtain the cropped images and perform fine-tuning of the local feature. Hereby, the weights in the global features are fixed.
- Concatenating of the global and local feature outputs and fine-tuning of the fusion feature while freezing the weights of the other features.
- The position embeding allows the network to determine what part of the image a specific patch came from.
Install the dependencies before running the compute.py
file
- pip
$ pip install -r requirements.txt
First, build & download the model using command:
python run_model.py
you can change the attributes & parameters by, the default image is 384x384
:
custom_config = {
"img_size": 384,
"in_chans": 3,
"patch_size": 16,
"embed_dim": 768,
"depth": 12,
"n_heads": 12,
"qkv_bias": True,
"mlp_ratio": 4,
}
To run the classification function and predict probability output:
python compute.py -image or -i <image destination, usually the base dir>
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/FeaturePatch-VisionTransformation
) - Commit your Changes (
git commit -m 'Add some updates'
) - Push to the Branch (
git push origin feature/FeaturePatch-VisionTransformation
) - Open a Pull Request