Vision Transformer for Image Classification Using transfer learning

Поделиться
HTML-код
  • Опубликовано: 12 янв 2025

Комментарии •

  • @dr.noushathshaffi7515
    @dr.noushathshaffi7515 Год назад +3

    I've been searching for this tutorial for long time, and I can't express how thankful I am, Aarohi! Your RUclips channel is an absolute gem, and it truly deserves a multitude of subscriptions. The way you effortlessly share your expertise is not only enlightening but also engaging. Keep up the exceptional work!

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      Thank you for your heartwarming comment 🙂

  • @debjitdas1714
    @debjitdas1714 11 месяцев назад +1

    Very informative tutorial, Thank you. I have the following questions and doubts-
    1) During training, how to save the best model only after each epoch, and load that best model after completing training, for future use? (e.g. based on lowest validation loss)
    2) How to generate the confusion matrix and also the F-1 Score, Precision, Recall?
    3) Finally how to identify actually which test samples are correctly predicted and which test samples are not?
    4) Since, after initial 4-5 epochs the gap between training loss and test loss or between train accuracy and test accuracy is increasing continuously, so it needs further fine-tuning, so, please suggest how to do that.

    • @salihsalur4855
      @salihsalur4855 8 месяцев назад

      Hello, Could you answer question 2? f-1 Score, precision ... Do you have code to f1 score ...

  • @shounakdas1001
    @shounakdas1001 Год назад

    Thanks Aarohi, it is brilliant. Great Help to learn ViT

  • @soravsingla6574
    @soravsingla6574 Год назад

    Hello Ma’am
    Your AI and Data Science content is consistently impressive! Thanks for making complex concepts so accessible. Keep up the great work! 🚀 #ArtificialIntelligence #DataScience #ImpressiveContent 👏👍

  • @sanjoetv5748
    @sanjoetv5748 Год назад

    please make a landmark detection here in vision transformer. i greatly in need for this project to be finished and the task is to create a 13 landmark detection using vision transformer. and i cant find any resources that teaches how to do a landmark detection if vision transformer. this channel is my only hope.

  • @soravsingla6574
    @soravsingla6574 Год назад

    Code with Aarohi is Best RUclips channel for Artificial Intelligence
    #BestChannel #RUclipsChannel #ArtificialIntelligence #CodeWithAarohi #DataScience #Engineering #MachineLearning #DataAnalysis #BestLearning #LearnDataScience #DataScienceCourse #AytificialIntelligenceCourse #Codewithaarohi #CodeWithAarohi

  • @ambikajadoonanan2852
    @ambikajadoonanan2852 Год назад

    Good day. Thank you for this wonderful demo. I have a few questions:
    1. Are there any other existing vision transformer models that you know of?
    2. How do I go about training a model using images corresponded with nutritional values in a certain column range within a separate excel database and spitting out the values predicted when applied to a single image? The name on each image is also identified against each value within the excel file.
    Many many thanks in advance for the assistance. :)

  • @JKaks-gr5zm
    @JKaks-gr5zm 10 месяцев назад +3

    I am getting the error "ModuleNotFoundError: No module named 'going_modular'" even though the going_modular folder and the Notebook are under the same folder. I am working in Colab. Please Help Ma'am.

    • @Ikramkrt
      @Ikramkrt 10 месяцев назад

      i have the same probleme but in jupyter , do you resolve this probleme?

    • @Ritam_Goswami_
      @Ritam_Goswami_ 7 месяцев назад

      i am currently having problems with epochs not running, it keeps taking very long time, what to do

    • @Ritam_Goswami_
      @Ritam_Goswami_ 7 месяцев назад

      just install the module from the directory in which the module is present, in a different cell

  • @aakashyadav1589
    @aakashyadav1589 3 месяца назад

    how to predict on very large dataset? lets say, you have 30,000 images, then using for loop will be comp. expensive, so , what's the best way to inference from pretrained model on large datasets?

  • @sathishkumars4463
    @sathishkumars4463 Год назад +1

    Awesome upload. How do I save the model or weights which I can load and perform inference later?

  • @danielasefa8087
    @danielasefa8087 Год назад

    Thanks so much ,I was waiting this video from you.

  • @neelshah1651
    @neelshah1651 Год назад

    Thank you for such a great content!!

  • @НиколайНовичков-е1э

    Thank you! Your video is very informative!

  • @anishmgeorge207
    @anishmgeorge207 7 месяцев назад

    Madam, I have one doubt...Here we use a pretrained model and we are training the model again with our dataset. So my doubts are from where do we get the pre trained model? And for which dataset the pretrained model got trained? Also, after retraining the model with our dataset, the weights will all get changed right?

  • @MaryBrockyn
    @MaryBrockyn Год назад

    Hi again, when I print the summary of the Vision Transformer, the Input Shapes for each Layer start with 32. I understand that the very first input [32, 3, 224, 224] means we have originally have an image size 224x224 with 3 colour channels. What does the 32 mean? Is that the batch size, and if so, do I have to change that value if I change my batch size for training?

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      Yes, you are correct! The "32" in the input shape [32, 3, 224, 224] refers to the batch size.

  • @swatimishra1555
    @swatimishra1555 Месяц назад +1

    I am getting this error "
    ModuleNotFoundError: No module named 'going_modular'"
    when trying to run it on google colab .
    how to fix it in google co lab.plz reply

    • @CodeWithAarohi
      @CodeWithAarohi  Месяц назад +1

      This is a folder. You can get it from my repo. Place it in the directory where your Jupyter notebook is

    • @swatimishra1555
      @swatimishra1555 Месяц назад +1

      @@CodeWithAarohi thanks a lot

    • @swatimishra1555
      @swatimishra1555 Месяц назад +1

      @@CodeWithAarohi how can I use this going_modular in google colab, is there any way

    • @CodeWithAarohi
      @CodeWithAarohi  Месяц назад +1

      @ paste it in your google drive where your colab notebook is

    • @swatimishra1555
      @swatimishra1555 Месяц назад +1

      @@CodeWithAarohi thanks

  • @Vibhu-ts8dh
    @Vibhu-ts8dh 8 месяцев назад

    ma'am how do i save and then load the model....since after saving and loading the model, i am not able to get the same predictions..is there any resources i can refer to learn about it

  • @hulkbaiyo8512
    @hulkbaiyo8512 Год назад

    I combine ur code and my code of training process. Add Learning rate scheduler and GPU memory gc. The result and speeds of training become so much beautiful without worry about GPU out of memory

  • @FERNANDOVALLE-ig8gl
    @FERNANDOVALLE-ig8gl 8 месяцев назад

    Could you add how to calculate the confusion matrix and other metrics please?

  • @devavratpro7061
    @devavratpro7061 Год назад

    Hi, Thanks for your great video. I am willing to traing the model for some other input size like 448x448. However, the model only takes 224x224 input size or gives error. How can I make neceesary changes?

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      You'll need to adapt the architecture to accommodate the larger input size.
      The key components to modify include:
      1- In the original ViT, the input image is divided into non-overlapping patches of size 16x16 pixels. For a 448x448 input size, you'll need to adjust the patch size accordingly. To keep it consistent with the original approach, you can use a patch size of 28x28 (448/16).
      2- The number of patches depends on the input size and patch size. For 448x448 input and 28x28 patches, you'll have 16x16 = 256 patches.
      3- Adjust the embedding dimension to suit your needs. The embedding dimension should still be proportional to the patch size and number of patches.
      4- You may need to adjust the number of transformer blocks to accommodate the larger input size. More blocks may be required for better performance.
      Example- Using PyTorch and Hugging Face Transformers ViT model for a 448x448 input size:
      import torch
      from transformers import ViTFeatureExtractor, ViTModel
      # Modify the feature extractor to match your desired input size
      feature_extractor = ViTFeatureExtractor(
      image_size=(448, 448),
      patch_size=28, # Adjusted patch size
      )
      # Modify the ViT model architecture
      model = ViTModel(
      image_size=(448, 448),
      patch_size=28,
      num_classes=1000, # Adjust the number of output classes
      # Modify other parameters as needed (embedding_dim, num_layers, etc.)
      )

  • @harshavenkatesh4409
    @harshavenkatesh4409 Год назад

    could not generate a random directory for manager socket , how do i resolve this error?

  • @Sunil-ez1hx
    @Sunil-ez1hx Год назад

    Thank you soo much mam for this amazing video

  • @nandiniloku7747
    @nandiniloku7747 Год назад

    thank you , very good explanation . which pre-trained model you are using here, is that tey are same as cnn pre trained model or you are using only the weights of the pre trained model ? which pre trained model is this >?

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      You can check this: github.com/pytorch/vision/blob/main/torchvision/models/vision_transformer.py Here check class ViT_B_16_Weights(WeightsEnum):

  • @MaryBrockyn
    @MaryBrockyn Год назад

    Thanks for the tutorial! Is there a quick way to let all images out of a folder get classified by the trained model and to also add the confusion matrix and other metrics therefore?

    • @MaryBrockyn
      @MaryBrockyn Год назад

      Also I am wondering about how to convert the images that I wanna get classified into the proper input shape? Can ypu help with that?
      Thanks in advance!

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад +1

      image_transform = transforms.Compose(
      [
      transforms.Resize(image_size),
      transforms.ToTensor(),
      transforms.Normalize(
      mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
      ),
      ]
      )

    • @MaryBrockyn
      @MaryBrockyn Год назад

      Thank you! Do you maybe also have an answer for my first question? ( Is there a quick way to let all images out of a folder get classified by the trained model and to also add the confusion matrix and other metrics as the accuarcy, rcall and F1-Score therefore?) @@CodeWithAarohi

  • @maharaniizza4601
    @maharaniizza4601 Год назад

    Hi Ms.Aarohi, thank you so much for your video. Can I ask, if I want to add callback early stopping, is it correct to modify the file engine in the epoch looping section? Thank you

  • @mehedihasanshojib5831
    @mehedihasanshojib5831 Год назад

    I prepared my dataset like you. But when i try to train it gives OSError: Caught OSError in DataLoader worker process 0. and image file is truncated (40 bytes not processed). I followed same to same like your code. just applied my own dataset. Can you tell me how to fix it?

  • @sandhyarani-wk4mn
    @sandhyarani-wk4mn Год назад

    mam I am getting no module found error for importing engine from going modular. I have downloaded and copied in the directory. plz help mam

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      Check the location of going_modular folder and your jupyter notebook. Both should be under same folder

  • @올라쿤레아요데지오몰

    Hi Aarohi, you made it look easy. I have a challenge: I am getting this error: ModuleNotFoundError: No module named 'helper_functions'

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад +1

      You can get the helper_functions.py file from ghere and paste it in your directory github.com/AarohiSingla/Image-Classification-Using-Vision-transformer

    • @올라쿤레아요데지오몰
      @올라쿤레아요데지오몰 Год назад

      @@CodeWithAarohi Thank you. It worked! One more thing, which activation function did you use? and at what stage did you implement it please?

  • @loveofmylifesoumyarashmi9972
    @loveofmylifesoumyarashmi9972 Год назад +1

    code to print the accuracy , f1 score, precision and recall??

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад +1

      Will create a separate video on it.

    • @salihsalur4855
      @salihsalur4855 8 месяцев назад

      Do you have code to f1 score, precision...

  • @ericobeng3139
    @ericobeng3139 8 месяцев назад

    Thank you for this great video. Can this be applied to video datasets? or do you have a video link to training ViT on Video dataset? Thank you.

    • @CodeWithAarohi
      @CodeWithAarohi  8 месяцев назад +1

      Yes, ViT can be applied to video datasets. While ViT was initially designed for processing static images, researchers have extended its application to video data by incorporating temporal information.

  • @YashSharma-le3mo
    @YashSharma-le3mo Год назад

    Hi mam
    I have Cuda available
    But it is giving assertion error
    Unable to run with Cuda

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      Check pytorch version. Is it compiled with cuda.

  • @princekhunt1
    @princekhunt1 Месяц назад

    Nice tutorial

  • @dipankarporey2171
    @dipankarporey2171 Год назад

    Could you please make one single video completely on "Attention"(including self-attention) architecture? Thank you for these videos.

  • @pranavdubal-c9j
    @pranavdubal-c9j 10 месяцев назад

    I am getting an error of module 'torchvision.models' has no attribute 'ViT_B_16_Weights'
    1 # 1. Get pretrained weights for ViT-Base
    ----> 2 pretrained_vit_weights = torchvision.models.ViT_B_16_Weights.DEFAULT
    3
    4 # 2. Setup a ViT model instance with pretrained weights
    5 pretrained_vit = torchvision.models.vit_B_16(weights=pretrained_vit_weights).to(device)
    AttributeError: module 'torchvision.models' has no attribute 'ViT_B_16_Weights'

  • @lavanyaravilla1511
    @lavanyaravilla1511 21 день назад

    Hi Arohi,i'm trying this code in JArvis pytorch environment i'm getting this error FileNotFoundError: Found no valid file for the classes .ipynb_checkpoints. Supported extensions are: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, .webp inspite of path correct

    • @CodeWithAarohi
      @CodeWithAarohi  19 дней назад

      your code is trying to load files with the .ipynb_checkpoints directory in the path, which isn't a valid image file format

  • @soravsingla6574
    @soravsingla6574 Год назад

    Well done

  • @salmatiru8797
    @salmatiru8797 День назад

    aapne ye predict ka image ka path kahase diya pls boliye

    • @CodeWithAarohi
      @CodeWithAarohi  День назад

      test.jpg image folder mein hain. Jisme jupyter notebook hai.

  • @sidharthpisharody
    @sidharthpisharody Год назад

    Mam is it possible to implement the paper "GA-Nav: Efficient Terrain Segmentation for Robot
    Navigation in Unstructured Outdoor Environments" I tried it but there is a "ModuleNotFoundError: No module named 'mmcv._ext'" error that I am not able to rectify. If u could show it it would be very helpful

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      I will try but after finishing the pipelined work.

  • @MaryBrockyn
    @MaryBrockyn Год назад

    Hello again,
    I am wondering about why you are using the CategricalCrossEntropy as the loss function. I tried to use Binary Cross Entropy instead as ist is a binary classification problem. I used loss_fn = torch.nn.BCELoss() . Somehow it does not work with your model. Do you have any idea why?

    • @MaryBrockyn
      @MaryBrockyn Год назад

      I am receiving this error: "Using a target size (torch.Size([4])) that is different to the input size (torch.Size([4, 2])) is deprecated. Please ensure they have the same size."

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      The reason for using categorical cross-entropy is that it is well-suited for multi-class classification problems.

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      The error you're encountering indicates a mismatch between the size of your target labels and the size of the model's output.

    • @MaryBrockyn
      @MaryBrockyn Год назад

      @@CodeWithAarohi But we are dealing with a binary problem, and not a multiclass classification problem, right? So thats why I assume a BCE would be a better loss function

    • @MaryBrockyn
      @MaryBrockyn Год назад

      Also my programm runs perfectly fine with CrossEntropyLoss(). As soon as I simply change the loss to BCELoss I get the error

  • @Shaggysus
    @Shaggysus 10 месяцев назад

    hi this video so helpful. im facing a issue with the helper_functions. how can i resolve that issue?

    • @CodeWithAarohi
      @CodeWithAarohi  10 месяцев назад

      Download helper_functions.py file from here and paste it in your working directory: github.com/AarohiSingla/Image-Classification-Using-Vision-transformer

  • @YashSharma-le3mo
    @YashSharma-le3mo Год назад

    Mam What actually it means that you have modified Classifer head and pause all other layers?

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад +1

      Modified the Classifier Head: Modifying the classifier head means that you are changing the architecture or parameters of the top layers responsible for making predictions. This can include adding or removing layers, changing the number of neurons, or making other architectural changes to better suit your specific task.
      Paused All Other Layers: "Pausing" or "freezing" layers means that you are preventing the weights of the layers in the feature extraction backbone from being updated during training. In other words, you are keeping these layers fixed and not allowing them to learn new features during fine-tuning.

    • @YashSharma-le3mo
      @YashSharma-le3mo Год назад

      @@CodeWithAarohi ok mam Thank you

  • @joshuahentinlal205
    @joshuahentinlal205 Год назад

    Maam i have problem in importing engine of going_modular can you help please

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад +1

      Dowmload the going_modular folder from github.com/AarohiSingla/Image-Classification-Using-Vision-transformer and put it in your current working directory.

    • @joshuahentinlal205
      @joshuahentinlal205 Год назад

      thanks alot Maam
      it really helped me.
      and one more enquiry, using your code, while training my dataset with just 2000 images i had been trainning for more than an hour but not even 1 epochs is completed. it goes it something like forever loop. can you please help @@CodeWithAarohi

  • @shrikar7341
    @shrikar7341 11 месяцев назад

    if i were to put an image for prediction lets say an image of orange but the only class headers are dandelion and daisy what will the prediction be?

    • @CodeWithAarohi
      @CodeWithAarohi  11 месяцев назад

      If you have added background class for random images which are not a part of these 2 classes then model will take the image of orange as background but if you only have these 2 classes then model will try to provide label to this orange image. Your model will not behave accurately in this case.

  • @MaryBrockyn
    @MaryBrockyn Год назад

    Hello again, how can I save the model to use it later on again?

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      You need to do something like this.
      # save model
      MODEL_PATH = 'custom-model'
      model.model.save_pretrained(MODEL_PATH)
      # loading model
      model = DetrForObjectDetection.from_pretrained(MODEL_PATH)
      model.to(DEVICE)

    • @MaryBrockyn
      @MaryBrockyn Год назад

      Thank you!@@CodeWithAarohi

  • @rajatchakraborty2058
    @rajatchakraborty2058 8 месяцев назад

    When I am trying to predict an image for my dataset it is showing "The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 0" error. Can anyone please help

    • @CodeWithAarohi
      @CodeWithAarohi  8 месяцев назад

      This means that you're trying to perform an operation that requires the two tensors to have the same size along their first dimension, but they don't match.
      For example, tensor "a" might have a shape of [4, X], where 4 represents the size of the first dimension. Tensor "b" might have a shape of [3, Y], where 3 represents the size of its first dimension. The error is raised because the size (4) of the first dimension of tensor "a" does not match the size (3) of the first dimension of tensor "b".

  • @kadapallanithin
    @kadapallanithin Год назад

    Thanks for the video

  • @safiullah353
    @safiullah353 Год назад

    from going_modular.going_modular import engine
    here a problem occur i'm unable to handle this please help me here

  • @soravsingla8782
    @soravsingla8782 Год назад

    Awesome

  • @sharifimroz6231
    @sharifimroz6231 Год назад

    could you please share the dataset link?

  • @fatematujjohora6163
    @fatematujjohora6163 Год назад

    How to install going_modular? plz answer me

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      going_modular is a folder in my repo. You need to put it in current working directory.

  • @AbdulQadeerRasooli-l8k
    @AbdulQadeerRasooli-l8k Год назад

    Hi thanks for your great video. i faced to this error ### ModuleNotFoundError: No module named 'going_modular', how to download going_modul folder from your link i cannot downloaded this folder

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      You can get the folder from here:github.com/AarohiSingla/Image-Classification-Using-Vision-transformer

  • @rohitsk5300
    @rohitsk5300 Год назад

    how can i extract the trained model for making an app??

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      MODEL_PATH = 'custom-model'
      model.model.save_pretrained(MODEL_PATH)

    • @rohitsk5300
      @rohitsk5300 Год назад

      @@CodeWithAarohi I'm not sure .. how to make that work... my code is almost same as the explained code... what should be exactly done to extract it out and loaded it back.....

  • @DataTheory92
    @DataTheory92 Год назад

    Can you make lectures on MLops please?

  • @pifordtechnologiespvtltd5698
    @pifordtechnologiespvtltd5698 10 месяцев назад

    Amazing

  • @priyanshupandey3148
    @priyanshupandey3148 Год назад +1

    Please upload the notebooks. It is not there.

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад +1

      github.com/AarohiSingla/Image-Classification-Using-Vision-transformer

    • @priyanshupandey3148
      @priyanshupandey3148 Год назад

      @@CodeWithAarohi Thank you very much!

  • @cyreneschannel5017
    @cyreneschannel5017 Год назад

    i like your video using image classification transformer, can you also make a video using vision transformer using video dataset

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад +1

      Sure

    • @ericobeng3139
      @ericobeng3139 8 месяцев назад

      @@CodeWithAarohi Please was the video on using ViT for videos already done?

  • @shresthjain7557
    @shresthjain7557 Год назад

    how to download that data set?

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      You can prepare your dataset by creating 2 folders and then put some images in those folders.

  • @imrankhan-el2zp
    @imrankhan-el2zp 8 месяцев назад

    how to resolve this issue???
    ModuleNotFoundError Traceback (most recent call last)
    Cell In[1], line 6
    4 from torch import nn
    5 from torchvision import transforms
    ----> 6 from helper_functions import set_seeds
    ModuleNotFoundError: No module named 'helper_functions'

    • @CodeWithAarohi
      @CodeWithAarohi  8 месяцев назад

      PAste teh helper_functions.py file where your juoyter notebook is

  • @SuryaPrakash-mp8bz
    @SuryaPrakash-mp8bz 19 дней назад

    Mam please provide a dataset of different fruits and say how to download it and what changes we need to do in code and how to train the model please help me mam

    • @CodeWithAarohi
      @CodeWithAarohi  19 дней назад

      Download the images of different fruits from internet. Then create seperate folder for each fruit and place the related images in it. Changes in code:- 1 Change the path of dataset. 2- Change the number of classes.
      Watch this video again and you will know where I have discussed about the number of classes and dataset.

  • @teetanrobotics5363
    @teetanrobotics5363 Год назад

    Awesome. tutorials. Aarohi, Could you please make a code tutorial for video superresolution using ESRGAN ?

    • @CodeWithAarohi
      @CodeWithAarohi  Год назад

      Sure, I will do the video after finishing pipelined videos.

  • @cyberhard
    @cyberhard Год назад

    Excellent as usual! How well do vision transformerd compare traditional CNNs for image classification?

    • @DataTheory92
      @DataTheory92 Год назад

      Vision transformer perform more better than CNN on images task as tested by scientist.

    • @DataTheory92
      @DataTheory92 Год назад

      For more complex task we have LLM models now where ML and normal neural networks are outdated. Understand first framework that why it is designed and how it operated then implement it using a code . You will understand more

  • @PawanKumar-fu2fh
    @PawanKumar-fu2fh 9 месяцев назад

    ModuleNotFoundError: No module named 'going_modular'

    • @CodeWithAarohi
      @CodeWithAarohi  9 месяцев назад

      going_modular is a folder. You need to put it in your current working directory and please check the path of it.

  • @hussamsarfraz7952
    @hussamsarfraz7952 Год назад

    thnx alot

  • @gitgat-wx4vq
    @gitgat-wx4vq 9 месяцев назад

    import torch
    import maxvit
    # from .maxvit import MaxViT, max_vit_tiny_224, max_vit_small_224, max_vit_base_224, max_vit_large_224
    # Tiny model
    network: maxvit.MaxViT = maxvit.max_vit_tiny_224(num_classes=1000)
    input = torch.rand(1, 3, 224, 224)
    output = network(input)
    my purpose is to do give an input as an image (1,3,224,224) and generate output as its description for that. how should i do that, what should i add more to this code?

    • @CodeWithAarohi
      @CodeWithAarohi  9 месяцев назад

      To achieve this, you'll need to use a different model architecture and approach, as image classification models like MaxViT are not designed for generating textual descriptions.