The Challenge#
Understanding images through natural language requires expensive cloud APIs. Local deployment of vision-language models is complex.
The Solution#
Deployed Salesforce/blip-vqa-base model on CPU with a Gradio interface, enabling real-time visual question answering without GPU requirements.
Key Achievement#
Enabled multimodal AI capabilities on standard hardware, opening possibilities for AR/VR applications in EdTech and MedTech.
Technologies Used#
Transformers, PyTorch, Gradio, Computer Vision, NLP
