Build a Multi-Modal Search App with Chroma and CLIP

Advertisement

May 29, 2025 By Tessa Rodriguez

Search isn't limited to words anymore. People expect to use an image or a short phrase and get back something that feels relevant. That’s where multi-modal search comes in. It lets you compare across different data types—like text and images—using shared meaning rather than matching exact words.

Chroma makes this setup approachable. It’s a vector database that stores data in the form of embeddings—numerical representations of content. Whether your input is a sentence, a photo, or a caption, once it’s turned into an embedding, Chroma can compare it with everything else in your collection and bring back similar results.

The steps below outline how to create a working multi-modal search app using Chroma. This isn't about building a full product from scratch—it's about laying the groundwork for a search system that can understand both language and visuals.

Steps to Build a Multi-Modal Search App with Chroma

Step 1: Understand What You’re Working With

Before you start building, it helps to understand what makes multi-modal search work. The key is embeddings. You’ll use a model that can convert different types of content—like photos and sentences—into a form that can be compared numerically. That’s how the system can relate a photo of a dog to the words "a golden retriever catching a ball."

These embeddings are stored in Chroma. When someone searches, their input is converted the same way, and Chroma compares it with what’s already there to find the closest matches.

To begin, make sure you have:

  • An embedding model that works for both text and images (CLIP is a popular choice)
  • Chroma installed and ready to run
  • Your dataset prepared for embedding—images resized and cleaned, text well-formatted

This is where the core concept sits: all data types get translated into the same language—vectors. Once that’s done, Chroma handles the comparison.

Step 2: Set Up Chroma and Load Your Data

With the concepts in place, it's time to set up Chroma and feed it your content.

Install Chroma with:

bash

CopyEdit

pip install chromed

Next, create a collection that acts like a container for your content:

python

CopyEdit

import chromadb

from chromadb.config import Settings

client = chromadb.Client(Settings())

collection = client.create_collection(name="multi_modal_store")

Now, let's bring in the embedding model. CLIP works well here because it supports both image and text inputs and maps them into the same embedding space. This allows you to store them side by side and search across types without extra layers of conversion.

python

CopyEdit

from PIL import Image

import torch

import clip

import os

model, preprocess = clip.load("ViT-B/32")

def get_image_embedding(image_path):

image = preprocess(Image.open(image_path)).unsqueeze(0)

with torch.no_grad():

return model.encode_image(image).squeeze().tolist()

def get_text_embedding(text):

tokens = clip.tokenize([text])

with torch.no_grad():

return model.encode_text(tokens).squeeze().tolist()

Add both images and text into the same collection:

python

CopyEdit

image_vec = get_image_embedding("dog.jpg")

text_vec = get_text_embedding("A golden retriever playing fetch.")

collection.add(

ids=["img_1", "txt_1"],

embeddings=[image_vec, text_vec],

metadatas=[{"type": "image"}, {"type": "text"}],

documents=["dog.jpg", "A golden retriever playing fetch."]

)

This mix of content types in one collection is what gives your app the flexibility to search across formats without switching systems.

Step 3: Create the Search Logic

Now that your content is stored and indexed, you can move on to defining how users will interact with it.

Every query goes through the same embedding process you used when uploading the data. Whether it’s a sentence or an image, it gets converted into a vector, and Chroma finds the nearest stored vectors.

Let’s start with a text input:

python

CopyEdit

query = "A dog catching a ball"

vec = get_text_embedding(query)

results = collection.query(

query_embeddings=[vec],

n_results=3

)

The results will give you document IDs and metadata, which you can use to retrieve the original files or text snippets.

For image input:

python

CopyEdit

img_vec = get_image_embedding("fetch_dog.jpg")

results = collection.query(

query_embeddings=[img_vec],

n_results=3

)

This approach doesn’t need separate pipelines for each data type. The model does the heavy lifting by embedding everything into one shared space, and Chroma takes care of finding the closest matches.

If you want to allow users to narrow results—say, only return images—you can filter by metadata during the query.

Step 4: Connect the Interface

With the core logic working, you need a way to let people use it. A web interface using something like Flask can provide a simple entry point.

Here’s an outline of a basic Flask setup:

python

CopyEdit

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route("/search", methods=["POST"])

def search():

data = request.json

mode = data.get("mode")

content = data.get("input")

if mode == "text":

vec = get_text_embedding(content)

elif mode == "image":

vec = get_image_embedding(content)

else:

return jsonify({"error": "Invalid input type"})

results = collection.query(

query_embeddings=[vec],

n_results=5

)

return jsonify(results)

if __name__ == "__main__":

app.run(debug=True)

This lets users send a POST request with either a text string or an image path. The server handles embedding and querying and sends back results ready to be shown in the interface.

If you're building a full front end, you can wire it up to handle file uploads, display image previews, and support features like filters or result categories.

What You Can Build With It

Once this foundation is in place, you’re free to adapt it to whatever kind of search experience you want to build. The logic stays the same, but the context can change:

  • Search tools for e-commerce, where shoppers find products using a description or a photo
  • Art or media databases where users search by mood, subject, or visual style
  • Educational tools that return similar material using sample input—like showing related flashcards from a single image or line of text

What makes the system work across all of these cases is the consistent embedding and search setup. You store a wide mix of content types, but the user doesn’t need to know that. All they do is provide an input, and the app does the rest.

Conclusion

Building a multi-modal search app with Chroma is direct once you break it down. You store your data as embeddings, feed those into Chroma, and use a shared model to interpret every incoming search. This creates a simple yet flexible way to let users search across types—whether they're working with photos, phrases, or a mix of both.

The hardest part is done once the model and data setup are working. From there, building useful interfaces, applying filters, or scaling the content can grow gradually without changing your base logic.

Advertisement

Recommended Updates

Technologies

How the Open Medical-LLM Leaderboard Is Setting Standards for AI in Healthcare

Tessa Rodriguez / May 25, 2025

How the Open Medical-LLM Leaderboard ranks and evaluates AI models, offering a clear benchmark for accuracy and safety in healthcare applications

Technologies

Explore How Google and Meta Antitrust Cases Affect Regulations

Tessa Rodriguez / Jun 04, 2025

Learn the regulatory impact of Google and Meta antitrust lawsuits and what it means for the future of tech and innovation.

Technologies

Faster Search on a Budget: Binary and Scalar Embedding Quantization Explained

Tessa Rodriguez / May 26, 2025

How Binary and Scalar Embedding Quantization for Significantly Faster and Cheaper Retrieval helps reduce memory use, lower costs, and improve search speed—without a major drop in accuracy

Technologies

Getting Started with LeNet: A Look at Its Architecture and Implementation

Alison Perry / May 28, 2025

Learn everything about mastering LeNet, from architectural insights to practical implementation. Understand its structure, training methods, and why it still matters today

Technologies

Idefics2: A Powerful 8B Vision-Language Model for Open AI Development

Tessa Rodriguez / May 26, 2025

Explore Idefics2, an advanced 8B vision-language model offering open access, high performance, and flexibility for developers, researchers, and the AI community

Technologies

How ServiceNow Leverages AI to Solve the Digital Transformation ROI Puzzle

Alison Perry / Jun 19, 2025

Discover how ServiceNow uses AI to boost ROI, streamline workflows, and transform digital operations across your business

Technologies

What Is ChatGPT Search? How to Use the AI Search Engine

Alison Perry / Jun 09, 2025

Learn what ChatGPT Search is and how to use it as a smart, AI-powered search engine

Technologies

How to Ensure AI Transparency and Compliance

Tessa Rodriguez / Jun 04, 2025

Learn best practices for auditing AI systems to meet transparency standards and stay compliant with regulations.

Technologies

ChatGPT-4 Vision Tips: Make the Most of Its Visual Superpowers

Alison Perry / May 29, 2025

Discover 7 practical ways to get the most out of ChatGPT-4 Vision. From reading handwritten notes to giving UX feedback, this guide shows how to use it like a pro

Technologies

How to Use Gradio on Hugging Face Spaces to Run ComfyUI Workflows Without Paying

Alison Perry / May 12, 2025

How to run ComfyUI workflows for free using Gradio on Hugging Face Spaces. Follow a step-by-step guide to set up, customize, and share AI models with no local installation or cost

Technologies

How the Open Chain of Thought Leaderboard is Redefining AI Evaluation

Tessa Rodriguez / May 25, 2025

How the Open Chain of Thought Leaderboard is changing the way we measure reasoning in AI by focusing on step-by-step logic instead of final answers alone

Technologies

Midjourney 2025: V7 Timeline and Video Features You Need to Know

Alison Perry / Jun 19, 2025

Discover Midjourney V7’s latest updates, including video creation tools, faster image generation, and improved prompt accuracy