How to Build a Multi-Modal Search App with Chroma

May 29, 2025 By Tessa Rodriguez

Search isn't limited to words anymore. People expect to use an image or a short phrase and get back something that feels relevant. That’s where multi-modal search comes in. It lets you compare across different data types—like text and images—using shared meaning rather than matching exact words.

Chroma makes this setup approachable. It’s a vector database that stores data in the form of embeddings—numerical representations of content. Whether your input is a sentence, a photo, or a caption, once it’s turned into an embedding, Chroma can compare it with everything else in your collection and bring back similar results.

The steps below outline how to create a working multi-modal search app using Chroma. This isn't about building a full product from scratch—it's about laying the groundwork for a search system that can understand both language and visuals.

Steps to Build a Multi-Modal Search App with Chroma

Step 1: Understand What You’re Working With

Before you start building, it helps to understand what makes multi-modal search work. The key is embeddings. You’ll use a model that can convert different types of content—like photos and sentences—into a form that can be compared numerically. That’s how the system can relate a photo of a dog to the words "a golden retriever catching a ball."

These embeddings are stored in Chroma. When someone searches, their input is converted the same way, and Chroma compares it with what’s already there to find the closest matches.

To begin, make sure you have:

An embedding model that works for both text and images (CLIP is a popular choice)
Chroma installed and ready to run
Your dataset prepared for embedding—images resized and cleaned, text well-formatted

This is where the core concept sits: all data types get translated into the same language—vectors. Once that’s done, Chroma handles the comparison.

Step 2: Set Up Chroma and Load Your Data

With the concepts in place, it's time to set up Chroma and feed it your content.

Install Chroma with:

bash

CopyEdit

pip install chromed

Next, create a collection that acts like a container for your content:

python

CopyEdit

import chromadb

from chromadb.config import Settings

client = chromadb.Client(Settings())

collection = client.create_collection(name="multi_modal_store")

Now, let's bring in the embedding model. CLIP works well here because it supports both image and text inputs and maps them into the same embedding space. This allows you to store them side by side and search across types without extra layers of conversion.

python

CopyEdit

from PIL import Image

import torch

import clip

import os

model, preprocess = clip.load("ViT-B/32")

def get_image_embedding(image_path):

image = preprocess(Image.open(image_path)).unsqueeze(0)

with torch.no_grad():

return model.encode_image(image).squeeze().tolist()

def get_text_embedding(text):

tokens = clip.tokenize([text])

with torch.no_grad():

return model.encode_text(tokens).squeeze().tolist()

Add both images and text into the same collection:

python

CopyEdit

image_vec = get_image_embedding("dog.jpg")

text_vec = get_text_embedding("A golden retriever playing fetch.")

collection.add(

ids=["img_1", "txt_1"],

embeddings=[image_vec, text_vec],

metadatas=[{"type": "image"}, {"type": "text"}],

documents=["dog.jpg", "A golden retriever playing fetch."]

)

This mix of content types in one collection is what gives your app the flexibility to search across formats without switching systems.

Step 3: Create the Search Logic

Now that your content is stored and indexed, you can move on to defining how users will interact with it.

Every query goes through the same embedding process you used when uploading the data. Whether it’s a sentence or an image, it gets converted into a vector, and Chroma finds the nearest stored vectors.

Let’s start with a text input:

python

CopyEdit

query = "A dog catching a ball"

vec = get_text_embedding(query)

results = collection.query(

query_embeddings=[vec],

n_results=3

)

The results will give you document IDs and metadata, which you can use to retrieve the original files or text snippets.

For image input:

python

CopyEdit

img_vec = get_image_embedding("fetch_dog.jpg")

results = collection.query(

query_embeddings=[img_vec],

n_results=3

)

This approach doesn’t need separate pipelines for each data type. The model does the heavy lifting by embedding everything into one shared space, and Chroma takes care of finding the closest matches.

If you want to allow users to narrow results—say, only return images—you can filter by metadata during the query.

Step 4: Connect the Interface

With the core logic working, you need a way to let people use it. A web interface using something like Flask can provide a simple entry point.

Here’s an outline of a basic Flask setup:

python

CopyEdit

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route("/search", methods=["POST"])

def search():

data = request.json

mode = data.get("mode")

content = data.get("input")

if mode == "text":

vec = get_text_embedding(content)

elif mode == "image":

vec = get_image_embedding(content)

else:

return jsonify({"error": "Invalid input type"})

results = collection.query(

query_embeddings=[vec],

n_results=5

)

return jsonify(results)

if __name__ == "__main__":

app.run(debug=True)

This lets users send a POST request with either a text string or an image path. The server handles embedding and querying and sends back results ready to be shown in the interface.

If you're building a full front end, you can wire it up to handle file uploads, display image previews, and support features like filters or result categories.

What You Can Build With It

Once this foundation is in place, you’re free to adapt it to whatever kind of search experience you want to build. The logic stays the same, but the context can change:

Search tools for e-commerce, where shoppers find products using a description or a photo
Art or media databases where users search by mood, subject, or visual style
Educational tools that return similar material using sample input—like showing related flashcards from a single image or line of text

What makes the system work across all of these cases is the consistent embedding and search setup. You store a wide mix of content types, but the user doesn’t need to know that. All they do is provide an input, and the app does the rest.

Conclusion

Building a multi-modal search app with Chroma is direct once you break it down. You store your data as embeddings, feed those into Chroma, and use a shared model to interpret every incoming search. This creates a simple yet flexible way to let users search across types—whether they're working with photos, phrases, or a mix of both.

The hardest part is done once the model and data setup are working. From there, building useful interfaces, applying filters, or scaling the content can grow gradually without changing your base logic.

Build a Multi-Modal Search App with Chroma and CLIP

Steps to Build a Multi-Modal Search App with Chroma

Step 1: Understand What You’re Working With

Step 2: Set Up Chroma and Load Your Data

Step 3: Create the Search Logic

Step 4: Connect the Interface

What You Can Build With It

Conclusion

Recommended Updates

How the Open Medical-LLM Leaderboard Is Setting Standards for AI in Healthcare

Explore How Google and Meta Antitrust Cases Affect Regulations

Faster Search on a Budget: Binary and Scalar Embedding Quantization Explained

Getting Started with LeNet: A Look at Its Architecture and Implementation

Idefics2: A Powerful 8B Vision-Language Model for Open AI Development

How ServiceNow Leverages AI to Solve the Digital Transformation ROI Puzzle

What Is ChatGPT Search? How to Use the AI Search Engine

How to Ensure AI Transparency and Compliance

ChatGPT-4 Vision Tips: Make the Most of Its Visual Superpowers

How to Use Gradio on Hugging Face Spaces to Run ComfyUI Workflows Without Paying

How the Open Chain of Thought Leaderboard is Redefining AI Evaluation

Midjourney 2025: V7 Timeline and Video Features You Need to Know