Advertisement
Search isn't limited to words anymore. People expect to use an image or a short phrase and get back something that feels relevant. That’s where multi-modal search comes in. It lets you compare across different data types—like text and images—using shared meaning rather than matching exact words.
Chroma makes this setup approachable. It’s a vector database that stores data in the form of embeddings—numerical representations of content. Whether your input is a sentence, a photo, or a caption, once it’s turned into an embedding, Chroma can compare it with everything else in your collection and bring back similar results.
The steps below outline how to create a working multi-modal search app using Chroma. This isn't about building a full product from scratch—it's about laying the groundwork for a search system that can understand both language and visuals.
Before you start building, it helps to understand what makes multi-modal search work. The key is embeddings. You’ll use a model that can convert different types of content—like photos and sentences—into a form that can be compared numerically. That’s how the system can relate a photo of a dog to the words "a golden retriever catching a ball."
These embeddings are stored in Chroma. When someone searches, their input is converted the same way, and Chroma compares it with what’s already there to find the closest matches.
To begin, make sure you have:
This is where the core concept sits: all data types get translated into the same language—vectors. Once that’s done, Chroma handles the comparison.
With the concepts in place, it's time to set up Chroma and feed it your content.
Install Chroma with:
bash
CopyEdit
pip install chromed
Next, create a collection that acts like a container for your content:
python
CopyEdit
import chromadb
from chromadb.config import Settings
client = chromadb.Client(Settings())
collection = client.create_collection(name="multi_modal_store")
Now, let's bring in the embedding model. CLIP works well here because it supports both image and text inputs and maps them into the same embedding space. This allows you to store them side by side and search across types without extra layers of conversion.
python
CopyEdit
from PIL import Image
import torch
import clip
import os
model, preprocess = clip.load("ViT-B/32")
def get_image_embedding(image_path):
image = preprocess(Image.open(image_path)).unsqueeze(0)
with torch.no_grad():
return model.encode_image(image).squeeze().tolist()
def get_text_embedding(text):
tokens = clip.tokenize([text])
with torch.no_grad():
return model.encode_text(tokens).squeeze().tolist()
Add both images and text into the same collection:
python
CopyEdit
image_vec = get_image_embedding("dog.jpg")
text_vec = get_text_embedding("A golden retriever playing fetch.")
collection.add(
ids=["img_1", "txt_1"],
embeddings=[image_vec, text_vec],
metadatas=[{"type": "image"}, {"type": "text"}],
documents=["dog.jpg", "A golden retriever playing fetch."]
)
This mix of content types in one collection is what gives your app the flexibility to search across formats without switching systems.
Now that your content is stored and indexed, you can move on to defining how users will interact with it.
Every query goes through the same embedding process you used when uploading the data. Whether it’s a sentence or an image, it gets converted into a vector, and Chroma finds the nearest stored vectors.
Let’s start with a text input:
python
CopyEdit
query = "A dog catching a ball"
vec = get_text_embedding(query)
results = collection.query(
query_embeddings=[vec],
n_results=3
)
The results will give you document IDs and metadata, which you can use to retrieve the original files or text snippets.
For image input:
python
CopyEdit
img_vec = get_image_embedding("fetch_dog.jpg")
results = collection.query(
query_embeddings=[img_vec],
n_results=3
)
This approach doesn’t need separate pipelines for each data type. The model does the heavy lifting by embedding everything into one shared space, and Chroma takes care of finding the closest matches.
If you want to allow users to narrow results—say, only return images—you can filter by metadata during the query.
With the core logic working, you need a way to let people use it. A web interface using something like Flask can provide a simple entry point.
Here’s an outline of a basic Flask setup:
python
CopyEdit
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route("/search", methods=["POST"])
def search():
data = request.json
mode = data.get("mode")
content = data.get("input")
if mode == "text":
vec = get_text_embedding(content)
elif mode == "image":
vec = get_image_embedding(content)
else:
return jsonify({"error": "Invalid input type"})
results = collection.query(
query_embeddings=[vec],
n_results=5
)
return jsonify(results)
if __name__ == "__main__":
app.run(debug=True)
This lets users send a POST request with either a text string or an image path. The server handles embedding and querying and sends back results ready to be shown in the interface.
If you're building a full front end, you can wire it up to handle file uploads, display image previews, and support features like filters or result categories.
Once this foundation is in place, you’re free to adapt it to whatever kind of search experience you want to build. The logic stays the same, but the context can change:
What makes the system work across all of these cases is the consistent embedding and search setup. You store a wide mix of content types, but the user doesn’t need to know that. All they do is provide an input, and the app does the rest.
Building a multi-modal search app with Chroma is direct once you break it down. You store your data as embeddings, feed those into Chroma, and use a shared model to interpret every incoming search. This creates a simple yet flexible way to let users search across types—whether they're working with photos, phrases, or a mix of both.
The hardest part is done once the model and data setup are working. From there, building useful interfaces, applying filters, or scaling the content can grow gradually without changing your base logic.
Advertisement
How the Open Medical-LLM Leaderboard ranks and evaluates AI models, offering a clear benchmark for accuracy and safety in healthcare applications
Learn the regulatory impact of Google and Meta antitrust lawsuits and what it means for the future of tech and innovation.
How Binary and Scalar Embedding Quantization for Significantly Faster and Cheaper Retrieval helps reduce memory use, lower costs, and improve search speed—without a major drop in accuracy
Learn everything about mastering LeNet, from architectural insights to practical implementation. Understand its structure, training methods, and why it still matters today
Explore Idefics2, an advanced 8B vision-language model offering open access, high performance, and flexibility for developers, researchers, and the AI community
Discover how ServiceNow uses AI to boost ROI, streamline workflows, and transform digital operations across your business
Learn what ChatGPT Search is and how to use it as a smart, AI-powered search engine
Learn best practices for auditing AI systems to meet transparency standards and stay compliant with regulations.
Discover 7 practical ways to get the most out of ChatGPT-4 Vision. From reading handwritten notes to giving UX feedback, this guide shows how to use it like a pro
How to run ComfyUI workflows for free using Gradio on Hugging Face Spaces. Follow a step-by-step guide to set up, customize, and share AI models with no local installation or cost
How the Open Chain of Thought Leaderboard is changing the way we measure reasoning in AI by focusing on step-by-step logic instead of final answers alone
Discover Midjourney V7’s latest updates, including video creation tools, faster image generation, and improved prompt accuracy