Advertisement
How to fix attribute error in Python with easy-to-follow methods. Avoid common mistakes and get your code working using clear, real-world solutions
Discover how Nvidia continues to lead global AI chip innovation despite rising tariffs and international trade pressures.
Learn how to build a resume ranking system using Langchain. From parsing to embedding and scoring, see how to structure smarter hiring tools using language models
Explore Google's AI supercomputer performance and Nvidia's MLPerf 3.0 benchmark win in next-gen high-performance AI systems
Explore the concept of global and local variables in Python programming. Learn how Python handles variable scope and how it affects your code
How the NumPy argmax() function works, when to use it, and how it helps you locate maximum values efficiently in any NumPy array
Learn how machine learning predicts product failures, improves quality, reduces costs, and boosts safety across industries
Explore Idefics2, an advanced 8B vision-language model offering open access, high performance, and flexibility for developers, researchers, and the AI community
How to merge two dictionaries in Python using different methods. This clear and simple guide helps you choose the best way to combine Python dictionaries for your specific use case
Learn what ChatGPT Search is and how to use it as a smart, AI-powered search engine
Learn best practices for auditing AI systems to meet transparency standards and stay compliant with regulations.
ChatGPT in customer service can provide biased information, misinterpret questions, raise security issues, or give wrong answers
Search isn't limited to words anymore. People expect to use an image or a short phrase and get back something that feels relevant. That’s where multi-modal search comes in. It lets you compare across different data types—like text and images—using shared meaning rather than matching exact words.
Chroma makes this setup approachable. It’s a vector database that stores data in the form of embeddings—numerical representations of content. Whether your input is a sentence, a photo, or a caption, once it’s turned into an embedding, Chroma can compare it with everything else in your collection and bring back similar results.
The steps below outline how to create a working multi-modal search app using Chroma. This isn't about building a full product from scratch—it's about laying the groundwork for a search system that can understand both language and visuals.
Before you start building, it helps to understand what makes multi-modal search work. The key is embeddings. You’ll use a model that can convert different types of content—like photos and sentences—into a form that can be compared numerically. That’s how the system can relate a photo of a dog to the words "a golden retriever catching a ball."
These embeddings are stored in Chroma. When someone searches, their input is converted the same way, and Chroma compares it with what’s already there to find the closest matches.
To begin, make sure you have:
This is where the core concept sits: all data types get translated into the same language—vectors. Once that’s done, Chroma handles the comparison.
With the concepts in place, it's time to set up Chroma and feed it your content.
Install Chroma with:
bash
CopyEdit
pip install chromed
Next, create a collection that acts like a container for your content:
python
CopyEdit
import chromadb
from chromadb.config import Settings
client = chromadb.Client(Settings())
collection = client.create_collection(name="multi_modal_store")
Now, let's bring in the embedding model. CLIP works well here because it supports both image and text inputs and maps them into the same embedding space. This allows you to store them side by side and search across types without extra layers of conversion.
python
CopyEdit
from PIL import Image
import torch
import clip
import os
model, preprocess = clip.load("ViT-B/32")
def get_image_embedding(image_path):
image = preprocess(Image.open(image_path)).unsqueeze(0)
with torch.no_grad():
return model.encode_image(image).squeeze().tolist()
def get_text_embedding(text):
tokens = clip.tokenize([text])
with torch.no_grad():
return model.encode_text(tokens).squeeze().tolist()
Add both images and text into the same collection:
python
CopyEdit
image_vec = get_image_embedding("dog.jpg")
text_vec = get_text_embedding("A golden retriever playing fetch.")
collection.add(
ids=["img_1", "txt_1"],
embeddings=[image_vec, text_vec],
metadatas=[{"type": "image"}, {"type": "text"}],
documents=["dog.jpg", "A golden retriever playing fetch."]
)
This mix of content types in one collection is what gives your app the flexibility to search across formats without switching systems.
Now that your content is stored and indexed, you can move on to defining how users will interact with it.
Every query goes through the same embedding process you used when uploading the data. Whether it’s a sentence or an image, it gets converted into a vector, and Chroma finds the nearest stored vectors.
Let’s start with a text input:
python
CopyEdit
query = "A dog catching a ball"
vec = get_text_embedding(query)
results = collection.query(
query_embeddings=[vec],
n_results=3
)
The results will give you document IDs and metadata, which you can use to retrieve the original files or text snippets.
For image input:
python
CopyEdit
img_vec = get_image_embedding("fetch_dog.jpg")
results = collection.query(
query_embeddings=[img_vec],
n_results=3
)
This approach doesn’t need separate pipelines for each data type. The model does the heavy lifting by embedding everything into one shared space, and Chroma takes care of finding the closest matches.
If you want to allow users to narrow results—say, only return images—you can filter by metadata during the query.
With the core logic working, you need a way to let people use it. A web interface using something like Flask can provide a simple entry point.
Here’s an outline of a basic Flask setup:
python
CopyEdit
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route("/search", methods=["POST"])
def search():
data = request.json
mode = data.get("mode")
content = data.get("input")
if mode == "text":
vec = get_text_embedding(content)
elif mode == "image":
vec = get_image_embedding(content)
else:
return jsonify({"error": "Invalid input type"})
results = collection.query(
query_embeddings=[vec],
n_results=5
)
return jsonify(results)
if __name__ == "__main__":
app.run(debug=True)
This lets users send a POST request with either a text string or an image path. The server handles embedding and querying and sends back results ready to be shown in the interface.
If you're building a full front end, you can wire it up to handle file uploads, display image previews, and support features like filters or result categories.
Once this foundation is in place, you’re free to adapt it to whatever kind of search experience you want to build. The logic stays the same, but the context can change:
What makes the system work across all of these cases is the consistent embedding and search setup. You store a wide mix of content types, but the user doesn’t need to know that. All they do is provide an input, and the app does the rest.
Building a multi-modal search app with Chroma is direct once you break it down. You store your data as embeddings, feed those into Chroma, and use a shared model to interpret every incoming search. This creates a simple yet flexible way to let users search across types—whether they're working with photos, phrases, or a mix of both.
The hardest part is done once the model and data setup are working. From there, building useful interfaces, applying filters, or scaling the content can grow gradually without changing your base logic.