Advertisement

Recommended Updates

Technologies

Common Fixes for AttributeError in Python Code

Tessa Rodriguez / May 15, 2025

How to fix attribute error in Python with easy-to-follow methods. Avoid common mistakes and get your code working using clear, real-world solutions

Technologies

Explore How Nvidia Maintains AI Dominance Despite Global Tariffs

Tessa Rodriguez / Jun 04, 2025

Discover how Nvidia continues to lead global AI chip innovation despite rising tariffs and international trade pressures.

Technologies

Building a Smarter Resume Ranking System with Langchain

Alison Perry / May 29, 2025

Learn how to build a resume ranking system using Langchain. From parsing to embedding and scoring, see how to structure smarter hiring tools using language models

Technologies

Understanding Google's AI Supercomputer and Nvidia's MLPerf 3.0 Win

Alison Perry / Jun 13, 2025

Explore Google's AI supercomputer performance and Nvidia's MLPerf 3.0 benchmark win in next-gen high-performance AI systems

Technologies

Mastering Variable Scope: Python’s Global and Local Variables Explained

Alison Perry / Jun 04, 2025

Explore the concept of global and local variables in Python programming. Learn how Python handles variable scope and how it affects your code

Technologies

How to Use NumPy’s argmax() to Find the Index of the Max Value

Tessa Rodriguez / May 21, 2025

How the NumPy argmax() function works, when to use it, and how it helps you locate maximum values efficiently in any NumPy array

Technologies

Predicting Product Failures with Machine Learning: A Comprehensive Guide

Tessa Rodriguez / Jun 19, 2025

Learn how machine learning predicts product failures, improves quality, reduces costs, and boosts safety across industries

Technologies

Idefics2: A Powerful 8B Vision-Language Model for Open AI Development

Tessa Rodriguez / May 26, 2025

Explore Idefics2, an advanced 8B vision-language model offering open access, high performance, and flexibility for developers, researchers, and the AI community

Technologies

A Step-by-Step Guide to Merging Two Dictionaries in Python

Alison Perry / May 18, 2025

How to merge two dictionaries in Python using different methods. This clear and simple guide helps you choose the best way to combine Python dictionaries for your specific use case

Technologies

What Is ChatGPT Search? How to Use the AI Search Engine

Alison Perry / Jun 09, 2025

Learn what ChatGPT Search is and how to use it as a smart, AI-powered search engine

Technologies

How to Ensure AI Transparency and Compliance

Tessa Rodriguez / Jun 04, 2025

Learn best practices for auditing AI systems to meet transparency standards and stay compliant with regulations.

Technologies

6 Risks of ChatGPT in Customer Service: What Businesses Need to Know

Alison Perry / Jun 13, 2025

ChatGPT in customer service can provide biased information, misinterpret questions, raise security issues, or give wrong answers

Build a Multi-Modal Search App with Chroma and CLIP

May 29, 2025 By Tessa Rodriguez

Search isn't limited to words anymore. People expect to use an image or a short phrase and get back something that feels relevant. That’s where multi-modal search comes in. It lets you compare across different data types—like text and images—using shared meaning rather than matching exact words.

Chroma makes this setup approachable. It’s a vector database that stores data in the form of embeddings—numerical representations of content. Whether your input is a sentence, a photo, or a caption, once it’s turned into an embedding, Chroma can compare it with everything else in your collection and bring back similar results.

The steps below outline how to create a working multi-modal search app using Chroma. This isn't about building a full product from scratch—it's about laying the groundwork for a search system that can understand both language and visuals.

Steps to Build a Multi-Modal Search App with Chroma

Step 1: Understand What You’re Working With

Before you start building, it helps to understand what makes multi-modal search work. The key is embeddings. You’ll use a model that can convert different types of content—like photos and sentences—into a form that can be compared numerically. That’s how the system can relate a photo of a dog to the words "a golden retriever catching a ball."

These embeddings are stored in Chroma. When someone searches, their input is converted the same way, and Chroma compares it with what’s already there to find the closest matches.

To begin, make sure you have:

  • An embedding model that works for both text and images (CLIP is a popular choice)
  • Chroma installed and ready to run
  • Your dataset prepared for embedding—images resized and cleaned, text well-formatted

This is where the core concept sits: all data types get translated into the same language—vectors. Once that’s done, Chroma handles the comparison.

Step 2: Set Up Chroma and Load Your Data

With the concepts in place, it's time to set up Chroma and feed it your content.

Install Chroma with:

bash

CopyEdit

pip install chromed

Next, create a collection that acts like a container for your content:

python

CopyEdit

import chromadb

from chromadb.config import Settings

client = chromadb.Client(Settings())

collection = client.create_collection(name="multi_modal_store")

Now, let's bring in the embedding model. CLIP works well here because it supports both image and text inputs and maps them into the same embedding space. This allows you to store them side by side and search across types without extra layers of conversion.

python

CopyEdit

from PIL import Image

import torch

import clip

import os

model, preprocess = clip.load("ViT-B/32")

def get_image_embedding(image_path):

image = preprocess(Image.open(image_path)).unsqueeze(0)

with torch.no_grad():

return model.encode_image(image).squeeze().tolist()

def get_text_embedding(text):

tokens = clip.tokenize([text])

with torch.no_grad():

return model.encode_text(tokens).squeeze().tolist()

Add both images and text into the same collection:

python

CopyEdit

image_vec = get_image_embedding("dog.jpg")

text_vec = get_text_embedding("A golden retriever playing fetch.")

collection.add(

ids=["img_1", "txt_1"],

embeddings=[image_vec, text_vec],

metadatas=[{"type": "image"}, {"type": "text"}],

documents=["dog.jpg", "A golden retriever playing fetch."]

)

This mix of content types in one collection is what gives your app the flexibility to search across formats without switching systems.

Step 3: Create the Search Logic

Now that your content is stored and indexed, you can move on to defining how users will interact with it.

Every query goes through the same embedding process you used when uploading the data. Whether it’s a sentence or an image, it gets converted into a vector, and Chroma finds the nearest stored vectors.

Let’s start with a text input:

python

CopyEdit

query = "A dog catching a ball"

vec = get_text_embedding(query)

results = collection.query(

query_embeddings=[vec],

n_results=3

)

The results will give you document IDs and metadata, which you can use to retrieve the original files or text snippets.

For image input:

python

CopyEdit

img_vec = get_image_embedding("fetch_dog.jpg")

results = collection.query(

query_embeddings=[img_vec],

n_results=3

)

This approach doesn’t need separate pipelines for each data type. The model does the heavy lifting by embedding everything into one shared space, and Chroma takes care of finding the closest matches.

If you want to allow users to narrow results—say, only return images—you can filter by metadata during the query.

Step 4: Connect the Interface

With the core logic working, you need a way to let people use it. A web interface using something like Flask can provide a simple entry point.

Here’s an outline of a basic Flask setup:

python

CopyEdit

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route("/search", methods=["POST"])

def search():

data = request.json

mode = data.get("mode")

content = data.get("input")

if mode == "text":

vec = get_text_embedding(content)

elif mode == "image":

vec = get_image_embedding(content)

else:

return jsonify({"error": "Invalid input type"})

results = collection.query(

query_embeddings=[vec],

n_results=5

)

return jsonify(results)

if __name__ == "__main__":

app.run(debug=True)

This lets users send a POST request with either a text string or an image path. The server handles embedding and querying and sends back results ready to be shown in the interface.

If you're building a full front end, you can wire it up to handle file uploads, display image previews, and support features like filters or result categories.

What You Can Build With It

Once this foundation is in place, you’re free to adapt it to whatever kind of search experience you want to build. The logic stays the same, but the context can change:

  • Search tools for e-commerce, where shoppers find products using a description or a photo
  • Art or media databases where users search by mood, subject, or visual style
  • Educational tools that return similar material using sample input—like showing related flashcards from a single image or line of text

What makes the system work across all of these cases is the consistent embedding and search setup. You store a wide mix of content types, but the user doesn’t need to know that. All they do is provide an input, and the app does the rest.

Conclusion

Building a multi-modal search app with Chroma is direct once you break it down. You store your data as embeddings, feed those into Chroma, and use a shared model to interpret every incoming search. This creates a simple yet flexible way to let users search across types—whether they're working with photos, phrases, or a mix of both.

The hardest part is done once the model and data setup are working. From there, building useful interfaces, applying filters, or scaling the content can grow gradually without changing your base logic.