Database Embedding
Markdown to ChromaDB Indexer
This tool indexes markdown files into ChromaDB for efficient semantic search capabilities, with support for both default embeddings and OpenAI’s text-embedding-3-small model for enhanced search quality.
Features
- Flexible embedding options:
- Default ChromaDB embeddings (no API key required)
- Optional OpenAI text-embedding-3-small model for enhanced quality
- Recursively processes markdown files in a directory
- Intelligent text chunking with configurable size and overlap
- Sentence-aware splitting to maintain context
- Extracts and preserves frontmatter metadata
- Converts markdown to searchable text
- Stores documents with their metadata in ChromaDB
- Supports semantic search queries
- Batch processing for large datasets
Installation
- Install the required dependencies:
pip install -r requirements.txt
- (Optional) Set up OpenAI embeddings:
- Create a
.env
file with your OpenAI API key:
- Create a
OPENAI_API_KEY=your_api_key_here
Usage
To index your markdown files:
python index_markdown.py index /path/to/your/markdown/directory
Optional arguments:
--db-path
: Specify a custom path for ChromaDB persistence (default: “chroma_db”)--chunk-size
: Maximum number of characters per chunk (default: 500)--chunk-overlap
: Number of characters to overlap between chunks (default: 50)--use-openai
: Use OpenAI embeddings instead of default embeddings (requires API key)
Examples
Index with custom settings:
# Using default embeddings
python index_markdown.py index /path/to/markdown --chunk-size 1000 --chunk-overlap 100
# Using OpenAI embeddings (requires API key)
python index_markdown.py index /path/to/markdown --use-openai
Python API Usage
from index_markdown import MarkdownIndexer
# Initialize the indexer with custom settings
indexer = MarkdownIndexer(
persist_dir="chroma_db",
chunk_size=500, # characters per chunk
chunk_overlap=50, # overlap between chunks
use_openai=True # set to True to use OpenAI embeddings
)
# Index a directory of markdown files
indexer.index_directory("/path/to/markdown/files")
# Query the indexed documents
results = indexer.query_documents("your search query", n_results=5)
Text Chunking
The indexer uses an intelligent chunking strategy:
- Sentence-Aware Splitting: Text is split at sentence boundaries to maintain context
- Configurable Chunk Size: Control the size of each chunk (default: 500 characters)
- Overlap Between Chunks: Maintains context between chunks (default: 50 characters)
- Metadata Preservation: Each chunk maintains:
- Original document metadata
- Chunk index
- Total chunks in document
- Source file path
Batch Processing
Documents are processed in batches (100 chunks per batch) to efficiently handle large datasets and manage memory usage.
Notes
- Processes all files with
.md
or.markdown
extensions - Each chunk is stored with complete metadata for traceability
- Uses BeautifulSoup for robust HTML parsing
- ChromaDB persistence directory is created if it doesn’t exist
- Unique IDs are generated for each chunk (format:
filename_chunk_N
)
Querying the Index
Using Command Line
The script provides a simple command-line interface for searching:
# Basic search
python index_markdown.py query "your search query"
# Search with more results
python index_markdown.py query "your search query" --n-results 10
# Search using OpenAI embeddings
python index_markdown.py query "your search query" --use-openai
# Filter results by file path
python index_markdown.py query "your search query" --path-filter "docs/"
# Specify different database path
python index_markdown.py query "your search query" --db-path "custom_db"
Using Python
You can query the indexed documents in two ways:
- Using the existing indexer:
from index_markdown import MarkdownIndexer
# Initialize with the same settings used for indexing
indexer = MarkdownIndexer(
persist_dir="chroma_db", # use the same directory as indexing
use_openai=True # set to True if you used OpenAI embeddings for indexing
)
# Simple query
results = indexer.query_documents(
query_text="your search query here",
n_results=5 # number of results to return
)
# Process results
for i, (document, metadata, score) in enumerate(zip(
results['documents'][0],
results['metadatas'][0],
results['distances'][0]
)):
print(f"\nResult {i+1} (similarity: {1-score:.3f})")
print(f"Source: {metadata['source_path']}")
print(f"Chunk: {metadata['chunk_index']+1}/{metadata['total_chunks']}")
print("Content:", document)
- Using ChromaDB directly:
import chromadb
from chromadb.utils import embedding_functions
# Initialize the client
client = chromadb.PersistentClient(path="chroma_db")
# Get the collection
collection = client.get_collection(
name="markdown_docs",
# Use the same embedding function as during indexing
embedding_function=embedding_functions.SentenceTransformerEmbeddingFunction()
# Or for OpenAI:
# embedding_function=embedding_functions.OpenAIEmbeddingFunction(
# api_key="your_key",
# model_name="text-embedding-3-small"
# )
)
# Query with filters
results = collection.query(
query_texts=["your search query"],
n_results=5,
# Optional: filter by metadata
where={"source_path": {"$contains": "specific/path"}},
# Optional: include relevance score
include=["metadatas", "documents", "distances"]
)
Advanced Query Features
- Metadata Filtering: Filter results based on metadata fields:
# Filter by specific file path
results = collection.query(
query_texts=["query"],
where={"source_path": {"$contains": "docs/"}}
)
# Filter by chunk index
results = collection.query(
query_texts=["query"],
where={"chunk_index": {"$lt": 3}} # only first 3 chunks
)
- Batch Queries: Search multiple queries at once:
results = collection.query(
query_texts=["query1", "query2", "query3"],
n_results=3
)
Query Results
Results include:
- Document Content: The text chunk that matches your query
- Metadata:
source_path
: Original markdown file pathchunk_index
: Position of the chunk in the documenttotal_chunks
: Total number of chunks in the document- Any frontmatter metadata from the original markdown
- Distance Score: Lower scores indicate better matches (using cosine similarity)
Results are ordered by semantic similarity to the query, with the most relevant chunks appearing first.