Providers

Providers are external libraries and services that supply the core functionality for embedding models and text splitting. These integrations connect Foundation4.ai to powerful pre-trained models and algorithms that power your pipelines.

Embedding Models

Embedding models are machine learning algorithm models that transform text into vector representations (embeddings). These vectors capture semantic meaning, allowing similar texts to have similar embeddings even if they use different words.

How Embeddings Work

Input: Text (a word, sentence, or document)
Processing: The embedding model encodes the text into a high-dimensional vector
Output: A numerical representation that can be compared with other embeddings using distance metrics

Embeddings are essential for vector search—documents with similar embeddings will be retrieved when semantically similar queries are performed.

Embedding Model Providers

Foundation4.ai integrates with several embedding providers:

FastEmbed: Lightweight, fast embedding models optimized for speed and efficiency
Hugging Face: Access to a vast library of pre-trained transformer models for embeddings
GPT4All: Local embedding models that run on your infrastructure without external dependencies
OpenAI embeddings

Important Considerations

When selecting an embedding model provider, keep in mind that the underlying libraries must be loaded in your environment. This means:

Required dependencies must be installed before use
The specific model files may need to be downloaded
Different providers have different resource requirements (CPU, memory, disk space)
Once configured in a pipeline, the same embedding model is used consistently for all documents

Text Splitters

Text splitters are algorithms that break documents into smaller, manageable chunks. Since embedding models typically have token limits and context windows, splitting large documents ensures they can be properly processed and indexed.

Why Text Splitting Matters

Improves search quality: Smaller chunks are more focused and easier to match with user queries
Respects model limits: Embedding models have maximum token counts; splitting ensures compliance
Reduces redundancy: Overlapping chunks preserve context between boundaries
Enables RAG: Smaller chunks retrieved during vector search provide more precise context to LLMs

Text Splitter Strategies

Text splitters vary in how they segment content:

Fixed-size splitting: Divides text into chunks of a specified character or token count
Semantic splitting: Breaks at natural boundaries (sentences, paragraphs) to preserve meaning
Overlap-based splitting: Creates overlapping chunks to maintain context across boundaries

Text Splitter Providers

Like embedding models, text splitters are provided by external services:

FastEmbed: Includes efficient text splitting integrated with its embedding capabilities
Hugging Face: Provides tokenizers and splitters aligned with specific transformer models
NLTK: Natural Language Toolkit offering language-aware sentence and word splitting
Custom implementations: Build your own splitter for specialized document formats

Important Considerations

Similar to embedding models, text splitter implementations require their libraries to be loaded:

Provider libraries and dependencies must be installed
Some splitters need language-specific data files (e.g., sentence tokenizers)
The choice of splitter should align with your document type and language
Once selected for a pipeline, the same splitting strategy applies to all new documents

Embedding Models​

How Embeddings Work​

Embedding Model Providers​

Important Considerations​

Text Splitters​

Why Text Splitting Matters​

Text Splitter Strategies​

Text Splitter Providers​

Important Considerations​

Embedding Models

How Embeddings Work

Embedding Model Providers

Important Considerations

Text Splitters

Why Text Splitting Matters

Text Splitter Strategies

Text Splitter Providers

Important Considerations