Providers
Providers are external libraries and services that supply the core functionality for embedding models and text splitting. These integrations connect Foundation4.ai to powerful pre-trained models and algorithms that power your pipelines.
Embedding Models
Embedding models are machine learning algorithm models that transform text into vector representations (embeddings). These vectors capture semantic meaning, allowing similar texts to have similar embeddings even if they use different words.
How Embeddings Work
- Input: Text (a word, sentence, or document)
- Processing: The embedding model encodes the text into a high-dimensional vector
- Output: A numerical representation that can be compared with other embeddings using distance metrics
Embeddings are essential for vector search—documents with similar embeddings will be retrieved when semantically similar queries are performed.
Embedding Model Providers
Foundation4.ai integrates with several embedding providers:
- FastEmbed: Lightweight, fast embedding models optimized for speed and efficiency
- Hugging Face: Access to a vast library of pre-trained transformer models for embeddings
- GPT4All: Local embedding models that run on your infrastructure without external dependencies
- OpenAI embeddings
Important Considerations
When selecting an embedding model provider, keep in mind that the underlying libraries must be loaded in your environment. This means:
- Required dependencies must be installed before use
- The specific model files may need to be downloaded
- Different providers have different resource requirements (CPU, memory, disk space)
- Once configured in a pipeline, the same embedding model is used consistently for all documents
Text Splitters
Text splitters are algorithms that break documents into smaller, manageable chunks. Since embedding models typically have token limits and context windows, splitting large documents ensures they can be properly processed and indexed.
Why Text Splitting Matters
- Improves search quality: Smaller chunks are more focused and easier to match with user queries
- Respects model limits: Embedding models have maximum token counts; splitting ensures compliance
- Reduces redundancy: Overlapping chunks preserve context between boundaries
- Enables RAG: Smaller chunks retrieved during vector search provide more precise context to LLMs
Text Splitter Strategies
Text splitters vary in how they segment content:
- Fixed-size splitting: Divides text into chunks of a specified character or token count
- Semantic splitting: Breaks at natural boundaries (sentences, paragraphs) to preserve meaning
- Overlap-based splitting: Creates overlapping chunks to maintain context across boundaries
Text Splitter Providers
Like embedding models, text splitters are provided by external services:
- FastEmbed: Includes efficient text splitting integrated with its embedding capabilities
- Hugging Face: Provides tokenizers and splitters aligned with specific transformer models
- NLTK: Natural Language Toolkit offering language-aware sentence and word splitting
- Custom implementations: Build your own splitter for specialized document formats
Important Considerations
Similar to embedding models, text splitter implementations require their libraries to be loaded:
- Provider libraries and dependencies must be installed
- Some splitters need language-specific data files (e.g., sentence tokenizers)
- The choice of splitter should align with your document type and language
- Once selected for a pipeline, the same splitting strategy applies to all new documents