Semantic Joins with Fenic

View in Github

This example demonstrates how to use Fenic's semantic joins to perform LLM-powered data matching based on natural language reasoning rather than exact equality or similarity scores.

Overview

Semantic joins enable you to join DataFrames using natural language predicates that are evaluated by language models. Unlike traditional joins that require exact matches or embedding-based similarity joins, semantic joins can understand complex relationships and make intelligent connections based on meaning and context.

This example showcases two practical use cases:

Content Recommendation: Matching user interests to relevant articles
Product Recommendations: Suggesting complementary products based on purchase history

Key Features Demonstrated

Natural Language Predicates: Using human-readable join conditions
LLM-Powered Reasoning: Leveraging GPT models for intelligent matching
Cross-Domain Understanding: Connecting concepts across different contexts
Zero-Shot Matching: No training data or examples required

How Semantic Joins Work

Basic Syntax

left_df.semantic.join(
    right_df,
    join_instruction="Natural language predicate with {column:left} and {column:right}"
)

Join Instruction Format

Must reference exactly two columns: one from each DataFrame
Use :left and :right suffixes to indicate which DataFrame each column comes from
Written as a boolean predicate that the LLM evaluates as True/False
Should be clear and unambiguous for consistent results

Example 1: Content Recommendation

Data Setup

User Profiles:

Sarah: "I love cooking Italian food and trying new pasta recipes"
Mike: "I enjoy working on cars and fixing engines in my spare time"
Emily: "Gardening is my passion, especially growing vegetables and flowers"
David: "I'm interested in learning about car maintenance and automotive repair"

Articles:

Cooking Pasta Recipes
Car Engine Maintenance
Gardening for Beginners
Advanced Automotive Repair

Semantic Join Implementation

users_df.semantic.join(
    articles_df,
    join_instruction="A person with interests '{interests:left}' would be interested in reading about '{description:right}'"
)

Matching Results

Sarah → Cooking Pasta Recipes ✅
Mike → Car Engine Maintenance + Advanced Automotive Repair ✅
Emily → Gardening for Beginners ✅
David → Car Engine Maintenance + Advanced Automotive Repair ✅

Example 2: Product Recommendations

Sample Data

Customer Purchases:

Alice: Professional DSLR Camera
Bob: Gaming Laptop
Carol: Yoga Mat
Dan: Coffee Maker

Product Catalog:

Camera Lens Kit, Tripod Stand (Photography)
Gaming Mouse, Mechanical Keyboard (Gaming)
Yoga Blocks, Exercise Resistance Bands (Fitness)
Coffee Beans, French Press (Food & Beverage)

Recommendation Logic

purchases_df.semantic.join(
    products_df,
    join_instruction="A customer who bought '{purchased_product:left}' would also be interested in '{product_name:right}'"
)

Recommendation Results

Alice (DSLR Camera) → Camera Lens Kit + Tripod Stand ✅
Bob (Gaming Laptop) → Gaming Mouse + Mechanical Keyboard ✅
Carol (Yoga Mat) → Yoga Blocks + Exercise Resistance Bands ✅
Dan (Coffee Maker) → Coffee Beans + French Press ✅

Technical Details

Session Configuration

config = fc.SessionConfig(
    app_name="semantic_joins",
    semantic=fc.SemanticConfig(
        language_models={
            "mini": fc.OpenAIModelConfig(
                model_name="gpt-4o-mini",
                rpm=500,
                tpm=200_000,
            )
        }
    ),
)

Performance Characteristics

Complexity: O(m × n) where m and n are the sizes of the DataFrames
LLM Calls: One API call per potential row pair
Rate Limiting: Respects RPM/TPM limits configured in session
Batching: Efficiently batches requests to optimize API usage

When to Use Semantic Joins

Ideal Use Cases:

Content personalization and recommendation systems
Product cross-selling and upselling
Skill-job matching in recruitment
Entity resolution across different data sources
Question-answer pairing for knowledge bases
Customer-service matching based on needs

Advantages:

No training data required (zero-shot)
Handles complex reasoning and context
Understands domain-specific relationships
Works with natural language descriptions
Flexible and interpretable join conditions

Considerations:

Higher latency than traditional joins
API costs for LLM usage
Rate limiting for large datasets
Best for moderate-sized datasets (hundreds to low thousands of rows)

Usage

# Ensure you have OpenAI API key configured
export OPENAI_API_KEY="your-api-key"

# Run the semantic joins example
python semantic_joins.py

Expected Output

The script demonstrates both use cases with clear before/after data views:

User-Article Matching: Shows how semantic understanding connects user interests to relevant content
Product Recommendations: Demonstrates intelligent product relationship detection for cross-selling

Learning Outcomes

This example teaches:

How to construct effective natural language join predicates
When semantic joins are preferable to traditional or similarity-based joins
Practical applications in recommendation systems and personalization
Understanding the trade-offs between accuracy, performance, and cost

Perfect for understanding how to leverage LLM reasoning capabilities for intelligent data joining scenarios that go beyond simple keyword matching or embedding similarity.