Document Metadata Extraction with Fenic
This example demonstrates how to extract structured metadata from unstructured text data, using Fenic's semantic extraction capabilities. It compares two different approaches for defining extraction schemas.
Overview
Document metadata extraction is a common use case for LLMs, allowing you to automatically parse and structure information from various document types including research papers, product announcements, meeting notes, news articles, and technical documentation.
This example showcases:
- Two Schema Approaches: ExtractSchema vs Pydantic models
- Structured Extraction: Converting unstructured text to structured metadata
- Zero-Shot Extraction: No examples required
Schema Approaches Compared
ExtractSchema Approach
doc_metadata_schema = fc.ExtractSchema([
fc.ExtractSchemaField(
name="keywords",
data_type=fc.ExtractSchemaList(element_type=fc.StringType),
description="List of key topics and terms"
)
])
Advantages:
- ✅ Supports complex data types (lists, nested structures)
- ✅ Native Fenic schema definition
- ✅ Type-safe with proper list handling
Pydantic Model Approach
from typing import Literal
class DocumentMetadata(BaseModel):
keywords: str = Field(..., description="Comma-separated list of key topics and terms")
document_type: Literal["research paper", "product announcement", "meeting notes", "news article", "technical documentation"] = Field(..., description="Type of document")
Advantages:
- ✅ Familiar Python class syntax
- ✅ Leverages existing Pydantic knowledge
- ✅ Supports Literal strings for constraining output values
- ❌ Limited to simple data types (str, int, float, bool, Literal)
Sample Data
The example processes 5 diverse document types:
- Research Paper - Academic abstract with technical terms
- Product Announcement - Marketing content with features and pricing
- Meeting Notes - Internal documentation with decisions and action items
- News Article - Breaking news with facts and impact
- Technical Documentation - API reference with specifications
Extracted Metadata Fields
- Title: Main subject or heading of the document
- Document Type: Classification (research paper, product announcement, etc.)
- Date: Any relevant date mentioned (publication, meeting, etc.)
- Keywords: Key topics and terms (list vs comma-separated string)
- Summary: One-sentence overview of the document's purpose
Key Differences in Output
ExtractSchema Keywords:
["machine learning", "climate modeling", "neural networks"]
Pydantic Keywords:
"machine learning, climate modeling, neural networks"
Usage
# Ensure you have OpenAI API key configured
export OPENAI_API_KEY="your-api-key"
# Run the extraction example
python document_extraction.py
Expected Output
The script will display:
- Sample Documents - Overview of loaded documents with text lengths
- ExtractSchema Results - Structured extraction using native Fenic schema
- Pydantic Results - Same extraction using Pydantic model
- Method Comparison - Side-by-side analysis of both approaches
When to Use Each Approach
Choose ExtractSchema when:
- You need complex data types (lists, nested objects)
- Working primarily within Fenic ecosystem
- Type safety is important for downstream processing
Choose Pydantic when:
- You prefer familiar Python class syntax
- Your team has existing Pydantic expertise
- Simple data types meet your requirements
Learning Outcomes
This example teaches:
- How to perform zero-shot semantic extraction
- Differences between schema definition approaches
Perfect for understanding Fenic's semantic extraction capabilities and choosing the right schema approach for your use case.