Document Metadata Extraction with Fenic
This example demonstrates how to extract structured metadata from unstructured text data, using Fenic's semantic extraction capabilities.
Overview
Document metadata extraction is a common use case for LLMs, allowing you to automatically parse and structure information from various document types including research papers, product announcements, meeting notes, news articles, and technical documentation.
This example showcases:
- Structured Extraction: Converting unstructured text to structured metadata
- Zero-Shot Extraction: No examples required
How it works
- Schema Definition with Pydantic
Define a Pydantic model to represent the structure of the data you want to extract. Each field must include a natural language description. This schema drives prompt generation and model output parsing.
- LLM Orchestration
Fenic uses the model provider of your choice to call the LLM with a structured output or tool-calling interface. The LLM returns data that conforms to the schema you defined.
- Data Structuring
The extracted data is represented as a struct column in a DataFrame with native Fenic struct fields. From there, it can be:
- Unnested into individual columns
- Exploded if it contains arrays
- Processed in place as nested data
Because Fenic maps Pydantic models to a strongly typed, columnar data model, certain Python types are not currently supported:
- Non-Optional Union types: Not expressible in Fenic's type system
- Dictionaries: Fenic does not yet support map types (future support via a JsonType is planned)
- Custom classes / dataclasses: These are stateful or logic-heavy constructs that don't fit the declarative nature of Fenic's data model
Despite these constraints, you can define complex extraction schemas using nested Pydantic models, optional fields, and lists—enabling robust and expressive structured extraction pipelines.
Defining a Pydantic Model
from typing import Literal
class DocumentMetadata(BaseModel):
keywords: str = Field(..., description="Comma-separated list of key topics and terms")
document_type: Literal["research paper", "product announcement", "meeting notes", "news article", "technical documentation"] = Field(..., description="Type of document")
Sample Data
The example processes 5 diverse document types:
- Research Paper - Academic abstract with technical terms
- Product Announcement - Marketing content with features and pricing
- Meeting Notes - Internal documentation with decisions and action items
- News Article - Breaking news with facts and impact
- Technical Documentation - API reference with specifications
Extracted Metadata Fields
- Title: Main subject or heading of the document
- Document Type: Classification (research paper, product announcement, etc.)
- Date: Any relevant date mentioned (publication, meeting, etc.)
- Keywords: Key topics and terms (list vs comma-separated string)
- Summary: One-sentence overview of the document's purpose
Usage
# Ensure you have OpenAI API key configured
export OPENAI_API_KEY="your-api-key"
# Run the extraction example
python document_extraction.py
Expected Output
The script will display:
- Sample Documents - Overview of loaded documents with text lengths
- Results - Extraction results