Skip to content

Markdown Processing with Fenic

View in Github

This example demonstrates how to process structured markdown documents using fenic's specialized markdown functions, JSON processing, and text extraction capabilities. We use the "Attention Is All You Need" paper, the foundational work that introduced the Transformer architecture, which powers modern large language models (LLMs).

What This Example Shows

Learn how to process structured markdown documents using fenic's comprehensive capabilities:

  • markdown.generate_toc() - Generate automatic table of contents from document headings
  • markdown.extract_header_chunks() - Extract and structure document sections into DataFrame rows
  • markdown.to_json() - Convert markdown to structured JSON for complex querying
  • json.jq() - Navigate complex document structures with powerful jq queries
  • text.extract() - Parse structured text using templates for field extraction
  • DataFrame operations - Filter, explode, unnest, and transform document data
  • Text processing - Split and parse content using fenic's templating functionality

Files

  • attention_is_all_you_need.md - OCR output of the "Attention Is All You Need" paper
  • markdown_processing.py - Python script demonstrating markdown processing workflow

Running the Example

cd examples/markdown_processing
uv run python markdown_processing.py

Make sure you have your OpenAI API key set in your environment:

export OPENAI_API_KEY="your-api-key-here"

What You'll Learn

This example showcases how fenic transforms unstructured markdown into structured, queryable data:

1. Document Structure Analysis

  • Load markdown documents into fenic DataFrames
  • Cast content to MarkdownType for specialized processing
  • Generate automatic table of contents from document hierarchy

2. Section Extraction and Structuring

  • Extract document sections using markdown.extract_header_chunks()
  • Convert nested arrays to DataFrame rows with explode() and unnest()
  • Work with structured section data (headings, content, hierarchy paths)

3. Traditional Text Processing

  • Filter DataFrames to find specific sections (e.g., References)
  • Parse structured content using text splitting with regex patterns
  • Handle academic citation formats and numbered references

4. JSON-Based Document Processing

  • Convert markdown to structured JSON with markdown.to_json()
  • Navigate complex nested document structures using json.jq()
  • Handle hierarchical content with powerful jq query language
  • Extract data from deeply nested JSON structures

5. Template-Based Text Extraction

  • Use text.extract() with templates for structured data parsing
  • Extract multiple fields from text in a single operation
  • Parse academic citations into separate reference numbers and content
  • Handle structured text formats with template patterns

6. Advanced DataFrame Operations

  • Combine multiple markdown functions in a single select statement
  • Chain operations for complex data transformations
  • Process hierarchical document structures efficiently
  • Cast between different data types (JsonType ↔ StringType)

Perfect for building academic paper analysis pipelines, research document processing, citation extraction systems, or preparing structured content for downstream analysis.