Skip to content

fenic.api.functions.markdown

Markdown functions.

Functions:

  • extract_header_chunks

    Splits markdown documents into logical chunks based on heading hierarchy.

  • generate_toc

    Generates a table of contents from markdown headings.

  • get_code_blocks

    Extracts all code blocks from a column of Markdown-formatted strings.

  • to_json

    Converts a column of Markdown-formatted strings into a hierarchical JSON representation.

extract_header_chunks

extract_header_chunks(column: ColumnOrName, header_level: int) -> Column

Splits markdown documents into logical chunks based on heading hierarchy.

Parameters:

  • column (ColumnOrName) –

    Input column containing Markdown strings.

  • header_level (int) –

    Heading level to split on (1-6). Creates a new chunk at every heading of this level, including all nested content and subsections.

Returns:

  • Column ( Column ) –

    A column of arrays containing chunk objects with the following structure:

    ArrayType(StructType([
        StructField("heading", StringType),        # Heading text (clean, no markdown)
        StructField("level", IntegerType),         # Heading level (1-6)
        StructField("content", StringType),        # All content under this heading (clean text)
        StructField("parent_heading", StringType), # Parent heading text (or null)
        StructField("full_path", StringType),      # Full breadcrumb path
    ]))
    

Notes: - Context-preserving: Each chunk contains all content and subsections under the heading - Hierarchical awareness: Includes parent heading context for better LLM understanding - Clean text output: Strips markdown formatting for direct LLM consumption

Chunking Behavior

With header_level=2, this markdown:

# Introduction
Overview text

## Getting Started
Setup instructions

### Prerequisites
Python 3.8+ required

## API Reference
Function documentation

Produces 2 chunks:

  1. Getting Started chunk (includes Prerequisites subsection)
  2. API Reference chunk
Split articles into top-level sections
df.select(markdown.extract_header_chunks(col("articles"), header_level=1))
Split documentation into feature sections
df.select(markdown.extract_header_chunks(col("docs"), header_level=2))
Create fine-grained chunks for detailed analysis
df.select(markdown.extract_header_chunks(col("content"), header_level=3))
Explode chunks into individual rows for processing
chunks_df = df.select(
    markdown.extract_header_chunks(col("markdown"), header_level=2).alias("chunks")
).explode("chunks")
chunks_df.select(
    col("chunks").heading,
    col("chunks").content,
    col("chunks").full_path
)
Source code in src/fenic/api/functions/markdown.py
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
@validate_call(config=ConfigDict(strict=True, arbitrary_types_allowed=True))
def extract_header_chunks(column: ColumnOrName, header_level: int) -> Column:
    """Splits markdown documents into logical chunks based on heading hierarchy.

    Args:
        column (ColumnOrName): Input column containing Markdown strings.
        header_level (int): Heading level to split on (1-6). Creates a new chunk at every
                            heading of this level, including all nested content and subsections.

    Returns:
        Column: A column of arrays containing chunk objects with the following structure:
            ```python
            ArrayType(StructType([
                StructField("heading", StringType),        # Heading text (clean, no markdown)
                StructField("level", IntegerType),         # Heading level (1-6)
                StructField("content", StringType),        # All content under this heading (clean text)
                StructField("parent_heading", StringType), # Parent heading text (or null)
                StructField("full_path", StringType),      # Full breadcrumb path
            ]))
            ```
    Notes:
        - **Context-preserving**: Each chunk contains all content and subsections under the heading
        - **Hierarchical awareness**: Includes parent heading context for better LLM understanding
        - **Clean text output**: Strips markdown formatting for direct LLM consumption

    Chunking Behavior:
        With `header_level=2`, this markdown:
        ```markdown
        # Introduction
        Overview text

        ## Getting Started
        Setup instructions

        ### Prerequisites
        Python 3.8+ required

        ## API Reference
        Function documentation
        ```
        Produces 2 chunks:

        1. `Getting Started` chunk (includes `Prerequisites` subsection)
        2. `API Reference` chunk

    Example: Split articles into top-level sections
        ```python
        df.select(markdown.extract_header_chunks(col("articles"), header_level=1))
        ```

    Example: Split documentation into feature sections
        ```python
        df.select(markdown.extract_header_chunks(col("docs"), header_level=2))
        ```

    Example: Create fine-grained chunks for detailed analysis
        ```python
        df.select(markdown.extract_header_chunks(col("content"), header_level=3))
        ```

    Example: Explode chunks into individual rows for processing
        ```python
        chunks_df = df.select(
            markdown.extract_header_chunks(col("markdown"), header_level=2).alias("chunks")
        ).explode("chunks")
        chunks_df.select(
            col("chunks").heading,
            col("chunks").content,
            col("chunks").full_path
        )
        ```
    """
    if header_level < 1 or header_level > 6:
        raise ValidationError(f"split_level must be between 1 and 6 (inclusive), but got {header_level}. Use 1 for # headings, 2 for ## headings, etc.")

    return Column._from_logical_expr(
        MdExtractHeaderChunks(Column._from_col_or_name(column)._logical_expr, header_level)
    )

generate_toc

generate_toc(column: ColumnOrName, max_level: Optional[int] = None) -> Column

Generates a table of contents from markdown headings.

Parameters:

  • column (ColumnOrName) –

    Input column containing Markdown strings.

  • max_level (Optional[int], default: None ) –

    Maximum heading level to include in the TOC (1-6). Defaults to 6 (all levels).

Returns:

  • Column ( Column ) –

    A column of Markdown-formatted table of contents strings.

Notes
  • The TOC is generated using markdown heading syntax (# ## ### etc.)
  • Each heading in the source document becomes a line in the TOC
  • The heading level is preserved in the output
  • This creates a valid markdown document that can be rendered or processed further
Generate a complete TOC
df.select(markdown.generate_toc(col("documentation")))
Generate a simplified TOC with only top 2 levels
df.select(markdown.generate_toc(col("documentation"), max_level=2))
Add TOC as a new column
df = df.with_column("toc", markdown.generate_toc(col("content"), max_level=3))
Source code in src/fenic/api/functions/markdown.py
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
@validate_call(config=ConfigDict(strict=True, arbitrary_types_allowed=True))
def generate_toc(column: ColumnOrName, max_level: Optional[int] = None) -> Column:
    """Generates a table of contents from markdown headings.

    Args:
        column (ColumnOrName): Input column containing Markdown strings.
        max_level (Optional[int]): Maximum heading level to include in the TOC (1-6).
                                 Defaults to 6 (all levels).

    Returns:
        Column: A column of Markdown-formatted table of contents strings.

    Notes:
        - The TOC is generated using markdown heading syntax (# ## ### etc.)
        - Each heading in the source document becomes a line in the TOC
        - The heading level is preserved in the output
        - This creates a valid markdown document that can be rendered or processed further

    Example: Generate a complete TOC
        ```python
        df.select(markdown.generate_toc(col("documentation")))
        ```

    Example: Generate a simplified TOC with only top 2 levels
        ```python
        df.select(markdown.generate_toc(col("documentation"), max_level=2))
        ```

    Example: Add TOC as a new column
        ```python
        df = df.with_column("toc", markdown.generate_toc(col("content"), max_level=3))
        ```
    """
    if max_level and (max_level < 1 or max_level > 6):
        raise ValidationError(f"max_level must be between 1 and 6 (inclusive), but got {max_level}. Use 1 for # headings, 2 for ## headings, etc.")
    return Column._from_logical_expr(
        MdGenerateTocExpr(Column._from_col_or_name(column)._logical_expr, max_level)
    )

get_code_blocks

get_code_blocks(column: ColumnOrName, language_filter: Optional[str] = None) -> Column

Extracts all code blocks from a column of Markdown-formatted strings.

Parameters:

  • column (ColumnOrName) –

    Input column containing Markdown strings.

  • language_filter (Optional[str], default: None ) –

    Optional language filter to extract only code blocks with a specific language. By default, all code blocks are extracted.

Returns:

  • Column ( Column ) –

    A column of code blocks. The output column type is: ArrayType(StructType([ StructField("language", StringType), StructField("code", StringType), ]))

Notes
  • Code blocks are parsed from fenced Markdown blocks (e.g., triple backticks ```).
  • Language identifiers are optional and may be null if not provided in the original Markdown.
  • Indented code blocks without fences are not currently supported.
  • This function is useful for extracting embedded logic, configuration, or examples from documentation or notebooks.
Extract all code blocks
df.select(markdown.get_code_blocks(col("markdown_text")))
Explode code blocks into individual rows
# Explode the list of code blocks into individual rows
df = df.explode(df.with_column("blocks", markdown.get_code_blocks(col("md"))))
df = df.select(col("blocks")["language"], col("blocks")["code"])
Source code in src/fenic/api/functions/markdown.py
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
@validate_call(config=ConfigDict(strict=True, arbitrary_types_allowed=True))
def get_code_blocks(column: ColumnOrName, language_filter: Optional[str] = None) -> Column:
    """Extracts all code blocks from a column of Markdown-formatted strings.

    Args:
        column (ColumnOrName): Input column containing Markdown strings.
        language_filter (Optional[str]): Optional language filter to extract only code blocks with a specific language. By default, all code blocks are extracted.

    Returns:
        Column: A column of code blocks. The output column type is:
            ArrayType(StructType([
                StructField("language", StringType),
                StructField("code", StringType),
            ]))

    Notes:
        - Code blocks are parsed from fenced Markdown blocks (e.g., triple backticks ```).
        - Language identifiers are optional and may be null if not provided in the original Markdown.
        - Indented code blocks without fences are not currently supported.
        - This function is useful for extracting embedded logic, configuration, or examples
          from documentation or notebooks.

    Example: Extract all code blocks
        ```python
        df.select(markdown.get_code_blocks(col("markdown_text")))
        ```

    Example: Explode code blocks into individual rows
        ```python
        # Explode the list of code blocks into individual rows
        df = df.explode(df.with_column("blocks", markdown.get_code_blocks(col("md"))))
        df = df.select(col("blocks")["language"], col("blocks")["code"])
        ```
    """
    return Column._from_logical_expr(
        MdGetCodeBlocksExpr(Column._from_col_or_name(column)._logical_expr, language_filter)
    )

to_json

to_json(column: ColumnOrName) -> Column

Converts a column of Markdown-formatted strings into a hierarchical JSON representation.

Parameters:

  • column (ColumnOrName) –

    Input column containing Markdown strings.

Returns:

  • Column ( Column ) –

    A column of JSON-formatted strings representing the structured document tree.

Notes
  • This function parses Markdown into a structured JSON format optimized for document chunking, semantic analysis, and jq queries.
  • The output conforms to a custom schema that organizes content into nested sections based on heading levels. This makes it more expressive than flat ASTs like mdast.
  • The full JSON schema is available at: docs.fenic.ai/topics/markdown-json
Supported Markdown Features
  • Headings with nested hierarchy (e.g., h2 → h3 → h4)
  • Paragraphs with inline formatting (bold, italics, links, code, etc.)
  • Lists (ordered, unordered, task lists)
  • Tables with header alignment and inline content
  • Code blocks with language info
  • Blockquotes, horizontal rules, and inline/flow HTML
Convert markdown to JSON
df.select(markdown.to_json(col("markdown_text")))
Extract all level-2 headings with jq
# Combine with jq to extract all level-2 headings
df.select(json.jq(markdown.to_json(col("md")), ".. | select(.type == 'heading' and .level == 2)"))
Source code in src/fenic/api/functions/markdown.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
@validate_call(config=ConfigDict(strict=True, arbitrary_types_allowed=True))
def to_json(column: ColumnOrName) -> Column:
    """Converts a column of Markdown-formatted strings into a hierarchical JSON representation.

    Args:
        column (ColumnOrName): Input column containing Markdown strings.

    Returns:
        Column: A column of JSON-formatted strings representing the structured document tree.

    Notes:
        - This function parses Markdown into a structured JSON format optimized for document chunking,
          semantic analysis, and `jq` queries.
        - The output conforms to a custom schema that organizes content into nested sections based
          on heading levels. This makes it more expressive than flat ASTs like `mdast`.
        - The full JSON schema is available at: docs.fenic.ai/topics/markdown-json

    Supported Markdown Features:
        - Headings with nested hierarchy (e.g., h2 → h3 → h4)
        - Paragraphs with inline formatting (bold, italics, links, code, etc.)
        - Lists (ordered, unordered, task lists)
        - Tables with header alignment and inline content
        - Code blocks with language info
        - Blockquotes, horizontal rules, and inline/flow HTML

    Example: Convert markdown to JSON
        ```python
        df.select(markdown.to_json(col("markdown_text")))
        ```

    Example: Extract all level-2 headings with jq
        ```python
        # Combine with jq to extract all level-2 headings
        df.select(json.jq(markdown.to_json(col("md")), ".. | select(.type == 'heading' and .level == 2)"))
        ```
    """
    return Column._from_logical_expr(
        MdToJsonExpr(Column._from_col_or_name(column)._logical_expr)
    )