fenic.api.functions.markdown
Markdown functions.
Functions:
-
extract_header_chunks
–Splits markdown documents into logical chunks based on heading hierarchy.
-
generate_toc
–Generates a table of contents from markdown headings.
-
get_code_blocks
–Extracts all code blocks from a column of Markdown-formatted strings.
-
to_json
–Converts a column of Markdown-formatted strings into a hierarchical JSON representation.
extract_header_chunks
extract_header_chunks(column: ColumnOrName, header_level: int) -> Column
Splits markdown documents into logical chunks based on heading hierarchy.
Parameters:
-
column
(ColumnOrName
) –Input column containing Markdown strings.
-
header_level
(int
) –Heading level to split on (1-6). Creates a new chunk at every heading of this level, including all nested content and subsections.
Returns:
-
Column
(Column
) –A column of arrays containing chunk objects with the following structure:
ArrayType(StructType([ StructField("heading", StringType), # Heading text (clean, no markdown) StructField("level", IntegerType), # Heading level (1-6) StructField("content", StringType), # All content under this heading (clean text) StructField("parent_heading", StringType), # Parent heading text (or null) StructField("full_path", StringType), # Full breadcrumb path ]))
Notes: - Context-preserving: Each chunk contains all content and subsections under the heading - Hierarchical awareness: Includes parent heading context for better LLM understanding - Clean text output: Strips markdown formatting for direct LLM consumption
Chunking Behavior
With header_level=2
, this markdown:
# Introduction
Overview text
## Getting Started
Setup instructions
### Prerequisites
Python 3.8+ required
## API Reference
Function documentation
Produces 2 chunks:
Getting Started
chunk (includesPrerequisites
subsection)API Reference
chunk
Split articles into top-level sections
df.select(markdown.extract_header_chunks(col("articles"), header_level=1))
Split documentation into feature sections
df.select(markdown.extract_header_chunks(col("docs"), header_level=2))
Create fine-grained chunks for detailed analysis
df.select(markdown.extract_header_chunks(col("content"), header_level=3))
Explode chunks into individual rows for processing
chunks_df = df.select(
markdown.extract_header_chunks(col("markdown"), header_level=2).alias("chunks")
).explode("chunks")
chunks_df.select(
col("chunks").heading,
col("chunks").content,
col("chunks").full_path
)
Source code in src/fenic/api/functions/markdown.py
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 |
|
generate_toc
generate_toc(column: ColumnOrName, max_level: Optional[int] = None) -> Column
Generates a table of contents from markdown headings.
Parameters:
-
column
(ColumnOrName
) –Input column containing Markdown strings.
-
max_level
(Optional[int]
, default:None
) –Maximum heading level to include in the TOC (1-6). Defaults to 6 (all levels).
Returns:
-
Column
(Column
) –A column of Markdown-formatted table of contents strings.
Notes
- The TOC is generated using markdown heading syntax (# ## ### etc.)
- Each heading in the source document becomes a line in the TOC
- The heading level is preserved in the output
- This creates a valid markdown document that can be rendered or processed further
Generate a complete TOC
df.select(markdown.generate_toc(col("documentation")))
Generate a simplified TOC with only top 2 levels
df.select(markdown.generate_toc(col("documentation"), max_level=2))
Add TOC as a new column
df = df.with_column("toc", markdown.generate_toc(col("content"), max_level=3))
Source code in src/fenic/api/functions/markdown.py
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
get_code_blocks
get_code_blocks(column: ColumnOrName, language_filter: Optional[str] = None) -> Column
Extracts all code blocks from a column of Markdown-formatted strings.
Parameters:
-
column
(ColumnOrName
) –Input column containing Markdown strings.
-
language_filter
(Optional[str]
, default:None
) –Optional language filter to extract only code blocks with a specific language. By default, all code blocks are extracted.
Returns:
-
Column
(Column
) –A column of code blocks. The output column type is: ArrayType(StructType([ StructField("language", StringType), StructField("code", StringType), ]))
Notes
- Code blocks are parsed from fenced Markdown blocks (e.g., triple backticks ```).
- Language identifiers are optional and may be null if not provided in the original Markdown.
- Indented code blocks without fences are not currently supported.
- This function is useful for extracting embedded logic, configuration, or examples from documentation or notebooks.
Extract all code blocks
df.select(markdown.get_code_blocks(col("markdown_text")))
Explode code blocks into individual rows
# Explode the list of code blocks into individual rows
df = df.explode(df.with_column("blocks", markdown.get_code_blocks(col("md"))))
df = df.select(col("blocks")["language"], col("blocks")["code"])
Source code in src/fenic/api/functions/markdown.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|
to_json
to_json(column: ColumnOrName) -> Column
Converts a column of Markdown-formatted strings into a hierarchical JSON representation.
Parameters:
-
column
(ColumnOrName
) –Input column containing Markdown strings.
Returns:
-
Column
(Column
) –A column of JSON-formatted strings representing the structured document tree.
Notes
- This function parses Markdown into a structured JSON format optimized for document chunking,
semantic analysis, and
jq
queries. - The output conforms to a custom schema that organizes content into nested sections based
on heading levels. This makes it more expressive than flat ASTs like
mdast
. - The full JSON schema is available at: docs.fenic.ai/topics/markdown-json
Supported Markdown Features
- Headings with nested hierarchy (e.g., h2 → h3 → h4)
- Paragraphs with inline formatting (bold, italics, links, code, etc.)
- Lists (ordered, unordered, task lists)
- Tables with header alignment and inline content
- Code blocks with language info
- Blockquotes, horizontal rules, and inline/flow HTML
Convert markdown to JSON
df.select(markdown.to_json(col("markdown_text")))
Extract all level-2 headings with jq
# Combine with jq to extract all level-2 headings
df.select(json.jq(markdown.to_json(col("md")), ".. | select(.type == 'heading' and .level == 2)"))
Source code in src/fenic/api/functions/markdown.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
|