Skip to content

fenic.api.functions.embedding

Embedding functions.

Functions:

  • compute_similarity

    Compute similarity between embedding vectors using specified metric.

  • normalize

    Normalize embedding vectors to unit length.

compute_similarity

compute_similarity(column: ColumnOrName, other: Union[ColumnOrName, List[float], ndarray], metric: SemanticSimilarityMetric = 'cosine') -> Column

Compute similarity between embedding vectors using specified metric.

Parameters:

  • column (ColumnOrName) –

    Column containing embedding vectors.

  • other (Union[ColumnOrName, List[float], ndarray]) –

    Either:

    • Another column containing embedding vectors for pairwise similarity
    • A query vector (list of floats or numpy array) for similarity with each embedding
  • metric (SemanticSimilarityMetric, default: 'cosine' ) –

    The similarity metric to use. Options:

    • cosine: Cosine similarity (range: -1 to 1, higher is more similar)
    • dot: Dot product similarity (raw inner product)
    • l2: L2 (Euclidean) distance (lower is more similar)

Returns:

  • Column ( Column ) –

    A column of float values representing similarity scores.

Raises:

  • ValidationError

    If query vector contains NaN values or has invalid dimensions.

Notes
  • Cosine similarity normalizes vectors internally, so pre-normalization is not required
  • Dot product does not normalize, useful when vectors are already normalized
  • L2 distance measures the straight-line distance between vectors
  • When using two columns, dimensions must match between embeddings
Compute dot product with a query vector
# Compute dot product with a query vector
query = [0.1, 0.2, 0.3]
df.select(
    embedding.compute_similarity(col("embeddings"), query).alias("similarity")
)
Compute cosine similarity with a query vector
query = [0.6, ... 0.8]  # Already normalized
df.select(
    embedding.compute_similarity(col("embeddings"), query, metric="cosine").alias("cosine_sim")
)
Compute pairwise dot products between columns
# Compute L2 distance between two columns of embeddings
df.select(
    embedding.compute_similarity(col("embeddings1"), col("embeddings2"), metric="l2").alias("distance")
)
Using numpy array as query vector
# Use numpy array as query vector
import numpy as np
query = np.array([0.1, 0.2, 0.3])
df.select(embedding.compute_similarity("embeddings", query))
Source code in src/fenic/api/functions/embedding.py
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
@validate_call(config=ConfigDict(strict=True, arbitrary_types_allowed=True))
def compute_similarity(
    column: ColumnOrName,
    other: Union[ColumnOrName, List[float], np.ndarray],
    metric: SemanticSimilarityMetric = "cosine"
) -> Column:
    """Compute similarity between embedding vectors using specified metric.

    Args:
        column: Column containing embedding vectors.

        other: Either:

            - Another column containing embedding vectors for pairwise similarity
            - A query vector (list of floats or numpy array) for similarity with each embedding

        metric: The similarity metric to use. Options:

            - `cosine`: Cosine similarity (range: -1 to 1, higher is more similar)
            - `dot`: Dot product similarity (raw inner product)
            - `l2`: L2 (Euclidean) distance (lower is more similar)

    Returns:
        Column: A column of float values representing similarity scores.

    Raises:
        ValidationError: If query vector contains NaN values or has invalid dimensions.

    Notes:
        - Cosine similarity normalizes vectors internally, so pre-normalization is not required
        - Dot product does not normalize, useful when vectors are already normalized
        - L2 distance measures the straight-line distance between vectors
        - When using two columns, dimensions must match between embeddings

    Example: Compute dot product with a query vector
        ```python
        # Compute dot product with a query vector
        query = [0.1, 0.2, 0.3]
        df.select(
            embedding.compute_similarity(col("embeddings"), query).alias("similarity")
        )
        ```

    Example: Compute cosine similarity with a query vector
        ```python
        query = [0.6, ... 0.8]  # Already normalized
        df.select(
            embedding.compute_similarity(col("embeddings"), query, metric="cosine").alias("cosine_sim")
        )
        ```

    Example: Compute pairwise dot products between columns
        ```python
        # Compute L2 distance between two columns of embeddings
        df.select(
            embedding.compute_similarity(col("embeddings1"), col("embeddings2"), metric="l2").alias("distance")
        )
        ```

    Example: Using numpy array as query vector
        ```python
        # Use numpy array as query vector
        import numpy as np
        query = np.array([0.1, 0.2, 0.3])
        df.select(embedding.compute_similarity("embeddings", query))
        ```
    """
    column_expr = Column._from_col_or_name(column)._logical_expr

    # Check if other is a column
    if isinstance(other, ColumnOrName):
        other_expr = Column._from_col_or_name(other)._logical_expr
        return Column._from_logical_expr(
            EmbeddingSimilarityExpr(column_expr, other_expr, metric)
        )

    # Otherwise it's a query vector
    if isinstance(other, list):
        query_array = np.array(other, dtype=np.float32)
    else:
        query_array = other.astype(np.float32)

    # Check for NaNs
    if np.any(np.isnan(query_array)):
        raise ValidationError("Query vector cannot contain NaN values")

    return Column._from_logical_expr(
        EmbeddingSimilarityExpr(column_expr, query_array, metric)
    )

normalize

normalize(column: ColumnOrName) -> Column

Normalize embedding vectors to unit length.

Parameters:

  • column (ColumnOrName) –

    Column containing embedding vectors.

Returns:

  • Column ( Column ) –

    A column of normalized embedding vectors with the same embedding type.

Notes
  • Normalizes each embedding vector to have unit length (L2 norm = 1)
  • Preserves the original embedding model in the type
  • Null values are preserved as null
  • Zero vectors become NaN after normalization
Normalize embeddings for dot product similarity
# Normalize embeddings for dot product similarity comparisons
df.select(
    embedding.normalize(col("embeddings")).alias("unit_embeddings")
)
Compare normalized embeddings using dot product
# Compare normalized embeddings using dot product (equivalent to cosine similarity)
normalized_df = df.select(embedding.normalize(col("embeddings")).alias("norm_emb"))
query = [0.6, 0.8]  # Already normalized
normalized_df.select(
    embedding.compute_similarity(col("norm_emb"), query, metric="dot").alias("dot_product_sim")
)
Source code in src/fenic/api/functions/embedding.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def normalize(column: ColumnOrName) -> Column:
    """Normalize embedding vectors to unit length.

    Args:
        column: Column containing embedding vectors.

    Returns:
        Column: A column of normalized embedding vectors with the same embedding type.

    Notes:
        - Normalizes each embedding vector to have unit length (L2 norm = 1)
        - Preserves the original embedding model in the type
        - Null values are preserved as null
        - Zero vectors become NaN after normalization

    Example: Normalize embeddings for dot product similarity
        ```python
        # Normalize embeddings for dot product similarity comparisons
        df.select(
            embedding.normalize(col("embeddings")).alias("unit_embeddings")
        )
        ```

    Example: Compare normalized embeddings using dot product
        ```python
        # Compare normalized embeddings using dot product (equivalent to cosine similarity)
        normalized_df = df.select(embedding.normalize(col("embeddings")).alias("norm_emb"))
        query = [0.6, 0.8]  # Already normalized
        normalized_df.select(
            embedding.compute_similarity(col("norm_emb"), query, metric="dot").alias("dot_product_sim")
        )
        ```
    """
    column_expr = Column._from_col_or_name(column)._logical_expr
    return Column._from_logical_expr(EmbeddingNormalizeExpr(column_expr))