fenic.api.functions.embedding
Embedding functions.
Functions:
-
compute_similarity
–Compute similarity between embedding vectors using specified metric.
-
normalize
–Normalize embedding vectors to unit length.
compute_similarity
compute_similarity(column: ColumnOrName, other: Union[ColumnOrName, List[float], ndarray], metric: SemanticSimilarityMetric = 'cosine') -> Column
Compute similarity between embedding vectors using specified metric.
Parameters:
-
column
(ColumnOrName
) –Column containing embedding vectors.
-
other
(Union[ColumnOrName, List[float], ndarray]
) –Either:
- Another column containing embedding vectors for pairwise similarity
- A query vector (list of floats or numpy array) for similarity with each embedding
-
metric
(SemanticSimilarityMetric
, default:'cosine'
) –The similarity metric to use. Options:
cosine
: Cosine similarity (range: -1 to 1, higher is more similar)dot
: Dot product similarity (raw inner product)l2
: L2 (Euclidean) distance (lower is more similar)
Returns:
-
Column
(Column
) –A column of float values representing similarity scores.
Raises:
-
ValidationError
–If query vector contains NaN values or has invalid dimensions.
Notes
- Cosine similarity normalizes vectors internally, so pre-normalization is not required
- Dot product does not normalize, useful when vectors are already normalized
- L2 distance measures the straight-line distance between vectors
- When using two columns, dimensions must match between embeddings
Compute dot product with a query vector
# Compute dot product with a query vector
query = [0.1, 0.2, 0.3]
df.select(
embedding.compute_similarity(col("embeddings"), query).alias("similarity")
)
Compute cosine similarity with a query vector
query = [0.6, ... 0.8] # Already normalized
df.select(
embedding.compute_similarity(col("embeddings"), query, metric="cosine").alias("cosine_sim")
)
Compute pairwise dot products between columns
# Compute L2 distance between two columns of embeddings
df.select(
embedding.compute_similarity(col("embeddings1"), col("embeddings2"), metric="l2").alias("distance")
)
Using numpy array as query vector
# Use numpy array as query vector
import numpy as np
query = np.array([0.1, 0.2, 0.3])
df.select(embedding.compute_similarity("embeddings", query))
Source code in src/fenic/api/functions/embedding.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
normalize
normalize(column: ColumnOrName) -> Column
Normalize embedding vectors to unit length.
Parameters:
-
column
(ColumnOrName
) –Column containing embedding vectors.
Returns:
-
Column
(Column
) –A column of normalized embedding vectors with the same embedding type.
Notes
- Normalizes each embedding vector to have unit length (L2 norm = 1)
- Preserves the original embedding model in the type
- Null values are preserved as null
- Zero vectors become NaN after normalization
Normalize embeddings for dot product similarity
# Normalize embeddings for dot product similarity comparisons
df.select(
embedding.normalize(col("embeddings")).alias("unit_embeddings")
)
Compare normalized embeddings using dot product
# Compare normalized embeddings using dot product (equivalent to cosine similarity)
normalized_df = df.select(embedding.normalize(col("embeddings")).alias("norm_emb"))
query = [0.6, 0.8] # Already normalized
normalized_df.select(
embedding.compute_similarity(col("norm_emb"), query, metric="dot").alias("dot_product_sim")
)
Source code in src/fenic/api/functions/embedding.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|