Embedding Specification

Embedding Specification#

See Google’s page about Embedding for a definition and examples.

Terminology#

(Embedding) Table: The lower-dimensional representations of sparse/categorical data. For each token in the vocabulary we have a vector with a size of the embedding dimension.
Embedding ID (Token): Represents an element of the embedding vocabulary.
Vocabulary Size: The total number of unique embedding IDs. This is the number of rows in the embedding table.
Embedding Dimension: The size of the lower dimensional space for the embeddings. This is the number of columns in the embedding table.
Sample (Example): Represents a single training example with multiple tokens.
Feature (Input): Represents a collection of samples.
Max Sequence Length: Defines the maximum number of tokens that a sample can have in a given feature.
Weight/Gain: The weight of each Embedding ID in a given sample.
Combiner: The aggregation function for combining the embeddings for a given sample. For instance, sum or mean.
(Feature) Activations: The weighted aggregation calculated with the Combiner for each sample in a given Input Feature.
(Feature) Gradients: Gradients (of the feature activations) with respect to the loss function.
(Embedding Table) Optimizer: The update function for the Model parameters and Embedding Table.

API#

class TableSpec(*, name, vocabulary_size, embedding_dim, initializer, optimizer, combiner, max_ids_per_partition=256, max_unique_ids_per_partition=256, suggested_coo_buffer_size_per_device=None, quantization_config=None, _stacked_table_spec=None, _setting_in_stack=None)#

Specifies one embedding table.

TableSpec is virtually immutable (for jax.jit) using eq=True and unsafe_hash=True, but has frozen=False to allow in-place updates when preparing for feature stacking or table stacking. See [dataclass doc](https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass) for more information.

Parameters:

name (str)
vocabulary_size (int)
embedding_dim (int)
initializer (Initializer | Callable[[...], Array | ndarray | bool | number | bool | int | float | complex])
optimizer (OptimizerSpec)
combiner (str)
max_ids_per_partition (int)
max_unique_ids_per_partition (int)
suggested_coo_buffer_size_per_device (int | None)
quantization_config (QuantizationConfig | None)
_stacked_table_spec (StackedTableSpec | None)
_setting_in_stack (TableSettingInStack | None)

combiner: str#: The aggregation function to compute activations for each sample. For example, sum or mean.

embedding_dim: int#: The number of columns in the embedding table.

initializer: Initializer | Callable[[...], Array | ndarray | bool | number | bool | int | float | complex]#: An initializer for the embedding table. See jax.nn.initializers() for more details.

max_ids_per_partition: int = 256#: The maximum number of embedding IDs that can be packed into a single partition.

max_unique_ids_per_partition: int = 256#: The maximum number of unique embedding IDs that can be packed into a single partition.

name: str#: Name of the table.

optimizer: OptimizerSpec#: An optimizer for the embedding table.

quantization_config: QuantizationConfig | None = None#: Quantization config (min, max, num_buckets) which represent the float range and number of discrete integer buckets to use for quantization.

property setting_in_stack: TableSettingInStack#: Returns the setting of this table in the stack.

property stacked_table_spec: StackedTableSpec#: Returns the stacked table spec which this table belongs to.

suggested_coo_buffer_size_per_device: int | None = None#: The minimum size of the input buffer that the preprocessing should try to create.

vocabulary_size: int#: The total number of unique embedding IDs. This is the number of rows in the embedding table.

class FeatureSpec(*, name, table_spec, input_shape, output_shape, _id_transformation=None)#

Specification for one embedding feature.

Notes

FeatureSpec is virtually immutable (for jax.jit()) using eq=True and unsafe_hash=True, but has frozen=False to allow in-place updates when preparing for feature stacking or table stacking. See [dataclass doc](https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass) for more information.

Warning

For all other purposes use embedding.update_preprocessing_parameters to maintain consistency between features, tables and stacked tables.

Parameters:

name (str)
table_spec (TableSpec)
input_shape (Sequence[int])
output_shape (Sequence[int])
_id_transformation (FeatureIdTransformation | None)

property id_transformation: FeatureIdTransformation#: Returns the transformation to apply to the input feature ids.

input_shape: Sequence[int]#: The shape of the input jax array, this is [global_batch_size, feature_valency]. The second element can be omitted for ragged input.

name: str#: Name of the feature.

output_shape: Sequence[int]#: The expected shape of the output activation jax array, this is [global_batch_size, embedding_dim].

table_spec: TableSpec#: The table spec for the feature.

class SparseCoreEmbed(feature_specs, sharding_axis='sparsecore_sharding', mesh=None, table_sharding_strategy='MOD', enable_minibatching=False, num_sc_per_device=-1, parent=<flax.linen.module._Sentinel object>, name=None)#

SparseCore embedding layer.

Parameters:

feature_specs (FeatureSpec | Sequence[FeatureSpec] | Mapping[str, FeatureSpec])
sharding_axis (str)
mesh (Mesh)
table_sharding_strategy (str)
enable_minibatching (bool)
num_sc_per_device (int)
parent (Module | Scope | _Sentinel | None)
name (str | None)

apply_gradient(gradients, embedding_lookup_inputs)#

Apply the gradients to the embedding variables.

Parameters:

gradients (Array | Sequence[Array] | Mapping[str, Array]) – The activation gradients.
embedding_lookup_inputs (PreprocessedInput) – The preprocessed data for embedding lookup.

Returns:

The updated activation embedding tables.

Return type:

Mapping[str, Mapping[str, Array]]

preprocess_inputs(step, features, features_weights, all_reduce_interface=None)#

Preprocesses the input for sparse dense matmul.

This method do not need to be invoked with module.apply().

Parameters:

step (int) – The current step
features (ndarray | Sequence[ndarray] | Mapping[str, ndarray]) – The input features for the current process. The features are expected to be Nested type (defined above). Concretely each leaf node should be either a 2D numpy array or a 1D list or numpy array of numpy arrays with dtype object (in the ragged tensor case).
features_weights (ndarray | Sequence[ndarray] | Mapping[str, ndarray] | None) – The input feature weights. The structure must be identical to the features. If None, uniform weights (1.0) are assumed for all features.
all_reduce_interface (Any | None) – The all reduce interface for minibatching. This can be generated using the get_all_reduce_interface function. Not required for single-host minibatching.

Returns:

The processed data for embedding lookup.

Return type:

PreprocessedInput

setup()#

Initializes a Module lazily (similar to a lazy __init__).

setup is called once lazily on a module instance when a module is bound, immediately before any other methods like __call__ are invoked, or before a setup-defined attribute on self is accessed.

This can happen in three cases:

Immediately when invoking apply(), init() or init_and_output().
Once the module is given a name by being assigned to an attribute of another module inside the other module’s setup method (see __setattr__()):
>>> class MyModule(nn.Module):
...   def setup(self):
...     submodule = nn.Conv(...)

...     # Accessing `submodule` attributes does not yet work here.

...     # The following line invokes `self.__setattr__`, which gives
...     # `submodule` the name "conv1".
...     self.conv1 = submodule

...     # Accessing `submodule` attributes or methods is now safe and
...     # either causes setup() to be called once.
Once a module is constructed inside a method wrapped with compact(), immediately before another method is called or setup defined attribute is accessed.

class SparseCoreEmbed(*args, **kwargs)#

SparseCore embedding layer.

Parameters:

args (Any)
kwargs (Any)

Return type:

Any

tpu_sparse_dense_matmul(preprocessed_inputs, embedding_variables, feature_specs, *, global_device_count, sharding_strategy='MOD', num_sc_per_device=None, enable_minibatching=False, perform_unstacking=True, use_activation_unstack_primitive=False)#

Computes the sparse dense matmul.

This function can be used with jax.jit and/or shard_map or as a complete standalone computation.

Parameters:

preprocessed_inputs (PreprocessedInput | SparseDenseMatmulInput) – The preprocessed inputs for sparse dense matmul.
embedding_variables (Mapping[str, EmbeddingVariables]) – A tuple of embedding tables and slot variables. The first one is always the embedding table, the following ones are slot variables. The tree structure must be identical to the lhs_row_pointers.
feature_specs (FeatureSpec | Sequence[FeatureSpec] | Mapping[str, FeatureSpec]) – The input features for the current process.
global_device_count (int) – The number of global devices (chips). Typically mesh.size.
sharding_strategy (str) – The sharding strategy (e.g., MOD)
num_sc_per_device (int | None) – The number of sparse cores per device. If None, it will be set to the number of sparse cores on the current host machine.
enable_minibatching (bool) – Whether to enable minibatching. Defaults to False.
perform_unstacking (bool) – If True, returns per-feature activations by unstacking the results. If False, returns raw stacked activations.
use_activation_unstack_primitive (bool) – If True, uses the new activation unstack primitive. Defaults to False.

Returns:

The activations structure with the same structure as feature_specs.

Raises:

ValueError – The input arrays and tuples are not of the expected structure or the sharding strategy is not supported.

Return type:

Array | Sequence[Array] | Mapping[str, Array]

Examples

Example invocation:

sparse_matmul = functools.partial(
    embedding.tpu_sparse_dense_matmul,
    global_device_count=mesh.size,
    feature_specs=feature_specs,
    sharding_strategy="MOD",
)
sparse_matmul = jax.shard_map(
    sparse_matmul,
    mesh=mesh,
    in_specs=(
        P(mesh.axis_names[0]),
        P(mesh.axis_names[0]),
    ),
    out_specs=P(mesh.axis_names[0]),
    check_vma=False,
)
sparse_matmul = jax.jit(sparse_matmul)
activations = sparse_matmul(
    preprocessed_inputs=preprocessed_inputs,
    embedding_variables,
)

tpu_sparse_dense_matmul_grad(activation_gradients, preprocessed_inputs, embedding_variables, feature_specs, *, global_device_count=None, sharding_strategy='MOD', label='', step=None, num_sc_per_device=None, enable_minibatching=False, perform_stacking=True, use_gradient_stacking_primitive=False, embedding_var_limits=None)#

Computes the updated embedding variables based on the activation gradients.

Parameters:

activation_gradients (Array | Sequence[Array] | Mapping[str, Array]) – The activation gradients.
preprocessed_inputs (PreprocessedInput | SparseDenseMatmulInput) – The preprocessed inputs for sparse dense matmul.
embedding_variables (Mapping[str, EmbeddingVariables]) – A tuple of embedding tables and slot variables. The first one is always the embedding table, the following ones are slot variables. The tree structure must be identical to the lhs_row_pointers.
feature_specs (FeatureSpec | Sequence[FeatureSpec] | Mapping[str, FeatureSpec]) – The input features for the current process.
global_device_count (int | None) – The number of devices in the global job.
sharding_strategy (str) – The sharding strategy (e.g., MOD)
label (str) – The label for the optimizer computation.
step (Array | int | None) – The current step number.
num_sc_per_device (int | None) – The number of sparse cores per device. If None, it will be set to the number of sparse cores on the current host machine.
enable_minibatching (bool) – Whether to use minibatching. Defaults to False.
perform_stacking (bool) – If True, expects per-feature gradients and stacks them internally. If False, assumes activation_gradients are already stacked.
use_gradient_stacking_primitive (bool) – If True, uses the gradient stacking primitive.
embedding_var_limits (Mapping[str, tuple[None | float, None | float]] | None) – The minimum and maximum values of the embedding table. If None, no bounds are applied.

Returns:

The updated activation embedding variables.

Return type:

Mapping[str, EmbeddingVariables]

Examples

Example invocation with jit + shard_map:

grad_update = functools.partial(
    embedding.tpu_sparse_dense_matmul_grad,
    feature_specs=feature_specs,
    sharding_strategy="MOD",
)
grad_update = jax.shard_map(
    grad_update,
    mesh=mesh,
    in_specs=(
        P(mesh.axis_names[0]),
        P(mesh.axis_names[0]),
        P(mesh.axis_names[0]),
    ),
    out_specs=P(mesh.axis_names[0]),
    check_vma=False,
)

grad_update = jax.jit(grad_update)
updated_embedding_variables = grad_update(
    activations_grad,
    preprocessed_inputs=preprocessed_inputs,
    embedding_variables,
)

preprocess_sparse_dense_matmul_input(features, features_weights, feature_specs, local_device_count, global_device_count, *, num_sc_per_device=None, sharding_strategy='MOD', has_leading_dimension=False, allow_id_dropping=False, batch_number=0, enable_minibatching=False, all_reduce_interface=None)#

Preprocesses the input for sparse dense matmul.

Parameters:

features (Nested[ArrayLike]) – The input features for the current process. The features are expected to be Nested type (defined above). Concretely each leaf node should be either a 2D numpy array or a 1D list or numpy array of numpy arrays with dtype object (in the ragged tensor case).
features_weights (Nested[ArrayLike] | None) – The input feature weights. The structure must be identical to the features. If None, uniform weights (1.0) are assumed for all features.
feature_specs (Nested[embedding_spec.FeatureSpec]) – The feature specs. This needs to have the same structure as features and features_weights (e.g., if one of them is a mapping then all of them are).
local_device_count (int) – The number of local devices (chips). Typically mesh.local_mesh.size.
global_device_count (int) – The number of global devices (chips). Typically mesh.size.
num_sc_per_device (int | None) – The number of sparse cores per device. If None, it will be set to the number of sparse cores on the current host machine.
sharding_strategy (str) – The sharding strategy (e.g., MOD)
has_leading_dimension (bool) – If set to True, then the first dimension of the output will be the number of local devices. This is useful when using the output in jax.pmap. If set to False, then the first dimension of the output will be the number of local devices * the static buffer size. This is useful when using the output in jax.jit. In conclusion, Set it to True if using jax.pmap and set it to False if using jax.jit.
allow_id_dropping (bool) – If set to True, then ids will be dropped if they exceed the max_ids_per_partition or max_unique_ids_per_partition limits.
batch_number (int) – The batch number.
enable_minibatching (bool) – Whether to enable minibatching.
all_reduce_interface (pybind_input_preprocessing.AllReduceInterface | None) – Interface to communicate between multiple hosts. This can be generated using the get_all_reduce_interface function. Not required for single-host minibatching.

Returns:

A tuple of PreprocessResults and SparseDenseMatmulInputStats.

Return type:

tuple[PreprocessedInput, SparseDenseMatmulInputStats]

preprocess_sparse_dense_matmul_input_from_sparse_tensor(indices, values, dense_shapes, feature_specs, local_device_count, global_device_count, *, num_sc_per_device=None, sharding_strategy='MOD', has_leading_dimension=False, allow_id_dropping=False, batch_number=0, enable_minibatching=False, all_reduce_interface=None)#

Preprocesses the input for sparse dense matmul.

Note

This function assumes that the values tensors contain embedding IDs and must be of integer type. Custom weights are not supported with this input format; uniform weights (1.0) are assumed for all embedding IDs. If you need to provide custom weights, use preprocess_sparse_dense_matmul_input instead.

Parameters:

indices (Nested[ArrayLike]) – A nested structure of 2-D int64 tensors, where each tensor has shape [N, ndims]. It represents the indices of non-zero elements in a sparse tensor, with elements being zero-indexed. For instance, indices=[[1,3], [2,4]] indicates that elements at [1,3] and [2,4] have non-zero values.
values (Nested[ArrayLike]) – A nested structure of 1-D int32 tensors, each with shape [N], representing the embedding IDs corresponding to indices. For example, with indices=[[1,3], [2,4]], values=[18, 3] means the embedding ID at [1,3] is 18 and at [2,4] is 3.
dense_shapes (Nested[ArrayLike]) – A nested structure of 2-element 1-D int64 tensors, defining the dense shape of the sparse tensor. It specifies the number of elements in each dimension. For example, dense_shape=[3,6] represents a 3x6 tensor.
feature_specs (Nested[embedding_spec.FeatureSpec]) – The feature specs. This needs to have the same structure as indices, values and dense_shapes (e.g., if one of them is a mapping then all of them are).
local_device_count (int) – The number of local devices (chips). Typically mesh.local_mesh.size.
global_device_count (int) – The number of global devices (chips). Typically mesh.size.
num_sc_per_device (int | None) – The number of sparse cores per device. If None, it will be set to the number of sparse cores on the current host machine.
sharding_strategy (str) – The sharding strategy (e.g., MOD)
has_leading_dimension (bool) – If set to True, then the first dimension of the output will be the number of local devices. This is useful when using the output in jax.pmap. If set to False, then the first dimension of the output will be the number of local devices * the static buffer size. This is useful when using the output in jax.jit. In conclusion, Set it to True if using jax.pmap and set it to False if using jax.jit.
allow_id_dropping (bool) – If set to True, then ids will be dropped if they exceed the max_ids_per_partition or max_unique_ids_per_partition limits.
batch_number (int) – The batch number.
enable_minibatching (bool) – Whether to enable minibatching.
all_reduce_interface (pybind_input_preprocessing.AllReduceInterface | None) – Interface to communicate between multiple hosts. This can be generated using the get_all_reduce_interface function. Not required for single-host minibatching.

Returns:

A tuple of PreprocessResults and SparseDenseMatmulInputStats.

Return type:

tuple[PreprocessedInput, SparseDenseMatmulInputStats]

Multivalent (Unordered/Pooled) Features#

For multivalent features, each sample is represented by an unordered set of embedding IDs. The embeddings corresponding to these IDs are aggregated or “pooled” into a single embedding vector for the sample. This is done using the combiner (e.g., sum, mean) specified in the TableSpec.

For example, if a sample has IDs [10, 21, 32] and the combiner is mean, the output activation will be mean(embedding(10), embedding(21), embedding(32)).

The input shape for a batch of such features is [batch_size, max_ids_per_sample], where max_ids_per_sample is the valency. The output shape is [batch_size, embedding_dim].

Sequence (Ordered/Concatenated) Features#

For sequence features, each sample is an ordered sequence of items, where each item can be one or more embedding IDs. The embeddings for each item in the sequence are computed and then concatenated to form the final output.

To handle sequence features, you will need to flatten the sequence dimension into the batch dimension before passing the features to the embedding layer. You can then reshape the output back to recover the sequence dimension. This is equivalent to concatenating the embeddings for each item in the sequence.

# input shape: [batch_size, sequence_length, valency]

# 1. Flatten the sequence dimension into the batch dimension
flattened_input = jnp.reshape(input, (batch_size * sequence_length, valency))

# 2. Perform the embedding lookup and combinations (if valency > 1)
flattened_output = embed_layer(flattened_input)
# flattened_output shape: [batch_size * sequence_length, embedding_dim]

# 3. Reshape the output back to the original sequence shape
output = jnp.reshape(flattened_output, (batch_size, sequence_length, embedding_dim))

If you have variable sequence lengths, you will need to pad your inputs to a max_sequence_length.

Optimizers#

See the Optimizers page for more details on the available optimizers and how to configure them.

Flax Embedding Layer#

Flax is the most commonly used JAX neural network library. The JAX SparseCore API provides a Flax layer that uses the primitive APIs to support large embeddings.

Flax comes in two flavors:

Linen (now deprecated) and the more recent NNX. The Flax project provides a guide for migrating from Linen to NNX. SparseCore project provides both Linen and NNX layers for large embedding models that can be used without the need for modification or extension. These layers are built on the primitive API, use the same Embedding Specification objects to configure the embedding and accept inputs from the preprocessing API.

You can find the Linen module here: linen.embed.SparseCoreEmbed. The newer NNX module is here: nnx.embed.SparseCoreEmbed.

Caveats#

Caveat 1: As with the primitive API and due to the size of embedding tables, the embedding tables are updated in-place during the gradient calculation. As such, gradients of the embeddings can’t be extracted in the same way as they are with dense layers.

Embedding Specification

Contents

Embedding Specification#

Terminology#

API#

Multivalent (Unordered/Pooled) Features#

Sequence (Ordered/Concatenated) Features#

Optimizers#

Flax Embedding Layer#

Caveats#