Custom Embeddings

This guide covers how to register custom embeddings using the EmbeddingRecorder interface provided by the MarkovML Python SDK.

🚧

Setup required

If you haven't already, please install the MarkovML SDK on your system before continuing with this Setup MarkovML SDK. See the Setup Your Machine guide for more details.

Custom Embedding Usage

The platform currently uses Sentence Transformers to compute embeddings. However, users may want to visualize their own custom-trained embeddings. This tool enables users to upload their own embeddings for a given dataset, which can then be visualized and analyzed on MarkovML.

Create an Embedding Recorder

Before registering custom embeddings for a particular dataset, the dataset must first be registered with MarkovML. This can be done by following the guide here

import markov
from markov import EmbeddingRecorder

# get dataset by name
data_set = markov.dataset.get_by_name("my_dataset_name")
dataset_id = data_set.ds_id

embedding_recorder = EmbeddingRecorder(
  name="Custom embedding name", 
  dataset_id=dataset_id,
  notes="Optional description for this custom embedding "
)
embedding_recorder.register()

Add Embedding Records

To add individual embedding records to a MarkovML evaluation recording, we need both the individual dataset record and the corresponding embedding. Both of these can be provided as a list.

embedding_recorder.ds_columns can be used as reference to understand the columns for which the values need to be populated for individual dataset records.

A sample code is given below.

import markov
from markov import EmbeddingRecorder

# get dataset by name
data_set = markov.dataset.get_by_name("my_dataset_name")

# You can also get dataset by id
dataset = markov.dataset.get_by_id(dataset_id)

# Custom Embeddings for any segment can be added, Train segment is used here as an example, 
train_df = dataset.train.as_df() 

embedding_recorder = EmbeddingRecorder(
  	name="Custom embedding name", 
  	dataset_id=dataset_id,
  	notes="Optional description for this custom embedding "
)
embedding_recorder.register()
for _, row in train_df[embedding_recorder.ds_columns].iterrows():
 # Get embeddings for a particular row. 
 # A list of values is expected corresponding to the embedding for this particular record.
 # User defined method to generate embeddings, This is usually generated from a trained model.
	embedding: List[float] = get_embeddings(row)
	embedding_recorder.add_embedding_record(row.tolist(),embedding)

# its important to call finish to signal to MarkovML that all embeddings have been uploaded
embedding_recorder.finish()