Register Datasets

Dive deep into your data with data profiling, EDA and data visualizations by registering them with MarkovML

Register a Data Family

Data families allow you to group related Datasets that you register with MarkovML.

For example, you can group all your datasets for sentiment classifier under Sentiment Analysis Data Family. Note that datasets on MarkovML are immutable, i.e., once registered, you should not update an existing dataset.

Data Family can also be considered as a logical collection of all versions of datasets for a specific domain.
When you register a dataset, you are required to specify the parent data family for the dataset.

There are two ways of creating a data family

  • Through Web UI
  • Through Python SDK

A data family should be created before registering any associated datasets.

Create a Data Family Using the Web UI

You can add a new data family as part of the workflow to register a dataset from the MarkovML web application.

Once logged in, navigate to the Datasets page. Click the "Add New Dataset" button at the top of the screen.

In the final step ("Confirm features"), open the Data family dropdown menu, and you'll see an option to add a new data family.

Give the data family a unique name, and a brief optional description, then click save.

Create a Data Family Using the SDK

You can create a Data Family on Markov using a single line of Python code.

import markov

# Create a new data family for the dataset
df_reg_resp = markov.data.register_datafamily(  
    name="Hate Speech Data Family",  
    notes="This is a data family for hate speech datasets",  
    lang="en-us",  
    source="SOURCE_OF_THIS_DATASET",#e.g kaggle, customer_alpha, annotation_  
)

Now that you've successfully created a data family let's move on to registering a dataset with MarkovML.

Register a Dataset

Before registering a new dataset, please ensure you need the following.

  • DataFamily id (df_id) of the data family this dataset will belong to. You can find instructions below if you have not registered the data family.
  • The credential_id that maps to the registered cloud credentials with Markov if you are using cloud to upload your dataset.

Register a Dataset Using the SDK

Here are the three ways you can register a dataset with Markov based on your requirement.
Choose the method that best fits your data storage and processing preferences, whether your data is in the cloud, available in memory as DataFrames, or stored locally on your file system.

Uploading from Cloud

This method is convenient when your data is already hosted on a cloud platform. If your dataset is already stored in the cloud, specifically on an S3 bucket. You'll need to create a DataFamily, register S3 credentials for access, and then use the from_cloud method to create a DataSet object. Make sure you have the S3 file paths(URI's) of all the segments.

import markov
from markov.api.data.data_family import DataFamily
from markov.api.data.data_set import DataSet, DataSetRegistrationResponse
from markov.api.mkv_constants import DataCategory

# Create dataset family to tag the dataset
MarkovmlExampleDataFamily = DataFamily(
    notes="Example Data family for Markovml Datasets",
    name="MarkovMLExampleFamily",
)
df_response = MarkovmlExampleDataFamily.register()

# Register the creds to fetch dataset from s3 store (if already created this will return the existing cred_response)
cred_response = markov.credentials.register_s3_credentials(
    name="MarkovmlWorkshopCredentials",
    access_key="<ACCESS KEY>",
    access_secret="<ACCESS SECRET>",
    notes="Creds to access datasets for cloud upload",
)
cred_id = cred_response.credential_id
# Create final dataset object formed from cloud upload path
# Select the x_col_names which are the features and the y_name as the target while registering the dataset
data_set = DataSet.from_cloud(
    df_id=df_response.df_id,  # data family id
    x_col_names=["tweet"],  # feature columns
    y_name="sentiment",  # target column
    delimiter=",",  # delimiter used for the input dataset segments
    name="CloudUploadSDK_01",  # dataset name (should be unique)
    data_category=DataCategory.Text,
    # supply dataset category (Text data category) (Text: If any of the feature column is text)
    credential_id=cred_id,  # pass cred id to access the s3 path
    train_source="s3://path_to_dataset/twitter_train.csv",  # path to the dataset segment in s3 bucket
    test_source="s3://path_to_dataset/twitter_test.csv",  # path to the dataset segment in s3 bucket
)

# Register Dataset
ds_response: DataSetRegistrationResponse = data_set.upload()
print(ds_response)

Uploading from DataFrame

This method is useful when you have your data readily available in a DataFrame. When working with datasets in memory or performing data preprocessing using Pandas, this method is a great choice. Prepare your DataFrame, split it into training and testing sets, create a DataFamily, and use the from_dataframe method to create a DataSet object.

import pandas as pd
from sklearn.model_selection import train_test_split

from markov.api.data.data_family import DataFamily
from markov.api.data.data_set import DataSet, DataSetRegistrationResponse
from markov.api.mkv_constants import DataCategory

# Create dataset family to tag the dataset
MarkovmlExampleDataFamily = DataFamily(
    notes="Example Data family for Markovml Datasets",
    name="MarkovMLExampleFamily",
)
df_response = MarkovmlExampleDataFamily.register()

# Preparing dataframe to be uploaded
# You can also download from the below link
df = pd.read_csv(
    "https://platform-assets.markovml.com/datasets/sample/twitter_sentiment.csv"
)
train_df, test_df = train_test_split(df, test_size=0.2)

# Create final dataset object formed from dataframe as datasource to upload
# Select the x_col_names which are the features and the y_name as the target while registering the dataset
data_set = DataSet.from_dataframe(
    df_id=df_response.df_id,  # data family id
    x_col_names=["tweet"],  # features column
    y_name="sentiment",  # target column
    delimiter=",",  # delimiter used in the dataset
    name="DataframeUploadSDK_01",  # dataset name (should be unique)
    data_category=DataCategory.Text,
    # dataset category (Text, Numeric, Categorical) (
    # here if any of feature column is text we categorize it as Text data category)
    train_source=train_df,  # train dataset segment
    test_source=test_df,  # test dataset segment
)

# Register Dataset
ds_response: DataSetRegistrationResponse = data_set.upload()
print(ds_response)

Uploading from Filepath

This method is flexible and allows you to work with datasets stored on your local machine. If your datasets are stored locally and you want to register them with Markov, this method is very apt. Create a DataFamily, specify file paths for training and testing dataset segments, and use the from_filepath method to create a DataSet object.

from markov.api.data.data_family import DataFamily
from markov.api.data.data_set import DataSet, DataSetRegistrationResponse
from markov.api.mkv_constants import DataCategory

# Create dataset family to tag the dataset
MarkovmlExampleDataFamily = DataFamily(
    notes="Example Data family for Markovml Datasets",
    name="MarkovMLExampleFamily",
)
df_response = MarkovmlExampleDataFamily.register()

# Create final dataset object formed from filepath as datasource to upload
# Select the x_col_names which are the features and the y_name as the target while registering the dataset
# here data is the folder that contains your dataset.
# This example is based on twitter sentiment dataset available here
# "https://platform-assets.markovml.com/datasets/sample/twitter_sentiment.csv"
data_set = DataSet.from_filepath(
    df_id=df_response.df_id,  # data family id
    x_col_names=["tweet"],  # features column
    y_name="sentiment",  # target column
    delimiter=",",  # delimiter used in the dataset
    name="FilepathUploadSDK_01",  # dataset name which is being used for upload
    data_category=DataCategory.Text,  # dataset category (Text: If any of the feature column is text)
    train_source="data/twitter_train.csv",  # train dataset segment filepath
    test_source="data/twitter_test.csv",  # test dataset segment filepath
)

# Register Dataset
ds_response: DataSetRegistrationResponse = data_set.upload()
print(ds_response)