Register Dataset

Deep dive into your data with data profiling, EDA and data visualizations by registering your dataset with MarkovML

Register your Dataset using Markov SDK

Before registering a new dataset, make sure you have:

  1. DataFamily id df_id of the data family where this dataset will go. If you haven't made a data family yet, visit the Register Data Family page or find a complete sample code down below.
  2. The credential_idfor your cloud credentials registered with MarkovML if you are uploading from the cloud.

You have three options to register a dataset with MarkovML, depending on how and where your data is stored. Pick the method that suits you best: whether your data is in the cloud, available in memory as DataFrames, or stored locally on your computer.

1. Upload from Cloud

Use this method when your data is already hosted on a cloud platform, particularly on S3 bucket. First, create a DataFamily and register your S3 credentials for access. Then, use the from_cloud() method to create a DataSet object. Ensure you have the S3 file paths (URIs) for all segments of your dataset.

Step 1: Create Dataset Family

Use the Markov's DataFamily() method to create a new dataset family to tag the dataset. And register it to MarkovML using the register(). By registering the data family to MarkovML, you can access it both from the MarkovML Web UI and SDK.

Sample Code

import markov
from markov.api.data.data_family import DataFamily

# Create dataset family to tag the dataset
MarkovmlExampleDataFamily = DataFamily(
    notes="Example Data family for Markovml Datasets",
    name="MarkovMLExampleFamily",
)
#register your Data Family with MarkovML --> You can find it on your UI
df_response = MarkovmlExampleDataFamily.register()

Step 2: Register S3 Credentials to MarkovML

Register the creds to fetch the dataset from the S3 store using the markov.credentials.register_s3_credentials() method and provide the following details:

  1. name: Give a unique name to the S3 credentials.
  2. access_key: Paste the ACCESS_KEY of your AWS S3 bucket. This key is required to access your S3 storage and retrieve the dataset.
  3. access_secret: Paste the ACCESS_SECRET of your S3 bucket. Similar to the access key, this secret is also needed for authentication when accessing your S3 storage.
  4. notes: Add notes for future reference. (optional)

Sample Code

# Register the creds to fetch dataset from s3 store (if already created this will return the existing cred_response)
cred_response = markov.credentials.register_s3_credentials(
    name="MarkovmlWorkshopCredentials",
    access_key="<ACCESS KEY>",
    access_secret="<ACCESS SECRET>",
    notes="Creds to access datasets for cloud upload",
)
cred_id = cred_response.credential_id

📘

Note

If already created this will return the existing cred_response.

Step 3: Create a DataSet object

Use the from_cloud() method to create a DataSet object from a dataset uploaded to the cloud. Provide the following details:

df_id: ID of the data family.
x_col_names: Names of the feature columns.
y_name: Name of the target column.
delimiter: Delimiter used in the dataset segments.
name: Unique name for the dataset.
data_category: Category of the dataset (e.g., text data).
credential_id: ID of the credentials for accessing the S3 path.
train_source: Path to the training dataset segment in the S3 bucket.
test_source: Path to the testing dataset segment in the S3 bucket.

Sample Code

# Create final dataset object formed from cloud upload path
# Select the x_col_names which are the features and the y_name as the target while registering the dataset
data_set = DataSet.from_cloud(
    df_id=df_response.df_id,  # data family id
    x_col_names=["tweet"],  # feature columns
    y_name="sentiment",  # target column
    delimiter=",",  # delimiter used for the input dataset segments
    name="CloudUploadSDK_01",  # dataset name (should be unique)
    data_category=DataCategory.Text,
    # supply dataset category (Text data category) (Text: If any of the feature column is text)
    credential_id=cred_id,  # pass cred id to access the s3 path
    train_source="s3://path_to_dataset/twitter_train.csv",  # path to the dataset segment in s3 bucket
    test_source="s3://path_to_dataset/twitter_test.csv",  # path to the dataset segment in s3 bucket
)

Step 4: Register and Upload your Dataset

After finishing the previous three steps, you are ready to register and upload your dataset to MarkovML. Simply use the upload() method to do so.

Sample Code

# Register and Upload your Dataset
ds_response: DataSetRegistrationResponse = data_set.upload()
print(ds_response)

Complete Sample Code for Dataset Registry from Cloud

import markov
from markov.api.data.data_family import DataFamily
from markov.api.data.data_set import DataSet, DataSetRegistrationResponse
from markov.api.mkv_constants import DataCategory

# Create dataset family to tag the dataset
MarkovmlExampleDataFamily = DataFamily(
    notes="Example Data family for Markovml Datasets",
    name="MarkovMLExampleFamily",
)
df_response = MarkovmlExampleDataFamily.register()

# Register the creds to fetch dataset from s3 store (if already created this will return the existing cred_response)
cred_response = markov.credentials.register_s3_credentials(
    name="MarkovmlWorkshopCredentials",
    access_key="<ACCESS KEY>",
    access_secret="<ACCESS SECRET>",
    notes="Creds to access datasets for cloud upload",
)
cred_id = cred_response.credential_id
# Create final dataset object formed from cloud upload path
# Select the x_col_names which are the features and the y_name as the target while registering the dataset
data_set = DataSet.from_cloud(
    df_id=df_response.df_id,  # data family id
    x_col_names=["tweet"],  # feature columns
    y_name="sentiment",  # target column
    delimiter=",",  # delimiter used for the input dataset segments
    name="CloudUploadSDK_01",  # dataset name (should be unique)
    data_category=DataCategory.Text,
    # supply dataset category (Text data category) (Text: If any of the feature column is text)
    credential_id=cred_id,  # pass cred id to access the s3 path
    train_source="s3://path_to_dataset/twitter_train.csv",  # path to the dataset segment in s3 bucket
    test_source="s3://path_to_dataset/twitter_test.csv",  # path to the dataset segment in s3 bucket
)

# Register Dataset
ds_response: DataSetRegistrationResponse = data_set.upload()
print(ds_response)

2. Upload from DataFrames

Use this method when your data is readily available in a DataFrame. It is a great choice when working with datasets in memory or performing data preprocessing using Pandas. Simply prepare your DataFrame, split it into training and testing sets, create a DataFamily, and utilize the from_dataframe() method to create a DataSet object. Follow the steps below:

Step 1: Create Data Family

This step is identical to the one mentioned in the "Upload from Cloud" method above. For more details, refer to the previous section or check the complete sample code below.

Step 2: Split the DataFrame into Training and Test Set

Use the code below to prepare the data frame to be uploaded/registered into a training set and testing set.

Sample Code

# Preparing dataframe to be uploaded
# You can also download from the below link
df = pd.read_csv(
    "https://platform-assets.markovml.com/datasets/sample/twitter_sentiment.csv"
)
train_df, test_df = train_test_split(df, test_size=0.2)

Step 3: Create a DataSet object

Use the from_dataframe() method to create a final dataset object from a dataframe as the data source for uploading and registering your dataset. Provide the following details:

df_id: ID of the data family.
x_col_names: Names of the feature columns.
y_name: Name of the target column.
delimiter: Separator used in the dataset.
name: Unique name for the dataset.
data_category: Categories of the dataset (e.g., text, numeric).
train_source: Training dataset segment.
test_source: Testing dataset segment.

Sample Code

# Create final dataset object formed from dataframe as datasource to upload
# Select the x_col_names which are the features and the y_name as the target while registering the dataset
data_set = DataSet.from_dataframe(
    df_id=df_response.df_id,  # data family id
    x_col_names=["tweet"],  # features column
    y_name="sentiment",  # target column
    delimiter=",",  # delimiter used in the dataset
    name="DataframeUploadSDK_01",  # dataset name (should be unique)
    data_category=DataCategory.Text,
    # dataset category (Text, Numeric, Categorical) (
    # here if any of feature column is text we categorize it as Text data category)
    train_source=train_df,  # train dataset segment
    test_source=test_df,  # test dataset segment
)

Step 4: Register and Upload your Dataset

This step is identical to the one mentioned in the "Upload from Cloud" method above. For more details, refer to the previous section or check the complete sample code below.

Complete Sample Code for Uploading from DataFrames

import pandas as pd
from sklearn.model_selection import train_test_split

from markov.api.data.data_family import DataFamily
from markov.api.data.data_set import DataSet, DataSetRegistrationResponse
from markov.api.mkv_constants import DataCategory

# Create dataset family to tag the dataset
MarkovmlExampleDataFamily = DataFamily(
    notes="Example Data family for Markovml Datasets",
    name="MarkovMLExampleFamily",
)
df_response = MarkovmlExampleDataFamily.register()

# Preparing dataframe to be uploaded
# You can also download from the below link
df = pd.read_csv(
    "https://platform-assets.markovml.com/datasets/sample/twitter_sentiment.csv"
)
train_df, test_df = train_test_split(df, test_size=0.2)

# Create final dataset object formed from dataframe as datasource to upload
# Select the x_col_names which are the features and the y_name as the target while registering the dataset
data_set = DataSet.from_dataframe(
    df_id=df_response.df_id,  # data family id
    x_col_names=["tweet"],  # features column
    y_name="sentiment",  # target column
    delimiter=",",  # delimiter used in the dataset
    name="DataframeUploadSDK_01",  # dataset name (should be unique)
    data_category=DataCategory.Text,
    # dataset category (Text, Numeric, Categorical) (
    # here if any of feature column is text we categorize it as Text data category)
    train_source=train_df,  # train dataset segment
    test_source=test_df,  # test dataset segment
)

# Register and upload your Dataset
ds_response: DataSetRegistrationResponse = data_set.upload()
print(ds_response)

3. Upload from Filepath

This method is flexible and allows you to work with datasets stored on your local machine. If your datasets are stored locally and you want to register them with MarkovML, this approach is perfect. Simply create a DataFamily, provide file paths for the training and testing dataset segments, and use the from_filepath() method to create a DataSet object. Follow the steps below:

Step 1: Create Data Family

This step is identical to the one mentioned in the "Upload from Cloud" method above. For more details, refer to the previous section or check the complete sample code below.

Step 2: Create Dataset Object

Use the from_filepath() method to create a final dataset object from a file path as the data source for upload and provide the following details:

df_id: The ID of the data family.
x_col_names: Names of the feature columns.
y_name: Name of the target column.
delimiter: Separator used in the dataset.
name: Unique name for the dataset being uploaded.
data_category: Categories of the dataset (e.g., text, numeric).
train_source: File path for the training dataset segment.
test_source: File path for the testing dataset segment.

Sample Code Snippet

data_set = DataSet.from_filepath(
    df_id=df_response.df_id,  # data family id
    x_col_names=["tweet"],  # features column
    y_name="sentiment",  # target column
    delimiter=",",  # delimiter used in the dataset
    name="FilepathUploadSDK_01",  # dataset name which is being used for upload
    data_category=DataCategory.Text,  # dataset category (Text: If any of the feature column is text)
    train_source="data/twitter_train.csv",  # train dataset segment filepath
    test_source="data/twitter_test.csv",  # test dataset segment filepath
)

Step 3: Register and upload your Dataset

This step is identical to the one mentioned in the "Upload from Cloud" method above. For more details, refer to the previous section or check the complete sample code below.

Complete Sample Code for Uploading from FilePath

from markov.api.data.data_family import DataFamily
from markov.api.data.data_set import DataSet, DataSetRegistrationResponse
from markov.api.mkv_constants import DataCategory

# Create dataset family to tag the dataset
MarkovmlExampleDataFamily = DataFamily(
    notes="Example Data family for Markovml Datasets",
    name="MarkovMLExampleFamily",
)
df_response = MarkovmlExampleDataFamily.register()

# Create final dataset object formed from filepath as datasource to upload
# Select the x_col_names which are the features and the y_name as the target while registering the dataset
# here data is the folder that contains your dataset.
# This example is based on twitter sentiment dataset available here
# "https://platform-assets.markovml.com/datasets/sample/twitter_sentiment.csv"
data_set = DataSet.from_filepath(
    df_id=df_response.df_id,  # data family id
    x_col_names=["tweet"],  # features column
    y_name="sentiment",  # target column
    delimiter=",",  # delimiter used in the dataset
    name="FilepathUploadSDK_01",  # dataset name which is being used for upload
    data_category=DataCategory.Text,  # dataset category (Text: If any of the feature column is text)
    train_source="data/twitter_train.csv",  # train dataset segment filepath
    test_source="data/twitter_test.csv",  # test dataset segment filepath
)

# Register and upload your Dataset
ds_response: DataSetRegistrationResponse = data_set.upload()
print(ds_response)

What’s Next