What Happens During Dataset Registration?

Dataset registration with MarkovML

Dataset Type Identification

It begins by identifying the type of dataset that has been registered to MarkovML.

Analysis and Processing of Datasets

For datasets containing text, categorical, and numerical features, MarkovML conducts various analyses to extract insights. Users can customize these analyses using MarkovML user interface (UI).

Pre-processing for Text Datasets

Before analyzing a text dataset, we perform specific pre-processing steps to clean, normalize, and prepare the data. These steps include:

  1. Data Cleansing: Removing duplicates, numbers, URLs, etc.
  2. Data Normalization: Eliminating stop words, converting to lowercase, removing punctuation, stemming, etc.

It's important to note that these pre-processing steps do not alter the original dataset.

Data Analysis

Text Datasets

For text datasets, you can apply a range of analyzers from the MarkovML Analyzer library. These include N-gram analysis, keyword extraction, topic modeling, and data profiling.

Mixed-Categorical Datasets (Text + Categorical Values)

If a dataset contains both categorical and text columns, you can selectively apply analyzers to the text columns.

Numerical Datasets

Datasets primarily composed of numerical data undergo data profiling to extract extensive summary statistics.

AutoML-Baseline Model and Quality Analysis

For datasets with a categorical target column, MarkovML conducts two additional analyses:

Baseline Model Analyzer

We train a baseline model using AutoML to establish a benchmark for performance evaluation.

Markov Quality Analyzer

This tool assesses the labeling quality of the dataset using various label error estimation algorithms.

Details of Baseline Model Training:

When creating the baseline model, we (MarkovML) consider different scenarios:

  1. If train/test splits are specified during dataset registration, we use the train set for model training and the test set for evaluating final metrics.
  2. For datasets without specified splits, we automatically split them into train and test sets using stratified sampling to ensure representativeness. We employ k-fold cross-validation to prevent overfitting, keeping the test set isolated for final metric generation.
  3. The baseline model is trained via an AutoML pipeline, which autonomously selects the optimal model (e.g., XGBoost, LGBM, Random Forest, logistic regression) and determines the best hyper-parameters using techniques like CFO/Blend Search.

📘

NOTE

For Enterprise Customers with a Hybrid deployment setup, MarkovML does not store datasets, and all analyzers run within the customer's VPC.