What Happens During Dataset Registration?

Dataset Type Identification

First, we determine the type of dataset that has been registered.

Analysis and Processing of Datasets

For Text, Categorical, and Numerical datasets, we perform various analyses to generate insights. Users can customize the analyses they want to run on their datasets using our user interface (UI).

Pre-processing for Text Datasets

Before running an analysis on a Text dataset, we go through a series of pre-processing steps specific to that analysis. These steps are responsible for cleaning, normalizing, and preparing the dataset for analysis. It's important to note that the original dataset remains unaltered during this pre-processing step. The pre-processing steps include actions like:

Data Cleansing: Removing duplicates, numbers, URLs, etc.
Data Normalization: Eliminating stop words, converting to lowercase, removing punctuation, stemming, etc.

Data Analysis

Text Datasets:
After step 3, we apply various analyzers from the Markov ML Analyzer library. These include N-gram analysis, keyword extraction, topic modeling, and data profiling, among others.

Mixed-Categorical Datasets (Text + Categorical Values):
If the dataset contains both categorical and text columns, we selectively run analyzers on the text columns.

Numerical Datasets:
For datasets primarily composed of numerical data, we run data profilers to extract extensive summary statistics.

AutoML-Baseline Model and Quality Analysis

If the dataset has a target column with categorical output, we run two additional analyzers:

Baseline Model Analyzer: We train a baseline model on the dataset using AutoML (details below).
Markov Quality Analyzer: This tool measures the labeling quality of the dataset using various label error estimation algorithms.

Details of Baseline Model Training:
When creating the baseline model, we consider the following scenarios:

If you've specified train/test splits during dataset registration, we use the train set for model training and the test set for generating final metrics.
For datasets without specified splits, we automatically split them into train and test sets using stratified sampling to ensure representativeness. We use k-fold cross-validation to prevent overfitting, and the test set remains isolated for final model metric generation.
The baseline model is trained through an AutoML pipeline, which autonomously selects the optimal model from options like XGBoost, LGBM, Random Forest, and logistic regression. It also determines the best hyper-parameters using techniques such as CFO/Blend Search.

NOTE: If you are an Enterprise Customer with a Hybrid deployment setup, MarkovML does not store your dataset. All the analyzers are run in your VPC.