Text Data Normalization

Data normalization pipeline using Markov SDK

The Markov SDK's text normalization capabilities simplify the preprocessing of text data, providing a convenient way to clean and standardize text for various applications, including natural language processing and machine learning. For example, the TextFilter options offer fine-grained control over the elements to be normalized, allowing users to tailor the process to their specific requirements.

Example Use Cases:

  1. Cleaning Text for NLP Tasks: The normalization process is valuable for cleaning text data before applying natural language processing (NLP) techniques.
  2. Preprocessing Social Media Data: Ideal for preprocessing text data from social media platforms by removing or transforming elements like mentions, hashtags, and URLs.
  3. Data Cleaning for Machine Learning Models: Ensures that text data is cleaned and standardized, making it suitable for training machine learning models that are sensitive to specific text patterns.

TextFilter Example

If you are working with text data and need to preprocess it by normalizing various elements, the Markov SDK provides a powerful set of tools. The example below demonstrates how to use Markov's text normalization functionalities, specifically focusing on the markov.dataset_utils.normalize method and the TextFilter options.

1. With MarkovML registered dataset

Step 1: Fetch your dataset

Fetch your registered dataset by name or id using the dataset.get_by_name() or dataset.get_by_id() method.

Sample Code

import markov
#fetch dataset by name
dataset = markov.dataset.get_by_name("Insert dataset name")

Step 2: Normalize your dataset using TextFilter

Utilize MarkovML's TextFilter to refine your dataset by removing specific elements such as URLs, email addresses, and more. This feature helps streamline your data by eliminating unwanted information and enhancing its quality and usability.

Use the dataset.train.normalized_df(filters=[TextFilter.URL, TextFilter.EMAIL]) to normalize and filter your dataset inputs with URL and Email. Then, create an updated normalized data frame using themarkov.dataset_utils.normalize() as shown below:

Sample Code

#normalize the train dataset using MarkovML's TextFilter
normalized_df = dataset.train.normalized_df(filters=[TextFilter.URL, TextFilter.EMAIL])
train_df: DataFrame = dataset.train.as_df()
col_names = list(train_df.columns)

#update the dataframe with normalized data
updated_df = markov.dataset_utils.normalize(
    dataframe=train_df,
    col_names=col_names,
    filters=[TextFilter.URL, TextFilter.EMAIL],
)

Complete Sample Code for Registered Twitter Sentiment Dataset

import pandas as pd
from pandas import DataFrame

import markov
from markov import TextFilter

#fetch dataset by name
dataset = markov.dataset.get_by_name("twitter sentiment") #make sure your dataset has a training set to run this sample code

#normalize the train dataset using MarkovML's TextFilter
normalized_df = dataset.train.normalized_df(filters=[TextFilter.URL, TextFilter.EMAIL])
train_df: DataFrame = dataset.train.as_df()
col_names = list(train_df.columns)

#update the dataframe with normalized data
updated_df = markov.dataset_utils.normalize(
    dataframe=train_df,
    col_names=col_names,
    filters=[TextFilter.URL, TextFilter.EMAIL],
)

2. With any sample dataset

Start by defining your dataset and converting it into a pandas dataframe if it's not already in that format. Then, utilize the markov.dataset_utils.normalize() utility function to normalize your dataset using TextFilter. This function helps filter your data by removing elements such as URLs, email addresses, brackets, and more, as shown in the sample code below.

Complete Sample Code

import pandas as pd
from pandas import DataFrame

import markov
from markov import TextFilter

# Define input data
input_data = {
    "tweet": [
        "This is a sample tweet with a URL: https://example.com",
        "Another tweet mentioning #hashtags and @mentions.",
        "Check out this HTML-encoded text: <p>Hello, world!</p>",
        "Contact us at [email protected] for inquiries.",
        "Remove multiple   whitespaces and      keep one.",
        "Price: $10.99 USD",
        "Call us at +1 (123) 456-7890 for assistance.",
        "This tweet contains emojis: 😃👍❤️",
        "Text inside brackets [like this]",
        'Special characters: !@#$%^&*()_+-={}[]|\\:;"<>,.?/~',
        "This tweet has\na new line.",
    ],
    "summary": [
        "Contains a URL",
        "Mentions #hashtags and @mentions",
        "Contains HTML-encoded text",
        "Contains an email address",
        "Contains multiple whitespaces",
        "Contains a currency amount",
        "Contains a phone number",
        "Contains emojis",
        "Contains text inside brackets",
        "Contains special characters",
        "Contains a new line",
    ],
}

# Create a DataFrame
input_dataframe = pd.DataFrame(input_data)

# Normalization pipeline
input_normalized_df = markov.dataset_utils.normalize(
    dataframe=input_dataframe,
    col_names=input_dataframe.columns,
    filters=[
        TextFilter.URL,
        TextFilter.EMAIL,
        TextFilter.BRACKET,
        TextFilter.EMOJI,
        TextFilter.CURRENCY,
        TextFilter.HTML,
        TextFilter.MENTION_AT,
        TextFilter.MENTION_HASH,
        TextFilter.MULTI_WHITESPACE,
        TextFilter.NEW_LINE,
        TextFilter.NUMBERS,
        TextFilter.SPECIAL_CHARS,
        TextFilter.PHONE,
    ],
)

What’s Next