Text Data Normalization

Data normalization pipeline using Markov SDK

The Markov SDK's text normalization capabilities simplify the preprocessing of text data, providing a convenient way to clean and standardize text for various applications, including natural language processing and machine learning. The TextFilter options offer fine-grained control over the elements to be normalized, allowing users to tailor the process to their specific requirements.

Example Use Cases:

  1. Cleaning Text for NLP Tasks: The normalization process is valuable for cleaning text data before applying natural language processing (NLP) techniques.
  2. Preprocessing Social Media Data: Ideal for preprocessing text data from social media platforms by removing or transforming elements like mentions, hashtags, and URLs.
  3. Data Cleaning for Machine Learning Models: Ensures that text data is cleaned and standardized, making it suitable for training machine learning models that are sensitive to specific text patterns.

If you are working with text data and need to preprocess it by normalizing various elements, the Markov SDK provides a powerful set of tools. The example below demonstrates how to use Markov's text normalization functionalities, specifically focusing on the markov.dataset_utils.normalize method and the TextFilter options.

import pandas as pd
from pandas import DataFrame

import markov
from markov import TextFilter

dataset = markov.dataset.get_by_name("Titanic")
normalized_df = dataset.train.normalized_df(filters=[TextFilter.URL, TextFilter.EMAIL])
train_df: DataFrame = dataset.train.as_df()
col_names = list(train_df.columns)
updated_df = markov.dataset_utils.normalize(
    dataframe=train_df,
    col_names=col_names,
    filters=[TextFilter.URL, TextFilter.EMAIL],
)

# Define input dataframe
input_data = {
    "tweet": [
        "This is a sample tweet with a URL: https://example.com",
        "Another tweet mentioning #hashtags and @mentions.",
        "Check out this HTML-encoded text: <p>Hello, world!</p>",
        "Contact us at [email protected] for inquiries.",
        "Remove multiple   whitespaces and      keep one.",
        "Price: $10.99 USD",
        "Call us at +1 (123) 456-7890 for assistance.",
        "This tweet contains emojis: πŸ˜ƒπŸ‘β€οΈ",
        "Text inside brackets [like this]",
        'Special characters: !@#$%^&*()_+-={}[]|\\:;"<>,.?/~',
        "This tweet has\na new line.",
    ],
    "summary": [
        "Contains a URL",
        "Mentions #hashtags and @mentions",
        "Contains HTML-encoded text",
        "Contains an email address",
        "Contains multiple whitespaces",
        "Contains a currency amount",
        "Contains a phone number",
        "Contains emojis",
        "Contains text inside brackets",
        "Contains special characters",
        "Contains a new line",
    ],
}

# Create a DataFrame
input_dataframe = pd.DataFrame(input_data)

# Normalization pipeline
input_normalized_df = markov.dataset_utils.normalize(
    dataframe=input_dataframe,
    col_names=input_dataframe.columns,
    filters=[
        TextFilter.URL,
        TextFilter.EMAIL,
        TextFilter.BRACKET,
        TextFilter.EMOJI,
        TextFilter.CURRENCY,
        TextFilter.HTML,
        TextFilter.MENTION_AT,
        TextFilter.MENTION_HASH,
        TextFilter.MULTI_WHITESPACE,
        TextFilter.NEW_LINE,
        TextFilter.NUMBERS,
        TextFilter.SPECIAL_CHARS,
        TextFilter.PHONE,
    ],
)