Aug 06, 2024●3 reads●No License

Distance Profile Time Step Classifier

m
Mo Abdelhamid

Introduction

The distance profile is a crucial measure in time series data mining, used extensively for similarity search and nearest-neighbor search tasks. It involves calculating the distance between a query subsequence and every subsequence in a time series. This operation, while conceptually simple, forms the foundation for many advanced analytical tasks, including anomaly detection, motif discovery, and time series segmentation.

In our work, we employ the MASS (Mueen’s Algorithm for Similarity Search) algorithm to compute the distance profile. MASS is renowned for its computational efficiency and scalability, capable of processing large datasets rapidly. This efficiency is critical when dealing with real-world time series data, where the volume of data can be substantial.

The distance profile's versatility allows it to handle various types of queries, including weighted queries and those involving multidimensional data. By aggregating distance profiles across multiple dimensions, we can find the nearest neighbors in a multidimensional space, further enhancing the algorithm's applicability.

Distance Profile

Definition
The distance profile of a query subsequence 𝑄 with respect to a time series 𝑇 is a vector where each element represents the distance between 𝑄 and a corresponding subsequence of 𝑇. Formally, if 𝑇 is a time series of length 𝑛 and 𝑄 is a subsequence of length 𝑚, the distance profile 𝐷 is an 𝑛−𝑚+1 length vector where 𝐷[𝑖] is the distance between 𝑄 and the subsequence of 𝑇 starting at index 𝑖.
Calculation
The most common distance measure used in calculating the distance profile is the z-normalized Euclidean distance, which is robust to variations in scale and offset. The process involves:

Z-normalization: Each subsequence of 𝑇 and the query 𝑄 are normalized to have zero mean and unit variance.
Distance Computation: The Euclidean distance between the normalized 𝑄 and each normalized subsequence of 𝑇 is computed and stored in the distance profile vector.

Distance Profile Implementation using NumPy

The following code is a simple implementation of the distance profile algorithm on a one-dimensional series.

import numpy as np

def z_normalize(ts):
    """Z-normalize a time series."""
    return (ts - np.mean(ts)) / np.std(ts)

def sliding_window_view(arr, window_size):
    """Generate a sliding window view of the array."""
    return np.lib.stride_tricks.sliding_window_view(arr, window_size)

def distance_profile(query, ts):
    """Compute the distance profile of a query within a time series."""
    query_len = len(query)
    ts_len = len(ts)
    
    # Z-normalize the query
    query = z_normalize(query)
    
    # Generate all subsequences of the time series
    subsequences = sliding_window_view(ts, query_len)
    
    # Z-normalize the subsequences
    subsequences = np.apply_along_axis(z_normalize, 1, subsequences)
    
    # Compute the distance profile
    distances = np.linalg.norm(subsequences - query, axis=1)
    
    return distances

# Example time series and query
time_series = np.array([1, 2, 3, 4, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4])
query = np.array([2, 3, 4])

# Compute the distance profile
dist_profile = distance_profile(query, time_series)

print("Distance Profile:", dist_profile)

This code is provided for illustration purposes. For a more efficient implementation, use matrixprofile or stumpy python packages.

Example of Distance Profile with Plots

Description

This figure illustrates the process of analyzing a time series to identify the occurrence of a specific query subsequence using the distance profile. It comprises three subplots:

Query Subsequence:

This subplot shows the query subsequence that we are searching for within the larger time series.
The query subsequence is a short segment of the time series that serves as the pattern we want to locate in the main time series.

Time Series:

This subplot displays the entire time series in which we are searching for the query subsequence.
The x-axis represents the index of the data points within the time series.
The y-axis represents the amplitude of the time series.
Two vertical red dashed lines indicate the start and end points of where the query subsequence appears within the time series. This helps visualize the exact location of the query within the broader time series.

Distance Profile:

This subplot presents the distance profile, which is a measure of how similar each subsequence within the time series is to the query subsequence.
The x-axis represents the index of the subsequence within the time series.
The y-axis represents the z-normalized Euclidean distance, with lower values indicating higher similarity to the query subsequence.
The distance profile helps identify regions in the time series that closely match the query subsequence. Peaks and troughs in the distance profile highlight the varying degrees of similarity across different parts of the time series.

Interpretation

Query Identification: The vertical lines in the time series plot clearly mark the region that matches the query subsequence, providing a visual cue for where the query is located.
Similarity Measure: The distance profile plot allows us to see the similarity across the entire time series. Low values in the distance profile correspond to high similarity, indicating potential matches for the query subsequence.
Pattern Detection: This combined visualization is particularly useful for identifying recurring patterns, anomalies, or specific behaviors within a time series. The distance profile is an essential tool in time series analysis, offering a quantitative measure of similarity.

Multi-Dimensional Matrix Profile Calculation

In our work, we developed a method to calculate the multi-dimensional matrix profile using Mueen’s Algorithm for Similarity Search (MASS). This method is designed to handle multi-dimensional time series data, which is common in many real-world applications where data is collected across multiple channels or features simultaneously.

def multi_dimensional_mass(self, query_subsequence, time_series) -> np.ndarray:
    """
    Calculate the multi-dimensional matrix profile.

    Args:
        query_subsequence (np.ndarray): The query subsequence.
        time_series (np.ndarray): The time series.

    Returns:
        np.ndarray: The multi-dimensional matrix profile.
    """
    for dim in range(time_series.shape[1]):
        if dim == 0:
            profile = stumpy.core.mass(
                query_subsequence[:, dim], time_series[:, dim]
            )
        else:
            profile += stumpy.core.mass(
                query_subsequence[:, dim], time_series[:, dim]
            )
    return profile

Generating Predictions

We implemented a robust methodology for generating predictions using the computed multi-dimensional matrix profiles. Once the matrix profiles were established, we utilized them to identify subsequences within the time series that closely matched the query subsequences. By leveraging the aggregated similarity measures across all dimensions, we were able to pinpoint the most similar patterns. This process involved comparing each subsequence within the time series to the query, calculating the z-normalized Euclidean distances, and subsequently ranking the similarities. The subsequences with the lowest distances were considered the best matches, thereby providing predictions about the presence and location of specific patterns within the time series. This approach not only enhanced the precision of our predictions but also ensured that the multi-dimensional nature of the data was comprehensively analyzed, leading to more insightful and actionable results.

Summary

The distance profile is essential in time series data mining, facilitating tasks like similarity search, anomaly detection, and motif discovery. It calculates the distance between a query subsequence and all other subsequences within a time series, forming the basis for advanced analytical tasks.

We utilized the MASS (Mueen's Algorithm for Similarity Search) for its efficiency and scalability, crucial for handling large real-world datasets. The process involves:

Z-normalization of the time series and query to manage scale variations.
Euclidean distance computation for each subsequence against the query.
Visualization through plots to graphically depict the comparison across the time series.
Our implementation supports both one-dimensional and multidimensional data, enhancing the analysis and prediction accuracy by quantifying similarities across time series data. This methodology allows for precise, actionable insights critical for advanced time series analysis in various applications.

Aug 06, 2024●3 reads●No License

Distance Profile Time Step Classifier

m
Mo Abdelhamid

Introduction

Distance Profile

Definition
The distance profile of a query subsequence 𝑄 with respect to a time series 𝑇 is a vector where each element represents the distance between 𝑄 and a corresponding subsequence of 𝑇. Formally, if 𝑇 is a time series of length 𝑛 and 𝑄 is a subsequence of length 𝑚, the distance profile 𝐷 is an 𝑛−𝑚+1 length vector where 𝐷[𝑖] is the distance between 𝑄 and the subsequence of 𝑇 starting at index 𝑖.
Calculation
The most common distance measure used in calculating the distance profile is the z-normalized Euclidean distance, which is robust to variations in scale and offset. The process involves:

Z-normalization: Each subsequence of 𝑇 and the query 𝑄 are normalized to have zero mean and unit variance.
Distance Computation: The Euclidean distance between the normalized 𝑄 and each normalized subsequence of 𝑇 is computed and stored in the distance profile vector.

Distance Profile Implementation using NumPy

The following code is a simple implementation of the distance profile algorithm on a one-dimensional series.

import numpy as np

def z_normalize(ts):
    """Z-normalize a time series."""
    return (ts - np.mean(ts)) / np.std(ts)

def sliding_window_view(arr, window_size):
    """Generate a sliding window view of the array."""
    return np.lib.stride_tricks.sliding_window_view(arr, window_size)

def distance_profile(query, ts):
    """Compute the distance profile of a query within a time series."""
    query_len = len(query)
    ts_len = len(ts)
    
    # Z-normalize the query
    query = z_normalize(query)
    
    # Generate all subsequences of the time series
    subsequences = sliding_window_view(ts, query_len)
    
    # Z-normalize the subsequences
    subsequences = np.apply_along_axis(z_normalize, 1, subsequences)
    
    # Compute the distance profile
    distances = np.linalg.norm(subsequences - query, axis=1)
    
    return distances

# Example time series and query
time_series = np.array([1, 2, 3, 4, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4])
query = np.array([2, 3, 4])

# Compute the distance profile
dist_profile = distance_profile(query, time_series)

print("Distance Profile:", dist_profile)

This code is provided for illustration purposes. For a more efficient implementation, use matrixprofile or stumpy python packages.

Example of Distance Profile with Plots

Description

This figure illustrates the process of analyzing a time series to identify the occurrence of a specific query subsequence using the distance profile. It comprises three subplots:

Query Subsequence:

This subplot shows the query subsequence that we are searching for within the larger time series.
The query subsequence is a short segment of the time series that serves as the pattern we want to locate in the main time series.

Time Series:

This subplot displays the entire time series in which we are searching for the query subsequence.
The x-axis represents the index of the data points within the time series.
The y-axis represents the amplitude of the time series.
Two vertical red dashed lines indicate the start and end points of where the query subsequence appears within the time series. This helps visualize the exact location of the query within the broader time series.

Distance Profile:

This subplot presents the distance profile, which is a measure of how similar each subsequence within the time series is to the query subsequence.
The x-axis represents the index of the subsequence within the time series.
The y-axis represents the z-normalized Euclidean distance, with lower values indicating higher similarity to the query subsequence.
The distance profile helps identify regions in the time series that closely match the query subsequence. Peaks and troughs in the distance profile highlight the varying degrees of similarity across different parts of the time series.

Interpretation

Query Identification: The vertical lines in the time series plot clearly mark the region that matches the query subsequence, providing a visual cue for where the query is located.
Similarity Measure: The distance profile plot allows us to see the similarity across the entire time series. Low values in the distance profile correspond to high similarity, indicating potential matches for the query subsequence.
Pattern Detection: This combined visualization is particularly useful for identifying recurring patterns, anomalies, or specific behaviors within a time series. The distance profile is an essential tool in time series analysis, offering a quantitative measure of similarity.

Multi-Dimensional Matrix Profile Calculation

def multi_dimensional_mass(self, query_subsequence, time_series) -> np.ndarray:
    """
    Calculate the multi-dimensional matrix profile.

    Args:
        query_subsequence (np.ndarray): The query subsequence.
        time_series (np.ndarray): The time series.

    Returns:
        np.ndarray: The multi-dimensional matrix profile.
    """
    for dim in range(time_series.shape[1]):
        if dim == 0:
            profile = stumpy.core.mass(
                query_subsequence[:, dim], time_series[:, dim]
            )
        else:
            profile += stumpy.core.mass(
                query_subsequence[:, dim], time_series[:, dim]
            )
    return profile

Generating Predictions

Summary

We utilized the MASS (Mueen's Algorithm for Similarity Search) for its efficiency and scalability, crucial for handling large real-world datasets. The process involves:

Distance Profile Time Step Classifier

Table of contents

Introduction

Distance Profile

Distance Profile Implementation using NumPy

Example of Distance Profile with Plots

Description

Interpretation

Multi-Dimensional Matrix Profile Calculation

Generating Predictions

Summary

Distance Profile Time Step Classifier

Table of contents

Introduction

Distance Profile

Distance Profile Implementation using NumPy

Example of Distance Profile with Plots

Description

Interpretation

Multi-Dimensional Matrix Profile Calculation

Generating Predictions

Summary

Models

Datasets

Datasets

Models