Distances and Norms#

masspcf provides GPU-accelerated computation of pairwise distance matrices and norms for collections of piecewise constant functions. These are key building blocks for downstream machine learning tasks such as clustering, classification, and dimensionality reduction.

Mathematical background#

Given two PCFs \(f\) and \(g\), the \(L_p\) distance between them is

\[d_p(f, g) = \left( \int_0^\infty |f(t) - g(t)|^p \, dt \right)^{1/p}.\]

The \(L_p\) norm of a single PCF \(f\) is

\[\| f \|_p = \left( \int_0^\infty |f(t)|^p \, dt \right)^{1/p}.\]

For piecewise constant functions, these integrals reduce to finite sums over the breakpoint intervals, which masspcf evaluates exactly.

Distance between two PCFs#

lp_distance() computes the \(L_p\) distance between two individual PCFs, returning a scalar.

import masspcf as mpcf

f = mpcf.Pcf([[0.0, 3.0], [1.0, 0.0]])
g = mpcf.Pcf([[0.0, 1.0], [2.0, 0.0]])

d = mpcf.lp_distance(f, g)        # L1 distance (default)
d2 = mpcf.lp_distance(f, g, p=2)  # L2 distance

This is the simplest distance API and is useful for quick comparisons or unit testing. For computing distances between all pairs in a collection, use pdist() or cdist() instead.

Pairwise distances#

pdist() computes the full pairwise \(L_p\) distance matrix for a 1-D tensor of PCFs. The result is a DistanceMatrix — a compressed symmetric matrix where entry \((i, j)\) is the \(L_p\) distance between the \(i\)-th and \(j\)-th PCF.

Basic usage#

import masspcf as mpcf
from masspcf.random import noisy_sin

X = noisy_sin((50,), n_points=100)

# L1 distance matrix (default)
D = mpcf.pdist(X)
# D.size is 50, D[i, j] gives the distance between X[i] and X[j]

The p parameter controls which \(L_p\) distance is computed:

D1 = mpcf.pdist(X, p=1)   # L1 distance
D2 = mpcf.pdist(X, p=2)   # L2 distance

Progress output#

By default, progress output is suppressed. To show a progress bar during computation, pass verbose=True:

D = mpcf.pdist(X, verbose=True)

Input requirements#

pdist requires a 1-D PCF tensor. If your data is in a higher-dimensional tensor, slice out the dimension you want first:

A = mpcf.zeros((2, 50))
# ... fill A ...

# Distances between the 50 PCFs in the first row
D = mpcf.pdist(A[0, :])

GPU acceleration#

For large collections of PCFs, pdist automatically offloads the computation to the GPU when one is available. The number of pairwise integrals grows as \(n(n-1)/2\), so GPU acceleration can provide dramatic speedups for large \(n\).

Cross-distances#

cdist() computes the pairwise \(L_p\) distances between two tensors of PCFs. Unlike pdist, the two input tensors can differ in shape and size. The result is a FloatTensor whose shape is the concatenation of the two input shapes.

import masspcf as mpcf
from masspcf.random import noisy_sin, noisy_cos

X = noisy_sin((30,), n_points=100)
Y = noisy_cos((20,), n_points=100)

D = mpcf.cdist(X, Y)          # shape (30, 20)
D2 = mpcf.cdist(X, Y, p=2)   # L2 cross-distances

Higher-dimensional tensors are supported – the output shape is (*X.shape, *Y.shape):

A = noisy_sin((5, 10), n_points=50)
B = noisy_cos((8,), n_points=50)

D = mpcf.cdist(A, B)   # shape (5, 10, 8)

L2 kernel matrices#

l2_kernel() computes the pairwise \(L_2\) inner-product (kernel) matrix for a 1-D tensor of PCFs. The result is a SymmetricMatrix where entry \((i, j)\) is

\[K_{ij} = \langle f_i, f_j \rangle_{L_2} = \int_0^\infty f_i(t) \, f_j(t) \, dt.\]
K = mpcf.l2_kernel(X)
# K[i, j] gives the L2 inner product between X[i] and X[j]

The kernel matrix is symmetric and includes diagonal entries (self inner products). To use it with scikit-learn, convert to a dense NumPy array:

K_dense = mpcf.l2_kernel(X, verbose=False).to_dense()

Distance matrices#

DistanceMatrix provides a compressed storage format for distance matrices. Since a distance matrix is symmetric with zeros on the diagonal, it stores only the strict lower triangle — \(n(n-1)/2\) elements instead of \(n^2\). Entries are enforced to be nonnegative, and writes to the diagonal are rejected unless the value is zero.

from masspcf import DistanceMatrix
from masspcf.typing import float32

m = DistanceMatrix(100, dtype=float32)
m[3, 7] = 2.5
assert m[7, 3] == 2.5   # symmetric access
assert m[3, 3] == 0.0   # diagonal is always zero

To convert to a full NumPy array:

dense = m.to_dense()   # shape (100, 100), dtype float32

Tensors of distance matrices#

Distance matrices can be stored in tensors just like PCFs or point clouds. Use the distmat32 or distmat64 dtypes:

import masspcf as mpcf

# A 1-D tensor holding 10 distance matrices
T = mpcf.zeros((10,), dtype=mpcf.distmat64)

# Assign a matrix into the tensor
m = mpcf.DistanceMatrix(5, dtype=mpcf.float64)
m[0, 1] = 3.14
T[0] = m

# Read it back — symmetric access works as expected
T[0][0, 1]   # 3.14
T[0][1, 0]   # 3.14

Slicing, copying, flattening, and equality comparison all work as for other tensor types:

sub = T[2:5]           # slice — shape (3,)
T2 = T.copy()          # independent deep copy
flat = T.flatten()     # shape (10,)
T == T2                # True

Symmetric matrices#

SymmetricMatrix provides a more general compressed storage format for symmetric matrices without the distance matrix constraints. It stores the lower triangle including the diagonal — \(n(n+1)/2\) elements instead of \(n^2\).

from masspcf import SymmetricMatrix
from masspcf.typing import float32

m = SymmetricMatrix(100, dtype=float32)
m[3, 7] = 2.5
assert m[7, 3] == 2.5   # symmetric access

To convert to a full NumPy array:

dense = m.to_dense()   # shape (100, 100), dtype float32

Tensors of symmetric matrices use the symmat32 or symmat64 dtypes:

T = mpcf.zeros((10,), dtype=mpcf.symmat64)

Norms#

lp_norm() computes the \(L_p\) norm of every PCF in a tensor, returning a NumPy array of the same shape.

import masspcf as mpcf
from masspcf.random import noisy_sin

X = noisy_sin((50,), n_points=100)

# L1 norms of all 50 PCFs
norms = mpcf.lp_norm(X, p=1)
# norms.shape is (50,)

For higher-dimensional tensors, the output shape matches the input:

A = mpcf.zeros((3, 10))
# ... fill A ...

norms = mpcf.lp_norm(A, p=1)
# norms.shape is (3, 10)

Using distances in machine learning#

pdist returns a DistanceMatrix. Call to_dense() to obtain a standard NumPy array for use with scikit-learn and other libraries.

Clustering#

from sklearn.cluster import AgglomerativeClustering

D = mpcf.pdist(X, verbose=False).to_dense()

clustering = AgglomerativeClustering(
    n_clusters=3,
    metric='precomputed',
    linkage='average'
)
labels = clustering.fit_predict(D)

Multidimensional scaling#

from sklearn.manifold import MDS

D = mpcf.pdist(X, verbose=False).to_dense()

mds = MDS(n_components=2, dissimilarity='precomputed')
coords = mds.fit_transform(D)

Classification with distance-based kernels#

A common approach is to convert a distance matrix into a kernel matrix and use it with a kernel SVM. For example, using a Gaussian (RBF) kernel:

\[K_{ij} = \exp\!\left(-\frac{D_{ij}^2}{2\sigma^2}\right)\]
import numpy as np
from sklearn.svm import SVC

D = mpcf.pdist(X, verbose=False).to_dense()
sigma = np.median(D[D > 0])
K = np.exp(-D**2 / (2 * sigma**2))

clf = SVC(kernel='precomputed')
clf.fit(K, labels)