Computational Aspects of Learning and Processing Metrical Data

Workshop: Computational Aspects of Learning and Processing Metrical Data

Part of the Computational Geometry Week (CG Week) 2020

Date: June 26th, 2020
Location: On-line
Organizer: Anastasios Sidiropoulos, University of Illinois at Chicago.

Summary

The analysis of metrical data sets is a central motivational task for computational geometry. The success of various data processing methods depends on the actual metrical representation of the input data. Within computational geometry, an important research thrust for computing metrical representations of data sets is the focus of the metric embedding community. In this context, the overarching goal is to design efficiently-computable mappings into some host metric space that preserve the geometry of the input, or, encode a non-metrical object as a metric space with desirable properties. A different reseach thrust comes from the machine learning community, and, more specifically, the area of metric learning. There, the goal is to recover a ground truth that is encoded as a metric space from partial proximity information. This workshop aims to bring together researchers from these differnet disciplines, and foster the exchange of research problems and methodologies.

Participants

Arnold Filtser, Columbia University.

Amaury Habrard, University Jean Monnet of Saint-Etienne.

Brian Kulis, Boston University.

Jelani Nelson, UC Berkeley.

Ben Raichel, UT Dallas.

Ilya Razenshteyn, Microsoft Research Redmond.

Yusu Wang, The Ohio State University.

Schedule

16:30 - 17:10 CET Brian Kulis, Boston University.

Title:
Metric Learning: Overview and New Directions
Abstract:
Metric learning is a supervised machine learning problem concerned with learning a task-specific distance function from supervised data. It has found numerous applications in problems such as similarity search, clustering, and ranking. Much of the foundational work in this area focused on the class of so-called Mahalanobis metrics, which may be viewed as Euclidean distances after linear transformations of the data. This talk will describe two recent directions in metric learning: deep metric learning and divergence learning. The first replaces the linear transformations with the output of a neural network, while the second considers a broader class than Mahalanobis metrics. I will discuss some of my recent work along both of these fronts, as well as ongoing attempts to combine these approaches together using a novel framework called deep divergences.

17:10 - 17:35 CET Ilya Razenshteyn, Microsoft Research Redmond.

Title:
Scalable Nearest Neighbor Search for Optimal Transport
Abstract:
The Optimal Transport (aka Wasserstein) distance is an increasingly popular similarity measure for structured data domains, such as images or text documents. This raises the necessity for fast nearest neighbor search with respect to this distance, a problem that poses a substantial computational bottleneck for various tasks on massive datasets. In this talk, I will discuss fast tree-based approximation algorithms for searching nearest neighbors with respect to the Wasserstein-1 distance. I will start with describing a standard tree-based technique, known as QuadTree, which has been previously shown to obtain good results. Then I'll introduce a variant of this algorithm, called FlowTree, and show that it achieves better accuracy, both in theory and in practice. In particular, the accuracy of FlowTree is in line with previous high-accuracy methods, while its running time is much faster.

The talk is based on a joint work with Arturs Backurs, Yihe Dong, Piotr Indyk and Tal Wagner.
The paper can be found here and the code here.

17:35 - 18:00 CET Yusu Wang, The Ohio State University.

Title:
Recovering metric structure behind perturbed networks
Abstract:
Graphs and network data are ubiquitous across a wide range of scientific and application domains. Often in practice, an input graph can be considered as a noisy observed snapshot of a hidden true graph (which could be the 1-skeleton of a hidden domain). In this talk, I will advocate the following network model: We assume that there is a true graph $G^*$ which is a certain proximity graph for points sampled from a hidden domain $X$, and thus captures the geometry of $X$. The observed graph $G$ is an Erdos-Renyi type perturbed version of $G^*$; in particular, edges not in $G^*$ can be added with probability $p$, while edges in $G^*$ can be deleted with probability $q$. We also call such an observed graph a ER-pertubed RGG (random geometric graph). We are interested both in recovering the original shortest path metric of $G^*$ from the noisy observed graph $G$, and in understanding properties of such ER-perturbed RGG. The talk will focus on the metric recovery problem. We will show how the Jaccard index and the edge-clique number can both be used to recover the shortest path metric in $G^*$ within a constant factor, but with different pros and cons.

This talk is based on joint work with M. Kahle, S. Parthasarathy, D. Sivakoff, and M. Tian.

18:00 - 18:25 CET Amaury Habrard, University Jean Monnet of Saint-Etienne.

Title:
Controlling the space projection in Mahalanobis distance learning with respect to specific instances
Abstract:
Mahalanobis distance learning is one of the most famous framework of metric learning. It can be interpreted as finding a good projection of the data where some Euclidean-based distances can be used to perform efficiently some classification or clustering tasks. In this talk, I will show that this space can be controlled by applying some regression constraints over specific chosen instances. I will also discuss some recent work for designing/learning good metrics in the context of imbalanced data classification where the idea is to adapt the metric with respect to the type of instances used to compare a query point.

18:25 - 18:50 CET Jelani Nelson, UC Berkeley.

Title:
Optimal terminal dimensionality reduction in Euclidean space
Abstract:
The Johnson-Lindenstrauss lemma states that for any $X\subset \mathbb{R}^d$ with $|X| = n$ and for any $\epsilon$, there exists a map $f:X\to \mathbb{R}^m$ for $m = O(\log n / \epsilon^2)$ such that: for all $x \in X$, for all $y \in X$, $(1-\epsilon)\|x - y\|_2 \leq \|f(x) - f(y)\|_2 \leq (1+\epsilon)\|x - y\|_2$. We show that this statement can be strengthened. In particular, the above claim holds true even if "for all $y \in X$" is replaced with "for all $y \in \mathbb{R}^d$".

Joint work with Shyam Narayanan.

18:50 - 19:15 CET Ben Raichel, UT Dallas.

Title:
Generalized Metric Repair
Abstract:
Many modern data analysis algorithms either assume or are considerably more efficient if the distances between the data points satisfy a metric. However, as real data sets are noisy, they often do not possess this fundamental property. To address this issue, here we consider the generalized metric repair problem (GMR). Given a positively weighted connected graph, $G$, call a cycle in $G$ broken if it contains an edge whose weight is greater than the sum of the weights of the other edges in the cycle. In GMR the goal is to modify as few edge weights as possible such that no broken cycles remain. We argue any hitting set of the broken cycles determines a solution, leading to both hardness and approximation results. Specifically, we give 1) an approximation preserving reduction from MULTICUT, 2) for any constant $c$ a fixed parameter tractable algorithm for $c$-chordal graphs, and 3) approximations parameterized on measures of how badly the cycles are broken. Moreover, when the graph $G$ is a complete graph, we give an $O(OPT^{1/3})$ approximation, where $OPT$ is the optimal solution size.

19:15 - 19:40 CET Arnold Filtser, Columbia University.

Title:
On metric embeddings, shortest path decompositions and face cover of planar graphs
Abstract:
We will discuss the problem of low distortion embeddings of shortest-path metrics of weighted graphs into $\ell_1$. Shortest path decomposition (SPD) is a hierarchical partition of a graph using shortest paths. Every (weighted) path graph has an SPD of depth $1$. A graph $G$ has an SPD of depth $k$ if after removing some shortest path $P$, every connected component in $G\setminus P$ has an SPD of depth $k-1$. We present a novel metric embedding technique based on SPD. We will show that any graph with SPD of depth $k$, can be embedded into $\ell_1$ with distortion $O(\sqrt{k})$. Afterwards, we will show how to use this result in order to embed a planar graph with $\gamma$ faces into $\ell_1$ with distortion $O(\sqrt{\log \gamma})$. We will also discuss implications to the sparest cut problem.

16:30 - 17:10 CET	Brian Kulis, Boston University.
	Title: Metric Learning: Overview and New Directions Abstract: Metric learning is a supervised machine learning problem concerned with learning a task-specific distance function from supervised data. It has found numerous applications in problems such as similarity search, clustering, and ranking. Much of the foundational work in this area focused on the class of so-called Mahalanobis metrics, which may be viewed as Euclidean distances after linear transformations of the data. This talk will describe two recent directions in metric learning: deep metric learning and divergence learning. The first replaces the linear transformations with the output of a neural network, while the second considers a broader class than Mahalanobis metrics. I will discuss some of my recent work along both of these fronts, as well as ongoing attempts to combine these approaches together using a novel framework called deep divergences.

17:10 - 17:35 CET	Ilya Razenshteyn, Microsoft Research Redmond.
	Title: Scalable Nearest Neighbor Search for Optimal Transport Abstract: The Optimal Transport (aka Wasserstein) distance is an increasingly popular similarity measure for structured data domains, such as images or text documents. This raises the necessity for fast nearest neighbor search with respect to this distance, a problem that poses a substantial computational bottleneck for various tasks on massive datasets. In this talk, I will discuss fast tree-based approximation algorithms for searching nearest neighbors with respect to the Wasserstein-1 distance. I will start with describing a standard tree-based technique, known as QuadTree, which has been previously shown to obtain good results. Then I'll introduce a variant of this algorithm, called FlowTree, and show that it achieves better accuracy, both in theory and in practice. In particular, the accuracy of FlowTree is in line with previous high-accuracy methods, while its running time is much faster. The talk is based on a joint work with Arturs Backurs, Yihe Dong, Piotr Indyk and Tal Wagner. The paper can be found here and the code here.

17:35 - 18:00 CET	Yusu Wang, The Ohio State University.
	Title: Recovering metric structure behind perturbed networks Abstract: Graphs and network data are ubiquitous across a wide range of scientific and application domains. Often in practice, an input graph can be considered as a noisy observed snapshot of a hidden true graph (which could be the 1-skeleton of a hidden domain). In this talk, I will advocate the following network model: We assume that there is a true graph $G^$ which is a certain proximity graph for points sampled from a hidden domain $X$, and thus captures the geometry of $X$. The observed graph $G$ is an Erdos-Renyi type perturbed version of $G^$; in particular, edges not in $G^$ can be added with probability $p$, while edges in $G^$ can be deleted with probability $q$. We also call such an observed graph a ER-pertubed RGG (random geometric graph). We are interested both in recovering the original shortest path metric of $G^$ from the noisy observed graph $G$, and in understanding properties of such ER-perturbed RGG. The talk will focus on the metric recovery problem. We will show how the Jaccard index and the edge-clique number can both be used to recover the shortest path metric in $G^$ within a constant factor, but with different pros and cons. This talk is based on joint work with M. Kahle, S. Parthasarathy, D. Sivakoff, and M. Tian.

18:00 - 18:25 CET	Amaury Habrard, University Jean Monnet of Saint-Etienne.
	Title: Controlling the space projection in Mahalanobis distance learning with respect to specific instances Abstract: Mahalanobis distance learning is one of the most famous framework of metric learning. It can be interpreted as finding a good projection of the data where some Euclidean-based distances can be used to perform efficiently some classification or clustering tasks. In this talk, I will show that this space can be controlled by applying some regression constraints over specific chosen instances. I will also discuss some recent work for designing/learning good metrics in the context of imbalanced data classification where the idea is to adapt the metric with respect to the type of instances used to compare a query point.

18:25 - 18:50 CET	Jelani Nelson, UC Berkeley.
	Title: Optimal terminal dimensionality reduction in Euclidean space Abstract: The Johnson-Lindenstrauss lemma states that for any $X\subset \mathbb{R}^d$ with $\|X\| = n$ and for any $\epsilon$, there exists a map $f:X\to \mathbb{R}^m$ for $m = O(\log n / \epsilon^2)$ such that: for all $x \in X$, for all $y \in X$, $(1-\epsilon)\\|x - y\\|_2 \leq \\|f(x) - f(y)\\|_2 \leq (1+\epsilon)\\|x - y\\|_2$. We show that this statement can be strengthened. In particular, the above claim holds true even if "for all $y \in X$" is replaced with "for all $y \in \mathbb{R}^d$". Joint work with Shyam Narayanan.

18:50 - 19:15 CET	Ben Raichel, UT Dallas.
	Title: Generalized Metric Repair Abstract: Many modern data analysis algorithms either assume or are considerably more efficient if the distances between the data points satisfy a metric. However, as real data sets are noisy, they often do not possess this fundamental property. To address this issue, here we consider the generalized metric repair problem (GMR). Given a positively weighted connected graph, $G$, call a cycle in $G$ broken if it contains an edge whose weight is greater than the sum of the weights of the other edges in the cycle. In GMR the goal is to modify as few edge weights as possible such that no broken cycles remain. We argue any hitting set of the broken cycles determines a solution, leading to both hardness and approximation results. Specifically, we give 1) an approximation preserving reduction from MULTICUT, 2) for any constant $c$ a fixed parameter tractable algorithm for $c$-chordal graphs, and 3) approximations parameterized on measures of how badly the cycles are broken. Moreover, when the graph $G$ is a complete graph, we give an $O(OPT^{1/3})$ approximation, where $OPT$ is the optimal solution size.

19:15 - 19:40 CET	Arnold Filtser, Columbia University.
	Title: On metric embeddings, shortest path decompositions and face cover of planar graphs Abstract: We will discuss the problem of low distortion embeddings of shortest-path metrics of weighted graphs into $\ell_1$. Shortest path decomposition (SPD) is a hierarchical partition of a graph using shortest paths. Every (weighted) path graph has an SPD of depth $1$. A graph $G$ has an SPD of depth $k$ if after removing some shortest path $P$, every connected component in $G\setminus P$ has an SPD of depth $k-1$. We present a novel metric embedding technique based on SPD. We will show that any graph with SPD of depth $k$, can be embedded into $\ell_1$ with distortion $O(\sqrt{k})$. Afterwards, we will show how to use this result in order to embed a planar graph with $\gamma$ faces into $\ell_1$ with distortion $O(\sqrt{\log \gamma})$. We will also discuss implications to the sparest cut problem.