# Imposing a bottleneck constraint to improve the performance of statistical matching

## Date

## Authors

## Journal Title

## Journal ISSN

## Volume Title

## Publisher

## Abstract

Consider an observational study where a researcher aims to assess the effect of a treatment on a response. Estimation of this effect may be complicated by the presence of confounders—variables that are correlated both with the treatment and the response. Failure to account for confounders may lead to large biases in the estimation of treatment effects. Statistical matching, in which each treated unit is paired with a control unit with similar covariate values, is a common way to account these confounders. In its most basic form, statistical matching pairs each treated unit with a control unit with “similar” values on confounding covariates. Estimates of treatment effects are then obtained by taking the average differences between these pairs. Common measures of the similarity of confounding covariates between treated and control units include the Euclidean distance, the Mahalanobis distance, and the absolute difference in propensity scores. Most matching methods aim to minimize the total dissimilarity between treated and matched control units. However, this may open the door for a matching with significant imbalances on certain covariates within a small proportion of units. Hence, a researcher may also desire to limit the maximum imbalance possible within a matched pair. Additionally, when matching is performed without replacement—that is, when each control unit can be used in a match at most one time—many existing algorithms for finding an optimal matching scale poorly to large datasets. In this dissertation, we formulate statistical matching as a graph theory problem, where nodes are units, edges are drawn between potential matches, and each edge has a weight measuring the dissimilarity between the corresponding match. We argue that the computational complexity of a problem can be substantially reduced by sparsening the original matching graph—that is, by removing edges until the number of edges is proportional to the number of vertices. Further, we outline how algorithms can be constructed for efficiently solving matching problems on these constrained sparse graphs. Further, we show that matching with a constraint on the maximum imbalance on a covariate is equivalent to finding a match within a bottleneck subgraph, in which edges in the graph are drawn if and only if their weight is less than some pre-specified threshold. Thus, the addition of this bottleneck constraint serves two purposes. First, it prevents imbalance on covariates within matches, thereby reducing potential bias in treatment effect estimates. Second, because the matching subgraph is sparse, optimally matching within this subgraph can substantially decrease computational complexity. We apply our method to a study on the efficacy of right heart catheterization. Results indicate that implementing bottleneck constraints can efficiently obtain a statistical matching. The matching successfully eliminate poor matches from the sample and may yield strong overall balance on confounding covariates. Finally, we propose a new algorithm for the minimum cost-optimal pair matching problem by using negative cycles and negative alternating paths of even length (NAPELs). This approach allows for solving the minimum cost matching problem without having to introduce additional structure to the matching graph. We prove that our algorithm will find a minimum cost matching. Additionally, we give simulation results that confirm the optimality of our algorithm, and suggest methods for improving its runtime.