Another busy month in property testing. We have nine papers, covering a range of topics from distribution testing, pattern-freeness testing, and even LLMs! Let’s start the journey.
Proving Natural Distribution Properties is Harder than Testing Them by Tal Herman and Guy Rothblum (ECCC). This paper is on interactive proof systems for distribution testing. Suppose an untrusted prover claims to have discovered some property of a distribution (\mathcal{D}). The verifier can sample from (\mathcal{D}) and communicate with the prover, in the hope of verifying this property more efficiently that running a property tester directly. There is a rich line of work showing that many non-trivial properties can be verified much faster t…
Another busy month in property testing. We have nine papers, covering a range of topics from distribution testing, pattern-freeness testing, and even LLMs! Let’s start the journey.
Proving Natural Distribution Properties is Harder than Testing Them by Tal Herman and Guy Rothblum (ECCC). This paper is on interactive proof systems for distribution testing. Suppose an untrusted prover claims to have discovered some property of a distribution (\mathcal{D}). The verifier can sample from (\mathcal{D}) and communicate with the prover, in the hope of verifying this property more efficiently that running a property tester directly. There is a rich line of work showing that many non-trivial properties can be verified much faster than testing directly. But in all these results, the honest prover learns the entire distribution and, thus has sample complexity (\Omega(N)) (where (N) is the domain size). This paper asks whether doubly-sublinear non-trivial protocols exists, wherein the honest prover has sample complexity (o(N)) and the verifier is more efficient than directly testing. The main result is hardness: for many natural properties, if the honest prover has (o(N)) sample complexity, then the verifier complexity is the same as the testing complexity. This says that proving is (provably!) more difficult than directly testing. The main technical work is in showing such a result for estimating collision statistics of the distribution. Collision statistics form the basis of most label-invariant distribution testing algorithms, leading to the various results in the paper.
Testing forbidden order-pattern properties on hypergrids by Harish Chandramouleeswaran, Ilan Newman, Tomer Pelleg, and Nithin Varma (arXiv). Monotonicity is arguably one of the most fundamental properties studied in property testing. It is a special case of “forbidden order” properties. Consider a function (f:[n]d \to \mathbb{R}), and permutation (\pi) of ([k]), for some constant (k). The function is “(\pi)-free” if there do not exist (x_1 \prec x_2 \prec \ldots \prec x_k) where (\pi) is the permutation of the sorting of (f(x_1), f(x_2), \ldots, f(x_k)). (We use (\prec) for the standard coordinate-wise partial order.) So monotonicity is equivalent to ((2,1))-freeness (or ((1,2))-freeness depending on ascending/descending order), since it suffices to consider the order of pairs of domain points. Even over the line ((d=1)), forbidden order properties exhibit a rich theory. This paper initiates the study of forbidden order freeness for the (d=2) case. For (\pi = (1,2,3))-freeness, the paper gives a (poly(\log n)) time property tester. But curiously, any one-sided tester for (\pi=(1,3,2))-freeness requires (\Omega(\sqrt{n})) queries. There is an interesting adaptivity gap: there is an adaptive (n{4/5})-query tester for any (\pi) for (k=3), but a corresponding non-adaptive (one-sided) tester requires (\Omega(n)) queries.
Relative-error unateness testing by Xi Chen, Diptaksho Palit, Kabir Peshawaria, William Pires, Rocco A. Servedio, Yiding Zhang (arXiv). Another variant of monotonicity testing. A function (f:{0,1}n \to {0,1}) is unate if it is either non-decreasing or non-increasing along every dimension. (A monotone function must be non-decreasing in every dimension.) Interestingly, depending on the setting, unateness can be harder or have the same complexity as monotonicity testing. The paper investigates the recent “relative-error model”. In the standard setting, the distance between two functions (f, g) is the (normalized) Hamming distance (| f -g |_0/2^n). In the relative-error setting, we measure the distance as the symmetric difference between the “ones”, so it is (|f{-1}(1) \Delta g^{-1}(1)|/|f^{-1}(1)|). This distinction is analogous to dense vs sparse graph property testing. This paper shows that the complexity of unateness testing is (\widetilde{\Theta}(\varepsilon^{-1} \log N)), where (N = |f^{-1}(1)|). The tester is non-adaptive, while the lower bound is for adaptive algorithms. This result is the same bound recently obtained for monotonicity in this model.
Uniformity Testing under User-Level Local Privacy by Clément L. Canonne, Abigail Gentle, Vikrant Singhal (arXiv). Let us revisit the standard uniformity testing problem with user-level privacy. There are (n) users, each of whom holds (m) samples of a distribution (\mathcal{D}). The domain size is denoted as (k). These users communicate with a central server, whose aim is to test (\mathcal{D}) for uniformity (or equality, or whatever property you care for). This is a practically motivated problem where each user is a phone that does not want to share its private data, but the server wishes to perform some learning on all the data. If each user held one sample, the setting is called “local differential privacy”: we want the algorithm/tester output to be nearly identical if a single user changes their data. It is known that, for public coin protocols, such a tester requires (\Theta(k)) samples in total (in contrast with the (\Theta(\sqrt{k})) samples for the standard tester). In the paper’s setting, the protocol has the differentially private with respect to changes in any of the user’s data. This is much stronger, since an individual change is a much smaller fraction of the user’s data. If each user held more than (\sqrt{k}) samples, then each user could simply run a distribution tester and communicate a single private bit with the server. The interesting case is when (m \ll \sqrt{k}). This paper gives a natural tradeoff between (m) and (n), that generalizes the local DP guarantee. Basically, the paper shows that when (mn) is at least (k), then one can get uniformity testing with user-level privacy. Different results hold for private coin protocols, where local DP is harder.
Non-iid hypothesis testing: from classical to quantum by Giacomo De Palma, Marco Fanizza, Connor Mowry, Ryan O’Donnell (arXiv). This paper studies the “non-iid” distribution hypothesis/equality testing problem. Suppose there are (T) distributions (p_1, p_2, \ldots, p_T), over the domain ([d]). The aim is to test if the average distribution (\sum_i p_i/T) is a known distribution (q). We are only allowed a few samples from each distribution. A recent result proved if we get just (2) samples from each distribution, then equality testing is possible when (T \gg \sqrt{d}) (ignoring (\varepsilon) factors). On the other hand, this is not possible with just a single sample from each distribution. This paper studies the quantum analogue of this problem. But first, it actually improves the (\varepsilon) dependence on the classical bound, matching the optimal (\sqrt{d}/\varepsilon^2) bound. In the quantum setting, each “distribution” is a quantum state, which is represented as a (d \times d) matrix. Samples of the state are basically projections/eigenvectors. It is known that (O(d)) samples suffices to do test if a quantum state is “maximally mixed” (the equivalent of the identity distribution). This paper shows that, in the non-iid setting, even getting one sample from each state suffices for equality testing. This is in contrast with the classical setting, where (2) samples are required per distribution.
Near-Optimal Property Testers for Pattern Matching by Ce Jin, Tomasz Kociumaka (arXiv). This is a result on sublinear algorithms for string matching. Consider a pattern string (P) of length (m) and a text (T) of length (n), where both are considered part of the input. Our aim is to determine if (P) is a substring of (T), or if (P) has Hamming distance (k = \varepsilon m) from every substring of (T). Conventionally for this literature, the parameter (k) (and not (\varepsilon)) is used. There is a simple sampling algorithm that makes (\widetilde{O}(\sqrt{nm/k} + n/k)) queries, but has linear running time. The main question is to get an algorithm whose running time also matches this bound. Previous work gave a (\widetilde{O}((n^2m/k)^{1/3} + n/k)) time procedure. The main result gives a non-adaptive algorithm whose running time matches the (\widetilde{O}(\sqrt{nm/k} + n/k)) query bound. Curiously, one can get an adaptive algorithm that improves the dependence on (n) to (n-m), so it is faster when the pattern and text are roughly of the same length. These upper bounds are matched with lower bounds, so it proves an adaptivity gap for this regime. The paper gets the complete picture of the time complexity for all ranges of (n-m).
Computational Complexity in Property Testing by Renato Ferreira Pinto Jr., Diptaksho Palit, Sofya Raskhodnikova (arXiv). A generalization of the spirit of the previous paper. The paper asks the fundamental question how query complexity and time complexity are related for property testing. A property tester is a ((q(n), t(n)))-tester if it makes (q(n)) queries and has running time (t(n)). The paper provides a precise formalization of running time in the RAM model for a sublinear algorithm. The main result is a hierarchy theorem akin to the classic time hierarchy theorem. Assuming SETH, for any (reasonable) ((q(n), t(n))), there is a property with a ((q(n), t(n)))-tester that has no ((q’(n), t’(n)))-tester, where (q’(n) \ll q(n)) and (t’(n) \ll t(n)). (There is an unconditional hierarchy theorem, with much stronger conditions on (t’(n)).) The gaps between query complexity and running time are explored for the classic problem of halfspace testing. For the distribution-free distance approximation problem, there are (O(d/\varepsilon^2))-query testers, but they run in time exponential in (d). Assuming the fine-grained hardness of (k)-SUM, this paper proves a lower bound showing that the running time of any property tester must have such an exponential dependence.
Sublinear Algorithms for Estimating Single-Linkage Clustering Costs by Pan Peng, Christian Sohler, Yi Xu (arXiv). Single-linkage clustering (SLC) is a common practical algorithm. The input is a weighted graph, where the weight denotes the distance between the objects/vertices. Given a collection of clusters, the distance between two clusters is the distance between the closest pair of vertices. SLC just repeatedly merges the two clusters that are closest, to get a hierarchical clustering. This paper investigates SLC from the sublinear lens. One can define an objective function corresponding to SLC; for a given clustering, the cost is the sum (over clusters) of the minimum spanning trees of each cluster. For any (k), one can consider the optimal clustering with (k) clusters. The aim is to approximate these costs. The “SLC profile vector” is the list of costs over all (k). The main result gives a sublinear time algorithm that approximates this profile vector. The running time is (O(\sqrt{W} d)) (ignoring (poly(\varepsilon^{-1} \log n)) factors), where (W) is the maximum weight and (d) is the maximum degree of the graph. Naturally, these results are obtained by building on the classic sublinear time MST approximation algorithm. The algorithms in the paper are quite interesting, and one wonders if there is a potential to implement them.
Prior Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods by Avrim Blum, Daniel Hsu, Cyrus Rashtchian, Donya Saless (arXiv). This paper is a fascinating take on Retrieval Augmented Generation (RAG) on Large Language Models (LLMs). While existing LLMs can surprise us with their synthesis abilities, they often suffer in learning new context or factual responses. RAGs are a technique used wherein the LLM is given more context through an existing, curated knowledge base to improve its accuracy. The paper models this problem in terms of a knowledge graph. Suppose there is an unknown ground truth graph (G^*), which one can think of as a representation of all “facts” in an area. Most question/answer or factual retrievals can be thought as a paths in this graph. (Factual data is sometimes store in graph format, where edge represent relationships between entities.) So our aim is to find a short (even constant length) (s)-(t) path in (G^*). There is a subgraph (G \subseteq G^*) that is known as “prior knowledge”. (One could think of this as what the LLM has been trained on.) There is a retrieval mechanism that generates edges from (G^*); in property testing language, this is exactly the query model, and provides to connection to sublinear graph algorithms. With a few queries to the retrieval mechanism, we wish to find the (s)-(t) path. The main result shows that the prior graph (G) has to be quite dense to have constant query path algorithms. Otherwise, there is a lower bound of (\sqrt{n}) queries, where (n) is the number of vertices in (G^*). There are many related results on hallucinations, where spurious edges may be in the prior.