An enough week

Squarepoint Challenge, also known as Codeforces Round 1055, was the first event of last week (problems, results, top 5 on the left, analysis). This round followed a different problemsetting philosophy, so for me it was mostly about figuring out the best way to implement a solution and actually implementing it, as I got the general idea for solving each problem, including H2, fairly quickly. Unfortunately, I got bogged down with the details of F for almost a whole hour, which meant I could not really compete for the top spots.

Nachia, on the other hand, was one of the fastest on the first 8 prob…

Nachia, on the other hand, was one of the fastest on the first 8 problems and managed to figure out the implementation of H2 with almost half an hour to spare. Congratulations on the victory!

Here is that problem F that delayed me: you are given n<=250000 numbers which are written in sorted order. You need to process q<=250000 queries, each query is a segment [l, r] of the numbers. For each query, you need to find the maximum amount of numbers within the segment [l, r] that can be taken in such a way that for each three taken numbers the difference between the largest and the smallest of them is more than z. The number z is the same for all queries. Can you see how to combine the standard building blocks to obtain a solution here?

The Universal Cup opened its 4th season with Stage 1: Korolyov (problems, results, top 5 on the left, analysis). The usual suspects USA1 were only the third team to solve everything, but they were much faster on the easy problems and won with much lower total penalty time. Congratulations to them and to the teams THU1, PKU1 and HoMaMaOvO who also solved 11!

In my previous summary, I have mentioned two problems. The first one (E1, E2) came from Codeforces: there is a hidden array of size 2n-1, where each number between 1 and n appears twice, except one which only appears once, and you need to find out which one. To do that, you can only ask queries of the following form: choose a collection of indices in the array, and a number between 1 and n, and you will be told if this number appears at at least one of those indices in the array.

In E1, we have n<=300 and you need to find the answer using at most 4n+2*ceil(log2n) queries. In E2, we have n=300 exactly and you need to find the answer using at most 925 queries.

Having n=300 instead of n<=300 in E2 strongly hints that the expected solution is randomized, while the 4n+2*ceil(log2n) formula in E1 most likely promises a deterministic one. So the two problems with the same story would seem almost independent in terms of their solutions.

There is a building block that is shared between the solutions, and it is somewhat natural: let us split the entire array into two parts, and ask if a given number appears in each of those parts.

It is easier to see how to make a randomized solution out of this building block: if we know that a number appears in both parts, then it cannot be the answer, as it must have two occurrences. Moreover, if we choose the parts randomly, then with probability around 1/2 the two occurrences will end up in different parts and we will exclude this number from the candidates for the answer. And this, in turn, means that after 1/(1/2)=2 attempts on average we will exclude each number that is not the answer, so after around 2n total attempts we will have only one number remaining which we can return (this is not exactly the case as we will also spend attempts on the number that is in fact the answer, so effectively the least lucky sequence gets multiplied by 2, but this change only adds a logarithmic term to the expected total). Given that each attempt takes 2 queries, we will spend around 4n queries on average, which is too much as 4*300=1200>925.

On a more careful examination, we can notice that we don’t always need 2 queries per attempt. If we got the answer that the first part does not have the number, we do not need to ask about the second part as we know it is there. So on failed attempts (that have both occurrences in one part) we will spend only 1.5 queries on avereage, while on successful attempts we will spend 2. The number of failed attempts is 2-1=1 per number on average, so we will spend around 3.5n queries in total, which is still a bit too much as 3.5*300=1050>925.

Let us put E2 aside for a moment and return to E1. How can we make a deterministic solution out of the same building block? Well, let us fix the split into two parts, and ask for all numbers if they appear in each of the two parts. We spent 2n queries on this, and all numbers are split into three categories: appearing only in the first part, only in the second part, or in both. The key observation here is that by comparing the parity of the size of a part with the parity of the amount of numbers that appear in both parts (and therefore have 1 occurrence in each part) we can tell if the answer, the only number that appears in one part but only once, is in this part or not. This way we have reduced the problem to one with twice smaller size, but now we also have some numbers that are known to appear only once.

We can now apply the same idea recursively to the smaller problem, with the small optimization that for numbers that are known to appear only once, we need only one query to figure out which of the two parts does it belong to. Since we spend two queries for a number that appears twice or for the answer and one query for a number that is known to appear once, the total number of queries to split a part in two is always the size of the part plus one. So the total number of queries we need is roughly 2n+n+n/2+... ~= 4n, so we have solved E1.

Returning to E2, we need to notice that the first step of our solution for E1 is quite efficient: we spend 2n queries, and on average we will have only n/4 candidates left after it, so we eliminate 3n/4 candidates in 2n queries, 8/3 queries per candidate on average, and 8/3 is much less than 3.5 which was our average before. We cannot be as efficient in subsequent steps (because we also have to deal with numbers that are known to appear once), but we do not need to: applying the naive randomized solution which spends 3.5 queries per candidate for the remaining n/4 candidates yields 7n/8 queries, so the total number of queries is 2n+7n/8=23n/8, which for n=300 gives 862.5, well below the allowed amount. This solves E2.

I find the different yet interconnected solutions for the two variants of the problem quite beautiful.

The second one came from AtCoder: you are given a tree with n<=5000 vertices. We choose a real weight ai for each vertex i from the range [-n+1,1] independently and unformly at random. Now, for each vertex i we compute xi as the highest sum of weights in a connected subtree containing vertex i. Now we define the score of the tree as follows: if for all i we have 0<=xi<=1, then the score is equal to the product of all xi, otherwise it is 0. You need to find the expected value of the score.

First, let us look at the connected subtree B with the maximum sum of weights among all connected subtrees. All vertices in this subtree have the same value of xi=y. Let us choose one of those vertices and root the tree from that vertex. Now suppose we run the standard dynamic programming that computes for each subtree the highest sum of weights in a connected subtree containing the root of that subtree, with the very straightforward transition of dpi=ai+sum(max(0, dpj)) over all children j of i.

For each vertex i in our best subtree B we have the following constraints: dpi<=y (since otherwise we have a connected subtree with value more than y, which is a contradiction to the choice of y), and dpi>=0 (since otherwise we can remove y together with its subtree from B to obtain another connected subtree with a higher value). Additionally, for the root of B we must have dpi=y.

It turns out that those conditions are not just necessary, but sufficient! If for all vertices i in B we have 0<=dpi<=y, and for the root dpi=y, then for all vertices i in B we have xi=y. We can see this by looking at how an arbitrary connected subtree intersects with B, and realizing that since dpi>=0, we can always add more vertices of B to the subtree without decreasing its value until the intersection becomes a subtree of B, and then its value is bounded by dpi<=y so we can further augment the intersection to the full B without decreasing the value.

Now suppose we choose the values of ai as we run this dynamic programming. Then no matter what are the values of dpj for vertices j not in B, for each vertex i that is in B but is not its root we will have a range of values *ai *of length exactly y (out of the overall range [-n+1,1] of length n) that will yield 0<=dpi<=y, and for the root there will be exactly one value of ai that yields dpi=y. This means that we can compute the probability density for xi=y in B as a function of y that is independent from what happens outside of B: it is simply y**b-1/n**b, where b is the number of vertices in B.

Now let us look at the connected subtree C with the highest value that contains at least one vertex outside of B. All vertices of C that are not in B have the same value of xi=z, and we can assume that those vertices form a connecteded subtree themselves. Indeed, suppose removing B splits C into multiple independent parts. Since B is the subtree with the highest sum, each of those parts must have a non-positive sum, otherwise we could improve B. But then we can remove all but one of those parts and obtain a new C with at least as high value of z that is in one piece. So there are essentially two choices to consider: either C is disjoint with B, or C contains the entire B plus some other connected subtree.

Then we can apply exactly the same approach as we did for B and prove that the probability density for each vertex in C-B having xi=z is simply z**c-1/n**c, and we additionally know that z<=y.

We can keep applying the same step until the entire tree is split into connected components with the same values of xi, with the probability density of that happenning being a product of terms of the form y**b-1/n**b and some constraints of the form z<=y. And here comes another crucial observation: we can forget about the constraints of the form z<=y! Since when z>y, we can still obtain this structure, just taking the component with z before he component with y. So in effect, we just need to split the tree into connected parts, and then we can handle the parts completely independently, with the only constraint being 0<=y<=1.

The product of xi for all i in B is equal to y**b, and its expected value is therefore equal to the integral of y**b*y**b-1/n**b over 0<=y<=1, which is equal to 1/(2*b*n**b). The expected value of a product is equal to the product of expected values for independent random variables, and we already established that we can treat y, z, ... independently. So we just need to find the sum of products of 1/(2*b*n**b) for all possible ways to split a tree into connected subtrees, which is solved with a standard dynamic programming on the tree that remembers the size of “running” part containing the root of the subtree in its state, and runs in O(n2) since the sum of products of sizes of children being merged in such DP for any tree is O(n2), in turn since every pair of vertices merge into one subtree only once.

From another angle, we have “reordered the summation” in this solution two times: first, we decide on the values of xi, and then we decide on the values of dpi, and only then we obtain the values of ai which are the original random variables.

I find this solution quite beautiful as it requires creating a deep understanding of the structure induced by the process described in the problem statement, and this structure ends up being non-trivial yet tractable. Well done to the problemsetters!

Thanks for reading, and check back next week.

Similar Posts