combinatorics

Ideal combinatoric strings

This is something I worked on a year ago, so I'll keep it (relatively) brief.

Inspiration

There's a keypad on my apartment building which accepts 5-digit codes. One day on the way in, I started thinking about how long it would take to guess a working code by brute force. The most obvious answer is that with five digits, you have %10^5 = 100,000% possible codes; since each one of these is 5 digits long, you're looking at %500,000% button pushes to guarantee finding an arbitrary working code.

Realization

But then I realized you could be a little smarter about it. If you assume the keypad only keeps track of the most recent five digits, you can do a rolling approach that cuts down your workload. For example, say I hit 12345 67890. If the keypad works as described, you're not just checking two codes there, you're checking 6: 12345, 23456, 34567, 45678, 56789, 67890.

Foundation

The next natural question was how many button-pushes this could actually save. After doing some work, I satisfied myself that you could cut it down by a factor of %~D%, where %D% is the number of digits in a code. So if you typed the right digits in the right order, instead of the %500,000% before, you're only looking at %100,004% (you need an extra 4 to wrap the last few codes up).

Experimentation

The next natural question was: how do you actually come up with that string of digits? It has to be perfect, in that it contains every possible 5-digit combination without repeating any of them. As with almost any exploratory problem, the best approach is to simplify as much as possible. For instance, consider a binary code three digits long, which only has %2^3 = 8% different codes: %{000, 001, 010, 011, 100, 101, 110, 111}%.

My formula suggested you should be able to hit all eight of those in a %2^3 + 3 - 1 = 10% digit string, and it's easy enough to put one together by a little trial and error: 0 0 0 1 1 1 0 1. I found it was easiest to treat these strings as cyclical, so the 0 1 at the end wrap around to give you 0 1 0 and 1 0 0. As a bonus, any rotation of this string will work just as well.

Perspiration

As I scaled the problem up, however, more and more things became clear. First, above a certain point, you start getting multiple viable optimal strings that are not simple transformations of one another. Second, finding an elegant way to generate these strings was not turning out to be easy.

I found one mechanical way of generating a valid string that worked, but I didn't love it. If you list all the combinations you have to cover, and then slot each combination into a buffer greedily, meaning the earliest spot where it can fit (potentially with overlap), it works out.

E.g.:
mechanical approach

Beautification

At some point I realized the generative process could be viewed as a directed graph, the nodes representing an N-length code, its successors delineating alternatives for continuing the string. After a few attempts, I got a pretty clear-looking one down (the node labels 0-7 are standing in for their binary counterparts):

directed graph

As it turns out, you can start on any node and if you trace a Hamiltonian cycle—touching each vertex only once and ending back at the start—the numbers you hit along the way form a valid optimal string. This approach also scales with the parameters of the problem, but requires a messier or multi-dimensional graph.

Confirmation

Whenever I stumble into open-ended problems like this, I avoid Googling for the answer at first, because what's the fun in that? I was pretty sure this would be a solved problem, though, and after spending a while working this through myself, I looked for and found Wikipedia's page on De Bruijn sequences. As usual, I was beaten to the punch by somebody a hundred years earlier. Hilariously, however, in this case the results matched up better than expected. Check out the Wiki page, notably a) the graph on the right, and b) the first line of the section "Uses".

Notation

If you want to see the wild scratch pad which brought me from point A to point B, by all means, enjoy.

Lower bound of big-O for sort

This is a quick story about today's thing that I discovered that is already in Wiki. I must be up in the triple digits at this point. That said, this is one of the more obvious ones.

I was reading a thread about which sorting algorithm could be considered "best," and someone mentioned a couple of algorithms which allegedly run in %O(n log log n)% time. This came as a shock, since I'd thought %n log n% was the brick wall for an honest sort, and I wondered if these other sorts were for real, whether that would imply the existence of a linear time sort.

I tried to imagine what the lower bound would be, and figured there must be a minimum number of comparisons required for some number of elements. Didn't take long to get from there to realizing that %n% integers can be arranged in %n!% different permutations, which I reasoned meant that one must gather enough information (read: take comparisons) from the data to uniquely identify which of %n!% different arrangements the sort is in.

That, in turn, screams out for a log, specifically %log_2(n!)%. If the sort permutation is truly random, then on average, we should expect to be able to identify it from %log_2(n!)% bits (read: comparisons, again.) To be a little more precise, I guess it'd be more like \[ \frac{\lceil{n \log_2 (n!)}\rceil}{n} . \]

I cheated a little here and plugged %lim_{n->\infty} [log_2(n!)]% into Wolfram Alpha, and it was clear the dominating factor is, surprise surprise, %n log n%. As for those mystery %n log log n% algorithms, they were tough to track down too, and there seemed to be a lack of final consensus on the issue. Due to the math described herein if nothing else, they must operate with some limitation or another on domain or input, assuming they do work.

Later, seeing the bottom of the Wikipedia page on sorting algorithms, I saw all of this done similarly, with the weird surprise that once you get to %n=12%, the maximum number of comparisons required suddenly goes up by one inexplicably, requiring %30% comparisons where we predict %29%. Sadly, the footnote links explaining those aberrations were in journals behind paywalls, but the gist seemed to be that it was a bit of an open (or at least highly non-trivial) question.

The Subset Sum Problem

I have spent time working on almost every impossible problem there is.

This is a lie, but it is rooted in a kernel of truth. I remember my very first impossible problem. When I was eight or nine, my dad gave me a word problem about firemen and houses or some bullshit, but what it boiled down to was:

Draw three points wherever you want on a paper. Label them %A, B, C%.

Draw three other points, also wherever you want. Label them %1, 2, 3%.

Try to draw lines so that %A, B,% and %C% each have three lines going out to %1, 2,% and %3%—and have none of them cross.

sad diagram

And I spent days drawing nonsense like that (thanks Google Images), even after he tried to convey with a straightforward topological proof why it was impossible. But I wasn't convinced until I tried and tried and tried. It was especially painful because as it turns out, you can get all but one last little line to connect.

I'm older now, and have some modest understanding of math, so one might think I no longer fall for traps like this. Not so. I still have to try. But at least now I know on a rational level that it's a lost cause.

The most recent example is a couple of nights ago when I was reading about the subset sum problem. It's about as easy to explain as the problem above; in layman's terms, it's "Figure out how to quickly look through a bunch of positive and negative numbers and see if you can pick some out that add up to 0."

In a slightly more rigorous sense, it's a problem known to be NP-complete (you would think that might tip me off). Given an arbitrary set of integers, determine whether or not there is any subset thereof that sums to %N%, where %N% is some given integer (but classically %0%.) The trick, of course, is that you have to do it in polynomial time; the most efficient algorithm offered on Wikipedia requires %O(2^(N//2))%.

After thinking on it, my essentially random approach was to mod everything. The logic went like this.

You have %N% numbers. A naive approach would be to check the sum of every possible combination of your numbers, which is a combinatoric nightmare on the order of %O(2^N * N)%. No good.

Well, if we're shooting for %0%, then we can divide our set into positives and negatives. Surely this is an improvement; now instead of dealing with the power set of %N%—that is, the number of subsets in a set of size %N%, which is %2^N%—we've made some progress, since we won't waste time trying sets that are all positive numbers, for example. But it turns out this by itself buys us very little.

Then I go, hey, what if you take every integer modulo %2%? Or, in other words, go through and mark down which numbers are even and which are odd.

Why? Because we know that our potential sum is going to be even (i.e. %0%). We already knew that we'd have to take one or more numbers from each the positives and the negatives. Put those truths together, and it means we know that any group that might work will have a certain property: if the sum of our positive numbers is odd, then the sum of our negatives will have to be odd too, or else the sum of both together would be odd, which would mean not zero. Likewise, if our positives sum is even, the negs have to be even too. Or,

% \forall_P {P \subseteq S, P_i \geq 0}, \sum\P \equiv x \ ("mod"\ 2) \leftrightarrow \forall_Q {Q \subseteq S, Q_i < 0}, \sum\Q \equiv x \ ("mod"\ 2) .%

This is surely good, since on average, half of all possible combinations won't meet that constraint and thus can be discarded out of hand.

But wait, it gets better. Why not take everything modulo %3%? The same principle applies, which is that our overall sum must equal %0 \ ("mod"\ 3)%. So if the sum of the positives %("mod"\ 3)% is %1%, we know we only have to check those groups of negatives which sum to %2 \ ("mod"\ 3)%. And now it seems like we're getting a threefold rather than twofold improvement on our deal.

But wait, it gets better. Why not throw everything so far into a big pot and see what happens? It's not too hard to show that if you take %mod 2%, note which possible combinations remain, and then do the same thing with %mod 3%, you'll generally multiplicatively reduce your potential viable combinations. Then continue this with a handful of ascending integers (maybe primes?) and with almost no work at all, you'll all but perfectly sieve out any invalid combinations.

And so far as I could tell, this is all true. The insidious problem, and this is where I stopped working on this, is that the long cold years have taught me that things are never easy, particularly extremely difficult mathematical things. I have absolutely no doubt that if I sat down and did the math carefully, I would find that I was using up at least as many operations in logistical overhead as I was saving in the sieving process.

That's how these deceptive problems work. You feel like you're making progress, but all you've really done is paint a horse like a zebra; the nature of the thing itself remains unchanged.

Primes are especially brutal about that. Watch out for primes.