Ideal combinatoric strings

This is something I worked on a year ago, so I'll keep it (relatively) brief.


There's a keypad on my apartment building which accepts 5-digit codes. One day on the way in, I started thinking about how long it would take to guess a working code by brute force. The most obvious answer is that with five digits, you have %10^5 = 100,000% possible codes; since each one of these is 5 digits long, you're looking at %500,000% button pushes to guarantee finding an arbitrary working code.


But then I realized you could be a little smarter about it. If you assume the keypad only keeps track of the most recent five digits, you can do a rolling approach that cuts down your workload. For example, say I hit 12345 67890. If the keypad works as described, you're not just checking two codes there, you're checking 6: 12345, 23456, 34567, 45678, 56789, 67890.


The next natural question was how many button-pushes this could actually save. After doing some work, I satisfied myself that you could cut it down by a factor of %~D%, where %D% is the number of digits in a code. So if you typed the right digits in the right order, instead of the %500,000% before, you're only looking at %100,004% (you need an extra 4 to wrap the last few codes up).


The next natural question was: how do you actually come up with that string of digits? It has to be perfect, in that it contains every possible 5-digit combination without repeating any of them. As with almost any exploratory problem, the best approach is to simplify as much as possible. For instance, consider a binary code three digits long, which only has %2^3 = 8% different codes: %{000, 001, 010, 011, 100, 101, 110, 111}%.

My formula suggested you should be able to hit all eight of those in a %2^3 + 3 - 1 = 10% digit string, and it's easy enough to put one together by a little trial and error: 0 0 0 1 1 1 0 1. I found it was easiest to treat these strings as cyclical, so the 0 1 at the end wrap around to give you 0 1 0 and 1 0 0. As a bonus, any rotation of this string will work just as well.


As I scaled the problem up, however, more and more things became clear. First, above a certain point, you start getting multiple viable optimal strings that are not simple transformations of one another. Second, finding an elegant way to generate these strings was not turning out to be easy.

I found one mechanical way of generating a valid string that worked, but I didn't love it. If you list all the combinations you have to cover, and then slot each combination into a buffer greedily, meaning the earliest spot where it can fit (potentially with overlap), it works out.

mechanical approach


At some point I realized the generative process could be viewed as a directed graph, the nodes representing an N-length code, its successors delineating alternatives for continuing the string. After a few attempts, I got a pretty clear-looking one down (the node labels 0-7 are standing in for their binary counterparts):

directed graph

As it turns out, you can start on any node and if you trace a Hamiltonian cycle—touching each vertex only once and ending back at the start—the numbers you hit along the way form a valid optimal string. This approach also scales with the parameters of the problem, but requires a messier or multi-dimensional graph.


Whenever I stumble into open-ended problems like this, I avoid Googling for the answer at first, because what's the fun in that? I was pretty sure this would be a solved problem, though, and after spending a while working this through myself, I looked for and found Wikipedia's page on De Bruijn sequences. As usual, I was beaten to the punch by somebody a hundred years earlier. Hilariously, however, in this case the results matched up better than expected. Check out the Wiki page, notably a) the graph on the right, and b) the first line of the section "Uses".


If you want to see the wild scratch pad which brought me from point A to point B, by all means, enjoy.

Goldbach as bitfield

I've been trying to visualize the problem in many different ways, tables and graphs and geometry and anything else that seems plausible. Here's one I nailed down this morning that's concise and readily graspable for any coder.

First, ignore the %2%. It just gets in the way.

And then we get going. Starting with %3%, we'll build up a list of primes and I'll show how you can easily check for Goldbach numbers while you do. As you build a string of bits, on each new number, you reverse the bits you have so far and slide it over one.

Much easier understood with a demonstration, starting with %n=6 => n/2=3% and incrementing %n=n+2%, as it must be even.

n=6   3
prm?  1
rev   1

n=8   3 5
prm?  1 1
rev   1 1

n=10  3 5 7
prm?  1 1 1
rev   1 1 1

n=12  3 5 7 9
prm?  1 1 1 0
rev   0 1 1 1

n=14  3 5 7 9 11
prm?  1 1 1 0 1
rev   1 0 1 1 1

n=16  3 5 7 9 11 13
prm?  1 1 1 0  1  1
rev   1 1 0 1  1  1

n=18  3 5 7 9 11 13 15
prm?  1 1 1 0  1  1  0
rev   0 1 1 0  1  1  1

n=20  3 5 7 9 11 13 15 17
prm?  1 1 1 0  1  1  0  1
rev   1 0 1 1  0  1  1  1

All we're doing with rev is reversing the order of the "prime?" bits. If you watch from one to the next, you can see that this is actually resulting in a kind of ongoing shift each time. Observe:

prm?   1  1  1  0  1  1  0  1  1  0  1  0  0  1
n      3  5  7  9 11 13 15 17 19 21 23 25 27 29
6      1
8      1  1
10     1  1  1
12     0  1  1  1
14     1  0  1  1  1
16     1  1  0  1  1  1
18     0  1  1  0  1  1  1
20     1  0  1  1  0  1  1  1
22     1  1  0  1  1  0  1  1  1
24     0  1  1  0  1  1  0  1  1  1
26     1  0  1  1  0  1  1  0  1  1  1
28     0  1  0  1  1  0  1  1  0  1  1  1
30     0  0  1  0  1  1  0  1  1  0  1  1  1
32     1  0  0  1  0  1  1  0  1  1  0  1  1  1

Notice all we are doing, for each new line, is taking %prm?(n)%, sticking it at the beginning of the new line, and shoving over all the rest. Same as above, but easier to see here. Note also the prime bits read down vertically in the regular order.


The thing is, you can check for any and all Goldbach numbers by simply bitwise ANDing two corresponding strings together. (This may actually be an efficient way to code this, should such a need arise.) Take %n=28%:

       3  5  7  9 11 13 15 17 19 21 23 25
prm?   1  1  1  0  1  1  0  1  1  0  1  0
rev    0  1  0  1  1  0  1  1  0  1  1  1
  &    0  1  0  0  1  0  0  1  0  0  0  1

The resulting bits identify which numbers are "Goldbach numbers," primes which will sum to %n%. In this case we get two pairs:

%5 + 23 = 28%
%11 + 17 = 28%

This process will give you all valid Goldbach numbers for a given %n%.

  prm? 1  1  1  0  1  1  0  1  1  0  1  0  0  1
n      3  5  7  9 11 13 15 17 19 21 23 25 27 29
6     (1)                                            3+3=6
8     (1  1)                                         3+5=8
10    (1 (1) 1)                                      3+7=10   5+5=10
12     0 (1  1) 1                                    5+7=12
14    (1  0 (1) 1  1)                                3+11=14  7+7=14
16    (1 (1  0  1  1) 1)                             3+13=16  5+11=16
18     0 (1 (1  0  1) 1) 1                           5+13=18  7+11=18

...and so on. I added the parentheses to emphasize the pairs, but it's just matching up by working from the outsides to the center. It did make me notice something I should have seen before, which is that if you don't want to bother with all the reversing and stuff, you accomplish the same thing by taking the prime bit string, counting in from both sides, and watching for matches that are both 1. Same exact thing.

The astute observer may point out that I'm not actually doing anything in this write-up, which is true. Especially viewed this most recent way, all we're doing is literally taking the Goldbach Conjecture by its definition, and looking manually for pairs of primes that add up. If there is something to be gained from this approach, I suspect it lies in study of the "sliding" nature of the reversal version, and seeing how it conspires not to leave holes.

This is almost certainly biased on my hunch that Goldbach's undoing lies in a thorough examination of the mechanism driving its avoidance of symmetry; to violate Goldbach, there would have to be an instance of a certain symmetry, which would in turn be representative of a periodic pattern to a degree verboten to primes.

But hey, that's just me.

A thought of little bit

So here's how I went down the information theory rabbit hole this morning.

How much information is in a bit?

You might say '%1% bit.' I am inclined to agree.

How much information is in two bits?

My answer: it depends.

Disregarding any other considerations, and on average, I would agree that two bits stores two bits of information. However, let's say that the first bit you receive from some black box is a %0%. (Or it could be a %1%, and after inverting the rest of the following argument, it would yield identical results, which perversely almost makes it seems as though that first bit carries no information after all.)

But I digress. So, you have your %0%. Now, out comes another %0% as the second bit.

A word on this black box. You don't know what's inside it. It could be a friend connected through the internet, it could be a random number generator, it could be sequential bits from a vacation picture on your hard drive, it could be aliens trying to make contact, you name it.

Now, the specific details of the type of distribution and manner of bit-generating agents that might be in there is actually an absolutely critical point which is not to be glossed over, and so I will cheerfully completely disregard it, except to say that for our purposes, we will allow in a hand-wavy way that all the possibilities listed each represent valid non-negligible possibilities, as that is the spirit of this thought experiment.

Moving on. A priori, the second bit (%b_2%) could have been a %0% or a %1%. To be able to estimate which was more likely, without knowing the details about what the probability distribution of possible bit-generating entities actually is, is arguably impossible, but with the barest of assumptions, this is no longer the case. I maintain that a second %0% was indeed the more likely bit to arrive, and by a significant margin, too.

Consider it through the lens of entropy. Whatever is in the box, it will be emitting bits following some kind of overarching guidelines, which you can consider ranging from "perfectly ordered" to "perfectly entropic", the latter being more or less synonymous with "completely random." If it is a true RNG in there, then your odds are 50/50 for each bit that comes out, including %b_2%.

However, if it is something highly ordered, say a simple computer program, it may well be printing an endless stream of zeros. As was strongly indicated by my previous excursion into the cartography of computation space, repeated streams of bits are the single easiest thing to generate with any kind of reasonable computational framework; note that computational framework here can refer to anything from a programming language to the behavior of an ant colony on down to physics itself. It simply takes less work to generate some numbers than others. So, all things being roughly equal, I submit that you'll find a second 0 more often than not. This represents the more ordered possibility, while %b_2=1% would be the more entropic possibility.

Because of these probabilities, the underlying information content of that second bit is changed. After all, if something very likely happens, you've gained very little information; if the sun rises tomorrow, you can't conclude much from it. If the sun were not to rise, well, that contains a LOT of information, information you'd get to enjoy for maybe a month or two before the atmosphere freezes over.

Let's try a more mathematical but equally stark demonstration of the same essential argument. Consider: you watch the box for days, and all it does it spit out %0%s, one after another. Millions of them. Each new zero says even less than the one before it (or, arguably, devalues all bits received retroactively—probably moot.)

And then, one day, you wake up and see a single %1% came through. Being a man or woman of science, you eventually think to count exactly how many zeros came out before the %1%, and to your amazement, there are %3,141,592,654% zeros, which you naturally immediately recognize as %\pi \times 10^9%. Even setting aside any deeper implications about what that might portend, it is now obvious that the odds are extremely good that there is an agent or process which is highly structured at work in the box.

All that from a single bit. Clearly that %3,141,592,655^(th)% bit contained more information than the millions before it. You might say it shares its information content among all the bits before it, but it has still, by itself, made an enormous net change to the Shannon entropy and Kolmogorov complexity of the data. As a lower bound, you might say that in such a pattern, where you have %n% %0%s followed by a single %1%, that %1% contains at least a ballpark figure of %log_2 n% bits of actual information, that being the amount of information inherent in the number generated. E.g., %0000 0000 1% gives you %8% zeros, and %log_2 8 = 3%, which is the number of bits you'd require to generate a number between %1% and %8% inclusive.

So what?

I don't know. Stuff is neat.

(Maybe I'll show how primes fit into all this tomorrow!)