complexity theory

Dark integers

(or, "Yet another way to learn more about the underlying structure of the universe.")

Let us assume.

I was reading my article on the distribution of computation space and going back over my results, but this time around, a new potential implication took root in my imagination.

Let's assume for the moment that Turing-completeness is necessary and sufficient for computation of any general sort, and furthermore, that there are no such mechanisms in existence that can compute on a fundamentally higher level. In other words, let us assume that every computing device or mechanism is isomorphic to any other such device or mechanism.

Let's also assume that my experimental analysis of computation space was more or less correct, at least in the broad strokes. If someone else conducted a similar experiment using a substantially different instruction set or methodology, some of the specifics might turn out differently, but I suspect the more fundamental properties I found would still show up.

These aren't trivial assumptions, but they're at the very least plausible. So we proceed.

Scarcity.

Near the end of my article, I speculated that primes would probably be under-represented in a list of numbers generated by random code and sorted by frequency. This was partly an observation, but mostly a logical argument; by their nature, primes eschew association with simple patterns, and simple patterns are the best you're going to get out of a random program generator. I would not expect primes to start showing up in bulk until we reached the level of complexity where a prime sieve might reasonably form, and that's way past the scope of what we were dealing with so far. (Side note: it would be interesting to figure out the shortest possible prime generating program using that sort of basic instruction set.)

What the article doesn't include is my follow-up analysis on the subject, which indeed showed that primes seem to be relatively less frequent than other integers in their neighborhood. Here's a plot relating the set of integers (absolute values were taken) to the number of times each one appeared in the cumulative output of a billion random micro-programs.

Integer Frequency through N=600

There is a lot in this plot that bears discussion, but let's stick to the highlights:

  • It is roughly logarithmic.
  • There are conspicuous spikes (and patterns of spikes), most notably around each %2^n%.
  • The red dots—you may have to click on it and zoom in to see them—are primes.
Conspicuous spikes.

If you haven't figured it out yourself, the spikes on the powers of two are a consequence of the binary assembly code in which these micro-programs are written. It's operationally trivial for computers to multiply or divide by two, so you'll see it a lot in a random distribution like this. And getting to %64% is not much more work than getting to %32% (after all, the infrastructure is already in place.) The same goes for %128%, %256%, and so on, which is why you see these spikes and why they're barely even tapering off.

Red dots.

Closer examination of where these dots fall show that they're almost always in a crevice, or at least on a slope, which is to say that the number in question has a notably smaller chance of being generated than one or both of its neighbors.

However, primes are way too slippery a customer to allow this to always happen. Take, for example, %n=31%. This is the first prime that's sitting up on a mesa. I argued before that primes are hard to generate, so what gives?

Well, for many purposes, the distribution of primes is effectively random. Because of this, some of them are doomed to land right next to a really popular neighbor, as with our poor %31%. As discussed, %32% is especially easy to generate, and as it happens, adding or subtracting %1% is also generally a piece of cake. It's just bad luck for %31% that his neighbor attracts all this traffic for him.

Degrees of introversion.

The example of %31% is particularly egregious, since those %2^n% spikes are the biggest bullies on the playground. But all primes are affected by this phenomenon to some extent. Since a prime by its nature can't be directly conjured via a loop (read: multiplication, exponentiation, hyper-operations), the only way to get to these numbers (short of e.g. prime sieves) is to land on some composite number close to it, and then happen to add or subtract the right constant. The frequency of any prime is ultimately a reflection of its neighborhood.

There are lots of consequences to this arrangement. Turning it around, one good example is the case of twin primes, which are primes that differ only by %2%. You only get twin primes when the number between them is divisible by %6%. Any multiple of %6%, being divisible by %2%, %3%, and %6%, is right up there as an easy target for random generation. The upshot here is that when you have a twin prime pair, you will almost always see a single-value spike between them, since they're surrounding such a juicy target.

But enough fun with plots.

Dark integers.

This is a term I pulled out of my ass which I like so much that I refuse to Google it and find out it's already taken.

We've shown how certain integers are "popular" in terms of computational space frequency, like the powers of two; we have also shown that some tend in the other direction, such as the primes. A dark integer, according to me, is loosely defined as an integer that's especially hard to generate through an algorithm of any kind. I am hampered by the fact that I don't have a working theory as to how to strictly identify these numbers other than by brute force, but I have some idea of their properties.

They are not necessarily prime—in fact, they seemingly prefer to be complicated composites with many disparate factors, often including one or more large primes as a factor. It does make some sense that this would be the ideal setup for a dark integer. The erratic and informationally-dense set of factors it comprises makes it an unpopular number already, and not being prime allows it to be bordered by or nearby to one or more primes, which will not attract any spillover traffic.

'Darkness' in this sense is a relative term, at least for the moment. Perhaps it will make sense to define a dark number as any integer %N% that has a strictly larger Kolmogorov complexity than any %n < N%, although that's still difficult to prove. At any rate, some numbers are darker than others, and while it should roughly correspond to which numbers are lowest on a plot such as the one above, we have to remember that this is an experimentally-derived data set and prone to noise, especially as frequency decreases.

That said, I would like to tip my hat to %3465%, which was the lowest number that did not appear a single time in my billion programs. Wolfram Alpha has this to say about %3465%:

  • %3465 = 3^2 \times 5 \times 7 \times 11%.
  • %3465% has the representation %3465=2^7 3^3+9%.
  • %3465% divides %34^6-1%.

Whether any of that is significant is too soon to say.

Time to get crazy.

So, do you remember what we're assuming about computers? Basically, that a computer is a computer by any other name? It comes into play now.

  • Assumption: The universe is not magic. It plays by the same Turing-equivalent rules as everything in it. This means physics is ultimately deterministic, and if we had infinite time and infinite memory, we could simulate our universe on a Turing machine.
  • Assumption: The Big Bang happened. The initial configuration of energy at its inception was effectively random.

The universe seemingly operates by following physical laws, with all matter and energy having been kicked off by that hot mess of a Big Bang. This strikes me as loosely analogous to my Skynet micro-program generator, what with random initial conditions carrying out clearly defined step by step processes. More to the point, our earlier assumption posits that they are mathematically interchangeable for our purposes.

Hypothesis: For any given countable set of anything in the universe, the quantity of that thing is relatively unlikely to be a dark integer, all things being equal.

To test the hypothesis:

  1. Identify dark integers. The more, the larger, the merrier.
  2. Identify anything that can be quantified.
  3. Quantify the anything.
  4. Repeat to within error bounds for confirmation or disproof.

There may, unfortunately, be some bounds on what the anything could actually be. Ideally, it would be as simple as catching a bunch of atoms of something over and over and tracking the result, but more likely it would have to be something more directly related to the Big Bang, such as star counts in galaxies.

If all of this were sound, conclusions could be drawn.

If there are fewer dark numbers after all, it may be further evidence of the truly random and arbitrary nature of creation, and more importantly, it may be strong evidence of a deterministic universe and all that that implies.

If there is no discrepancy when it comes to dark numbers, it may imply that there is a true stochastic or otherwise bizarre layer to physics, or it may imply that the Big Bang was ordered more deliberately than one would think, or that there is some agency at work that is disrupting the "natural" order of things in one way or another.

Either way, we learn.


Addendum

The dark integers on the graph above are specific to the instruction set used to generate it. While there is reason to expect that most sets of operations will be similar inasmuch as a binary system seems to be the simplest approach, it is not a guarantee. One can imagine pathological instruction sets that will yield an entirely different contour, such as one that uses the primes as its base units.

Long story short, if the binary analysis doesn't turn up results, it might be worth investigating the possibility of reverse engineering an instruction set by identifying its specific dark integers and trying to tease out what sort of atomic operations would give rise to those holes.

While applying any of this to cosmology is wildly speculative, biology and the mechanism of evolution seem like a perfect candidate for testing the viability of dark integers in general. Evolution is undeniably a finite computational force that lends itself well to the whole idea. Multi-cellular organisms are literally built from code, and it stands to reason that on balance, evolutionary pressures will tend to restrict most quantifiable biological features to easier-to-compute amounts. More on this as I investigate.

Exploring computation space

The birth of Skynet

A couple of years back, I wrote a little program named Skynet. It uses a configurable set of assembly-like instructions to spawn tiny programs, cobbled together completely at random. The original idea was to be able to feed it a set of input and output integer lists representing some function you were trying to find, and have it then blindly fumble its way through millions or billions of those micro-programs until it hit on one that satisfied your conditions.

For example, if you wanted some code that would exponentiate, you might feed it a configuration file signifying

Input 1: 2 3
Output 1: 8
Input 2: 4 3
Output 2: 256

and it would run forever until it stumbled upon a solution, or you killed it.

And it worked, and it was fun to screw around with, if you like that sort of thing. Feel free to grab it from the link above, but it was a quicky that I wrote for myself, so it might be rough to figure out. It's Turing-complete, but it doesn't do anything very clever—there's no fancy machine learning algorithms here, just straight random goodness—but you can specify a few parameters to try to guide its efforts, such as minimum and maximum program length to try, or how many instructions to execute before giving up on a microprogram as a runaway and moving on. I think it also tries to spot and discard overtly obvious dead ends and NOOPs.

Limitations

So yes, it works as intended, but the problem is that the realm of all possible computable functions is one of those seriously immense virtual universes that tend to pop up when dealing with combinatorics or math in general. You might have 16 instructions enabled at one time, but instructions at the assembly level don't accomplish a hell of a lot, and when you consider a branching factor that big, all of a sudden you're looking at %16^10% possible microprograms for even a modest 10-instruction snippet of code.

That's just over a trillion, many of which are likely to be non-halting and thus run several hundred instructions before bailing. The upshot is that to exhaustively try all of them, you're probably talking days or even months, depending on how optimized it all is. After quickly discovering the futility of systematically enumerating them, I said "fuck it" and resdesigned things with its current random approach, perfectly happy to try out 50-instruction programs if the pRNG requests it. Of course, if there were actually some program out in the ether that needed to be 50 instructions long, the odds of running into it through chance are almost nothing—you could expect to hit it around the same time the last protons and neutrons are decaying.

But I digress. A lot of simpler algorithms can be squeezed into a tractable space.

Things it's figured out:
  • Fibonacci sequence
  • factorials
  • doubling, cubing
  • for loops
  • adding pairs of input numbers on the stack
  • finding the mean of input numbers
  • determining divisibility of input numbers by 7
  • reversing order of stack
Not so much:
  • primality detection
  • prime generation
  • misc. other NP problems
  • bootstrapping into sentience
Details

Some of those successes were much easier than others. Fibonacci, for example, seems to be about the next easiest thing for it to do besides count. The other day I got GitLab CI going which I'm abusing mercilessly by having it spawn Skynet jobs instead of actually testing anything. But manually, this is how it rolls:

[12:00] ~/progs/Skynet> time ./4bit --file fib.skynet -r
fscanfs return 6111111111  
low=4 up=8 exact=0 stopAfter=1 maxSteps=50

Input 1:  
Output 1: 1 1 2 3 5 8 13  
File loaded successfully.  
Randomizing instructions...  
[ INS  SWAB  MULT  PUSH  POP  INC  ADD  JNE  NAND  DEC  PEEK  SKIP  IFF  NOT  LOCO_1  MOD ]

Satisficing program (length 5):  INC  PUSH  ADD  SWAB  JNE  -=- Tracing operation -=-


                        [init   a=0     b=0     c=0     pc=0    sp=1]           ( __ )
        INC             [1      a=1     b=0     c=0     pc=1    sp=1]           ( __ )
        PUSH            [2      a=1     b=0     c=0     pc=2    sp=2]           ( __ 1 )
        ADD             [3      a=1     b=0     c=0     pc=3    sp=2]           ( __ 1 )
        SWAB            [4      a=0     b=1     c=0     pc=4    sp=2]           ( __ 1 )
        JNE             [5      a=0     b=1     c=0     pc=0    sp=2]           ( __ 1 )
        INC             [6      a=1     b=1     c=0     pc=1    sp=2]           ( __ 1 )
        PUSH            [7      a=1     b=1     c=0     pc=2    sp=3]           ( __ 1 1 )
        ADD             [8      a=2     b=1     c=0     pc=3    sp=3]           ( __ 1 1 )
        SWAB            [9      a=1     b=2     c=0     pc=4    sp=3]           ( __ 1 1 )
        JNE             [10     a=1     b=2     c=0     pc=0    sp=3]           ( __ 1 1 )
        INC             [11     a=2     b=2     c=0     pc=1    sp=3]           ( __ 1 1 )
        PUSH            [12     a=2     b=2     c=0     pc=2    sp=4]           ( __ 1 1 2 )
        ADD             [13     a=4     b=2     c=0     pc=3    sp=4]           ( __ 1 1 2 )
        SWAB            [14     a=2     b=4     c=0     pc=4    sp=4]           ( __ 1 1 2 )
        JNE             [15     a=2     b=4     c=0     pc=0    sp=4]           ( __ 1 1 2 )
        INC             [16     a=3     b=4     c=0     pc=1    sp=4]           ( __ 1 1 2 )
        PUSH            [17     a=3     b=4     c=0     pc=2    sp=5]           ( __ 1 1 2 3 )
        ADD             [18     a=7     b=4     c=0     pc=3    sp=5]           ( __ 1 1 2 3 )
        SWAB            [19     a=4     b=7     c=0     pc=4    sp=5]           ( __ 1 1 2 3 )
        JNE             [20     a=4     b=7     c=0     pc=0    sp=5]           ( __ 1 1 2 3 )
        INC             [21     a=5     b=7     c=0     pc=1    sp=5]           ( __ 1 1 2 3 )
        PUSH            [22     a=5     b=7     c=0     pc=2    sp=6]           ( __ 1 1 2 3 5 )
        ADD             [23     a=12    b=7     c=0     pc=3    sp=6]           ( __ 1 1 2 3 5 )
        SWAB            [24     a=7     b=12    c=0     pc=4    sp=6]           ( __ 1 1 2 3 5 )
        JNE             [25     a=7     b=12    c=0     pc=0    sp=6]           ( __ 1 1 2 3 5 )
        INC             [26     a=8     b=12    c=0     pc=1    sp=6]           ( __ 1 1 2 3 5 )
        PUSH            [27     a=8     b=12    c=0     pc=2    sp=7]           ( __ 1 1 2 3 5 8 )
        ADD             [28     a=20    b=12    c=0     pc=3    sp=7]           ( __ 1 1 2 3 5 8 )
        SWAB            [29     a=12    b=20    c=0     pc=4    sp=7]           ( __ 1 1 2 3 5 8 )
        JNE             [30     a=12    b=20    c=0     pc=0    sp=7]           ( __ 1 1 2 3 5 8 )
        INC             [31     a=13    b=20    c=0     pc=1    sp=7]           ( __ 1 1 2 3 5 8 )
        PUSH            [32     a=13    b=20    c=0     pc=2    sp=8]           ( __ 1 1 2 3 5 8 13 )
        ADD             [33     a=33    b=20    c=0     pc=3    sp=8]           ( __ 1 1 2 3 5 8 13 )
        SWAB            [34     a=20    b=33    c=0     pc=4    sp=8]           ( __ 1 1 2 3 5 8 13 )
        JNE             [35     a=20    b=33    c=0     pc=0    sp=8]           ( __ 1 1 2 3 5 8 13 )
        INC             [36     a=21    b=33    c=0     pc=1    sp=8]           ( __ 1 1 2 3 5 8 13 )
        PUSH            [37     a=21    b=33    c=0     pc=2    sp=9]           ( __ 1 1 2 3 5 8 13 21 )
        ADD             [38     a=54    b=33    c=0     pc=3    sp=9]           ( __ 1 1 2 3 5 8 13 21 )
        SWAB            [39     a=33    b=54    c=0     pc=4    sp=9]           ( __ 1 1 2 3 5 8 13 21 )
        JNE             [40     a=33    b=54    c=0     pc=0    sp=9]           ( __ 1 1 2 3 5 8 13 21 )
        INC             [41     a=34    b=54    c=0     pc=1    sp=9]           ( __ 1 1 2 3 5 8 13 21 )
        PUSH            [42     a=34    b=54    c=0     pc=2    sp=10]          ( __ 1 1 2 3 5 8 13 21 34 )

                                         ...

Total=357252            Timeouts=206816         Redundant=6694


real    0m0.130s  
user    0m0.127s  
sys     0m0.002s  
[12:00] ~/progs/Skynet>

As may or may not be clear, it spits out its working program at the top, and then walks you through a trace execution.

One of my discoveries was that program length matters a lot less than I would have guessed. You can set your program length bounds to look for stuff 30-50 instructions long, and it'll still pop out a Fibonacci program after a second. This is presumably because when you throw shit randomly together, you end up with an abundance of NOOPs (at least with my instruction set.)

To put that in slightly more concrete perspective, there are one or two 5-length versions of Fibonacci that exist, out of something like a million possible programs; put another way, you could say ~0.0002% of 5-lengthers are Fibonacci programs. But as it turns out, if you go out to 20 instructions, that percentage doesn't drop by much—in fact, for some things it climbs. The point I'm trying to make, which surprised me, is that regardless of the program length you're searching for, you'll hit anything provided it's simple enough to fit in that length or simpler, and these types of algorithms seem to occupy a pretty constant market share at any depth.

Present day

(Well, yesterday. )

I discovered PyPy and Cython, things which I should have known about a long time ago but didn't. I wrote Skynet in C because it's second to none for this sort of thing, probably a 30x speed increase over the same thing in vanilla Python. But Cython lets you statically type variables, embed C code, and other wily stuff, while PyPy appears to be a magical JIT compiler that makes your pure Python run 5-10x faster (if it's this sort of project).

Anyway, I wanted to try them out, so last night I ported Skynet back to Python, yielding skython, a name which I'm half proud of but made me die a little inside to use. After I got the basics working, I started to adapt it for Cython, but realized I would effectively be porting it awkwardly back to C-ish if I continued, so I've stuck with Python and am using PyPy, which seems to run it at maybe 25% of the speed of the C version. In other words, close enough to be an acceptable trade-off. It's not up to snuff yet, but it is working, and it's certainly faster to add features than it was with C.

To wit: I was thinking about the program-market-share concept I mentioned, and it led me to add a method which creates programs randomly without looking for any input or output, but rather keeping track of everything it churns out on its own. And this is where we actually rejoin the title and original thematic intent of this post, because the idea there was to explore the distribution of algorithmic information in computation space, by which I mean seeing just how often it opts for some sequences over others when selecting out of the vast set of all computable functions.

And?

And it was interesting. I gave it some pretty neutral parameters and had it run a billion microprograms. There's lots I could have done (and perhaps will do) to clean up and solidify the results, but there was enough to get a sense of things.

A few results
  • Half the programs do nothing. Give or take.
  • Most of the rest, percentage-wise, print out endless strings of 0s or 1s.
  • The last few percent, though, starts to get interesting.
  • When you get down to 0.5% or so, you start to get 0101010... and 123456... sequences.
  • They're followed close behind by the 22222s and the 110110s and the 012012s... etc.

The first eye-catching result is down at 0.00272% computation-space-share (but this is still in the top percent of frequency):

0,1,3,7,15,31,63,127,255,511,1023,2047,4095,8191,16383,32767,65535,131071,262143,524287,1048575,2097151,4194303,8388607,16777215  

The pattern's clear enough. A little ways on, down at 0.00045%, I spot a 1, 4, 7, 10, ... along with the first Fibonacci. Down further, I recognize a pattern that came up in some math problem I worked on months ago:

0,1,2,5,10,21,42,85,170,341,682,1365,2730  

But as I went deeper, I found sequences that clearly had structure but were foreign to me:

1,3,8,20,49,119,288,696,1681,4059,9800,23660,57121,137903  

I started looking them up on OEIS, and found that most of the interesting-looking patterns were indeed in there, and at least frickin' half of them were related in some way to the Fibonacci sequence. Not that this is especially surprising or perplexing in hindsight, I'd just never given much thought to how fundamental the principle is upon which it's built. No wonder that it pops up in nature everywhere; God must have noticed the same thing. So yeah, there are some interesting patterns when you get deep, but I didn't find anything earth-shattering.

Zip 'em up!

When I got tired of that, I moved on to a test of algorithmic information theory. The expectation is that the more frequently a program appears, the less complexity it is apt to have. A quick'n'dirty test for complexity is compressibility—the more complex a thing, the less compressible, to the point where any data sufficiently close to random (which is 'perfectly' complex, in this sense) will only get bigger.

I had my big honkin' 88mb data file, sorted by output frequency, so first I split it into equal parts, going from least to most common outputs:

[13:17] ~/progs/skython> split -da 2 -n 25 1bil.txt
[13:18] ~/progs/skython> ls -l
-rw-r--r-- 1 tc tc  88516009 Aug  1 09:46 1bil.txt
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x00
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x01
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x02
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x03
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x04
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x05
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x06
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x07
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x08
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x09
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x10
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x11
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x12
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x13
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x14
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x15
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x16
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x17
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x18
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x19
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x20
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x21
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x22
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x23
-rw-r--r-- 1 tc tc   3540649 Aug  1 13:18 x24

And then I zipped 'em:

[13:18] ~/progs/skython> gzip -9 x??; ll
-rw-r--r-- 1 tc tc  88516009 Aug  1 09:46 1bil.txt
-rw-r--r-- 1 tc tc    851286 Aug  1 13:18 x00.gz
-rw-r--r-- 1 tc tc    850941 Aug  1 13:18 x01.gz
-rw-r--r-- 1 tc tc    851373 Aug  1 13:18 x02.gz
-rw-r--r-- 1 tc tc    852635 Aug  1 13:18 x03.gz
-rw-r--r-- 1 tc tc    849508 Aug  1 13:18 x04.gz
-rw-r--r-- 1 tc tc    857673 Aug  1 13:18 x05.gz
-rw-r--r-- 1 tc tc    853534 Aug  1 13:18 x06.gz
-rw-r--r-- 1 tc tc    852645 Aug  1 13:18 x07.gz
-rw-r--r-- 1 tc tc    854754 Aug  1 13:18 x08.gz
-rw-r--r-- 1 tc tc    854989 Aug  1 13:18 x09.gz
-rw-r--r-- 1 tc tc    855178 Aug  1 13:18 x10.gz
-rw-r--r-- 1 tc tc    848454 Aug  1 13:18 x11.gz
-rw-r--r-- 1 tc tc    846578 Aug  1 13:18 x12.gz
-rw-r--r-- 1 tc tc    844529 Aug  1 13:18 x13.gz
-rw-r--r-- 1 tc tc    849767 Aug  1 13:18 x14.gz
-rw-r--r-- 1 tc tc    849013 Aug  1 13:18 x15.gz
-rw-r--r-- 1 tc tc    846908 Aug  1 13:18 x16.gz
-rw-r--r-- 1 tc tc    852737 Aug  1 13:18 x17.gz
-rw-r--r-- 1 tc tc    847214 Aug  1 13:18 x18.gz
-rw-r--r-- 1 tc tc    821187 Aug  1 13:18 x19.gz
-rw-r--r-- 1 tc tc    760842 Aug  1 13:18 x20.gz
-rw-r--r-- 1 tc tc    752602 Aug  1 13:18 x21.gz
-rw-r--r-- 1 tc tc    727900 Aug  1 13:18 x22.gz
-rw-r--r-- 1 tc tc    692860 Aug  1 13:18 x23.gz
-rw-r--r-- 1 tc tc    605098 Aug  1 13:18 x24.gz

...and as expected, they all got smaller, because these were text dumps which are always gonna be highly compressible. But that aside, it's evident that gzip had a much easier time with the more common stuff. Again, no surprise, but cool. Note the implicit curve that drops off sharply until you get to the top 75 or 80% of them, where it levels off and stays effectively flat.

One last thing to check

I'm beginning to suspect that primes are inherently complex, not just from number theory but in a Kolmogorov-y, robust sense. More Linux tricks:

[10:24] ~/progs/skython> for i in {1..30}; do cut -s --delimiter=, -f $i 1bil.txt >> 1bil_vals.txt; done
[10:27] ~/progs/skython> sed 's/\(\s\|\-\)//g' 1bil_vals.txt | uniq -c | sort -gr > 1bil_sorted_abs_final.txt

which gives a sorted text file of how many times every individual term came up overall, ignoring patterns:

[10:30] ~/progs/skython> head -20 1bil_sorted_abs_final.txt 
16245701  
11338986 1  
8543839 0  
5732203 2  
3037257 3  
1661122 4  
 980248 5
 838860 6
 582802 7
 457519 8
 338796 9
 289842 10
 226633 12
 208586 11
 161242 13
 154787 14
 149209 15
 140625 16
  90908 18
  90094 17

It's my suspicion that primes will be disproportionately low on this list—as they should be, given how computers work—and a first glance confirmed that somewhat, but then I stopped to write all this.

Okay. That's it. Here's the big ugly data file if you want it. Maybe I'll write up my thoughts on an approach to %P=^{?}NP% via transfinite induction tomorrow. Peace out.

The Subset Sum Problem

I have spent time working on almost every impossible problem there is.

This is a lie, but it is rooted in a kernel of truth. I remember my very first impossible problem. When I was eight or nine, my dad gave me a word problem about firemen and houses or some bullshit, but what it boiled down to was:

Draw three points wherever you want on a paper. Label them %A, B, C%.

Draw three other points, also wherever you want. Label them %1, 2, 3%.

Try to draw lines so that %A, B,% and %C% each have three lines going out to %1, 2,% and %3%—and have none of them cross.

sad diagram

And I spent days drawing nonsense like that (thanks Google Images), even after he tried to convey with a straightforward topological proof why it was impossible. But I wasn't convinced until I tried and tried and tried. It was especially painful because as it turns out, you can get all but one last little line to connect.

I'm older now, and have some modest understanding of math, so one might think I no longer fall for traps like this. Not so. I still have to try. But at least now I know on a rational level that it's a lost cause.

The most recent example is a couple of nights ago when I was reading about the subset sum problem. It's about as easy to explain as the problem above; in layman's terms, it's "Figure out how to quickly look through a bunch of positive and negative numbers and see if you can pick some out that add up to 0."

In a slightly more rigorous sense, it's a problem known to be NP-complete (you would think that might tip me off). Given an arbitrary set of integers, determine whether or not there is any subset thereof that sums to %N%, where %N% is some given integer (but classically %0%.) The trick, of course, is that you have to do it in polynomial time; the most efficient algorithm offered on Wikipedia requires %O(2^(N//2))%.

After thinking on it, my essentially random approach was to mod everything. The logic went like this.

You have %N% numbers. A naive approach would be to check the sum of every possible combination of your numbers, which is a combinatoric nightmare on the order of %O(2^N * N)%. No good.

Well, if we're shooting for %0%, then we can divide our set into positives and negatives. Surely this is an improvement; now instead of dealing with the power set of %N%—that is, the number of subsets in a set of size %N%, which is %2^N%—we've made some progress, since we won't waste time trying sets that are all positive numbers, for example. But it turns out this by itself buys us very little.

Then I go, hey, what if you take every integer modulo %2%? Or, in other words, go through and mark down which numbers are even and which are odd.

Why? Because we know that our potential sum is going to be even (i.e. %0%). We already knew that we'd have to take one or more numbers from each the positives and the negatives. Put those truths together, and it means we know that any group that might work will have a certain property: if the sum of our positive numbers is odd, then the sum of our negatives will have to be odd too, or else the sum of both together would be odd, which would mean not zero. Likewise, if our positives sum is even, the negs have to be even too. Or,

% \forall_P {P \subseteq S, P_i \geq 0}, \sum\P \equiv x \ ("mod"\ 2) \leftrightarrow \forall_Q {Q \subseteq S, Q_i < 0}, \sum\Q \equiv x \ ("mod"\ 2) .%

This is surely good, since on average, half of all possible combinations won't meet that constraint and thus can be discarded out of hand.

But wait, it gets better. Why not take everything modulo %3%? The same principle applies, which is that our overall sum must equal %0 \ ("mod"\ 3)%. So if the sum of the positives %("mod"\ 3)% is %1%, we know we only have to check those groups of negatives which sum to %2 \ ("mod"\ 3)%. And now it seems like we're getting a threefold rather than twofold improvement on our deal.

But wait, it gets better. Why not throw everything so far into a big pot and see what happens? It's not too hard to show that if you take %mod 2%, note which possible combinations remain, and then do the same thing with %mod 3%, you'll generally multiplicatively reduce your potential viable combinations. Then continue this with a handful of ascending integers (maybe primes?) and with almost no work at all, you'll all but perfectly sieve out any invalid combinations.

And so far as I could tell, this is all true. The insidious problem, and this is where I stopped working on this, is that the long cold years have taught me that things are never easy, particularly extremely difficult mathematical things. I have absolutely no doubt that if I sat down and did the math carefully, I would find that I was using up at least as many operations in logistical overhead as I was saving in the sieving process.

That's how these deceptive problems work. You feel like you're making progress, but all you've really done is paint a horse like a zebra; the nature of the thing itself remains unchanged.

Primes are especially brutal about that. Watch out for primes.