#### algorithms

This is a quick story about today's thing that I discovered that is already in Wiki. I must be up in the triple digits at this point. That said, this is one of the more obvious ones.

I was reading a thread about which sorting algorithm could be considered "best," and someone mentioned a couple of algorithms which allegedly run in O(n log log n) time. This came as a shock, since I'd thought n log n was the brick wall for an honest sort, and I wondered if these other sorts were for real, whether that would imply the existence of a linear time sort.

I tried to imagine what the lower bound would be, and figured there must be a minimum number of comparisons required for some number of elements. Didn't take long to get from there to realizing that n integers can be arranged in n! different permutations, which I reasoned meant that one must gather enough information (read: take comparisons) from the data to uniquely identify which of n! different arrangements the sort is in.

That, in turn, screams out for a log, specifically log_2(n!). If the sort permutation is truly random, then on average, we should expect to be able to identify it from log_2(n!) bits (read: comparisons, again.) To be a little more precise, I guess it'd be more like $\frac{\lceil{n \log_2 (n!)}\rceil}{n} .$

I cheated a little here and plugged lim_{n->\infty} [log_2(n!)] into Wolfram Alpha, and it was clear the dominating factor is, surprise surprise, n log n. As for those mystery n log log n algorithms, they were tough to track down too, and there seemed to be a lack of final consensus on the issue. Due to the math described herein if nothing else, they must operate with some limitation or another on domain or input, assuming they do work.

Later, seeing the bottom of the Wikipedia page on sorting algorithms, I saw all of this done similarly, with the weird surprise that once you get to n=12, the maximum number of comparisons required suddenly goes up by one inexplicably, requiring 30 comparisons where we predict 29. Sadly, the footnote links explaining those aberrations were in journals behind paywalls, but the gist seemed to be that it was a bit of an open (or at least highly non-trivial) question.

### The birth of Skynet

A couple of years back, I wrote a little program named Skynet. It uses a configurable set of assembly-like instructions to spawn tiny programs, cobbled together completely at random. The original idea was to be able to feed it a set of input and output integer lists representing some function you were trying to find, and have it then blindly fumble its way through millions or billions of those micro-programs until it hit on one that satisfied your conditions.

For example, if you wanted some code that would exponentiate, you might feed it a configuration file signifying

``````Input 1: 2 3
Output 1: 8
Input 2: 4 3
Output 2: 256
``````

and it would run forever until it stumbled upon a solution, or you killed it.

And it worked, and it was fun to screw around with, if you like that sort of thing. Feel free to grab it from the link above, but it was a quicky that I wrote for myself, so it might be rough to figure out. It's Turing-complete, but it doesn't do anything very clever—there's no fancy machine learning algorithms here, just straight random goodness—but you can specify a few parameters to try to guide its efforts, such as minimum and maximum program length to try, or how many instructions to execute before giving up on a microprogram as a runaway and moving on. I think it also tries to spot and discard overtly obvious dead ends and NOOPs.

##### Limitations

So yes, it works as intended, but the problem is that the realm of all possible computable functions is one of those seriously immense virtual universes that tend to pop up when dealing with combinatorics or math in general. You might have 16 instructions enabled at one time, but instructions at the assembly level don't accomplish a hell of a lot, and when you consider a branching factor that big, all of a sudden you're looking at 16^10 possible microprograms for even a modest 10-instruction snippet of code.

That's just over a trillion, many of which are likely to be non-halting and thus run several hundred instructions before bailing. The upshot is that to exhaustively try all of them, you're probably talking days or even months, depending on how optimized it all is. After quickly discovering the futility of systematically enumerating them, I said "fuck it" and resdesigned things with its current random approach, perfectly happy to try out 50-instruction programs if the pRNG requests it. Of course, if there were actually some program out in the ether that needed to be 50 instructions long, the odds of running into it through chance are almost nothing—you could expect to hit it around the same time the last protons and neutrons are decaying.

But I digress. A lot of simpler algorithms can be squeezed into a tractable space.

###### Things it's figured out:
• Fibonacci sequence
• factorials
• doubling, cubing
• for loops
• adding pairs of input numbers on the stack
• finding the mean of input numbers
• determining divisibility of input numbers by 7
• reversing order of stack
###### Not so much:
• primality detection
• prime generation
• misc. other NP problems
• bootstrapping into sentience
##### Details

Some of those successes were much easier than others. Fibonacci, for example, seems to be about the next easiest thing for it to do besides count. The other day I got GitLab CI going which I'm abusing mercilessly by having it spawn Skynet jobs instead of actually testing anything. But manually, this is how it rolls:

``````[12:00] ~/progs/Skynet> time ./4bit --file fib.skynet -r
fscanfs return 6111111111
low=4 up=8 exact=0 stopAfter=1 maxSteps=50

Input 1:
Output 1: 1 1 2 3 5 8 13
Randomizing instructions...
[ INS  SWAB  MULT  PUSH  POP  INC  ADD  JNE  NAND  DEC  PEEK  SKIP  IFF  NOT  LOCO_1  MOD ]

Satisficing program (length 5):  INC  PUSH  ADD  SWAB  JNE  -=- Tracing operation -=-

[init   a=0     b=0     c=0     pc=0    sp=1]           ( __ )
INC             [1      a=1     b=0     c=0     pc=1    sp=1]           ( __ )
PUSH            [2      a=1     b=0     c=0     pc=2    sp=2]           ( __ 1 )
ADD             [3      a=1     b=0     c=0     pc=3    sp=2]           ( __ 1 )
SWAB            [4      a=0     b=1     c=0     pc=4    sp=2]           ( __ 1 )
JNE             [5      a=0     b=1     c=0     pc=0    sp=2]           ( __ 1 )
INC             [6      a=1     b=1     c=0     pc=1    sp=2]           ( __ 1 )
PUSH            [7      a=1     b=1     c=0     pc=2    sp=3]           ( __ 1 1 )
ADD             [8      a=2     b=1     c=0     pc=3    sp=3]           ( __ 1 1 )
SWAB            [9      a=1     b=2     c=0     pc=4    sp=3]           ( __ 1 1 )
JNE             [10     a=1     b=2     c=0     pc=0    sp=3]           ( __ 1 1 )
INC             [11     a=2     b=2     c=0     pc=1    sp=3]           ( __ 1 1 )
PUSH            [12     a=2     b=2     c=0     pc=2    sp=4]           ( __ 1 1 2 )
ADD             [13     a=4     b=2     c=0     pc=3    sp=4]           ( __ 1 1 2 )
SWAB            [14     a=2     b=4     c=0     pc=4    sp=4]           ( __ 1 1 2 )
JNE             [15     a=2     b=4     c=0     pc=0    sp=4]           ( __ 1 1 2 )
INC             [16     a=3     b=4     c=0     pc=1    sp=4]           ( __ 1 1 2 )
PUSH            [17     a=3     b=4     c=0     pc=2    sp=5]           ( __ 1 1 2 3 )
ADD             [18     a=7     b=4     c=0     pc=3    sp=5]           ( __ 1 1 2 3 )
SWAB            [19     a=4     b=7     c=0     pc=4    sp=5]           ( __ 1 1 2 3 )
JNE             [20     a=4     b=7     c=0     pc=0    sp=5]           ( __ 1 1 2 3 )
INC             [21     a=5     b=7     c=0     pc=1    sp=5]           ( __ 1 1 2 3 )
PUSH            [22     a=5     b=7     c=0     pc=2    sp=6]           ( __ 1 1 2 3 5 )
ADD             [23     a=12    b=7     c=0     pc=3    sp=6]           ( __ 1 1 2 3 5 )
SWAB            [24     a=7     b=12    c=0     pc=4    sp=6]           ( __ 1 1 2 3 5 )
JNE             [25     a=7     b=12    c=0     pc=0    sp=6]           ( __ 1 1 2 3 5 )
INC             [26     a=8     b=12    c=0     pc=1    sp=6]           ( __ 1 1 2 3 5 )
PUSH            [27     a=8     b=12    c=0     pc=2    sp=7]           ( __ 1 1 2 3 5 8 )
ADD             [28     a=20    b=12    c=0     pc=3    sp=7]           ( __ 1 1 2 3 5 8 )
SWAB            [29     a=12    b=20    c=0     pc=4    sp=7]           ( __ 1 1 2 3 5 8 )
JNE             [30     a=12    b=20    c=0     pc=0    sp=7]           ( __ 1 1 2 3 5 8 )
INC             [31     a=13    b=20    c=0     pc=1    sp=7]           ( __ 1 1 2 3 5 8 )
PUSH            [32     a=13    b=20    c=0     pc=2    sp=8]           ( __ 1 1 2 3 5 8 13 )
ADD             [33     a=33    b=20    c=0     pc=3    sp=8]           ( __ 1 1 2 3 5 8 13 )
SWAB            [34     a=20    b=33    c=0     pc=4    sp=8]           ( __ 1 1 2 3 5 8 13 )
JNE             [35     a=20    b=33    c=0     pc=0    sp=8]           ( __ 1 1 2 3 5 8 13 )
INC             [36     a=21    b=33    c=0     pc=1    sp=8]           ( __ 1 1 2 3 5 8 13 )
PUSH            [37     a=21    b=33    c=0     pc=2    sp=9]           ( __ 1 1 2 3 5 8 13 21 )
ADD             [38     a=54    b=33    c=0     pc=3    sp=9]           ( __ 1 1 2 3 5 8 13 21 )
SWAB            [39     a=33    b=54    c=0     pc=4    sp=9]           ( __ 1 1 2 3 5 8 13 21 )
JNE             [40     a=33    b=54    c=0     pc=0    sp=9]           ( __ 1 1 2 3 5 8 13 21 )
INC             [41     a=34    b=54    c=0     pc=1    sp=9]           ( __ 1 1 2 3 5 8 13 21 )
PUSH            [42     a=34    b=54    c=0     pc=2    sp=10]          ( __ 1 1 2 3 5 8 13 21 34 )

...

Total=357252            Timeouts=206816         Redundant=6694

real    0m0.130s
user    0m0.127s
sys     0m0.002s
[12:00] ~/progs/Skynet>
``````

As may or may not be clear, it spits out its working program at the top, and then walks you through a trace execution.

One of my discoveries was that program length matters a lot less than I would have guessed. You can set your program length bounds to look for stuff 30-50 instructions long, and it'll still pop out a Fibonacci program after a second. This is presumably because when you throw shit randomly together, you end up with an abundance of NOOPs (at least with my instruction set.)

To put that in slightly more concrete perspective, there are one or two 5-length versions of Fibonacci that exist, out of something like a million possible programs; put another way, you could say ~0.0002 of 5-lengthers are Fibonacci programs. But as it turns out, if you go out to 20 instructions, that percentage doesn't drop by much—in fact, for some things it climbs. The point I'm trying to make, which surprised me, is that regardless of the program length you're searching for, you'll hit anything provided it's simple enough to fit in that length or simpler, and these types of algorithms seem to occupy a pretty constant market share at any depth.

### Present day

(Well, yesterday. )

I discovered PyPy and Cython, things which I should have known about a long time ago but didn't. I wrote Skynet in C because it's second to none for this sort of thing, probably a 30x speed increase over the same thing in vanilla Python. But Cython lets you statically type variables, embed C code, and other wily stuff, while PyPy appears to be a magical JIT compiler that makes your pure Python run 5-10x faster (if it's this sort of project).

Anyway, I wanted to try them out, so last night I ported Skynet back to Python, yielding skython, a name which I'm half proud of but made me die a little inside to use. After I got the basics working, I started to adapt it for Cython, but realized I would effectively be porting it awkwardly back to C-ish if I continued, so I've stuck with Python and am using PyPy, which seems to run it at maybe 25 of the speed of the C version. In other words, close enough to be an acceptable trade-off. It's not up to snuff yet, but it is working, and it's certainly faster to add features than it was with C.

To wit: I was thinking about the program-market-share concept I mentioned, and it led me to add a method which creates programs randomly without looking for any input or output, but rather keeping track of everything it churns out on its own. And this is where we actually rejoin the title and original thematic intent of this post, because the idea there was to explore the distribution of algorithmic information in computation space, by which I mean seeing just how often it opts for some sequences over others when selecting out of the vast set of all computable functions.

#### And?

And it was interesting. I gave it some pretty neutral parameters and had it run a billion microprograms. There's lots I could have done (and perhaps will do) to clean up and solidify the results, but there was enough to get a sense of things.

###### A few results
• Half the programs do nothing. Give or take.
• Most of the rest, percentage-wise, print out endless strings of 0s or 1s.
• The last few percent, though, starts to get interesting.
• When you get down to 0.5 or so, you start to get 0101010... and 123456... sequences.
• They're followed close behind by the 22222s and the 110110s and the 012012s... etc.

The first eye-catching result is down at 0.00272 computation-space-share (but this is still in the top percent of frequency):

``````0,1,3,7,15,31,63,127,255,511,1023,2047,4095,8191,16383,32767,65535,131071,262143,524287,1048575,2097151,4194303,8388607,16777215
``````

The pattern's clear enough. A little ways on, down at 0.00045, I spot a 1, 4, 7, 10, ... along with the first Fibonacci. Down further, I recognize a pattern that came up in some math problem I worked on months ago:

``````0,1,2,5,10,21,42,85,170,341,682,1365,2730
``````

But as I went deeper, I found sequences that clearly had structure but were foreign to me:

``````1,3,8,20,49,119,288,696,1681,4059,9800,23660,57121,137903
``````

I started looking them up on OEIS, and found that most of the interesting-looking patterns were indeed in there, and at least frickin' half of them were related in some way to the Fibonacci sequence. Not that this is especially surprising or perplexing in hindsight, I'd just never given much thought to how fundamental the principle is upon which it's built. No wonder that it pops up in nature everywhere; God must have noticed the same thing. So yeah, there are some interesting patterns when you get deep, but I didn't find anything earth-shattering.

#### Zip 'em up!

When I got tired of that, I moved on to a test of algorithmic information theory. The expectation is that the more frequently a program appears, the less complexity it is apt to have. A quick'n'dirty test for complexity is compressibility—the more complex a thing, the less compressible, to the point where any data sufficiently close to random (which is 'perfectly' complex, in this sense) will only get bigger.

I had my big honkin' 88mb data file, sorted by output frequency, so first I split it into equal parts, going from least to most common outputs:

``````[13:17] ~/progs/skython> split -da 2 -n 25 1bil.txt
[13:18] ~/progs/skython> ls -l
-rw-r--r-- 1 tc tc  88516009 Aug  1 09:46 1bil.txt
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x00
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x01
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x02
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x03
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x04
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x05
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x06
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x07
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x08
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x09
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x10
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x11
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x12
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x13
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x14
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x15
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x16
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x17
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x18
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x19
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x20
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x21
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x22
-rw-r--r-- 1 tc tc   3540640 Aug  1 13:18 x23
-rw-r--r-- 1 tc tc   3540649 Aug  1 13:18 x24
``````

And then I zipped 'em:

``````[13:18] ~/progs/skython> gzip -9 x??; ll
-rw-r--r-- 1 tc tc  88516009 Aug  1 09:46 1bil.txt
-rw-r--r-- 1 tc tc    851286 Aug  1 13:18 x00.gz
-rw-r--r-- 1 tc tc    850941 Aug  1 13:18 x01.gz
-rw-r--r-- 1 tc tc    851373 Aug  1 13:18 x02.gz
-rw-r--r-- 1 tc tc    852635 Aug  1 13:18 x03.gz
-rw-r--r-- 1 tc tc    849508 Aug  1 13:18 x04.gz
-rw-r--r-- 1 tc tc    857673 Aug  1 13:18 x05.gz
-rw-r--r-- 1 tc tc    853534 Aug  1 13:18 x06.gz
-rw-r--r-- 1 tc tc    852645 Aug  1 13:18 x07.gz
-rw-r--r-- 1 tc tc    854754 Aug  1 13:18 x08.gz
-rw-r--r-- 1 tc tc    854989 Aug  1 13:18 x09.gz
-rw-r--r-- 1 tc tc    855178 Aug  1 13:18 x10.gz
-rw-r--r-- 1 tc tc    848454 Aug  1 13:18 x11.gz
-rw-r--r-- 1 tc tc    846578 Aug  1 13:18 x12.gz
-rw-r--r-- 1 tc tc    844529 Aug  1 13:18 x13.gz
-rw-r--r-- 1 tc tc    849767 Aug  1 13:18 x14.gz
-rw-r--r-- 1 tc tc    849013 Aug  1 13:18 x15.gz
-rw-r--r-- 1 tc tc    846908 Aug  1 13:18 x16.gz
-rw-r--r-- 1 tc tc    852737 Aug  1 13:18 x17.gz
-rw-r--r-- 1 tc tc    847214 Aug  1 13:18 x18.gz
-rw-r--r-- 1 tc tc    821187 Aug  1 13:18 x19.gz
-rw-r--r-- 1 tc tc    760842 Aug  1 13:18 x20.gz
-rw-r--r-- 1 tc tc    752602 Aug  1 13:18 x21.gz
-rw-r--r-- 1 tc tc    727900 Aug  1 13:18 x22.gz
-rw-r--r-- 1 tc tc    692860 Aug  1 13:18 x23.gz
-rw-r--r-- 1 tc tc    605098 Aug  1 13:18 x24.gz
``````

...and as expected, they all got smaller, because these were text dumps which are always gonna be highly compressible. But that aside, it's evident that gzip had a much easier time with the more common stuff. Again, no surprise, but cool. Note the implicit curve that drops off sharply until you get to the top 75 or 80 of them, where it levels off and stays effectively flat.

#### One last thing to check

I'm beginning to suspect that primes are inherently complex, not just from number theory but in a Kolmogorov-y, robust sense. More Linux tricks:

``````[10:24] ~/progs/skython> for i in {1..30}; do cut -s --delimiter=, -f \$i 1bil.txt >> 1bil_vals.txt; done
[10:27] ~/progs/skython> sed 's/$\s\|\-$//g' 1bil_vals.txt | uniq -c | sort -gr > 1bil_sorted_abs_final.txt
``````

which gives a sorted text file of how many times every individual term came up overall, ignoring patterns:

``````[10:30] ~/progs/skython> head -20 1bil_sorted_abs_final.txt
16245701
11338986 1
8543839 0
5732203 2
3037257 3
1661122 4
980248 5
838860 6
582802 7
457519 8
338796 9
289842 10
226633 12
208586 11
161242 13
154787 14
149209 15
140625 16
90908 18
90094 17
``````

It's my suspicion that primes will be disproportionately low on this list—as they should be, given how computers work—and a first glance confirmed that somewhat, but then I stopped to write all this.

Okay. That's it. Here's the big ugly data file if you want it. Maybe I'll write up my thoughts on an approach to P=^{?}NP via transfinite induction tomorrow. Peace out.