-x^2 logic

Three things to understand that really make this algorithm faster:
If A and B are binary digits (0 or 1), then:
    ~a=1-a, where ~ is the "not" function and 'a' is a binary digit.
    ~x=-x-1, where x is a binary number, modulo some base power of 2 (in assembly, this is commonly 2^8, 2^16, 2^32, or 2^64)
    a(1-a) is 0, always [it is equivalent to a&~a].
    a(1-b)+b(1-a) = a+b-2ab is the same as a^b, where ^ is the xor function.
So then suppose you have x is a number whose bits are abcdefgh... where 'a' is the least-significant bit.
In base 10, a product might look at the 1s place, 10s place, 100s place, etc for the digits. In binary, we looks at 1s, 2s, 4s, etc. for the bits.
So then the 1s place of ~x is (1-a), the second digit is (1-b), etc.
Then the product of x*~x is:

1s      a(1-a)
2s      a(1-b)+b(1-a)
4s      a(1-c)+b(1-b)+c(1-a)
8s      a(1-d)+b(1-c)+c(1-b)+d(1-a)
16s     a(1-e)+b(1-d)+c(1-c)+d(1-b)+e(1-a)
32s     a(1-f)+b(1-e)+c(1-d)+d(1-c)+e(1-b)+f(1-a)
64s     a(1-g)+b(1-f)+c(1-e)+d(1-d)+e(1-c)+f(1-b)+g(1-a)
128s    a(1-h)+b(1-g)+c(1-f)+d(1-e)+e(1-d)+f(1-c)+g(1-b)+h(1-a)
...

So applying the fact that a(1-a)=0 (this works the same for b(1-b), etc.):
1s      0
2s      a(1-b)+b(1-a)
4s      a(1-c)+c(1-a)
8s      a(1-d)+b(1-c)+c(1-b)+d(1-a)
16s     a(1-e)+b(1-d)+d(1-b)+e(1-a)
32s     a(1-f)+b(1-e)+c(1-d)+d(1-c)+e(1-b)+f(1-a)
64s     a(1-g)+b(1-f)+c(1-e)+e(1-c)+f(1-b)+g(1-a)
128s    a(1-h)+b(1-g)+c(1-f)+d(1-e)+e(1-d)+f(1-c)+g(1-b)+h(1-a)
...

And now applying the trick following that:

1s      0
2s      a^b
4s      a^c
8s      a^d+b^c
16s     a^e+b^d
32s     a^f+b^e+c^d
64s     a^g+b^f+c^e
128s    a^h+b^g+c^f+d^e
...

Oh hey, now *that* looks pretty easy! In fact, we've already filled out what the first 8 bits look like!

Since this is x*~x, this is x*(-x-1)=-x^2-x. So if we add the original x and negate, we get the first 8 bits of x^2.


If x is an 8-bit int and you want the full 16-bit result of x*~x
1s      0
2s      a^b
4s      a^c
8s      a^d+b^c
16s     a^e+b^d
32s     a^f+b^e+c^d
64s     a^g+b^f+c^e
128s    a^h+b^g+c^f+d^e
256s       +b^h+c^g+d^f
512s           +c^h+d^g+e^f
1024s              +d^h+e^g
2048s                  +e^h+f^g
4096s                      +f^h
8192s                          +g^h
16384s  0
32768s  0

Now if this is being applied to a circuit, you can reduce the number of additions:
1s      0
2s      a^b
4s      a^c
8s      a^d+b^c
16s     a^e+b^d
32s     a^f+b^e+c^d
64s     a^g+b^f+c^e
128s    a^h+b^g+c^f+d^e
256s    b^h+c^g+d^f
512s    c^h+d^g+e^f
1024s   d^h+e^g
2048s   e^h+f^g
4096s   f^h
8192s   g^h
16384s  0
32768s  0

a^b+b
a+2b-2ab

And hey, why not add in the original x in there to make it compute -x^2 :
We'll use the identity that a+b-2ab=a^b
1s      0               +a  =a
2s      a^b             +b  =a+2b-2ab
4s      a^c             +c  =a+2c-2ac
8s      a^d+b^c         +d  =a+2d-2ad+b^c
16s     a^e+b^d         +e  =a+2e-2ae+b^d
32s     a^f+b^e+c^d     +f  =a+2f-2af+b^e+c^d
64s     a^g+b^f+c^e     +g  =a+2g-2ag+b^f+c^e
128s    a^h+b^g+c^f+d^e +h  =a+2h-2ah+b^g+c^f+d^e
256s    b^h+c^g+d^f
512s    c^h+d^g+e^f
1024s   d^h+e^g
2048s   e^h+f^g
4096s   f^h
8192s   g^h

Anything multiplied by 2 can get shifted to the next digit position:
1s      a
2s      a
4s      a             +b-ab
8s      a+b^c         +c-ac
16s     a+b^d         +d-ad
32s     a+b^e+c^d     +e-ae
64s     a+b^f+c^e     +f-af
128s    a+b^g+c^f+d^e +g-ag
256s    b^h+c^g+d^f   +h-ah
512s    c^h+d^g+e^f
1024s   d^h+e^g
2048s   e^h+f^g
4096s   f^h
8192s   g^h

Now we'll use a+b-ab=a|b (| is logical OR):
1s      a
2s      a
4s      a|b
8s      a|c+b^c
16s     a|d+b^d
32s     a|e+b^e+c^d
64s     a|f+c^e+b^f
128s    a|g+d^e+b^g+c^f
256s    c^g+d^f-a  +b
512s    d^g+f^e-bh +c
1024s       g^e-ch +d
2048s   f^g+e  -dh
4096s       f  -eh
8192s       g  -fh
16384s  h~g