Take a number and think of all the possible ways you can encode them. Make up some new rules because you feel like it. Oh wait, maybe you should throw in some custom encoding because it feels right. That pretty much sums for the 50 ways you can encode numbers in Packed Encoding Rules.
This essentially means on the receiving side we have no idea how big the integer’s going to be. Could be a 4-bit integer or a 8192-bit integer. In addition the integer could be negative, which gives us a few rules of encoding. Typically the ASN spec for unconstrained integers will look like this:
Given any number we first figure out the minimum number of octets that we can encode them into. For example 0x7f encodes into a single byte, while 0x7fab encodes into 2-bytes. What if the number is negative? Then we encode the number using it’s 2′s complement version. In Ruby and/or C, that would be:
1 + ~-val
One gotcha. If the positive number that we are encoding has the high-bit set and it falls on the octet boundary, then we precede this with 0×00. For example 255 encodes into \x00\xff while -1 encodes into \xff. See the difference?
And the whole thing is always octet-aligned in the ALIGNED variant of PER. Of course since the recipient doesn’t know how many octets there are, the number of octets encoded in the same way precedes the actual number.
What if you knew that there was a minimum bound for the number? The minimum bound doesn’t necessarily have to be positive. In this case, what we end up encoding is the offset from the minimum bound. This means that the offset will always be a positive number and we can encode it like the Unconstrained version, except we don’t have to have the prefix of 0×00 if the number has a high-bit set and it falls on an octet-boundary. So in other words, 255, as a semi-constrained number, encodes is 0xff, not \x00\xff as above. The ASN specification might look something like:
On the receiver side, you better know what the minimum bound is since you need to add this to the decoded offset to get back the final number.
This is also octet-aligned in the ALIGNED version of PER and is preceded by a length encoded in the same way.
Let’s go one step further. Suppose you knew both the min and the max bounds of the number. In this case, the perverse encoding tries to pack this into as little bytes as possible. We are going to introduce the range and offset before we start encoding:
range = max – min + 1
offset = value – min
Range == 1
In this case max == min and because of this the number encodes into zero bits on the wire. The recipient essentially knows that there’s something there. This is very similar to encoding NULL.
Range <= 255
In this case we compute the minimum possible bits that we can encode the range as and then encode the offset using this number of bits. And of course, we don’t align this to anything. Just a bunch of bits being output with no respect for the byte boundary. The minimum number of bits is pretty simple to compute.
nbits = 0
while val > 0
nbits += 1
val >>= 1
So consider the following ASN with foo == 2.
In this case, range is 15 and number of bits is 4 and offset of 2 is 1 all of which encodes into \x10 with the bottom 4-bits to be possibly filled out by the next PER object.
Range == 255
*whew* In this case we take the 1-byte offset and pack it as a simple unsigned char…
[ offset ].pack("C")
…and align it to the octet boundary.
Range <= 65536
Still straight forward. We take the 2-byte offset and pack it as a simple unsigned short in network-byte order…
[ offset ].pack("n')
…and align it to the octet boundary.
Range > 65535
The offset gets encoded as a Semi Constrained number with a minimum bound of zero, aligned to the octet-boundary. However the length preceding has some fanciness to it. We use the range to figure out the maximum possible bits that we will ever need and then encode the offset into that many number of bits.
Let’s try that again. Assuming that the range is 65536, this requires a maximum of 17-bits. This means no matter what the offset is, we can stuff that into 17-bits and that’s exactly what we do.
This is a whacky, forward-looking, backwards-compatible encoding. Typically this shows up in the ASN specification as an ellipsis. For example a v1 specification for a protocol might say this:
foo INTEGER(1..16 ...)
with the intent that somewhere someone might need to send foo outside of the 1 to 16 range. In this case v1 implementations add an extra bit before the integer encoding set to zero, indicating that the value is in root. Let’s say time goes by and someone decides in the v2 specification to do this:
foo INTEGER(1..16 ... 32)
What do we do then? Well, remember that bit we added? We set that to one to indicate the value has been extended and is outside the initially planned range. In this case we have a single bit indicate extensibility followed by the encoding of the value outside the range as an Unconstrained number. This also allows a v2 implementation to send in v2-specific values down to v1 without breaking the parsing.
We think we should be done by now, yeah? Nope. Not too fast. Perversity does come in abundance.
That’s pretty much for numbers in general. What about counts? Like in arrays or strings? Well there’s yet another encoding for that. Let’s say that we don’t necessarily know up front how many elements there are in an array or the number of characters in a string. This is encoded in a few different ways (again).
count <= 127
In this case we encode the count as a single byte. Plain and simple. In other words:
[ count ] .pack("C")
count < 16384
Whoa there. 2 byte count? In this case the high-bit is set and the whole count goes down as a bit-endian 2-byte integer. In reality the first byte of the 2-byte encoding has the top 2-bits set to 10 which gives us a total of 6 + 8 bits to encode this value. That would be:
[ 0x8000 | count ].pack("n")
count >= 16384
This is also called as fragmentation. And yet again, we come up with rules and more rules. Fragments are expected to be chunked as 16K, 32K, 48K or 64K items. Remember we talked about the top two bits in the previous example? Well those 2-bits are now 11. The bottom 6 bits (right justified) can encode values 1-4 to indicate the chunk size. Why 4? Have no idea. It’s just the way it goes. So if you have a total of 16385 items, this is how you would encode them:
[ 0xc0 ].pack("C") + [16K items] + [ 0x01 ].pack("C") + [1 item]
So what’s happening is that we take the total chunks and split it up to be maximum of 16K, 32K, 48K and 64K and then encode the remaining elements as the next sized chunk. *sigh*
There’s just one last way we can encode numbers. Finally!
Abnormally these are pretty large, but normally…Seriously, it’s actually called normally-small in the specification. This is typically used in cases where we expect the number to kinda-sorta be small. I know I’m repeating myself here, but don’t know how else to explain it best.
The encoding itself is not octet-aligned with the hope that it will get squished tight with the previous PER object’s encoding.
count <= 63
That’s right folks. If the count is <= 63 we use a 7-bit encoding with the high bit set to zero and bottom 6 bits encoding the count.
count > 63
In this case there’s a single-bit set to 1 indicating an abnormally large [sic] count followed by an octet-aligned encoding using the unbounded count encoding described above.
Okay, there you have it. All the perversity of encoding numbers. This is not one of those projects you can bring home and boast about. Given what we know, a lot of the open-source parsers that I’ve seen so far have TODO or WTF (yeah, WTF) comments indicating ambiguity about how things need to be parsed. Fuzzing them? Priceless…