The “Why This Should Matter to You” Story
A few scenarios today:
- You don’t understand why the answer to “how many bytes are in a kilobyte” isn’t 1000 (at least, not exactly).
- You see if((x & mask) > 7) and wonder why it compiles despite the author forgetting an ‘&’
- You feel a little ashamed because in spite of the fact that your non-programmer friends say that you “think in ones and zeroes”, you don’t actually understand how your code becomes ones and zeroes.
Everything here is symptomatic of a lack of understanding of the way data is represented at the most fine grained level in computer science and in programming.
It’s possible to understand some programming concepts without understanding the mechanics of how they work internally. After all, you can understand that a list.sort() method without understanding the sort algorithm being used.
So the general story is that your understanding of the data and data structures that go into programming breaks down at some level. You may understand that a string is a series of characters, but perhaps not that characters are actually integers or that integers are actually bytes or that bytes are actually series of bits or that bits are simply Booleans. And that breakdown in understanding somewhere in the chain leads to situations where you don’t fully understand what your peers are talking about.
In the last two posts in this series, I’ve gone through some basic but important points about Boolean algebra, which one might describe as the basic mathematics of truth values. In this system there are two fundamental values to which constants, variables and expressions might evaluate: true and false. Boolean algebra is, in a very real sense, the absolute fundamental building block for everything that we as programmers do, but it may not be entirely obvious as to how.
To get there, let’s leave behind the syntax of Boolean algebra and pick up the syntax of machine language. So forget about “true” and “false” and consider instead 1 and 0, respectively. 1 is true, 0 is false. And further let’s get rid of the conjunction, disjunction and negation operators and consider, in their stead, the more familiar & | ~ operators. (You’ll notice that these are single operators and not the double operators with which you are likely familiar and this is not a typo — there is a method to my madness.) All of the rules from the previous posts still apply so, for instance, the identity law of A | 0 = A applies.
So far we’ve just swapped syntactical operators and constants. But let’s now consider an interesting property of basic arithmetic (not Boolean arithmetic) which is that written numbers are really a sequence of answers to a pre-arranged sequence of questions.
Okay, let me clarify with an example. Let’s pick a three digit number like 481. There’s nothing particularly special about this number, but let’s think of it in terms of a Q&A session. It goes like this “how many hundreds are in this number?” followed by the answer “four.” “How many tens are in this number?” “Eight.” “And how many ones?” “One.” This is really all there is to numerical representation of natural numbers (discounting the concept of zero, negative numbers, decimals, etc).
In all numbers that we’re comfortable with in normal society, the question always involves powers of 10: 1, 10, 100, etc. And so the answer is always a number 0 through 9. But what if we asked the question in a different way? What if we did it in terms of 7? Or 34? Or… 2? The answer, respectively, would always be an answer 0 through 6, 33… and 1. That last one is most interesting to us here. The answer will always be between 0 and 1 or, put another way, the answer will always be 0 or 1. Interesting.
Let’s look at an example. The number 10110, using powers of 2, is asking the question, “how many 16’s, 8’s, 4’s, 2’s and 1’s are there” to which the answers come as “1, 0, 1, 1, 0” respectively. We have 1 sixteen, no eights, 1 four, 1 two and no ones for a total of 22. 10110 is the binary representation of 22. That’s probably not new to you regardless of your CS/math/etc background if you’re a programmer. But what might be new to you is just how fundamental this concept is because it allows translation between physical and “soft”-ware concepts. How so? Because you know what’s actually pretty hard to represent and measure physically? 0 through 9. But you know what isn’t? Booleans. On or off. There or not. Current flowing or nothing. Chad punched or not. Magnetic surface raised or flat. Ion charged or neutral. You get the idea.
We can build computers where information is stored as a series of Booleans representing not “true or false”, but “yes or no” and we can still apply all of the same principles, rules, equivalences and concepts, albeit with slightly different semantics. We can take these “yes or no” values and assemble them into integers using our binary question schemes. We can them assemble those integers into ASCII characters, strings, basic programs, complex programs, multi-media files, and you know the rest.
The math behind all of programming is simple. The basis for the whole thing is simply gigantic sequences of Boolean values assembled into ever-larger pieces. The marriage of Boolean and counting unique to binary encoding means that you know how to assemble the Boolean sequences into meaningful information and also how to manipulate information and store it back, performing operations using identity, domination, negation, and other equivalences (we will get to some interesting optimizations using bit-shifting and other techniques in future posts).
The only missing piece here is how we get from these individual “1 or 0” values, called “bits” to larger values. Well, most of this is arbitrary. For instance, the definition of “byte” is simply “8 consecutive bits,” meaning that a byte can represent one of 256 possible integer values. These different values can also be assigned (again arbitrarily) to non-integral values such as characters. This is the basis for ASCII, a near-universal standard for representing text using integers (and thus binary/bits). This is the mechanism by which the building blocks are assembled.
How it Helps You
In a (hyphenated) word: background-knowledge. For instance, with all of this in mind, it’s relatively easy to get to solutions for the mysteries at the beginning of the post. First up is the weirdness surrounding bit counting using KB, MB, GB, etc. Is a kiloybyte a thousand bytes the way a kilogram is a thousand grams? Well, it could be, if you define it the way we’re used to seeing those terms defined. But that isn’t actually what winds up happening. What happens instead is the result of an interesting quirk of powers of two. As it turns out, 2 to the 10th is 1,024, which is roughly a ‘kilo’ of bytes. 2 to the 20th is 1,048,576, which is roughly a ‘mega’ (1 million) of bytes. Same kind of “fuzzy” counting happens with Giga, Tera, etc, all the way up the chain with 2 to the 30th, 40th, etc. So why reuse a prefix that means something else — something otherwise very precise? Personally, I think the adopters of this convention were being cute, and I don’t care for it at all (*dons flame retardant suit), but them’s the breaks. Apparently, I’m not the first to feel this way and others have tried and are trying to do something about it.
What about all that bit mask stuff? Well that’s now pretty simple to sort out as well. One common convention that you need for a bit of additional perspective is to know that “base-16” is often adopted for representing bytes. The reason is that a byte is 8 bits and each 4 bits represents one of 16 possible values. So if you have a base-16 number, bytes can be written with two values. These take the form of something like 4E because you get 0-9 of base-10 that you’re used to and then another 6 values representing 10 through 15 which are written as the characters A through F. It’s much easier to remember bytes as two values than it is to remember 16 bits. I mention all of this because “FF” is identical in value to “11111111” and “F0” is identical to “11110000”.
You don’t see bit masking as much in managed languages, though you might if you get your hands dirty down in the transport layer doing things like socket management. But those coming from a C background are quite familiar with it. If you get a message over the wire, it may come in as a stream of bytes. But when things like network latency and accuracy are issues (or perhaps in embedded environments when space is at a premium) there is a tendency to squeeze every last modicum of value out of the space available. In a web application, representing a number 1 through 10 with data type “int” (which in managed languages will be a 32-bit/4-byte integer) is not really an issue, but in the environments I’ve mentioned, it’s digital blasphemy to waste that space. You only need 4 bits to represent that — the other 28 are wasted.
But the smallest unit typically dealt with in programming convention is the byte. So what to do? Well, enter bit masking. Let’s say I have two speakers on some embedded device that can be set to volume level 1-10 and I want to write a protocol for querying volume level for both. What I would do is define a message that was 1 byte long and have the first 4 bits of the byte represent the volume of one speaker and the second 4 represent the volume of the other. If I wanted just the value of speaker 2, I would want a way to preserve the first 4 bits of the byte I got while clobbering the second 4. Well, enter the bit mask. If I perform the operation int myValue = 0x0F & incomingSpeakerByte, I get what I’m looking for. Why? Well, let’s consider this.
The value 0x0F can be rewritten as 0x00001111 and the value of the speakers is whatever is coming back. Let’s say that it’s set to volume level 9, which can be rewritten as 0x1001. But, let’s say that the other speaker is set to volume level 4, which is 0x0100. That means that both speakers together are represented with the byte 0x01001001. If I set a local variable (where space is not a premium) to the byte I’m getting over the wire, it will be equal to 73, and that is nonsense in our 1-10 volume domain. But the bit mask operation performs the operation 0x00001111 & 0x01001001 to get the final value of 9 (0x00001001) correctly. The reason this works is that the “&” operation aligns the two sequences of bits and performs AND on them one by one, recording the value. This is called “bitwise and.” If you think back to the domination and identity laws from the last post, we’re using domination (x & 0 = 0) to clobber the first 4 bits and identity (x & 1 = x) to preserve the second 4, which are the only four that interest us. (Storing the other byte in a local int is a bit more involved, taking advantage of the aforementioned “bit-shifting,” and beyond the scope of this post). The long and short of it is that when you see bit masking operations performed, you now know what’s going on — the developer is using the same byte to store multiple values and using this technique to extract the ones he or she needs at the moment.
As to thinking in ones and zeroes, you might not be quite there yet, but this is a step in the right direction. It helps you to understand the fundamental building blocks of software based data and how we get from the physical world to the “soft” representation that we deal with. It also hopefully makes some of the weird-symbol bit manipulation tricks seem a little less like black magic. At the very least, I hope that instead of viewing some of this stuff with suspicion or hopelessness, you now have some basic concepts, terms to google and questions to ask.