diffstat of MySQL 5.6 versus 5.5

Yesterday I wrote about what the diffstat between MySQL 5.5 and MariaDB 5.5 was, and previously to that, about the MariaDB code size as reported by sloccount. Let’s look at MySQL 5.6.

A naive wc based “lines of code” for MySQL 5.6 sql/ directory is ~490kLOC which contasts with MySQL 5.5 being ~375kLOC by the same measure. If we diffstat the sql/ directory like I did for MariaDB 5.5 we get:

 357 files changed, 172871 insertions(+), 67922 deletions(-)

Versus, as you remember from yesterday for MariaDB 5.5 over MySQL 5.5:

 250 files changed, 83639 insertions(+), 23090 deletions(-)

The MySQL 5.5 to 5.6 sql/ changes line up with What I found in my post MySQL modularity, are we there yet? in that the core server code for MySQL has grown by about 100,000 lines of code.

The jump from MySQL 5.5 to MariaDB 5.5 is a smaller one than jumping from MySQL 5.5 to MySQL 5.6, at least in terms of changed server code.

A judgement all on if a smaller diff is a safer jump or not will rest more with the quality of that code more than anything else. As we’ve seen previously, modularity isn’t coming to the MySQL code base any time soon.

So what about the diffstat of MariaDB compared to MySQL?

So, I’ve looked at what sloccount says on the differences between Oracle MySQL over versions of itself and the various MySQL branches around. What I haven’t looked at is the diffstat. Firstly, let’s look at MariaDB.

I’m going to look at MariaDB 5.5.29 as compared to MySQL 5.5.29, both checked out from bzr. A naive diffstat would give us:

 5261 files changed, 1086165 insertions(+), 122751 deletions(-)

And this looks like an awful lot of code that has changed: about 1,086,165 lines! This actually includes a whole other copy of InnoDB in the form of XtraDB. If we take that into account we get:

 5032 files changed, 864997 insertions(+), 125099 deletions(-)

Which is still incredibly high. Let’s look at what’s changed though. We actually see a bunch of changes in the test suite, some of which are relatively harmless, while others, like the change to rpl_tests/rpl_innodb.test have a “–replace_result MyISAM InnoDB” line added to them, which is awfully odd (possibly legitimate, but it stuck out).

In the end, I came up with this diff command which I think leaves us with a best diff for what is the code difference between MySQL 5.5 and MariaDB 5.5:

 diff -Nru --exclude=BUILD* --exclude=.bzr* --exclude debian* \
--exclude=man* --exclude=mysql-test* --exclude=win* \
--exclude=unittest* --exclude=test* \
--exclude=support-files* --exclude=README \
--exclude=Docs --exclude=CMakeLists.txt \
--exclude=COPYING.LESSER --exclude=INSTALL* \
--exclude=KNOWN_BUGS.txt \
--exclude=cmake* mysql-5.5.29/ mariadb-5.5.29/

This is not to discount the build and test changes that MariaDB have made, but in this case I feel they distort the numbers a bit and I’ve previously just been counting C and C++ code, so it’s probably fairer this way.

We end up with a diffstat of:

 1156 files changed, 326081 insertions(+), 42751 deletions(-)

If we then exclude the copyright notice changes and any whitespace by changing the start of the diff command to this:

diff -NruiEbwB --ignore-matching-lines='Copyright.*Monty' \
--ignore-matching-lines='Copyright.*Oracle'

We end up with a diffstat of:

 1129 files changed, 322821 insertions(+), 39588 deletions(-)

Which is a little different to what I found in my previous post (MariaDB code size) that just used sloccount. There we found that MariaDB 5.5 was 187,000 more lines of code than MySQL 5.5 while here we find the difference to be 283,000 lines of code. I suspect these differences to be in how diff and sloccount count things. If you do a naive count of the number of lines in source files in the sql/ directory you get 375kLOC while sloccount says 256kLOC.

There is still some noise in this number as there’s some Copyright notices for some of the strings code that changes, but this doesn’t seem to be too much. What about server code though? If we just diffstat the sql/ directory (core server code), then we get:

 250 files changed, 83639 insertions(+), 23090 deletions(-)

Which is still nothing to sneeze at, sloccount tells me that MySQL 5.5.29 only has 256kLOC in the sql/ directory to begin with and a naive wc count to be about 375kLOC.

Which is bigger: MySQL or PostgreSQL?

From my previous posts, we have some numbers (excluding NDB) for the size of MySQL, so what about PostgreSQL? Here, I used PostgreSQL git trunk and classing things in the contrib/ directory as plugins. I put the number of lines of code in the src/backend/storage directory down as storage engines LoC but did not count it as non-kernel code.

Version Total LoC Plugin LoC Storage Engines LoC Remaining (kernel)
MySQL 5.5.30 858,441 2,706 171,009 684,726 (79% kernel)
MySQL 5.6.10 1,049,344 29,122 236,067 784,155 (74% kernel)
MariaDB 5.5 1,142,118 11,781 304,015 826,322 (72% kernel)
Drizzle trunk 334,810 31,150 130,727 172,933 (51% kernel)
PostgreSQL trunk 648,691 61,934 17,802 586,757 (90% kernel)

What we can see is that the PostgreSQL kernel size is actually smaller than any recent MySQL version (5.1 was slightly smaller). This is rather interesting as it is generally thought that PostgreSQL does more than MySQL. What’s more telling is that total code size, PostgreSQL is about half of MySQL 5.6 or MariaDB 5.5. Only Drizzle ends up being smaller, which makes sense as it “does less”.

Is MySQL bigger than Linux?

I’m going to take the numbers from my previous post, MySQL Modularity, Are We There Yet? for the “kernel” size of MySQL – that is, everything that isn’t a plugin or storage engine.

For Linux kernel, I’m just going to use the a-bit-old git tree I have on my laptop. I’ve decided that the following directories are for “plugins” drivers/ arch/ sound/ firmware/ crypto/ usr/ virt/ tools/ scripts/ fs/*/* and everything else is core kernel code.

Version Total LoC Total Plugin LoC Remaining (kernel)
MySQL 5.6.10 1,049,344 265,189 784,155 (74% kernel)
MariaDB 5.5 1,142,118 315,796 826,322 (72% kernel)
Linux 9,983,269 8,824,121 1,159,148 (11% kernel)

The scary thing is that it’s surprisingly close, MySQL/MariaDB core is roughly 68-71% the  size of the Linux kernel. This is probably an unfairly large number for Linux too as there’s much more of Linux that is pluggable and modular… so I actually suspect they’re closer to exactly the same size.

If we look at the net/ directory in linux, it’s a grand total of 493,000 lines of code, all of which is fairly modular and independent. You could, quite reasonably, claim that the core of Linux is in fact closer to half a million lines of code than a million, making MySQL significantly larger.

So how many engineers are looking after each code base? We know there are over a thousand Linux kernel developers contributing to each release (e.g. https://lwn.net/Articles/395961/ for data back in 2010, and https://lwn.net/Articles/537110/ for Feb 2013).

I’m now going to fudge some things to attempt to work out how many “developers” are working on linux core code rather than drivers and arch specific things. I work out there’s probably about 20-25% of linux developers who work on things that are not drivers, filesystems or arch code. This is around 250-300 developers for each kernel release.

So… how many people have ever committed code to MySQL? This is fairly easy to find out: I simply looked at the entire bzr history, grepped out every committer and then uniqued the list (this required more than just sort -u as people used different email addresses and names). How many people have ever committed code to MySQL (i.e. their code can be found in the MySQL 5.6 bzr tree)? 312.

How many committers to MySQL 5.6 are there? 161. This is pretty amazing, that’s about half of what the total is. However, this number is misleading. For example, my name is there and the last commit to the MySQL tree from me was in 2008. You also see names such as Monty Taylor and Kristian Nielsen – all three of us not having worked for MySQL/Sun/Oracle for a great number of years. At the very least, there’s been a lot of code integration into MySQL 5.6 from many existing sources that were not previously in MySQL trunk.

MySQL modularity, are we there yet?

MySQL is now over four times the size than it was with MySQL 3.23. This has not come in the shape of plugins.

Have we improved modularity over time? I decided to take LoC count for plugins and storage engines (in the case of Drizzle, memory, myisam and innobase are storage engines and everything else comes under plugin). I’ve excluded NDB from these numbers as it is rather massive and is pretty much still a separate thing.

Version Total LoC Plugin LoC Storage Engines LoC Remaining (kernel)
MySQL 3.23.58 371,987 0 (0%) 176,276 195,711 (52% kernel)
MySQL 5.1.68 721,331 228 237,124 483,979 (67% kernel)
MySQL 5.5.30 858,441 2,706 171,009 684,726 (79% kernel)
MySQL 5.6.10 1,049,344 29,122 236,067 784,155 (74% kernel)
MariaDB 5.5 1,142,118 11,781 304,015 826,322 (72% kernel)
Drizzle trunk 334,810 31,150 130,727 172,933 (51% kernel)

I’ve used the non-plugin and non-storage engine code size to be the database “kernel” – i.e. the core of the database server.

What I find really interesting here is that yes, the amount of code that is to some degree modular has increased. The amount of code that is a MySQL plugin is still very small compared to the server size

Drizzle is 20-25% of the size of a modern MySQL or MariaDB server and for many applications does largely or exactly the same thing.

Other MySQL branch code sizes

Continuing on from my previous posts, MySQL code size over releases and MariaDB code size I’ve decided to also look into some other code branches. I’ve used the same methodology as my previous few posts: sloccount for C and C++ code only.

There are also other branches around in pretty widespread use (if only within a single company). I grabbed the Google, Facebook and Twitter patches and examined them too, along with Percona Server 5.1 and 5.5.

Codebase LoC (C, C++) +/- from MySQL
Google v4 patch 5.0.37 970,110 +26,378 (from MySQL 5.0.37)
MySQL@Facebook 1,087,715 +15,768 (from MySQL 5.1.52)
Twitter 5.5.29.t10 1,192,718 +3,624
Percona Server 5.1 trunk 1,066,418 +14,878 (from MySQL 5.1.66)
Percona Server 5.5 trunk 1,208,577 +19,483 (from MySQL 5.5.29) +142,159 (from PS 5.1)
Drizzle trunk 334,810

The Google patch has always had a reputation of being large, and with an extra 26kLOC of code, it certainly is the biggest of any of the more current branches – and that’s actually a surprise to me that it adds this much code.

The Facebook and Percona Server 5.1 branches are amazingly similar in how much extra code they add, and they’re not carbon copies of each other. The Twitter patch quite notable for how little extra code it adds.

For giggles, I included Drizzle – which is (even with all the plugins) less than a third of the size of MySQL 5.1.

It’s clear that the Percona Server and Facebook patches introduce much less code than MariaDB does, which does go with the general wisdom of them being closer to Oracle MySQL than MariaDB is.

If we look at Percona Server, we see that with Percona Server 5.5 there is indeed a bunch more code than was in Percona Server 5.1, with roughly 5,000 more lines of code than we’d expect from a simple port from MySQL 5.1 to MySQL 5.5. This feels about right, we’ve added new things to Percona Server 5.5 that weren’t in Percona Server 5.1.

MariaDB code size

Continuing on from my previous post, MySQL code size over releases.

I wanted to look at the different branches/patch sets of MySQL out there and work out how far from upstream they deviated. I’m just going to compare against whatever upstream version the most easily accessible version is based on (be it 5.0.x, 5.1.x or whatever).

For MariaDB versions, I removed innodb_plugin and replaced it with xtradb for stats purposes as the MariaDB innodb_plugin is essentially the same as upstream and I don’t want to artificially inflate the diff size.

The first three major versions of MariaDB were all based on MySQL 5.1. I used sloccount and only counted C and C++ code.

So, let’s look at some of the MySQL patch sets/branches that are around. Firstly, let’s look at MariaDB:

Codebase LoC (C, C++) +/- from MySQL +/- from prev maj Version
MariaDB 5.1 1,210,168 +157,532 0
MariaDB 5.2 1,227,434 +174,798 +17,266 (since MariaDB 5.1)
MariaDB 5.3 1,264,995 +212,359 +37,561 (since MariaDB 5.2)
MariaDB 5.5 1,377,405 +187,658 (from MySQL 5.5) +112,410 (since MariaDB 5.3)

From my previous post on lines of code in MySQL versions, we learned that with MySQL 5.6 we saw a 354kLOC increase over MySQL 5.5. What is quite surprising is how close some of the MariaDB differences are to this. With MariaDB 5.5, we’re looking at a 187kLOC difference, which is roughly two thirds that of MySQL 5.6. What’s also interesting is that each incremental MariaDB release has not added nearly as much code as the MySQL 5.1 to 5.5 and 5.5 to 5.6 jumps did.

MariaDB LoC over major versions

The MariaDB code size has also been increasing, if we look at the graph above  you can really see the jump in code size over the past few releases.

If we look at the delta between MariaDB and MySQL, the first MariaDB release (MariaDB 5.1) was certainly a large jump. Each incremental MariaDB release (5.2 and 5.3) have been a smaller delta than the initial one. With MariaDB 5.5 we actually decrease the delta from MySQL, which is something that’s interesting to look at.

If we were going a straight port of MariaDB 5.3 to be based off MySQL 5.5, we’d expect the delta to be around 137kLOC (what MySQL 5.1 to 5.5 is) but it isn’t. The difference to MariaDB 5.5 from MariaDB 5.3 is only ~112kLOC, and the on the whole delta decreases.

But what makes up this big initial jump for MariaDB? Let’s look at some of the MariaDB 5.1 only modules and what’s left:

MariaDB 5.1 component LoC (MariaDB 5.1)
PBXT 45,107
FederatedX 3,076
IBM DB2i 13,486
Total 61,669
Other 95,863

So the MariaDB delta is not increase just because they included some existing modules, there’s more code in there, about as much as any major MySQL version bump.

Tomorrow we look at other MySQL branches, and we see that the MariaDB delta truly is significantly larger than any other MySQL branch.

MySQL code size over releases

As the start of a bit of a delve into the various MySQL branches and patch sets that have been around, let’s start looking at the history of MySQL itself. This is how big MySQL has been over all of the major releases since the beginning (where beginning=3.23). (edit: These numbers were all gathered using sloccount and only counting C++ and C source files.)

Codebase LoC (C, C++) +/- from previous MySQL
MySQL 3.23.58 371,987 0
MySQL 4.0.30 368,695 -3,292 (from MySQL 3.23)
MySQL 4.1.24 859,572 +490,877 (from MySQL 4.0)
+174,352 excluding NDB
MySQL 5.0.96 916,667 +57,095 (from MySQL 4.1)
MySQL 5.1.68 1,052,636 +135,969 (from MySQL 5.0)
MySQL 5.5.30 1,189,747 +137,111 (from MySQL 5.1)
MySQL 5.6.10 1,544,202 +354,455 (from MySQL 5.5)
increase in MySQL source code size over version

MySQL code size over major versions

Note the sharp increase in 5.6

LoC Delta for major MySQL release

We can see that MySQL has had some interesting code size changes over time, the big jump in 4.1 over 4.0 was mostly due to the introduction of MySQL Cluster, but even so, it was a big jump.

MySQL 5.6 is the largest MySQL code size increase in a MySQL version ever. The last time we saw anything like this was with the merging of MySQL Cluster in 4.1. At the very least, Oracle is paying people to write lines of code to extent that nobody has before.

unireg.h is finally gone

I got rid of unireg.cc way back in 2009 as I rewrote all the FRM related code inside Drizzle to instead use a nice protobuf based structure. If you’re wondering what was there, I just quote this part of pack_screens() from unireg.cc in MySQL 5.6:

start_row=4; end_row=22; cols=80; fields_on_screen=end_row+1-start_row;

We have gradually pulled things out of unireg.h over the years too. But, let’s go back to ask the question “What is UNIREG?”. To answer that, I’m going to quote from something that was current back when MySQL 3.22 was the latest and greatest:

In 1979, he developed an in-house database tool called UNIREG for managing databases. Since 1979, UNIREG has been rewritten in several different languages and extended to handle big databases.

No doubt the definition of big has changed for most people since then. If we take what the MySQL 3.23 manual said:

Unireg is our tty interface builder, but it uses a low level connection to our ISAM (which is used by MySQL) and because of this it is very quick. It has existed since 1979 (on Unix in C since ~1986).

This is where the FRM file comes from, and likely where unireg.cc and unireg.h in MySQL  originated from. Since we didn’t want to have the limitations of FRM files in Drizzle, and since all code that mentioned UNIREG was generally not modern, we’ve been gradually removing bits or refactoring it since.

So, what was in unireg.h to begin with? I peeked into the MySQl 5.6 unireg.h to remind myself:

  • defines for libc5
  • defines for the MySQL thing that isn’t gettext
  • the PLUGINDIR define
  • some macros for errors
  • a bunch of SPECIAL_ prefixed defines for a bitfield which is a truly odd thing.
  • store_record(), restore_record(), cmp_record(), empty_record() macros
  • defines for flags to openfrm(), openprt() and openfrd() (the latter two not being present in MySQL)
  • defines used in relation to INFORMATION_SCHEMA tables and triggers
  • the BIN_LOG_HEADER_SIZE define
  • the DEFAULT_KEY_CACHE_NAME define

Soo… really a whole bunch of stuff that should never have been put there in the first place as most of that has clearly come in after it was MySQL.

But I’m getting sidetracked, the main point was that I looked at what was remaining in unireg.h inside Drizzle and it was just one thing: Code to abort the server in certain odd situations (that we’d completely rewritten so had no relation to unireg). So, I renamed it.

Renaming a file is a pretty minor change, but that hides all the work leading up to that point so that now I can go and explain Drizzle internals while making one less reference to UNIREG.

Now to go and remove the last tiny bit of unireg_check…

Fun with Coverity found bugs (Episode 1)

Taking the inspiration of  great series of blog posts “Fun with Bugs” (and not http://funwithbugs.com/ which is about both caring for and eating bugs), and since I recently went and run Coverity against Drizzle, I thought I’d have a small series of posts on bugs that it has found (and I’ve fixed).

An idea that has been pervasive in the Drizzle project (and one that I rather subscribe to) is that there is two types of correct: correct and obviously correct. Being obviously correct is much, much better than merely being correct.

The first category of problems that Coverity found was kind of interesting, there was a warning that data_file_name and index_file_name in class ha_myisam weren’t initialized in the ha_myisam constructor nor in any function that it calls. It turns out that this was basically because the code wasn’t exactly optimal, and these variables were used kind of oddly. In fact, in writing this blog post I went back and found that there’s a bunch of extra dead code and these should just be removed, along with the code that “used” them.

The historical use for data_file_name and index_file_name were that (in MySQL) you could specify different paths for MyISAM data and index files, so that the FRM ended up in the server datadir, the data file ended up some other place and the index file was off behind the sofa. Since MyISAM is used only for temporary tables in Drizzle, this is entirely not needed.

Another place where a similar bug was found by Coverity was in the SQLExecutor class of the json_server plugin. The _err variable wasn’t initialized in the constructor. After some careful auditing, I think this was actually a false positive as it was set to something before being used, but it was pretty simple to prevent future bugs by initializing it.

Two instances of the same warning, one just found a bunch of code to delete (rather useful) and the other is rather minor but may help someone in the future.

Coming up next: total embarrassment bugs.

Coverity scan for Drizzle

Coverity is a static analysis tool, which although proprietary itself does offer a free scanning service for free and open source software (which is great by the way, I totally owe people who make that happen a frosty beverage).

Prompted by someone volunteering to get MariaDB into the Coverity Scan, I realized that I hadn’t actually followed through with this for Drizzle. So, I went and submitted Drizzle. As a quick overview, this is the number of problems of each severity both projects got back:

Severity MariaDB Drizzle
High 178 96
Medium 1020 495
Low 47 52

I don’t know what MySQL may be, but it’d be great to see this out in the open too.

Being highly irresponsible (or, HOWTO DoS nearly all RDBMSs)

In my linux.conf.au 2013 talk, I had a big slide telling the audience how to do a simple Denial of Service attack against a MySQL server (post login). This was only one example of many others I could give, but I think it’s the simplest, and only requires the mysql command line tool and a single command. FYI, this also applies to PostgreSQL but I’ll leave the specifics up to somebody else to write.

There is a fundamental flaw in just about all MVCC databases that leaves a giant Denial of Service attack hole. It is the following: START TRANSACTION WITH CONSISTENT SNAPSHOT followed by a bunch of waiting. Sine the database server has to maintain this read view, InnoDB will continue to grow UNDO until it has to extend the ibdata1 file (system table space).

It’s important to remember that you cannot shrink the system table space (unlike with file-per-table where you can just do ALTER TABLE for any individual table suddenly finding itself a lot smaller).

As UNDO grows, InnoDB will faithfully expand the system table space until ENOSPC and then everything will fall in a heap.

In theory, you could have a system table space that doesn’t auto-extend, but then you’re relying on code paths to error out gracefully that I can pretty much bet you are completely untested.

The only real way to avoid this is doing both of the following:

  1. Use kill-idle-transactions feature from Percona Server
  2. have a script that checks for long running transactions and just kills them.

Similar things affect just about any MVCC database system. You’ll also see similar things with file system and volume manager snapshots.

So is it highly irresponsible pointing this out? Of course it isn’t, this should be pretty well known to most DBAs already and so should a whole bunch of other things. Remember all the things you saw in production and then went to hit your developers over the head for? Well, they’re all in this same category.

Go run giant UPDATEs, DELETEs or ALTER TABLE on a giant table in a replication setup, you’ll pretty much DoS your app as everything can’t get up to date read-only information from slaves.

Considering that this is merely scratching the top of the iceberg of ways to DoS a database server, keeping post authentication crashing bugs secret just seems… well… futile, even if you do accept security through obscurity as valid.

Further reading:

Aloo Palak/Spinach with Potatoes

This is the most lazy food blogging ever: I made this and it was yummy. You should make it too: http://www.ecurry.com/blog/indian/curries/dry/aloo-palakspinach-with-potatoes/ I was a it different to what I usually get when ordering Aloo Palak, and that’s a good thing as variety is the spice of life. I was doubly lazy and microwaved potatoes to speed up cooking time.

Those who do not know the future are doomed to repeat it

A couple of weeks ago, I attended the Open Source Developers Conference (OSDC) in Sydney where I gave the dinner keynote. I had previously given the dinner keynote at OSDC 2010 in Melbourne, where I explored a number of interesting topics “that I wasn’t really qualified to talk about.”

In writing the dinner keynote for 2010, I took the idea that people come to conferences to hear from experts in the field and decided that I should instead do the opposite of that. Talk about all the things that I think are interesting but I’m not an expert in. So in 2010 I covered: Drizzle database server (the only thing I was actually qualified to talk about), developing your own film, how much effort it takes to write a book, brewing your own beer, Bluehackers (and mental health in general), Security (it was the time of Stuxnet, oppressive border security), censorship (and how government claims that the internet is both different to and just like publishing a book at the same time), Wikileaks, how perhaps we should go after child pornographers rather than waste money on totalitarian filters, feminism, code of conducts at conferences, homophobia, a history of marriage and the notion of ‘traditional marriage’, the concept of freedom itself and a few pictures of vegetables made to look like faces. In the words of some attendees, “there was something to make everyone at least slightly uncomfortable at some point”.

My 2010 talk went really well, there was much applause and it inspired at least one person to go and brew their own beer (in itself a victory). Many thanks to Donna for spending a non-trivial amount of time helping me polish the final talk and help ensure some of my most important points were communicated properly.

So for 2012, I felt I had some big shoes to fill. Picking a topic (and writing the talk itself) for a dinner keynote is tricky. You have a captive audience with a wide variety of interests (and likely a few partners of attendees who aren’t at all technically minded). I wanted a topic that could have a good amount of humour (after all, we’re at dinner, relaxing and chatting) as well as a serious message that would speak to all the developers in the room (after all, this was the Open Source Developers Conference). Needing a title for the talk much in advance of when I would start writing the talk, I started thinking along the lines of “Those who do not know UNIX are doomed to re-implement it – poorly” and ”Those who do not know the past are doomed to repeat it” – thinking that there must be some good lessons that I’ve learned over the past years that could be turned into a dinner talk. I ended up settling on ”Those who do not know the future are doomed to repeat it”.

I, of course, left most of the specifics to be determined much closer to the conference itself as procrastination seems to be an integral part of writing a talk. Fast forward a while and if you were nearby you would have heard me exclaim “Who had this dumb-ass idea for a talk?” and “well, it seemed like a good idea at the time.”  Setting yourself constraints is good, and at least narrowed the search space for constructing something that’d go down well. Next came “How on earth do I construct a cohesive narrative around that?” as a whole bunch of fun anecdotes about what people in the past considered the future is great, but how do you weave a story around it? In thinking about what used to be the future (and indeed, researching it), I had the realisation that this in itself is a really good story and vehicle to talk about how to produce better software.

And so, I solidified a set of laws, and for mostly humorous purposes, I’ve called these “Stewarts Laws”. So, we started with:

Those who do not know the future are doomed to repeat it.

Stewarts 0th Law

Because in computing, we start counting from zero.

I then went on a grand tour of how we got to have the PC. Early personal computers being iterative improvements on technology that came before, and how packaging technology as an appealing product helps adoption and that no matter how good something is, if it’s too expensive, it’s never going to be mainstream. This last point was a homage to the great Hitchhiker’s Guide to the Galaxy, which was successful over the great Encyclopaedia Galactica for two reasons, one of which was “it was slightly cheaper”.

The platform which is more open will eventually succeed over ones that are more closed. (This really should have been a law… but I missed the opportunity). One example was Mozilla. The initial source release was way back in 1998 and this “quirky open source project” took a very long time to deliver a useful web browser (excluding all the internal Netscape development on this complete rewrite of the browser).

All complete rewrites of any sufficiently complex software takes at least 5 years to be remotely usable.

Stewarts 1st Law

With the insight that the more free platforms (the PC, Windows, the web, Mozilla) eventually win out and being a talk about the future, I could not possibly not cover “The Year of the Linux Desktop”. This was useful to cover the install and user experience of Debian 2.2 (potato). This was Linux in the year 2000 (with IPv6 support, and with World IPv6 day only six months ago, this is certainly the future). It was not friendly.

But there was KNOPPIX that built on what came before and this showed the way so that other distributions could end up creating a situation where there are now many distributions of Linux that make running a free desktop something that is no longer masochistic, it’s something that can be decidedly pleasant.

I (of course) had to cover the freedom in your pants. The cell phone. Specifically, how there is more free software running on a computer that fits in our pants pockets than there was storage in the computers we grew up with. It doesn’t matter if Android is better than an iPhone or not, the more open, free and cheaper platforms always win. But really, it’s just iterative improvement on what came before.

All innovation is really just iterative improvement.

Stewarts 2nd Law

Very rarely (if ever) is there a “eureka” moment that doesn’t build upon the work of others. Find your giant to stand upon so that you can see further.

We can, of course, get it wrong. I used the example of New Coke and wondered if Unity or GNOME3 are our “New Coke” or if Windows 8 is the new Vista. But really, it’s not making a mistake that is bad, it’s not realising it and correcting. What we need is CI. Not Continuous Integration (although that is part of it), I’m talking about Continuous Improvement.

Anybody who took a “Software Engineering” course at university will have read about, studied, and parroted things about “the waterfall model” and “software prototyping” and “incremental build model” and “spiral model” and maybe even “SCRUM” or XP (which seems to be jumping off cliffs and yelling at fish). You probably had to do an assignment where you wrote “We’re going to do X model” and then had to stick to it, quickly finding that it just didn’t quite work that way.

This is because all this static model of software development methodology is a bunch of dairy production byproduct – otherwise known as BOOLSHIT. There is no static way written in stone and there certainly isn’t “one true way.”

The best battle plans don’t survive first contact with the compiler

Stewarts 3rd Law

This law is obviously stolen, which leads me to:

Stealing good ideas is itself a good idea, that you should steal.

Stewarts 4th Law

Software development is evolution by natural selection. Mutations in software battle it out and the fittest survive. This is even more true in free software development, as anyone is free to fork the product, mutate it and compete. In this way, free software accelerates the free market – it forces companies to continually add value rather than vendor lock in.

Our development processes also evolve. We try new things and keep what works. There may be a “state of the art” that we think exists, but really what matters is continuing to improve your development process. You don’t have to suddenly catch up, just improve.

  • Revision Control
    We’ve had RCS, CVS, Subversion. We’ve had bzr, hg and git. Distributed is obviously the current state of the art.
  • Code review
    and improving how we do code review. Could you review code better? Could we have automated code review?
  • assert(), make the compiler do the work, defensive coding
    Write code to do some of your code review for you.
  • Explode at compile time rather than runtime (i.e. not user visible)
    Detecting problems earlier is better.
  • Extensive Unit testing
    Test each component, have components be components, not spaghetti.
  • Extensive testing
    Test the system as a whole
  • Running the test suite
    Actually run the test suite
  • Reliable test suite
    Have the test suite be reliable so that a failure really is a failure and not a false negative.
  • Continuous Integration
    Always test how things go together
  • Test before integration
    Test before pushing to trunk, ensuring even further that trunk is always releasable.
  • Merge captain
    Takes approved code, merges it. This is variants on the Linus model.
  • Automated merges
    Take the manual steps out, we can automate them (who needs to type 10 version control commands in when one will do)
  • Always releasable trunk
    “Release early, release often” refined to “release something that isn’t crap”
  • Release checklists
    There are probably different things you want to do upon release, check that you do them. For adding awesome new features, you want your marketing department to know about it. For awesome bug fixes, you want your support staff to know about it.
  • Continuous Deployment
    There is no environment like production environment.

This led me to two more laws:

Any system of sufficient size will have several versions of each component deployed simultaneously.

Stewarts 5th Law

Constructing software is itself a system of sufficient size.

Stewarts 5th Law, part B

This applies to software you both deploy yourself and release as a tarball (or however you do it). Even if we don’t like to think about it, when we release a software package we are slightly involved in deploying it. We can certainly make it easier or harder to deploy. There are always OS and library updates that will be out there, so there will always be your software running in different environments.

Not only will people using your software use it in different environments, the people developing your software will be too. No two developers have the same development environment setup. One will use a different editor, different shell, slightly different version of the compiler (maybe they haven’t applied an update yet) etc etc. We can’t program to the “one true environment” because no such thing exists.

So, What’s next?

Some of our older problems have good solutions, but many of the newer ones do not. How do we get the state of the art in software development to more people? What’s the next step to explore?

I encourage you to constantly think about your development process and what the future holds for it. After all, it is adapt or perish – the past is littered with the technological corpses of things that were “the future” but failed to innovate any further.

Sierra Nevada Beer Camp 2012 Oatmeal Stout

This, my friends, was good. Supremely drinkable with great flavour, colour and body. I want more of them. We have a bit of a saying around here that the Sierra Nevada brew of any style of beer is a rather good definition of that style – and this is no exception. If you feel like a good oatmeal stout, go grab one of these and you won’t be disappointed.