Lock-free clustering of large PostgreSQL data sets

Since 2012-09-27, I have been collecting overstock data from TF2WH every five minutes and storing a snapshot of this data in my database. This enables me to do some really cool things, such as chart past data as well as make projections about future stock levels. For this amount of time, data collection has been proceeding largely uninterrupted and today I have just shy of 40 million individual records.

However, the optimal physical layout of this data is to have the records for each item type grouped together, sorted by timestamp. Due to the nature of time, these records wind up being grouped first by timestamp, and then by item type. This can make lookups of data for a particular item slow, since many database pages need to be visited in order to collect all of the data about a particular item.

PostgreSQL (among other databases, of course) has two primary solutions to deal with this. The simplest one is the CLUSTER statement, which will reorganize all of the data in a table based on an index, such that lookups making use of that index are quick. Since the table’s primary key is composite around the item type and timestamp, this is convenient and does the right thing. Unfortunately, clustering requires an exclusive lock on the table. Sorting and writing out 40 million rows is not a fast operation, and while this is underway no new data can be added to the table, and nobody can query the table for information. This is not a good situation when new data is coming in every five minutes, and users are frequently requesting data of a web application.

The more complex solution is to use table partitioning. This requires that one empty parent table be created, and many child tables be created, each one for a specific item type. The parent table will allow querying across the entire data set, but since each item’s data is stored in a different table, clustering is no longer required — each table will be ordered by timestamp (since new rows are appended) and the separate tables keep the data sets for each item type physically separate on disk. This is a nice solution in theory, but PostgreSQL does not currently provide any mechanism for automatically partitioning a table based on a set of columns; the child tables have to be maintained either by hand, or by script. I like to avoid this kind of complexity when possible.

A compromise solution I’d thought of would allow collection of data to proceed during a cluster operation, but bring down the web application for the duration of the cluster. Before the cluster operation begins, I would have the data collection script retarget insertions into a different table. Once the cluster operation finishes, the script would return to inserting data into the primary table, and the contents of the other table would be transferred over. This is simple, but of course we want to keep the web application responsive if at all possible.

After asking in #postgresql I received some suggestions, and wound up implementing a modified version of one of them. This one allows me to effectively cluster the data set while allowing data collection to proceed and keeping the web application up (though with degraded performance, as the clustering operation and web application will both be waiting on each others’ disk I/O). Further, I was able to implement this change to the production database schema without interrupting data collection, nor taking the web application offline. (Although the application was unreachable for a few seconds due to an oversight on my part regarding table permissions. Theoretically, if I’d foreseen this requirement then there would have been no downtime.)

The table is partitioned into three child tables: static1, static2, and latest. The parent table has a rule that redirects all inserts into the “latest” partition. This serves as the data collection point, where new rows sit until they are migrated into one of the other two partitions. The “static1″ and “static2″ partitions serve as front and back buffers for the cluster operation. Only one partition will actually have any rows at any given time (the front buffer); the other will be empty (the back buffer).

When I want to perform a cluster operation, the rows from the “latest” partition are moved into the front buffer. Then, the rows from the front buffer are dumped into the back buffer, ordered by the primary key (this is what actually causes the data to be clustered). Finally, the front buffer is truncated to remove the dead rows, and the front and back buffers are swapped. Next time the operation happens, rows will be moved in the opposite direction between the two partitions. This is what the data flow looks like during one cluster operation:

latest --[ rows moved ]--> static1 --[ rows moved and sorted ]--> static2

The next time, it would look like this:

latest --[ rows moved ]--> static2 --[ rows moved and sorted ]--> static1

The magic queries that let all of this happen without interrupting the web application are:

WITH new_rows AS (DELETE FROM some_data_latest RETURNING *)
INSERT INTO some_data_static1
    SELECT * FROM new_rows;

WITH new_rows AS (DELETE FROM some_data_static1 RETURNING *)
INSERT INTO some_data_static2
    SELECT * FROM new_rows
    ORDER BY ...;

TRUNCATE TABLE some_data_static1;

ANALYZE some_data_latest;
ANALYZE some_data_static1;
ANALYZE some_data_static2;

The next time, references to static1 and static2 would be swapped.

There are other options besides TRUNCATE TABLE:

  • TRUNCATE TABLE static1; will very quickly dispose of the deleted records (usually in less than one second). However, it is not transaction-safe; concurrent transactions may wind up seeing no data in the parent table at all.
  • VACUUM static1; will slowly dispose of the deleted records, and will not cause concurrency issues. However, the time it takes to complete this operation is not small, and in my experience it causes sub-optimal layout of rows inserted into this partition in the future.
  • VACUUM FULL static1; will slowly dispose of the deleted records and will ensure that future inserts into this table are done as optimally as possible. However, it requires an exclusive lock on the table. This will prevent the web application from processing requests, and the whole point of this approach is to avoid this.

Based on this pro/con list, TRUNCATE TABLE seems to be the way to go. The window of time during which concurrency issues may result is not very large. However, because of these potential issues, I recommend executing the data-clustering query outside of a transaction — or, at least, only put the first two statements in a transaction. Commit the transaction before truncating the empty table.

(Update 2013-03-04: It’s possible to properly vacuum the table while minimizing the time during which an exclusive lock is held by first doing a standard vacuum and then doing a full one: VACUUM static1; VACUUM FULL static1;. The first vacuum takes a while, but the second one then usually completes in seconds. Of course, any deleted rows that are visible to ongoing transactions will not be removed as a part of this process — and that’s really the only reason why you would vacuum instead of truncate. If you need to reclaim disk space underneath an open transaction, vacuum is the way to go. Otherwise, there is no reason you can’t truncate the table.)

Here is the SQL that creates the tables enabling this approach. (Column names have been replaced with ... so that you may easily tailor this SQL to your own environment.)

-- Parent table.  This is the logical table against which queries
-- are executed.
CREATE TABLE some_data ( ... );

-- Partition for new records.
CREATE TABLE some_data_latest ()
INHERITS (some_data);

-- One of the buffers.
CREATE TABLE some_data_static1 ()
INHERITS (some_data);

-- The other buffer.
CREATE TABLE some_data_static2 ()
INHERITS (some_data);

-- Primary keys.  These will not enforce uniqueness across the
-- entire logical table, but they will serve to index the data.
ALTER TABLE ONLY some_data_latest
    ADD PRIMARY KEY (...);

ALTER TABLE ONLY some_data_static1
    ADD PRIMARY KEY (...);

ALTER TABLE ONLY some_data_static2
    ADD PRIMARY KEY (...);

-- Rule directing inserts on the main table to the latest table.
CREATE RULE some_data_insert AS
    ON INSERT TO some_data
    DO INSTEAD
        INSERT INTO some_data_latest
        VALUES (NEW.*);

This is the script that was used to apply this change to a production database. Thankfully, PostgreSQL implements transactional DDL, so we can rename tables out from under the web application and data collections scripts! They will use the old table up until the transaction is committed, at which point they will immediately begin using the new partitioned table.

BEGIN;

ALTER TABLE some_data RENAME TO some_data_latest;

CREATE TABLE some_data ( ... );

ALTER TABLE some_data_latest INHERITS (some_data);

CREATE TABLE some_data_static1 ()
INHERITS (some_data);

CREATE TABLE some_data_static2 ()
INHERITS (some_data);

ALTER TABLE ONLY some_data_static1
    ADD PRIMARY KEY (...);

ALTER TABLE ONLY some_data_static2
    ADD PRIMARY KEY (...);

CREATE RULE some_data_insert AS
    ON INSERT TO some_data
    DO INSTEAD
        INSERT INTO some_data_latest
        VALUES (NEW.*);

COMMIT;

Don’t forget any GRANT statements required to make the new tables accessible by the web application! This is the step I forgot, and it took down the web application for a minute or so.

Once this is complete, run the clustering script given above to perform the initial cluster of your data. While it’s running, you should notice that the web application will continue to be responsive, and new collected data will be immediately visible — just as before! Once the operation concludes, you should notice an increase in performance as lookups for past data will be using a mostly-clustered data set.

Object Copying in C#

When working on some sort of data-driven project, I frequently have the need to allow deep-copying of data objects. There are several patterns that accomplish this, but I’ve settled on one in particular.

Most .NET developers are probably familiar with the ICloneable interface. While this is a good starting point, it is not what I choose to rely on, for two reasons. First, the return type of the Clone method is object, so a cast is required. Second, the interface doesn’t really give you any special functionality. Nonetheless, implementing interfaces is usually a good thing, so my approach does use the interface, if only as a tag.

(I am leaving out equality test implementations for the sake of brevity, but one usually wants to implement equality testing when an object can be cloned.)

The common pattern I see when implementing a Clone method is to declare one public virtual method on the base class:

abstract class Animal
{
    public IList<Animal> Children { get; set; }

    public virtual Animal Clone()
    {
        var copy = (Animal)MemberwiseClone();

        // Deep-copy children
        copy.Children = Children.Select(c => c.Clone()).ToList();

        return copy;
    }
}

Simple enough. Let’s say that we have a derived class that needs some additional logic. We just override the Clone method, right?

class Dog : Animal
{
    public Collar Collar { get; set; }

    public virtual Animal Clone()
    {
        var copy = (Dog)base.Clone();

        copy.Collar = Collar.Clone();

        return copy;
    }
}

This works, but as soon as you try to clone a Dog directly, the ugliness of this pattern is apparent.

Animal animal = otherAnimal.Clone(); // Great!
Dog dog = otherDog.Clone();          // Compile-time error

Unfortunately, the return type of Animal.Clone() is Animal and subclasses may not change the return type, not even to narrow it. So to clone a Dog into a variable of type Dog means we have to cast:

Dog dog = (Dog)otherDog.Clone();

Yuck. This is passable, but it’s hardly optimal.

The good news is that with just one tweak, we can make this pleasant to deal with. First, the Clone method needs to be made protected and renamed. Second, we create a new public Clone method that is not virtual and calls the protected virtual method. Subclasses hide this method with a new implementation that does the same thing, but casts the result.

Here’s the full implementation:

abstract class Animal : ICloneable
{
    public IList<Animal> Children { get; set; }

    public Animal Clone()
    {
        return CloneImpl();
    }

    object ICloneable.Clone()
    {
        return CloneImpl();
    }

    protected virtual Animal CloneImpl()
    {
        var copy = (Animal)MemberwiseClone();

        // Deep-copy children
        copy.Children = Children.Select(c => c.Clone()).ToList();

        return copy;
    }
}

class Dog : Animal
{
    public Collar Collar { get; set; }

    new public Dog Clone()
    {
        return (Dog)CloneImpl();
    }

    protected virtual Animal CloneImpl()
    {
        var copy = (Dog)base.CloneImpl();

        copy.Collar = Collar.Clone();

        return copy;
    }
}

Now, we have nice class-specific methods that will return a properly-typed reference to the new copy.

Animal animal = otherAnimal.Clone(); // Works
Dog dog = otherDog.Clone();          // Also works!

It’s worth noting that both patterns will properly copy objects of more specific types than the reference you use to copy them. For example, given this variable:

Animal animal = new Dog();

animal.Clone() will return a Dog instance, typed as Animal. This is what we’d expect.

httpd migration complete

I’ve finished the httpd migration process. chrishowie.com is now using nginx as its primary httpd, which reverse-proxies to Apache for only a few mod_python and mod_mono web applications. Over the next few weeks, I’ll be trying to eliminate Apache entirely.

httpd and URI-to-site-mapping migration

Over the next week or so I’ll be working on migrating this site from Apache to nginx, as well as altering the way that various URLs map to sites/applications. I will be trying very hard to avoid any service interruptions by fully testing my nginx configuration before replacing Apache, but who knows what might happen.

To give some insight on this, I am switching to nginx because the primary performance bottleneck on the server is available RAM, and nginx tends to be extremely conservative with RAM while also consuming less CPU resources.

I’m altering the URL mapping configuration to better cope with the fact that I host many different services on the primary www.chrishowie.com virtual host. The first one I installed was this WordPress blog, and I installed it at the virtual host document root. This means that any sites hosted on the same virtual host effectively live in the same URI-space as my blog, and I really don’t like this. After the migration, each site will have its own subdomain to better isolate them. Right now, the subdomain configuration is very simple:

  • www.chrishowie.com hosts this blog and several applications.
  • chrishowie.com redirects to www.chrishowie.com.

After the migration, this will change:

  • www.chrishowie.com will be deprecated. Various URIs will redirect (with HTTP status code 301 “Moved Permanently”) to other subdomains in an effort to preserve the functionality of existing incoming links.
  • blog.chrishowie.com will host this blog.
  • chrishowie.com will host a simple static “business card” type site, linking to my blog and a few other applications.
  • Other subdomains will be created as necessary to support the other applications running on this site.

TF2 item store launched

So I had some free time last weekend and coded an item store for Team Fortress 2. I collect items and sell them for varying prices, usually one scrap metal each. If you’re interested, check out the store!

The store is built on a few components, all written in Python. There is a script to download the TF2 item schema, which includes data such as item names, quality names, item image URLs, and more. I selectively import this data into a table in a PostgreSQL database. Then, another script fetches the contents of my backpack and stores it into another table. One more table references the backpack table, listing the items I currently have for sale and their prices in item quantities (one scrap metal = two items). Finally, a mod_python publisher script fetches this data and renders it as a visually-pleasing item list, all without any client-side JavaScript.

If the store has something you’re interested in, feel free to contact me on Steam! You can add me to your friends list from the store, and I will reply as soon as possible.

C++ references, continued

So I got some feedback about my last C++ post. The comment states that references are not pointers, they are just names for another object.

Sorry for reopening a topic after nearly 6 months. But I cannot stay silent.
I think you got it wrong. Completely.
Although a reference might behave like “some sort” of a pointer, it is *not* a pointer. Your statement: “A reference is effectively a pointer, but this is hidden by the language.” is completely wrong.
To quote the C++ standard: “A reference is an alterantive name for an object.” It is just a new name for something that you’ve defined elsewhere. That’s the very reason why it cannot be null –> You cannot have an alternative name for an object that you do not have yet.

–Willi Burkhardt

Great, in theory. Unfortunately, none of the compilers I have used treat references as anything other than pointers. References are, on some level, supposed to guarantee non-null-ness as well as that they reference a valid object. This is not true in any compiler I have ever used.

Take this example (see it run):

#include <iostream>

static int const a_const = 5;

int const& A() {
	return a_const;
}

static int const* b_ptr = 0;

int const& B() {
	return *b_ptr;
}

int main() {
	int const& a_ref = A();
	
	std::cout << "Called A()" << std::endl;
	std::cout << "a_ref: " << a_ref << std::endl;
	
	int const& b_ref = B();
	
	std::cout << "Called B()" << std::endl;
	std::cout << "b_ref: " << b_ref << std::endl;
	
	return 0;
}

If we are to believe that references are simply another name for an object, then converting *b_ptr to a reference should have caused a runtime error. After all, we dereferenced a null pointer, right? The compiler should emit code to prevent this, right?

In an ideal world, this would cause an error — but it does not. The segmentation fault does not come until b_ref is used; indeed, we see “Called B()” in the program output, indicating that B() successfully returned a reference, which was stored in b_ref. Obviously, at runtime there was a null pointer dereference. But we didn’t use a pointer, I hear you saying. We used a reference!

Then please explain this behavior to me. On a language level, sure, references are “names for objects.” But this does not change the fact that the implementation is done using memory addresses — which is fundamentally the same thing pointers do. This helps to explain why we see the behavior of this sample. As I mentioned in my last post, when you convert an expression to a reference type, it’s treated exactly as though you had converted it to a pointer type, with an implicit address-of operator (&). So we can rewrite this function:

int const& B() {
	return *b_ptr;
}

Like this:

int const* B() {
	return &*b_ptr;
}

And it becomes immediately clear why the segmentation fault did not occur here — taking the address of a dereference expression is the same thing as taking the original expression. The & and * cancel out during compilation, and we just return the pointer. Take a look at this example, which is identical to the above example, except that A() is gone, and B() now returns a pointer, with dereferences added in the appropriate places (see it run):

#include <iostream>

static int const* b_ptr = 0;

int const* B() {
	return &*b_ptr;
}

int main() {
	int const* b_ptr = B();
	
	std::cout << "Called B()" << std::endl;
	std::cout << "b_ref: " << *b_ptr << std::endl;
	
	return 0;
}

Identical behavior.

So you can throw the spec at me all you want, but every implementation I’ve tried uses pointer-with-automatic-dereference semantics — if you convert every reference to a pointer, add an address-of operator to every assignment to a reference, and a dereference operator to every use of a reference, you will see identical behavior.

To preempt the “but the compiler can optimize local references” argument, the compiler can do exactly the same with pointers.

// With a reference
int A() {
    int a = 5;
    int& b = a;
    return b;
}

// With a pointer
int B() {
    int a = 5;
    int* b = &a;
    return *b;
}

I’ve heard the argument that the compiler can eliminate the reference in A(). Well, it can also eliminate the pointer in B(). If a local pointer is set to point at another local and the compiler can prove that it will never change, it can optimize it away just as easily as it can optimize away a local reference to a local.

So, this supports my original argument that references store memory addresses in the same way that pointers do, only with automatic dereferencing. They are effectively nothing more than syntactic sugar, allowing you to forget that you’re operating on an object somewhere else in memory.

Database versioning and handling branching

It’s no secret to developers of database-driven applications that trying to version a database schema (and seed data) is a royal pain. Propagating changes from your development environment to a live environment is not something that most version control systems are well-equipped to do. This is further complicated by distributed VCSes, like Git — how do you prevent schema migration chaos when developers can branch on a whim?

I’ve been mulling this issue over for a few months now. There are several development projects I have where database versioning is necessary. Finally, I’ve come up with a process that is not only effective, but actually really simple!

Taking a clue from Git, we can use a digest of the current schema definition file (the SQL script that will create the database structure) as the schema identity. The schema requires a meta-table to describe the identity of the current schema — this will allow for automated schema migration, since the migration script can detect the current schema version. (I usually have a meta-table anyway, for storing site configuration in key/value pairs. The schema version fits in this table fairly well.)

So, let’s say we’re starting a new project. We design a schema, making sure to include this meta-table as well as any seed data required for the application to function. This excludes the “current database schema identity” row, since adding that row in the schema script will cause the identity to change! Then we write a migration script. This script has two functions: load the initial schema, and migrate between schema versions. When performing the initial load, it should follow this by inserting the database schema identity into the meta-table. The identity is of course obtained by digesting the schema file.

Now we are ready to make changes. Let’s say we want to add a column to the table. First, we note what the current schema’s identity is. Let’s call this $SCHEMA_ORIGINAL. We tweak the schema definition file to include this column, and then we obtain the digest of the schema file, calling this $SCHEMA_NEW. Now, we write two migration functions in the migration script: one that will migrate from $SCHEMA_ORIGINAL to $SCHEMA_NEW (an ALTER TABLE ADD COLUMN query) as well as one that will migrate in the opposite direction (ALTER TABLE DROP COLUMN). This will allow us to roll back the database if necessary.

Now, when you ask the migration script to upgrade the database schema, it only has to fetch the database’s current version, digest the schema definition file, and then find a path between those versions using a breadth-first search of the available migrations, starting at the current version.

This technique can even account for merges! When merging branches A and B together, you would resolve any conflicts that arise in the schema definition, and then construct four migration functions: A to merged, merged to A, B to merged, and merged to B. The breadth-first search during migrations means that if you are then switching from branch A prior to the merge to branch B prior to the merge, it may actually be faster to migrate the database through the merge instead of backing up until the point A and B diverged.

It may also be useful to provide a mechanism to tag certain revisions as causing data loss (such as rolling back a column addition). The migration script would then prefer longer migration paths that preserve data over shorter migration paths that destroy it.

There are some downsides to this approach. For one, migration functions will have names that provide little meaning to humans, something like migrate_0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33_to_62cdb7020ff920e5aa642c3d4066950dd1f01f4d(). And another is that the migration script will need to construct an in-memory graph of every migration so that it can perform its search. If the only possible migration path contains a migration that will cause data loss, the script will exhaustively search the rest of the graph looking for an alternative. There’s probably some room for optimization there.

I will likely be coding up an MIT-licensed Python script to implement this algorithm in the coming days, so stay tuned.

VPS.NET experiences

I’ve been using VPS.NET as my hosting provider for two years now (since October 2009). Here’s my experience, the good and the bad.

The building block of the VPS offering is a node: a discrete unit of CPU time, RAM, storage, and bandwidth. These nodes can be deployed as separate servers, or combined together to create a more powerful server. Servers can be upgraded or downgraded by attaching and detaching nodes from them. The CPU/bandwidth changes can be applied without rebooting, but to apply RAM and storage changes, the VPS needs to be rebooted so that the RAM reassignment can happen and so the VPS root volume can be rebuilt. You can bill nodes per month or per day. The daily option is more expensive, but allows you to throw an extra node or two at a server that’s being slashdotted until the traffic subsides — much cheaper in the long haul then monthly nodes. Daily nodes are also a good choice for a throwaway server you’re going to test something on and then scrap. The math suggests that if you’re going to be using the server for longer than two or three weeks, monthly nodes are a better deal.

To further sweeten the deal, nodes are portable across any of their data centers. While you can’t easily migrate deployed VPSes, you can destroy a server in one data center and create a new one in another using the same node. I look forward to the day when we’ll be able to live-migrate running VPSes across the country. Hopefully someone at VPS.NET is working on that…

Initially I had one node and this blog was about the only thing running on the server. Since then, I’ve expanded the services I host to my personal email server (hosting a few domains for friends too), a Git-based personal software forge, some services for Wikimedia, and a few odds and ends. While not a feature unique to VPS.NET, having root on your own server is really nice. Need some software? Install it. Want to tweak the firewall? Go ahead. I’m a hacker, so it’s nice to be able to use my server to just try stuff out and see what works.

Each VPS has CPU and network usage graphs so you can pinpoint busy times and optimize accordingly. There’s also a console for each VPS that connects directly to the offline OS terminal, allowing you to recover from a botched network or ssh config without involving support. While I’d like to claim that I’ve never had to make use of this feature, that would be a lie. I don’t often need it, but it’s there when I do.

Uptime has been pretty good all-around. I’ve had a few outages here and there: a SAN failure that the staff were not properly prepared for, causing a few hours of downtime; a bootloader issue with Debian caused by upgrading to squeeze (arguably not VPS.NET’s fault, as Debian decided to change the boot volume path); and most recently, volume corruption caused by upgrading my node, which the support staff remedied within minutes, but ideally should have never happened. (This makes me a bit reluctant to upgrade again without someone over there babysitting the box to make sure it boots up correctly.) However, these occurrences were not frequent. That’s three distinct issues I can recall since late 2009, two if you attribute the bootloader problem to Debian. This isn’t 100% uptime, but it’s pretty darn close to it, and support is usually quick to respond when problems happen.

I resell a few nodes to some friends and former business partners, and they’ve had very little issues getting started. It takes about 10 minutes to go from purchasing a node to having a running server. When a friend wants another server, I can deliver them their login credentials to a running server within about 15 minutes. That’s pretty cool.

There are a few network-level filtering things one should be aware of. VPS.NET monitors outgoing SMTP (port 25) connections and applies their own spam filtering, rejecting mail if it’s too spammy. Some people might like this since it can protect them from being listed on a spam blacklist if their server gets compromised. Since I run my own mail server and do my own filtering, I wasn’t pleased when this was deployed, especially since it was done so without any communication on their part. Once I figured out what was going on, I was able to opt-out of this filtering by filing a support ticket. I would have preferred more transparency about this change. Finally, as with most hosts, IRC ports are filtered bidirectionally. As I was wanting to use my VPS to run a service that collects data from Wikimedia’s real-time change notification IRC server (essentially just a data stream delivered via IRC) this was a bit of a bummer. They will not budge on this policy, which is disappointing.

Overall, I am a happy customer. Things don’t go wrong often, and when they do support is quick to respond. I’d suggest trying a daily node or two if you want to see if they are a good fit for your hosting needs; they’re only a dollar per day.

What I dislike about C++, part 1: References

I’ve started a new job, for those of you who didn’t know. I’m now coding C++ daily. My relationship with C++ has been distant, simply because I haven’t really ever had a need to use it. However, C and C# are both strong languages of mine, and C++ sits somewhere in the middle: C with classes, C# without garbage collection. (These are rough approximations, and not without caveats.)

There are bound to be things in every programming language that get in the way of productivity. In this post I’ll highlight one of those things in C++: references. References are obfuscating and mostly useless. But first, why would you use a reference?

A reference is effectively a pointer, but this is hidden by the language. You use it like it’s not a pointer, and the compiler turns direct accesses into indirect accesses. For example:

#include <iostream>

int main() {
	int a = 5;
	int &b = a;
	
	std::cout << "a is " << a << std::endl;
	
	b = 6;
	
	std::cout << "a is " << a << std::endl;
}

The output will be:

a is 5
a is 6

Note that I did not say *b = 6;. The reference is treated as though it were not a pointer. Cool… but what’s the benefit?

The solitary benefit I’ve heard from others is that references cannot be null. When you declare a function/method that accepts an argument typed as a reference, it’s not possible to make that reference equivalent to a null pointer.

Ok, so we trade a bit of clarity for a compile-time guarantee that we won’t be dereferencing a null pointer. That’s a good trade, right?

Maybe. Consider this excerpt:

class Foo;

void do_something() {
    Foo *foo = new Foo();
    use_foo(*foo);
    delete foo;
}

This is contrived, yes, and there’s a better way to write this code. But I’m illustrating something here. I’ve told you that Foo is a class, but I haven’t told you the prototype for use_foo(). That’s on purpose. Now, you tell me if the Foo instance is going to be copied. I’ll even tell you that Foo doesn’t overload operator*.

Do you have your answer yet? If you said yes — the logical choice — you’re wrong. If you said no, you’re also wrong. Well, actually, if you said yes or no, you might be wrong. It’s impossible to tell. If use_foo()‘s declaration is use_foo(Foo foo) then a copy will be made using Foo‘s copy constructor (if possible, otherwise it will be a compile-time error). But if the function’s prototype is use_foo(Foo &foo) then we are actually passing in the value stored in the foo variable — a memory address. The object will not be copied. In other words, while it looks like we are dereferencing foo, we are actually doing no such thing.

In C, you can tell pretty much everything you need to know from a call site. In C++, you must know how the function you’re calling is defined too, simply because you have to know when things are references and when they are not. The treatment of what is fundamentally a pointer type as a value type (at the language level) is what causes the uncertainty. You use value-type grammar around references, even though they are not really a value type.

If it weren’t for the existence references, the code above would be perfectly clear. (Well, unless you didn’t have me to tell you that Foo doesn’t implement operator*…)

Don’t get me wrong, C++ still has a lot of good features. It’s just a bit irritating that because of a bad feature (and some other features too, such as some forms of operator overloading), I must know the details of every type and function on a line of code to be absolutely sure what it’s doing.

Bitcoin for Humans series

Anyone who’s been around me recently knows that I love Bitcoin. It’s a really neat idea, but it’s also a very complicated thing. Most people that I talk to don’t really understand how it works and why it was designed the way it was, and therefore have some incorrect ideas and criticisms of Bitcoin. So I am going to do a series of blog posts about it that attempt to reduce its very intricate design to simple analogies and concepts that might not be completely accurate, but that will help the non-technical person understand it a bit more.

If you don’t even know what Bitcoin is, it is a distributed and decentralized currency that operates over the Internet. There is no central bank or issuer; there are no regulators. The network as a whole decides what the rules are, and the source code for the official Bitcoin client is open. You can transfer Bitcoins to anyone, anywhere in the world, in about 10 minutes — and usually for free.

The topics I will cover in my series are:

  • Basic overview: How to send and receive coins.
  • Addresses: What an address actually is, and why you should care.
  • The block chain: What the block chain is, and how this keeps the network secure.
  • Mining: How Bitcoins are created, and why this is not “generating free money.”

If you can think of any other topics you would like me to cover, let me know and I will extend the series. If you want a bit more information right now, this video is a pretty good summary of the project: