Three Months with Go

Why Go?

For the last couple years I've been really interested in a language which gives me many of the things that I like from Python in a much more performant form. Go seemed like a plausible candidate and I was interested in implementing a server that I could practice performance tuning on, so I gave it a try with at least one commit per day for three months.

This blog post is the things I liked and didn't like after using Go for 10-20 hours per week and producing a ~8000 line application (blog post on that later!). It also draws some comparisons with python since the other language I'm using regularly (Scala) is a little too different to be an interesting comparison.

The project was a server implementing AMQP 0-9-1, the most notable competitor in that space being RabbitMQ. Erlang can double its memory usage during GC, so it seemed like if I could match its speed then my server would have an advantage in terms of the number of requests it could store in memory.

If you're interested, the server is called dispatchd.

Goroutines and Coroutines

NOTE: If you're already familiar with Go's concurrency model and aren't interested in python's coroutines, consider skipping to the performance section below.

The features of Go I like the most are channels and goroutines. In python I've gotten very used to having many concurrent operations happening using coroutines, where a function would yield control until an async operation completed. A little detail on how this works in python will help with contrast:

@gen.coroutine
def process(url):  
    resp1 = yield httpclient.fetch(url)
    # ...
    resp2 = yield httpclient.fetch(resp1json['url'])

The yield statements suspend the current function state and returns to the caller. The caller can use select(), epoll() or another efficient method for listening to all of the outstanding requests. When one of them completes the suspended function is resumed and the value returned from the function call that was yielded is returned.

This style of programming allows a ton of concurrency with only one thread. You can have thousands of suspended coroutines waiting on responses and as long as you aren't CPU bound it works great. Glyph from the Twisted project wrote a great post on why thread-based concurrency is bad, which is worth a read even if you want to do multi-threaded programming.

There's a lot of complexity in this model and a few problematic parts:

  • A decorator (function wrapper) is needed to interact with the event loop. You have to know if every function is a coroutine or not and make sure you use the correct calling convention (yield vs normal call).
  • Either every library you use has to be coroutine-aware or you have to resort to using thread pools. Doing parts of your work in coroutines and part in threads isn't that clean, so a lot of time is spent isolating or using less well tested coroutine-aware libraries.
  • In python 2.x to return a value from a coroutine you have to use raise gen.Return(value), which becomes very annoying if you want to return inside of a catch-all try block.
  • Any function which blocks the CPU for a long time prevents all other coroutines from running. This is a big place cooperative multitasking breaks down

The Go model is much simpler from a user's perspective. There are many different goroutines—green threads, essentially—which are multiplexed across a much smaller number of operating system threads. When a goroutine that is running on an operating system thread reaches a point where it would block (mutexes, I/O, etc) it potentially gives control to one of the other goroutines.

The only real disadvantage compared to the single threaded async model is that you have to worry about shared state. However, the advantages far out weight the disadvantages:

  • You make calls out to whatever Go code you want and the interleaving of processing will happen automatically
  • Since you can have multiple operating system threads and don't have to explicitly relinquish control it's much harder for one bad actor to lock up your entire process
  • You are likely also using channels, which tend to encourage systems that are pipelined and where you wouldn't have two goroutines writing to the same objects (my server, sadly, was not one of these)

The knowledge that you aren't using an OS thread is really freeing. In my server I had many places where I needed to have a monitor or function which periodically checked something. In Go that's as simple as:

func checker() {  
  for { // loop forever, checking the condition and then sleeping
    checkCondition()
    time.Sleep(...)
  }
}

func main() {  
  go checker() // run checker() in a different goroutine
}

Channels

The other important part of Go's concurrency model is channels. A channel is like a pipe. Values are written to them by some goroutines and read by some (probably different) goroutines. They have an optional buffer so that the writers can avoid blocking. They're also one of the only places in Go where you get type safety. Here are a few examples of how they work:

// Make a channel which sends/receives bools with a buffer size of 1
var somechan chan bool = make(chan bool, 1)

// write a value to someChan, block if the buffer is full and
// there is no goroutine waiting.
someChan <- true

// read a value from someChan into input, or block if there
// is no value available
var input = <- someChan  

These are the basic operations you can do. As a concrete example, I used them in my server to send work to the goroutine loop which was responsible for framing data and writing it to the network.

// write to someChan if there is a goroutine waiting 
// OR the buffer has room, otherwise do nothing 
select {  
  case someChan <- true:
  default:
}

Select allows you to run through a series of goroutine operations and do the first one which succeeds. There are all sorts of uses for this, but the one I used most often was an optional write.

My server has a Consumer type which needs to check whenever there might be new messages in the queue it is bound to. It uses a channel with a single-item buffer to decide whether it needs to check for new items. Whenever an event occurs that might mean that there is a new item the pattern above is used to signal the Consumer. Since the buffer has only one entry and we only write if it isn't full this guarantees that the consumer never has a backlog of potentially spurious signals.

You may be wondering at this point, why am I using a boolean channel rather than putting the items into the channels? A server which needs to receive messages, route them to queues, and then consume those message and send them to other servers seems at first glance like the most perfect use-case for channels and goroutines. The difficulty is that I needed to have control over my queue of messages so that I have stats and can save them to disk if needed. A Go channel is completely opaque, so I couldn't risk leaving any of my messages waiting in one when there were no consumers.

The solution I ended up with was a linked list of queue messages that is queried by the Consumers when they receive a signal. A signal is given to each Consumer by the queue when a new message is added, as well as by the consumer itself if the last time it tried to get a message it succeeded.

Performance Tuning

One of the goals of my project was to spend some time doing performance tuning. The standard library had everything I needed for that and more that I still haven't got to. Go includes a library net/http/pprof which installs HTTP handlers into your server that

Fairly early on in developing my server I used pprof to render a web page with tracing information about my server. Here is the relevant section:

pprof

It made sense that I would spend a lot of time in system calls since my server was mainly reading in messages and then writing them out again. What was interesting was that I was spending 3x as long on writes as I was on reads.

I discovered that I was writing to the network three times for every message, so instead of doing that I wrote the data to a buffer and then did a single write. The time spent in system calls stayed the same, but the read/write ratio improved and my server throughput doubled.

The other big performance win I got was using the blocking pprof page. This page shows the amount of time spent blocking at the various places that can block: acquiring Mutexes and waiting on channel writes in my case.

I was investigating a really confusing performance issue I'd had for more than a month where I could easily break 20k messages per second incoming, but I seemed to top out at 8k outgoing on my Mac Mini. There's no reason for this to be the case, the consumer side is a simpler and shorter code path than the producer side since the producer has to do routing.

The blocking page told me that the consumer was blocked waiting for the signal that told it there might be new messages. A bit more digging found that I was only pinging the consumer when a new message was added. Since the Consumer signal channel had a single item buffer if multiple messages were added to the queue at once it would only process one of them unless messages came in while it was processing.

I added code to have the consumer signal itself every time it successfully processed a message and the consumer-side throughput doubled.

At this point I was generally beating RabbitMQ using Rabbit's own performance testing tool for transient messages, and I owe it largely to the tools in Go's standard library.

Misc Likes

  • The encoding/decoding support using annotations on structs is great. I was able to parse the AMQP XML spec into nice object really quickly. The resulting code was way cleaner than the DOM querying method I had been using in python. Similarly, I had a fairly easy time producing custom JSON for use in my server's admin interface
  • The built in text templating was pretty good. It was fairly easy to do what I needed and easy to understand. I had a problem with not being able to suppress newlines, but that will be fixed in Go 1.6.
  • Error handling was more verbose than I'd prefer, but I really liked knowing that I could reason about exactly what errors would happen when. It's a very similar feeling of freedom to shared state that you get in single-threaded programming. Also, since I use code coverage tools, having to handle each error means that my test coverage is much more thorough than if I was using catch-all error handlers.

Misc Dislikes

  • Visibility is tied to packages. My server has a number of tightly coupled structs. I really wanted to have certain fields only visible to the struct's methods so that I wouldn't accidentally access a field which needed a mutex without acquiring it. The only way to do this was do move each struct and its method into a stand alone package and use interfaces to stop circular dependencies. I think my code is better for being broken up in this way—especially for testing—but it really seems like overkill compared to having visibility modifiers on the fields.
  • Since slices are already special it would be nice to have list comprehension syntax to shorten the map/filter type operations. I got used to the verbose error checking syntax but this continues to frustrate me
  • It seems like the vendoring story isn't completely settled yet and really prefer to have explicit dependency versions
  • only slices/maps/channels are typed. In my server I only had one untyped data structure (the linked list of the queue) and it made me worry about what programming would be like if I had more. In principle this is the same as a dynamically typed language, so it isn't enough to stop me from using it
  • The built in coverage support is not super useful. You can only test a single package while using coverage so you have to write a script to test each package (which is slow) and then merge the resulting coverage files. I assume this will get fixed with time, and I'm considering contributing a change myself