Upgrading Sajari to Go 1.5
The majority of the Sajari tech stack is now written in Go, so we follow the language progression pretty closely. This week was a big milestone (for many reasons) with the release of version 1.5. So far every major release has been a step forward for us, but this release was set to be very important given the complete re-write of the garbage collector (GC pauses are why customers see the odd query take much longer than normal). As per previous, it's been a great step forward for us, below are some of the key observations so far.
Build time
Build time (compiling our code) is slower with 1.5. Previously Go 1.4 built our engine code in ~4.5 seconds, for 1.5 it's closer to 12 seconds. This was expected and will improve in the future. It does not impact our work significantly.
Index load and recovery time
Sajari memory maps many GB of files into virtual memory (sometimes hundreds of GB) on restart. During this mapping phase we also scan some of the indexes and build in memory hashes, etc. On a restart for a server managing ~100GB of indexes we switched to Go 1.5 and immediately noticed a big change. The index load/scan time had dropped from 7 minutes to just 44 seconds! This is great result for us as during this reload time the engine cannot serve requests.
UPDATE 6th September - We jumped the gun slightly on the load improvements. Turns out this was mainly due to the page cache staying primed. We re-ran the numbers with and without a primed page cache and found the Go 1.5 improvement was closer to 5%, which is still a huge win for us!
Garbage collection
Sajari uses a decent memory footprint, so garbage collection has always been a key concern of ours. The bulk of the data when using Sajari is stored in indexes that are memory mapped and therefore not subject to garbage collection, but we still use a lot of in-memory indexes and hashmaps and generate garbage processing queries.
Prior to Go 1.5 the GC "stop the world" time could be very large, the impact to Sajari customers was evident under high load, where the occasional query would randomly take much longer than expected. The 1.5 release has a completely redesigned garbage collection process that aims to "stop the world" for no more than 10 msec at a time and then do much of the heavy lifting for the collection in the background in parallel with the user program (instead of holding up the user program for the entire process). Along with those welcome changes came other improvements for us, such as not needing to scan maps without pointers and much more.
Measuring the GC pause
We took some initial test GC measurements to see how things had changed. The key thing we saw is the majority of GC events are now in non stop the world (STW) part of the process. We still see some "forced" GC events that do STW for a few hundred milliseconds, but these are rare and thus far only seen during the loading phase (makes sense as this can do a lot of data processing).
Impact on production
At Sajari we measure latency in two ways:
- Total request time (app engine start to finish, this is what the client experiences)
- Engine request time (query time inside the Sajari engine, e.g. on compute engine instances)
The engine time is measured in the http handler and thus should be total request time inside the Sajari instance, but we've always noticed this can underreport time when compared to what the outsider is experiencing. The timer is the first thing started when a new request is handled, so we can only assume this means the http handler is taking it's time to actually handle the request. Anyhow, the difference is interesting, because when measuring request latency, Go 1.5 has increased our engine request latency, but reduced our total request time, particularly in the 90+ percentile range.
Below shows the request latency for 1 and 2, both pre and post 1.5 change for one of our clients (~10 million requests / month).
The above table is showing two things. Overall the request time is pretty stable for the majority of requests. It may have increased slightly, but nothing too crazy. Interestingly the 99% percentile seems to take noticeably longer inside the engine, but overall has basically halved. As per above, we don't believe this is actually slower in the engine, but more likely due to the way scheduling is handled, the timer is probably closer to realistic than it was previously.
Overall this is a big win for us. The 99th percentile is where GC impact is most evident for us and Go 1.5 has halved the latency impact to our customers. There are many extremely smart people working on Go and the improvements in short time have been amazing. We'd like to thank those people for all their hard work.