In part 1, I talked about the hardware that we used for the conference and our failover strategy. In this blog post, I’m going to talk about how we went from struggling to get 80 avatars into the keynote area for the conference to easily accommodating our planned capacity of 220.
In virtual environments that run the Second Life protocol, a large event is often held over a 4 region area in order to spread the processing load over 4 independent simulator instances rather than 1, or even over multiple different machines rather than a single server.
For the OpenSimulator Community Conference, every region would run on the same machine. However, there was still value in spreading users over multiple regions. Although an OpenSimulator instance launches threads with extreme enthusiasm for all sorts of different tasks, there are still a number of single thread processes that can potentially act as bottlenecks. For example, there’s a single thread that processes incoming UDP messages from viewers, one to send them back out to viewers and another to co-ordinate aspects of the scene itself, such as physics and avatar movements. There is room for improvement here (for instance, there has already been experimentation to process physics on a separate thread) but such work is highly complex. At this point, it’s much easier to spread the load between different regions.
To further ease performance issues, we also prevented avatars from crossing between the regions, as region crossing is currently a heavyweight process and not always reliable, especially in situations where source and destination regions are highly loaded. We also instituted a scheme so that most conference attendees could only enter the keynote region to which they were assigned, partly in order to eliminate any extra load generated by users teleporting between them.
Even with all these measures, we were really struggling with performance in the beginning. Getting 80 avatars into the keynote regions sent CPU load skyrocketing, straining even our 24 core system. There was a real worry on my part that we would have to shard the keynotes (i.e. have two identical copies of the regions and relay the presentations from one region to another). Understandably, nobody was enthusiastic about that – it would have been a real ding on the sense of everybody being in a single virtual place, as well as causing some significant organizational difficulties.
So to tackle these performance issues we instituted weekly load tests from May right up until the conference itself. Anybody was invited to come along and help stress test the environment. Because the infrastructure to process avatar registrations was not yet in place, most people entered the regions via the Hypergrid.
Find, Fix, Stress
Over these weeks, there were three major activities that we had to carry out in order to improve performance. Firstly. we had to find the performance bottlenecks. Secondly, the performance improvements themselves had to be devised, debugged, implemented and then tested under load. Lastly, we had to extend existing test tools to create a suitable synthetic bot load on the system.
I’m going to say a little bit about each of these things in turn.
Very broadly speaking, there are two kinds of bugs. Firstly, there are the bugs suffered by a single user with a set of steps that will reproduce the problem every single time. These are not necessarily simple, but at least the developer can recreate them and sooner or later pin them down to a particular place in the code.
Then there are the bugs which only occur under certain conditions such as heavy user load, unanticipated combinations of client behaviour or unpredictable network response times. In this case, it’s often obvious to the user when there is a problem (e.g. my avatar keeps freezing) but often not at all obvious why that problem is occurring. Moreover, these problems are often extremely difficult to recreate outside of that particular combination of events.
It’s the second kind of bugs which really challenged us on the technical side for the conference. You can get some traction on such issues with an expert knowledge of the system and many fixes were performed this way, particularly as we had the opportunity week by week to observe the effects of changes.
But it was also necessary to start measuring many new internal statistics (e.g. number of inbound UDP messages received per second, number of messages waiting to be handled by the system, number of different UDP messages sent by each connection). This is the kind of data that splashes out if you run the command “show stats all” on the simulator console. There is also an experimental feature to record statistical information every 5 seconds for later analysis (“debug stats record start|stop”).
This extra information helped us work out which aspects of the system were associated with performance problems and get a better grasp on system behaviour in general. However, even now, it’s still the case that much of this information is probably very difficult to interpret without a deep knowledge of the underlying mechanisms.
Over the course of five months of load tests, we made many changes to OpenSimulator. These changes addressed both raw performance issues (e.g. handling more avatars per region) and issues that appeared only under heavy load (e.g. mesh sometimes not being received by avatars teleporting in when a large number of other people were already connected).
One issue in particular was the handling of incoming avatar movement messages. Most viewers (clients) connected to OpenSimulator will send through a constant stream of AgentUpdate UDP packets, approximately 10 every second. These transmit changes to the avatar’s body and head rotation, camera position, etc.).
Many of these packets are identical or contain only very small changes (e.g. the avatar head rotation has changed by a fraction of a fraction of a degree). OpenSimulator was already discarding identical packets but only at a fairly late stage, and it was always processing packets where the changes were tiny compared to the last processed packet.
Hence, we started discarding packets at a much earlier stage, both those which were identical and those where the change from the last packet was so small that it was insignificant. This radically improved performance – we went from 80 avatars consuming more than half the available cycles of our 24 CPUs to those same 80 connections barely taking up 1 CPU.
This was the point at which we knew our server would be able to handle the planned conference load and it was a big relief! It also goes to show that in open-source, there’s nothing quite like making yourself “eat your own dogfood” – we had committed to put on a conference in OpenSimulator and so were highly motivated to spend the enormous time and effort necessary to get performance to where it needed to be.
Having people come in week by week to stress test our changes was invaluable. There’s absolutely no substitute for having real people connecting to the simulator using all sorts of different networks to build confidence that everything was going to work in the conference itself.
However, even with a fixed time for the tests and week by week publicity, we couldn’t get anywhere near enough real connections to match our planned 220 avatar target.
Therefore, we had to turn to a synthetic load, both to supplement real connections at load tests and to allow individual developers to at least approximate a high load when few real people were available.
We already had a test tool bundled with OpenSimulator called pCampbot, which creates a number of libopenmetaverse external client connections to stress test various aspects of the simulator (e.g. you can make such bots continuously teleport around until a failure does (or doesn’t) happen).
However, the existing pCampbot code was very awkward to use in conjunction with real connections and in a situation where bots would have to be added and removed over a number of regions. Hence, we made a number of enhancements to this tool both to make it easier to manage bot connections and to introduce new types of behaviour (e.g. get all bots to occupy a sit target).
My hope is that this tool will be useful in the future for people to independently test their OpenSimulator installations. However, this does require me to seriously improve the documentation at the pCampbot OpenSimulator wiki page. Please feel free to tell me if that kind of thing would be useful – otherwise these things have a tendency to slip down the long priority list.
In the next post, I plan to move onto some of the organizational aspects of putting on a conference in OpenSimulator and virtual worlds in general, such as grid management, region layout, planning committees, the people you need, etc. Stay tuned!