OpenSimulator Load Test 2 Report

Hi folks.  As some of you will know, we’ve recently started doing OpenSimulator load tests for the second OpenSimulator Community Conference (OSCC 2014) taking place on the weekend of November 8th/9th.

In this context, a load test means getting as many people as possible to log on to the keynote regions (where the keynote speeches are given) so that we can see how the system performs under heavy load.  The keynote regions are

  • 1
  • 2
  • 3
  • 4

The tests take place every Tuesday at 8pm British Summer Time (12pm Pacific Time, 3pm US Eastern Time, 9pm Central European Summer time) up until the conference. Getting real people to attend is very valuable since they (and their viewers) can behave in all sorts of ways that one doesn’t expect.  However, if there aren’t enough real people available then the load can be supplemented by bots.

This year, I hope to write up some reports of these tests when I get the time.  I think they show the challenges faced in reliably reaching higher avatar concurrency and the tools and code changes that evolve to meet these challenges.

One thing I wanted to do this year is to improve the data that we get from these tests for later analysis.  In a system as complex as OpenSimulator, identifying why something went wrong under load is as big a challenge as actually fixing the problem.

We already have experimental statistical logging in OpenSimulator that can capture a range of statistics (e.g. threads available in the threadpool, root agents logged in) at regular intervals.

However, the challenge is then to analyze this data.  To this end, I spent some time last week writing a set of Python scripts (known as Osta) that can produce summaries and graphs from statistical logs.  The tools are still very rough around the edges but they were enough to give some good insights as to the behaviour of the simulator during the load test.

In the graphs below, samples are plotted along the x axis where a sample was taken every 5 seconds.  So the test period was just over an hour.

To proceed to the analysis itself, our peak number of connected avatars for this test was 291.  The vast majority of these, over 200, were bots.

Connections were distributed fairly evenly, though practically all real users were on Keynote 1.

You can see from this and the previous graph that on occasions large number of avatars (around 50) disconnected simultaneously.  This was because instances of the bot test program (pCampbot in the OpenSimulator distribution) crashed with failure at the Mono level, causing large numbers of bot connections to simultaneously timeout.  This is one problem that needs to be tackled.

It’s interesting to correlate the root agents graph with the scene frame times for each keynote region.

The peaks represent points at which users would have experienced extreme movement lag because of the time that the scene loop was taking to complete each frame.  Judging by eye, there may be some correlation with the mass bot timeouts.  However, the analysis tools need to be improved to see if the high frame times occurred exactly when the disconnections happened or slightly before or after.

Another interesting graph is one which shows how many inbound packets were being received from connected clients and viewers.

As you can see, the number of inbound packets processed crashed to 0 at 2 points for Keynote 1.  Again, this is very strongly correlated with the peaks in frame time on that same region.  It’s also correlated with a peak in threads used from the SmartThreadPool for keynote 1, as shown below.

In fact, the maximum number of threads allocated to this pool was 128 and we hit this limit at about sample 500.  This would stop many inbound UDP packets from being processed, leading to an extremely laggy region from the user’s point of view.

Whilst this was certainly a problem and the number of STP threads has been increased for the next load test, it doesn’t really explain why there was a simultaneous crash in IncomingUDPReceivesCount.  This stat occurs at a very low level of inbound UDP processing and uses the runtime IOCP threadpool (which has no capacity problem) rather than the STP threads.  It will be interesting to see if this same problem occurs in today’s load test now that the number of STP threads has been increased.

12 comments for “OpenSimulator Load Test 2 Report

  1. Matt Lehmann
    July 23, 2014 at 10:24 pm

    It’s not surprising to me to see this point of failure considering the fact that a thread is allocated for each inbound UDP packet. Just by right clicking on an object a client will create numerous threads in the server UDP handler.

  2. justincc
    July 24, 2014 at 6:02 pm

    Many messages are processed on their own threadpool thread but not all. As we’ve previously discussed elsewhere, one reason for this is to avoid holding up other packet processing if one task takes longer than expected.

    I think there may be room for better efficiency if we can guarantee that some messages will never take very long to process (in which case they can just be made async and run on the main inbound thread). However, I doubt this is the real problem and I think that the third load test (analysis to be published later) shows this, as the thread limit there was not hit but there were still issues.

    I suspect it comes down now to sheer volume of UDP traffic but I also think there are ways to reduce this.

  3. July 25, 2014 at 11:08 am

    Hi Justin,

    Recently, thanks to a tip you placed somewhere (probably on the opensim-users mailing list, but I don’t recall exactly where), I moved key simulators on my company’s grid to async_call_method = UnsafeQueueUserWorkItem (instead of the more usual SmartThreadPool). This seemed to make a huge difference, specially on a region which had a number of scripts firing HTTP requests every other second — I was quickly running out of network sockets otherwise, as well as of threads (at some point I had over 150 active threads…).

    I wonder if as part of your tests you’re also going to experiment UnsafeQueueUserWorkItem over SmartThreadPool and publish the results.

    Thanks for doing this. Statistics, and the ability to process them, are quite important to make configuration decisions that are based on objective data, instead of ‘feelings’ 🙂

  4. Gavin Hird
    August 1, 2014 at 6:06 am

    I experimented with the effect the async_call_method = had on my system (OS X 10.9.4 and mono 3.2.7) and found that the thread setting yielded the best result and most responsive sims (which are 28 sims running on the same server on OSG + mirror standalone on a second server.)

    During the startup of the sims there is a phase where the SmartThreadPool setting logs a bunch of OpenSim.Framework.Util Run threadfunc 120 (Queued 4, Running 44) where the system seems to be thread constrained and also CPU constrained as the mono-sgen process never gets higher than around 240% utilization on the i7 processor.

    With the thread setting, when reaching this phase, the mono-sgen just allocates over 1900 threads and runs the i7 at round 730% for the duration of this phase (meaning all 8 cores are working on executing threads for mono-sgen)

    It would be interesting to run the load test on my standalone and see where it breaks. Are the load scripts available somewhere?


  5. Gavin Hird
    August 1, 2014 at 6:18 am

    On a note to the above, I believe the effect of setting the async_call_method to thread on OS X is that you leave the threading to Grand Central Dispatch that is built into the system and highly optimized

    The code for GCDs implementation in libdispatch is licensed by Apple under the Apache license

  6. justincc
    August 1, 2014 at 6:19 pm

    Interesting. I don’t use OSX much at all so I hadn’t heard about Grand Central Dispatch. As you probably know, the primary aim of using the threadpool is to avoid the cost of setting up new threads so if OSX doesn’t suffer this then simple Thread could be a better choice (though the SmartThreadPool has been extended to provide some debugging information for threads that take too long to service their tasks, etc.).

    I generate synthetic load via the pCampbot utility that comes with OpenSimulator. The wiki page has more details about how to set it up. Unfortunately, it suffers from hard Mono crashing at the moment when many bots are run simultaneously (e.g. >= 50), some race condition problem when reading data. I hope to do some work to get to the bottom of this very soon since it’s critical for load testing.

  7. Gavin Hird
    August 2, 2014 at 8:42 am

    I did some preliminary tests with the pCampbot utility on my standalone and with the async_call_method set to thread I could log in 35 concurrent bots (some timeouts occurred during login), and with async_call_method set to SmartThreadPool I could log in 44 concurrent bots out of 50 generated so also some timeouts occurred at login here too. I could in addition log in my usual avatar. –Perhaps spacing the bot logins better would make it easier tom complete successful logins for all the bots?

    For both settings cpu load never exceeded 160% and process memory for the mono-sgen process was under 1600 Mb. Spare time for the region with the bots logged in was between 12 and 14 ms.

    I could move my own avatar comfortably around (no lagging) and teleporting to other regions felt completely normal as did load times. For these sims the spare time typically was in the 18ms range which is normal.

    I did see quite a few more http time-outs with async_call_method set to thread, but apart from that everything seemed normal.

    The pCampbot console logged problems retrieving wearable assets and (partly) baking fails for the bots for both settings. It also logged giving up on region handshakes..

    I ran the pCampbot in a vmware virtual machine that ran OS X 10.9.4 and mono 3.2.7. on a different computer from the server. I don’t think it was resource constrained, but I could be wrong on that.

    Anyway that was fun and useful on my part. Any hints on the http timeouts and adjustments that could be made on that?

  8. Gavin Hird
    August 6, 2014 at 9:05 am

    I found that on using the async_call_method = thread the system was running out of open files per process as a large number of threads accessed floatsamcache files concurrently. For any *nix you’d have to adjust the “limit -n” parameter.

    On OS X the /etc/launchd.conf file needs a line containing “limit maxfiles 2048 65536” where 2048 in this case is the number of allowed open files per process, and the higher number the open files per system. The file needs to be created if it does not exist.
    Alternatively the file can be created at ~/launchd.conf for the user running OpenSim.exe if you don’t want it as a systemwide setting.

  9. justincc
    August 9, 2014 at 1:22 am

    Interesting stuff Gavin. I’ve included your quote at [1], hope you don’t mind. If you haven’t read that page already (I suspect you have) then it does contain a lot of other general performance information, though not much yet on http timeouts on similar.

    I hope to have more info and improvements in this area soon.


  10. Gavin Hird
    August 10, 2014 at 9:18 am

    Don’t mind at all. Just glad if my findings can help others. Yes, I stumbled across the Performance page and I will watch for updates. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *