Jump to content

Mapserver Host Hardware Changes


Recommended Posts

  • Homecoming Team

Hello! My name is Telephone, and you may remember me from such technically-focused posts as:

 

Server Architecture of City of Heroes

 

To begin with, a brief primer on the server architecture of City of Heroes. Homecoming has several main types of hosts used to run the game. We call these the 'dbserver hosts' (which, with the exception of Reunion, were virtualized back in June 2020 - see the above-mentioned post), the 'mapserver hosts', the 'services hosts', and the 'authentication hosts' (also now virtualized).

 

These are in addition to the various other hosts we have, which include our SQL servers, the forums (which have been virtualized since the beginning), our development infrastructure, backup and security infrastructure, and so on. And of course, the VM hosts themselves!

 

Hosts, What Do?

 

Services Hosts (and the Authentication Hosts)

 

These hosts run what we call 'global services', which are used by every shard. For example, the auction server and Architect Entertainment run on the current active services host (only one services host is active at a given time; the others are warm spares). These services are connected to by every shard, and in turn the services manage database connections and so on.

 

Similarly, the authentication hosts handle authentication for Homecoming, and work the same way (but are run in isolation for security reasons).

 

DBServer Hosts

 

What we called the 'dbserver hosts' above would be far better named the 'shard hosts'. Every shard has exactly one of these, and it runs a process called (not surprisingly), the 'dbserver'. This process serves as an interface between every other process in the shard and the actual SQL database. In addition, this host runs all the other shard-specific services (such as the arena, petitions, the login queue, LFG, and so on).

 

Even given all the above, these hosts are quite lightweight. Homecoming actually runs no less than nine shards (5 Live, 2 Beta, and 2 Prerelease), and sometimes our developers spin up additional temporary development private shards for testing. The load these place on our infrastructure is quite minimal.

 

Now, the primary load of City of Heroes comes from the final type of host ...

 

Mapserver Hosts

 

Everyone's favorite, the mapserver!

 

All shards in a given region and realm (realms are things such as Live, Beta, Prerelease, etc) share the same pool of mapserver hosts. Every single active map is its own process running on one of these hosts, and these processes are the primary load driver for Homecoming.

 

You may have heard mentions elsewhere of 'why is Homecoming running so many shards when they could lower their costs by running fewer?'. Well, the number of shards actually has little bearing on our costs because the mapserver hosts are shared by all shards (except Reunion, which has its own mapserver host).

 

Homecoming currently has six physical mapserver hosts in North America, and one physical mapserver host for Europe (for Reunion). For Beta, Prerelease, and development, we use virtual mapserver hosts, as the performance requirements are far less.

 

Each physical mapserver host (with the current hardware) can handle a maximum of a little over 1,500 players, which is why you may have heard us mention that Homecoming's maximum player capacity is approximately 10,000 in North America and 1,500 in Europe. Of course, when we are at maximum capacity, lag is quite noticeable (though we have not been near maximum capacity in quite some time).

 

So, why keep so many mapserver hosts in North America if we aren't using that capacity at all times? For several reasons:

 

  • These hosts need maintenance. We occasionally take one or two of the NA mapserver hosts out of service to perform maintenance on it (and then bring it back in the next week). Of course, this currently can't be done in the EU, so maintenance there needs more care.
  • These hosts can fail. Some of you may remember the EU mapserver host failure a few months ago, where we had to perform an emergency redirection of Reunion to the NA mapserver pool. More about hardware failures below.
  • Having spare capacity available helps reduce lag spikes - if there were too few cores or hosts available, players could see their performance impacted by people playing on other maps or even other shards. Imagine if your mission were to lag because of a Hamidon raid on another shard.
  • We often use spare CPU capacity on these hosts to perform certain development tasks (configured to always yield to players, of course). When a map is created or modified by developers, an extremely CPU-intensive process called beaconizing may need to take place. Even with the CPU power available to us, beaconizing a Page release can take many hours.
  • Lastly, it can take days to order a new physical mapserver host (OVH's lead time on this class of host is usually between 3 and 10 days), and the actual installation process is also quite time-consuming.

 

What's Changing

 

Now, after the entire primer above, here's what we're planning to change over the next few months.

 

Our physical mapserver hosts are what OVH calls an 'Advance-5' (A5). These are the most powerful hosts available in their Advance line, and we pay about $360/mo for each one (our pricing was locked in when we set up these A5s back in 2019).

 

Recently, we had a failure on one of our A5s in North America (a very bad drive failure). During the reinstallation, we were reviewing OVH's offerings and noticed that they have refreshed their A5s and that the new A5s are actually significantly better priced ($50-$100/mo cheaper, depending on various discounts).

 

So, we began discussions to see how we could take advantage of that pricing, and just to be sure, we decided to review all of OVH's updated hosting offerings. Upon reviewing them, we found that OVH actually has a new offering which is even more cost-effective than the A5 - enter the SCALE-3. Each SCALE-3 is actually slightly better than two A5s!

 

So, our plan going forward is that we will replace the six North American mapserver hosts with three SCALE-3 hosts. We've already cancelled the mapserver host with the drive failure, and ordered a SCALE-3. We'll replace the North American mapserver hosts two at a time (so we'll cancel two more next month when we order a second SCALE-3, and then the remaining three when we order the final SCALE-3). Although the immediate financial impact is somewhat painful ($1100 to spin up a single SCALE-3, due to a $600 setup fee and then paying $500/mo for the SCALE-3 itself), the savings are so great that we will see that paid back in just a few months.

 

One minor negative is the lead time on SCALE-3s. Currently, we expect to wait 20 days for our first SCALE-3 to arrive. Of course, the monthly fee for the SCALE-3 will apply when it is actually installed, so even though we have paid for the first month up front, that month won't start until mid-June.

 

Something else to note is that the SCALE-3s have ancillary storage on them, in addition to their normal storage. We've got plans to use this additional storage to improve our development and deployment architecture, further increasing our reliability and performance. Currently, the majority of our redundant bulk storage is hosted in the EU, so having access to it locally in NA will allow us to improve the efficiency of some of our operations.

 

Can We Save More Money?

 

OVH does allow us to order servers with a 12-month commitment, and doing so eliminates the setup cost and also gives a small discount. This would save quite a bit of extra money, but at this time we've decided not to pursue long commitments on hosting (especially since this is our first SCALE-3 - we need to make sure it has the performance we expect). We'll continue to evaluate this and may take advantage of it in future orders (including possibly the additional two SCALE-3s after this first one).

 

OVH also has a cheaper line of servers, called the Rise servers. Unfortunately, these servers lack a very important feature which OVH calls vRack. We use an OVH vRack to link the entire Homecoming cluster into a single private network; not having vRack would be a complete nonstarter for us.

 

Risks

 

The biggest risk is that reducing the number of physical servers means a failure could be much worse for us (we'd lose a third of our capacity rather than just a sixth). Given the reliability we've had in the past on our servers and our average player load, we feel this is an acceptable risk. Even so, we've made backup plans to mitigate potential risks (including the ability to quickly spin up backup mapserver hosts using our VM infrastructure or even in the cloud).

 

A smaller risk is that the SCALE-3 doesn't pan out to have the performance we expect. In this case, we would likely bite the bullet, accept the setup fee as a loss, and order new A5s to replace our existing A5s. This does have the possibility that the new A5s might also not have the performance we expect, but our existing VM hosts are very similar systems (though they are A4s rather than A5s) and we're comfortable based on their performance.

 

Europe

 

Of course, Europe doesn't need a SCALE-3. The existing hardware there is more than sufficient to handle the load, and doubling it would be a waste of money. But, we would very much like to refresh our hardware in Europe, and we'd like to increase our redundancy level there (after the unpleasant experience of the mapserver host there failing).

 

We're still researching the best way to do this, but no matter what we do there will likely be a slight cost increase. Currently, we're paying approximately $750/mo for the EU cluster (but some of that cost also supports NA, because we use EU for offsite backup and other processing).

 

The current possibility (subject to more research) is that we'll replace the existing EU cluster with an OVH INFRA-2 (which would run SQL, the Reunion dbserver, and the other EU services) and two OVH Advance-4s (as mapserver hosts for EU). This would increase our costs by approximately $100/mo, but would significantly increase our EU reliability. There are other possibilities as well which could be pursued without a cost increase, but we're not confident that they will have the player performance and experience we want to offer.

 

Expect a follow-up post on this when we get closer to making a decision on how to handle Europe.

 

TL;DR

 

We're planning to spend a total of approximately $1,800 on hardware upgrade setup fees over the next few months. These will result in reducing our costs by several hundred dollars a month. Because we will have some overlap in time we are paying for the old and new servers, the actual payback time will be more than three months, but should be less than six months.

 

Detailed Cost Calculations

  • Current Cost:
    • 6x Base A5's:  $331.99 * 6 = $1,991.94
    • 6x Upgraded 1Gb VRack:  $23.00 * 6 = $138.00
    • Total = $2,129.94
  • New Expected Cost:
    • 3x Base S3's: $513.99 * 3 = $1,541.97
    • Total = $1,541.97
  • Savings Per Month = $587.97
  • 3x Initial Setup Cost: $598 * 3 = $1794
  • Months of Savings to pay off Setup Cost = 3.05 Months
    • (but see above as to why it will actually take longer)
  • Like 5
  • Thanks 12
Link to post
Share on other sites
  • Cipher featured and pinned this topic

"Hello! My name is Telephone, and you may remember me from such technically-focused posts as:"

 

What the???  Stealing my call out... that I properly stole myself.

 

J/K, Thanks for the info

  • Thumbs Up 1
Link to post
Share on other sites

Appreciate the transparency, as always.  Donated a bit more this time round to help with the new hardware.

 

Glad you didn't go with the Rise servers.  That sounds too much like something out of the Terminator films!

  • Thumbs Up 1
Link to post
Share on other sites

Could this "less lag" have a chance to finally and completely resolve the animation/cancelling that's spreading like a sickness or is that related to specific powers and therefore it's a bug client/related?

Link to post
Share on other sites
  • Homecoming Team

 

59 minutes ago, Anyad said:

"Hello! My name is Telephone, and you may remember me from such technically-focused posts as:"

 

What the???  Stealing my call out... that I properly stole myself.

 

Good artists create. Great artists steal.

 

  

41 minutes ago, ThunderCAP said:

Could this "less lag" have a chance to finally and completely resolve the animation/cancelling that's spreading like a sickness or is that related to specific powers and therefore it's a bug client/related?

 

I'd have to have more details to answer this question, but I don't think this would be related to the CPU on the servers. It would most likely be related to network issues, issues with individual powers, and so on.

 

If you mean true animation cancelling (where you can sneak a power in to break another power's animation, allowing you to execute more powers more quickly), this is generally considered a bug or balance issue and I believe @Captain Powerhouse has recently been extremely diligent in working to resolve these issues (but I am sure he will correct me if I am wrong).

 

If you mean powers are failing to execute, that would most likely be a network or client issue. If the hosts are particularly loaded or (more likely) you are in a raid or otherwise CPU-saturated map, this would be a case that could be improved by the faster hardware. One thing I forgot to mention above is that mapserver processes are not heavily multi-threaded, so each map is generally limited to the power available from a single core (I believe some processing can run on other cores, but the bulk happens on the main thread).

 

If you merely mean visual jankiness but things are actually executing correctly, City of Heroes is an older game with a complex animation system and a lot of lag tolerance (when it was released, many people played on dial-up!), and these things can be expected to happen. This doesn't preclude the possibility of actual animation bugs, which we do correct when reported and when we can.

 

If you could provide more information (preferably in the Bug Reports forum, but please feel free to link your post in this thread), we might be able to provide more information.

  • Like 2
  • Thumbs Up 1
Link to post
Share on other sites

Always interesting to get some 'what's going on behind the curtain' information. I wanted to say thank you for the extra effort it clearly takes to maintain the Reunion server for us EU folks (technically I suppose I shouldn't refer to myself as an EU person anymore but don't get me started on that nonsense!).

  • Sad 1
  • Thumbs Up 3
Link to post
Share on other sites
  • 2 weeks later
Posted (edited)

I found this a very interesting post to read. Thanks for all the details.

I'm especially interested what you'll do in the long run for Europe, I'm still trying to recruit more people to join Reunion!

 

How much bandwidth does a mapserver use?

Edited by RogerWilco
  • Like 1
  • Thumbs Up 1

The adventurous Space Janitor reporting for duty. Cleaning the universe since 1992 and Paragon City, the Rogue Isles and Praetoria since 2011.

BlueYellowRed.png.cffb9b692dd0484133ca1d9ee2c8c4ce.png

Link to post
Share on other sites
  • 2 weeks later

So, hm, any of this related to the current outage?

_____________________________

50s (Indom and Excel):  Bluefox (Rad/Beam Rifle Corr), King Pumpkin Spice 🎃 (Mercs/Nature MM), Stupid Like A Fox (Time/DP Def), Poltergeist Prince (Darkness Control/Empathy 'Troller), Capt Sam's Space Zoo (Beast/FF MM), Snake Charming (Mind Control/Poison Troller), The Midnight Bridge (Storm/Rad Blast 'Fender)  Pawkysham Vex (Illusion/Trick Arrow Controller)  Piikal and P'Zhowm (Grav/Energy Dom)

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...