Mapserver Host Hardware Changes

Telephone · May 23, 2021

Hello! My name is Telephone, and you may remember me from such technically-focused posts as:

Server Architecture of City of Heroes

To begin with, a brief primer on the server architecture of City of Heroes. Homecoming has several main types of hosts used to run the game. We call these the 'dbserver hosts' (which, with the exception of Reunion, were virtualized back in June 2020 - see the above-mentioned post), the 'mapserver hosts', the 'services hosts', and the 'authentication hosts' (also now virtualized).

These are in addition to the various other hosts we have, which include our SQL servers, the forums (which have been virtualized since the beginning), our development infrastructure, backup and security infrastructure, and so on. And of course, the VM hosts themselves!

Hosts, What Do?

Services Hosts (and the Authentication Hosts)

These hosts run what we call 'global services', which are used by every shard. For example, the auction server and Architect Entertainment run on the current active services host (only one services host is active at a given time; the others are warm spares). These services are connected to by every shard, and in turn the services manage database connections and so on.

Similarly, the authentication hosts handle authentication for Homecoming, and work the same way (but are run in isolation for security reasons).

DBServer Hosts

What we called the 'dbserver hosts' above would be far better named the 'shard hosts'. Every shard has exactly one of these, and it runs a process called (not surprisingly), the 'dbserver'. This process serves as an interface between every other process in the shard and the actual SQL database. In addition, this host runs all the other shard-specific services (such as the arena, petitions, the login queue, LFG, and so on).

Even given all the above, these hosts are quite lightweight. Homecoming actually runs no less than nine shards (5 Live, 2 Beta, and 2 Prerelease), and sometimes our developers spin up additional temporary development private shards for testing. The load these place on our infrastructure is quite minimal.

Now, the primary load of City of Heroes comes from the final type of host ...

Mapserver Hosts

Everyone's favorite, the mapserver!

All shards in a given region and realm (realms are things such as Live, Beta, Prerelease, etc) share the same pool of mapserver hosts. Every single active map is its own process running on one of these hosts, and these processes are the primary load driver for Homecoming.

You may have heard mentions elsewhere of 'why is Homecoming running so many shards when they could lower their costs by running fewer?'. Well, the number of shards actually has little bearing on our costs because the mapserver hosts are shared by all shards (except Reunion, which has its own mapserver host).

Homecoming currently has six physical mapserver hosts in North America, and one physical mapserver host for Europe (for Reunion). For Beta, Prerelease, and development, we use virtual mapserver hosts, as the performance requirements are far less.

Each physical mapserver host (with the current hardware) can handle a maximum of a little over 1,500 players, which is why you may have heard us mention that Homecoming's maximum player capacity is approximately 10,000 in North America and 1,500 in Europe. Of course, when we are at maximum capacity, lag is quite noticeable (though we have not been near maximum capacity in quite some time).

So, why keep so many mapserver hosts in North America if we aren't using that capacity at all times? For several reasons:

These hosts need maintenance. We occasionally take one or two of the NA mapserver hosts out of service to perform maintenance on it (and then bring it back in the next week). Of course, this currently can't be done in the EU, so maintenance there needs more care.
These hosts can fail. Some of you may remember the EU mapserver host failure a few months ago, where we had to perform an emergency redirection of Reunion to the NA mapserver pool. More about hardware failures below.
Having spare capacity available helps reduce lag spikes - if there were too few cores or hosts available, players could see their performance impacted by people playing on other maps or even other shards. Imagine if your mission were to lag because of a Hamidon raid on another shard.
We often use spare CPU capacity on these hosts to perform certain development tasks (configured to always yield to players, of course). When a map is created or modified by developers, an extremely CPU-intensive process called beaconizing may need to take place. Even with the CPU power available to us, beaconizing a Page release can take many hours.
Lastly, it can take days to order a new physical mapserver host (OVH's lead time on this class of host is usually between 3 and 10 days), and the actual installation process is also quite time-consuming.

What's Changing

Now, after the entire primer above, here's what we're planning to change over the next few months.

Our physical mapserver hosts are what OVH calls an 'Advance-5' (A5). These are the most powerful hosts available in their Advance line, and we pay about $360/mo for each one (our pricing was locked in when we set up these A5s back in 2019).

Recently, we had a failure on one of our A5s in North America (a very bad drive failure). During the reinstallation, we were reviewing OVH's offerings and noticed that they have refreshed their A5s and that the new A5s are actually significantly better priced ($50-$100/mo cheaper, depending on various discounts).

So, we began discussions to see how we could take advantage of that pricing, and just to be sure, we decided to review all of OVH's updated hosting offerings. Upon reviewing them, we found that OVH actually has a new offering which is even more cost-effective than the A5 - enter the SCALE-3. Each SCALE-3 is actually slightly better than two A5s!

So, our plan going forward is that we will replace the six North American mapserver hosts with three SCALE-3 hosts. We've already cancelled the mapserver host with the drive failure, and ordered a SCALE-3. We'll replace the North American mapserver hosts two at a time (so we'll cancel two more next month when we order a second SCALE-3, and then the remaining three when we order the final SCALE-3). Although the immediate financial impact is somewhat painful ($1100 to spin up a single SCALE-3, due to a $600 setup fee and then paying $500/mo for the SCALE-3 itself), the savings are so great that we will see that paid back in just a few months.

One minor negative is the lead time on SCALE-3s. Currently, we expect to wait 20 days for our first SCALE-3 to arrive. Of course, the monthly fee for the SCALE-3 will apply when it is actually installed, so even though we have paid for the first month up front, that month won't start until mid-June.

Something else to note is that the SCALE-3s have ancillary storage on them, in addition to their normal storage. We've got plans to use this additional storage to improve our development and deployment architecture, further increasing our reliability and performance. Currently, the majority of our redundant bulk storage is hosted in the EU, so having access to it locally in NA will allow us to improve the efficiency of some of our operations.

Can We Save More Money?

OVH does allow us to order servers with a 12-month commitment, and doing so eliminates the setup cost and also gives a small discount. This would save quite a bit of extra money, but at this time we've decided not to pursue long commitments on hosting (especially since this is our first SCALE-3 - we need to make sure it has the performance we expect). We'll continue to evaluate this and may take advantage of it in future orders (including possibly the additional two SCALE-3s after this first one).

OVH also has a cheaper line of servers, called the Rise servers. Unfortunately, these servers lack a very important feature which OVH calls vRack. We use an OVH vRack to link the entire Homecoming cluster into a single private network; not having vRack would be a complete nonstarter for us.

Risks

The biggest risk is that reducing the number of physical servers means a failure could be much worse for us (we'd lose a third of our capacity rather than just a sixth). Given the reliability we've had in the past on our servers and our average player load, we feel this is an acceptable risk. Even so, we've made backup plans to mitigate potential risks (including the ability to quickly spin up backup mapserver hosts using our VM infrastructure or even in the cloud).

A smaller risk is that the SCALE-3 doesn't pan out to have the performance we expect. In this case, we would likely bite the bullet, accept the setup fee as a loss, and order new A5s to replace our existing A5s. This does have the possibility that the new A5s might also not have the performance we expect, but our existing VM hosts are very similar systems (though they are A4s rather than A5s) and we're comfortable based on their performance.

Europe

Of course, Europe doesn't need a SCALE-3. The existing hardware there is more than sufficient to handle the load, and doubling it would be a waste of money. But, we would very much like to refresh our hardware in Europe, and we'd like to increase our redundancy level there (after the unpleasant experience of the mapserver host there failing).

We're still researching the best way to do this, but no matter what we do there will likely be a slight cost increase. Currently, we're paying approximately $750/mo for the EU cluster (but some of that cost also supports NA, because we use EU for offsite backup and other processing).

The current possibility (subject to more research) is that we'll replace the existing EU cluster with an OVH INFRA-2 (which would run SQL, the Reunion dbserver, and the other EU services) and two OVH Advance-4s (as mapserver hosts for EU). This would increase our costs by approximately $100/mo, but would significantly increase our EU reliability. There are other possibilities as well which could be pursued without a cost increase, but we're not confident that they will have the player performance and experience we want to offer.

Expect a follow-up post on this when we get closer to making a decision on how to handle Europe.

TL;DR

We're planning to spend a total of approximately $1,800 on hardware upgrade setup fees over the next few months. These will result in reducing our costs by several hundred dollars a month. Because we will have some overlap in time we are paying for the old and new servers, the actual payback time will be more than three months, but should be less than six months.

Detailed Cost Calculations

Current Cost:
- 6x Base A5's: $331.99 * 6 = $1,991.94
- 6x Upgraded 1Gb VRack: $23.00 * 6 = $138.00
- Total = $2,129.94
New Expected Cost:
- 3x Base S3's: $513.99 * 3 = $1,541.97
- Total = $1,541.97
Savings Per Month = $587.97
3x Initial Setup Cost: $598 * 3 = $1794
Months of Savings to pay off Setup Cost = 3.05 Months
- (but see above as to why it will actually take longer)

Cipher · May 29, 2021

.

Anyad · May 29, 2021

"Hello! My name is Telephone, and you may remember me from such technically-focused posts as:"

What the??? Stealing my call out... that I properly stole myself.

J/K, Thanks for the info

Cinnder · May 29, 2021

Appreciate the transparency, as always. Donated a bit more this time round to help with the new hardware.

Glad you didn't go with the Rise servers. That sounds too much like something out of the Terminator films!

ThunderCAP · May 29, 2021

Could this "less lag" have a chance to finally and completely resolve the animation/cancelling that's spreading like a sickness or is that related to specific powers and therefore it's a bug client/related?

Telephone · May 29, 2021

59 minutes ago, Anyad said:

"Hello! My name is Telephone, and you may remember me from such technically-focused posts as:"

What the??? Stealing my call out... that I properly stole myself.

Good artists create. Great artists steal.

41 minutes ago, ThunderCAP said:

Could this "less lag" have a chance to finally and completely resolve the animation/cancelling that's spreading like a sickness or is that related to specific powers and therefore it's a bug client/related?

I'd have to have more details to answer this question, but I don't think this would be related to the CPU on the servers. It would most likely be related to network issues, issues with individual powers, and so on.

If you mean true animation cancelling (where you can sneak a power in to break another power's animation, allowing you to execute more powers more quickly), this is generally considered a bug or balance issue and I believe @Captain Powerhouse has recently been extremely diligent in working to resolve these issues (but I am sure he will correct me if I am wrong).

If you mean powers are failing to execute, that would most likely be a network or client issue. If the hosts are particularly loaded or (more likely) you are in a raid or otherwise CPU-saturated map, this would be a case that could be improved by the faster hardware. One thing I forgot to mention above is that mapserver processes are not heavily multi-threaded, so each map is generally limited to the power available from a single core (I believe some processing can run on other cores, but the bulk happens on the main thread).

If you merely mean visual jankiness but things are actually executing correctly, City of Heroes is an older game with a complex animation system and a lot of lag tolerance (when it was released, many people played on dial-up!), and these things can be expected to happen. This doesn't preclude the possibility of actual animation bugs, which we do correct when reported and when we can.

If you could provide more information (preferably in the Bug Reports forum, but please feel free to link your post in this thread), we might be able to provide more information.

Parabola · May 29, 2021

Always interesting to get some 'what's going on behind the curtain' information. I wanted to say thank you for the extra effort it clearly takes to maintain the Reunion server for us EU folks (technically I suppose I shouldn't refer to myself as an EU person anymore but don't get me started on that nonsense!).

RogerWilco · June 8, 2021

I found this a very interesting post to read. Thanks for all the details.

I'm especially interested what you'll do in the long run for Europe, I'm still trying to recruit more people to join Reunion!

How much bandwidth does a mapserver use?

Edited June 8, 2021 by RogerWilco

Clave Dark 5 · June 22, 2021

So, hm, any of this related to the current outage?

IIIXJokerXIII · June 25, 2021

Before you go that route you may want to compare price/performance with the link I've posted below. I've hosted many game servers over the years before I built my own network. I've used OVH for years and they're great at web/email/vm but always failed when I needed to get the most out of a game server. If you have any questions or comments I'm happy to help my contact is [email protected]

Dedicated Game Server Hosting - Amazon GameLift - Amazon Web Services

I should note some limits they will remove when you call them for a custom setup.

Edited June 25, 2021 by IIIXJokerXIII

Telephone · June 25, 2021

On 6/22/2021 at 1:35 AM, Clave Dark 5 said:

So, hm, any of this related to the current outage?

Not at all. The new SCALE-3 has not yet arrived, and the failed host was cancelled and removed from our cluster as of June 1st, several weeks ago.

15 hours ago, IIIXJokerXIII said:

Before you go that route you may want to compare price/performance with the link I've posted below. I've hosted many game servers over the years before I built my own network. I've used OVH for years and they're great at web/email/vm but always failed when I needed to get the most out of a game server.

We've actually evaluated AWS before. Unfortunately, their pricing and price/performance is not competitive with what we can get from OVH. Another major issue that we would run into is that the architecture of CoH is not amenable to ramping up and down mapserver hosts (we do have some custom features for scaling and draining on Homecoming thanks to Number Six's awesome work, but we'd have to do a lot of operational work to be able to scale servers the way GameLift expects).

Panthonca7034 · June 25, 2021

1 hour ago, Telephone said:

We've actually evaluated AWS before. Unfortunately, their pricing and price/performance is not competitive with what we can get from OVH. Another major issue that we would run into is that the architecture of CoH is not amenable to ramping up and down mapserver hosts (we do have some custom features for scaling and draining on Homecoming thanks to Number Six's awesome work, but we'd have to do a lot of operational work to be able to scale servers the way GameLift expects).

If I may chime in on AWS, I believe that service should be avoided at all costs. Never really been a fan of Amazon, and despite the problem that surfaced with OVH and the mystery packet loss that has somehow gone away (Metronome defenses kicked in? 😛 ), I can get behind the logic of going with OVH.

The second Item I'd like to inquire about is, has an offline backup of all data and everything securely been made just in case we have an realworld Rularuu, Rikti or Nemesis type event? Making all that progress only to have it go poof would prove to be discouraging after almost 2 years.

GraspingVileTerror · June 25, 2021

@Telephone's got you covered, @Panthonca7034!

https://forums.homecomingservers.com/topic/29779-server-down/page/16/?tab=comments#comment-375858

Panthonca7034 · July 3, 2021

Thank @GraspingVileTerror I figured there was a multi-pronged method to avoid disaster :)

Fira · August 17, 2021

Only stumbling on that now, find it very interesting, helps understand some of the choices there - though I'd still argue shards are too many currently, not because of load but simply due to fencing players (I mainly play on Reunion which is bit special but player 'dilution' over time is quite palpable and i'd expect same on some others given player counts)

Had previously briefly used and considered OVH services at smaller scale and it never looked like quite a competitive option there (among a few other things) but under that angle, scale and use case (clustering, multiple locations, etc) it definitely seems to make a ton more sense as middle ground between more bare alternatives, and overly expensive cloud-focused IaaS platforms as mentioned above

I'm intrigued by the mention of "beaconizing" here and what it entails, there is a page on OuroDev mentioning usage but that's it. Is it like a pre-release generation step, or server side pre-caching ? What's so important about it ?

Doc_Scorpion · August 17, 2021

1 hour ago, Fira said:

Only stumbling on that now, find it very interesting, helps understand some of the choices there - though I'd still argue shards are too many currently, not because of load but simply due to fencing players (I mainly play on Reunion which is bit special but player 'dilution' over time is quite palpable and i'd expect same on some others given player counts)

If you dig back into the posts from the early days... There was a lot of load on the earliest servers (low performance/inexpensive hosting plans?), and they were faced with a choice between opening more or having players stuck in queues forever and suffering performance issues once they did get to log in. And once a shard is open, telling players they have to find a new home because their old one is going dark isn't going to make you popular. (And name collisions are a thing.)

It's a classic rock-and-a-hard-place dilemma and there are no easy solutions.

Telephone · August 17, 2021

1 hour ago, Fira said:

I'm intrigued by the mention of "beaconizing" here and what it entails, there is a page on OuroDev mentioning usage but that's it. Is it like a pre-release generation step, or server side pre-caching ? What's so important about it ?

Beacons are the means by which AI can traverse maps efficiently. On live, there was actually a process called BeaconServer which worked to beaconize bases during play - which is one of the reasons live had such Draconian limits on base items, because generating beacons is a very expensive process (and the cost is not linear, especially when cramming many objects into a small area).

Homecoming does not run BeaconServer and allows much more extensive base-building; the cost for this is that there can be no real combat in bases.

32 minutes ago, Doc_Scorpion said:

If you dig back into the posts from the early days... There was a lot of load on the earliest servers (low performance/inexpensive hosting plans?), and they were faced with a choice between opening more or having players stuck in queues forever and suffering performance issues once they did get to log in.

In the earliest days of Homecoming, we had no issues spinning up plenty of hosts to handle mapserver load (we were using high-performance VMs until we moved to OVH). The problem we encountered is that player load on a shard is not linear - no matter how much power you throw at it, due the fundamental design of CoH's database system, a shard quickly reaches a saturation point between 2,000 and 3,000 players.

Homecoming's shards can handle roughly 2,500 players per shard with some lag (Reunion only 1,500 with its current setup, because it has only one mapserver host); even allocating the largest VMs available (32 cores at the time), we were unable to push a shard to 3,000 players before the dbserver collapsed (even 8 cores on the dbserver host were able to handle 2,000 players; I can't recall if we bumped it more to reach 2,500 at the time).

Homecoming's peak was 9.998 active players over Memorial Day weekend 2019 (believe me, we wish we had hit that magic 10,000, just because).

Fira · August 17, 2021

Thanks ! So basically pre-generating nav meshes for AIbefore release ? I'd have thought this would have been a thing in live and most games, pretty interesting if such a design choice was made primarily for base fighting

(Also to clarify per Don's answer, just reflecting as per current, I realize it's historical and obviously shard mergers would cause a lot of issues and aren't straightforward, I'm not particularly suggesting it should be done)

Edit: Hell, double-thanks for fixing Reunion while I'm reading this

Edited August 17, 2021 by Fira
as above

InfamousBrad · August 18, 2021

8 hours ago, Telephone said:

Homecoming does not run BeaconServer and allows much more extensive base-building; the cost for this is that there can be no real combat in bases.

It also never worked. (Sauce: tried playing a mastermind during a base raid.)

Telephone · August 18, 2021

We (finally) received our first SCALE-3 last week. We've been setting it up, and will likely be stress-testing it soon. More details to follow!

Bionic_Flea · August 18, 2021

Yay! Other than savings over time, what else can players expect from the Scale-3s? Faster load times; less lag; more responsive /AH? Or something else, maybe?

Telephone · August 18, 2021

1 hour ago, Bionic_Flea said:

Yay! Other than savings over time, what else can players expect from the Scale-3s? Faster load times; less lag; more responsive /AH? Or something else, maybe?

A good question - unfortunately, I have to be the bearer of bad news with respect to most of what you mention:

Faster Load Times

Load times are almost entirely on the client and network side of things. The mapserver hosts actually keep the entirety of the City of Heroes dataset in memory (you may have heard occasional references to 'shared memory corruption', usually causing Ouro issues but also likely responsible for the recent unplanned restart of Reunion earlier this week, where Mercy Island was having issues loading).

Because of this, the starting of a new mapserver process on the server side is very fast (it might take a short time if a new instance of a large city zone map is being spun up, but mission maps are usually much smaller, and the city zones don't tend to start new processes often). The bulk of the time would be spent loading the data on your PC and transferring the current state of the map over the network to you.

There might be a tiny improvement from the faster CPU in initial map setup, but I don't think it will be measurable.

Less Lag

This is definitely something that could improve, but it depends heavily on the source of lag - the servers can't do anything about network lag. That said, sometimes very heavily loaded maps do peg out the CPU (a given map is mostly single-threaded on the servers), so during Hami raids or other extremely busy events, there might well be some improvement here.

We've also got some other development work in the pipeline which I am not at liberty to divulge, but which could improve lag as well! (And not just server lag - we've got some ideas on network lag which may bear fruit).

More Responsive /AH

Wentworth's has elected to spend their money on new monocles this year instead of improving the market experience.

More seriously, the /AH issues are fundamental architectural issues in how the entire system works. There's definitely a desire to improve the experience, but throwing hardware at it won't improve it much. At some point in the future the system will probably get an overhaul to move to a more modern database architecture and do a general cleanup.

We do occasionally discover low-hanging fruit in the AH system and when we do, we try to fix those issues to improve the experience, but the system is badly in need of an overhaul.

Other Notes

OVH decided to redesign how networking works on their high-performance servers. It actually took us a good part of today to get the public networking set up on the new host (our private vRack worked much more quickly, thankfully) and pass it through to the mapserver host VM within, but everything seems to be working now. It's very likely that at next week's restart we'll put one of the NA shards entirely on the new host to see how it works out.

Bionic_Flea · August 18, 2021

Thanks for the quick response. You're my favorite form of communication device. Much better than TV and Radio (Radiooooo) and so much more personal and personable.

RageusQuitus2 · August 20, 2021

On 8/17/2021 at 9:44 PM, InfamousBrad said:

It also never worked. (Sauce: tried playing a mastermind during a base raid.)

I was part of several large base raids hero vs villains on champion back in the day. Not saying it was perfect but we did run some. So saying it never worked isnt accurate. It just didnt work well. And clearly not an option on homecomming.

Edited August 20, 2021 by RageusQuitus2

Sign In

Mapserver Host Hardware Changes

Recommended Posts

Telephone

Cipher

Anyad

Cinnder

ThunderCAP

Telephone

Parabola

RogerWilco

Clave Dark 5

IIIXJokerXIII

Telephone

Panthonca7034

GraspingVileTerror

Panthonca7034

Fira

Doc_Scorpion

Telephone

Fira

InfamousBrad

Telephone

Bionic_Flea

Telephone

Bionic_Flea

RageusQuitus2

Browse

Activity

Game Account

Wikis