City Council Telephone Posted January 23 City Council Posted January 23 Hello! I'm Telephone, and you may remember me from previous technical posts such as As many of you are likely aware, performance on Excelsior during peak times has been less than optimal for some time, and the recent increased player count since the license announcement has not improved things. We've spent a lot of time over the last couple of years profiling the shards and even more over the last couple of weeks in an attempt to resolve the performance issues, but we're now at a point where we need to actually upgrade our hardware, or, to be more specific, our SQL host hardware. The SQL Queue The issues we are encountering are caused by what the City of Heroes™ server software calls the SQL Queue. Everything players do is ultimately committed to a back-end database. Under normal operation, most of these operations are both parallelized and asynchronous, and they commit quickly (within nanoseconds), but there are some operations which are much larger and take more time to perform. There are also certain operations which have stricter timing requirements to maintain database integrity; these operations often come with what is called a barrier. When a barrier occurs, the entire SQL queue must be drained before continuing, meaning our normal large pool of asynchronous operations has to stop and wait on the barrier. One of the biggest barrier culprits (until a very recent fix by @Number Six) was the disbanding of a large league, which is why you may have noticed at the end of a Hamidon raid the shard often seemed to lag for some time when the league was disbanded. While we were able to find a solution to this barrier issue, there are other operations where the barrier can't be removed without significant rearchitecting. When the shard is very busy and there are a lot of other large operations taking place, a barrier can cause the entire shard to lag for several seconds, and this often gets into a vicious cycle where some of those other large operations may have their own barrier, or there may be database conflicts and the entire SQL operation is rolled back, rebuilt by the server, and sent to the database again with another barrier. The queue does eventually drain, but it could be a period of many seconds or even minutes until everything finally unwinds. If it gets particularly bad, the shard may enter what's called Overload Protection, where (among other load-shedding measures) new logins are temporarily forced to queue, even though the shard has not reached its player limit. Throw More Hardware At It Homecoming has been in operation for nearly five years now, and in that time the size of our databases and their indices has grown significantly. Our existing primary North America SQL hosts (of which we have two) are legacy OVH Advance-3 servers, with Xeon D-2141 CPUs (8 cores, 16 threads, 2.2-2.7 GHz), 64 GiB of RAM, and two NVMe drives (in a mirrored configuration). Excelsior's database alone is over 100 GB in size, and during peak time sees many thousands of transactions per second, so we have simply outgrown the hardware we have been running on. We're planning to upgrade both of them to the newest iteration of the OVH Advance-3, which is a Ryzen 5900X (12 cores, 24 threads, 3.8-4.7 GHz), 128 GiB of RAM, and four NVMe drives (in a RAID10 configuration). The main benefits we expect are that the doubling of RAM will hold many more indices in memory, and moving to four drives instead of two will double our I/O capacity. We're upgrading the one Excelsior (and Everlasting) are on first, and if that works well we will upgrade the other one (hosting Torchbearer, Indomitable, and global services) next month. Power Underwhelming? It's possible (but unlikely) that even this upgrade would not be enough to resolve the issues on Excelsior. The most likely cause of this would be insufficient RAM; the newest model of OVH Advance-3 has a maximum of 128 GiB of RAM, so we would need to go up to an Advance-4 to get more RAM (this would also increase the number of cores, but at a lower clock speed). This would be a somewhat significant cost increase, but if it becomes necessary we will explore it. There's also the possibility that the issue can't be resolved by more hardware; some of the SQL Queue problems are fundamental to the system design. We haven't stopped looking at fixes from the software side and while some of the barrier operations must remain barriers, there are potentially other fixes we can do to reduce load and database contention. TL;DR We believe our SQL hosts are no longer up to the task of handling our shards as they have grown, and need to upgrade their hardware. We're planning to spend approximately $600 this month and $600 next month in one-time charges on upgrading our primary North America SQL hosts. Our ongoing costs will increase by about $250-$300 per month ($125-$150 per month per database host; the amount is a little difficult to calculate due to taxes and SQL licensing costs). 18 5 3 1 2 2 2
EmperorSteele Posted January 23 Posted January 23 ... So what you're saying is we need to stop casting Barrier so much. OMG incarnate nerfs incoming!! But no, for realsies, thanks for the info and heads up =) 1 7 1 1
Glacier Peak Posted January 24 Posted January 24 That's news! Thanks for explaining it in detail. 1 1 I lead weekly Indom Badge Runs / A newer giant monster guide by Glacier Peak / A tour of Pocket D easter eggs! / Arena All-Star Accolade Guide! Best Post Ever....
carroto Posted January 24 Posted January 24 Are we at a much higher online user count now than pre-shutdown? Surely hardware has improved significantly in the last 10+ years. How were they able to make all this go back in the day? Make your own proc chance charts
Lunar Ronin Posted January 24 Posted January 24 1 hour ago, carroto said: Are we at a much higher online user count now than pre-shutdown? Surely hardware has improved significantly in the last 10+ years. How were they able to make all this go back in the day? There weren't as many people nor as much activity back on the live servers. I've found that people seriously misremember how small the live servers were in comparison to Excelsior and even Everlasting, likely skewed by modern MMORPGs. I've seen a few people over the years say that Infinity was a small server back on live. It was the third most populated server back then, just behind Freedom and Virtue. 1
Abraxus Posted January 24 Posted January 24 Hopefully, going from a RAID 1 to a Raid 5 array on the disks will help write speeds, and combined with the increase in processor power, and additional memory, the resulting performance increases should help. SQL will typically use every bit of resource you can throw at it. 1 What was no more, is REBORN!
PeregrineFalcon Posted January 24 Posted January 24 1 hour ago, Lunar Ronin said: There weren't as many people nor as much activity back on the live servers. I've found that people seriously misremember how small the live servers were in comparison to Excelsior and even Everlasting, likely skewed by modern MMORPGs. How would you know how many people were logged onto the retail servers? I don't remember Cryptic or Paragon ever showing those numbers. Did you work for either company? If not then how do you know what the numbers were? Being constantly offended doesn't mean you're right, it means you're too narcissistic to tolerate opinions different than your own.
General Idiot Posted January 24 Posted January 24 Personally I distinctly remember getting lag spikes of the kind described here back on live. And given the discussion of large leagues I can't help but wonder if that barrier thing was part of what made incarnate trials lag so horribly when they first released too. BAF especially I remember having upwards of thirty seconds delay on some things because the server was just that far behind. The server having to stop and empty the SQL queue strikes me as one of those things that doesnt cause a noticable delay in a small test server, even the open betas. But then cause a significant one on a live server with many more players doing all sorts of other things that could also be causing barriers here and there. So it's possible this was an issue on live too and just not considered enough of a problem to invest time and resources into solving it for no direct return. Remember, our team here isn't beholden to a publisher expecting return on any money they spend. So they can spend time fixing stuff like this where in a more commercial environment there'd be a bean counter somewhere saying no. 1 When life gives you lemonade, make lemons. Life will be all like "What?" [Admin] Emperor Marcus Cole: STOP! [Admin] Emperor Marcus Cole: WAIT ONE SECOND! [Admin] Emperor Marcus Cole: WHAT IS A SEAGULL DOING ON MY THRONE!?!?
macskull Posted January 24 Posted January 24 2 hours ago, Lunar Ronin said: There weren't as many people nor as much activity back on the live servers. I've found that people seriously misremember how small the live servers were in comparison to Excelsior and even Everlasting, likely skewed by modern MMORPGs. I've seen a few people over the years say that Infinity was a small server back on live. It was the third most populated server back then, just behind Freedom and Virtue. There were more people, but they were also spread out across more than a dozen servers. Before the current Homecoming population spike it wasn't uncommon for fully half the online players to be on a single server (Excelsior), which was never the case back on live. Even now when it's bumping up against 1500 players Excelsior still has about a third of the total online players. 1 1 "If you can read this, I've failed as a developer." -- Caretaker Proc information and chance calculator spreadsheet (last updated 15APR24) Player numbers graph (updated every 15 minutes) Graph readme @macskull/@Not Mac | Twitch | Youtube
Monty Haull Posted January 24 Posted January 24 IRL, I do a lot of work with large data sets so, I actually understand this. Thanks for posting details on the inner workings. 2 Help control the Rikti population. Have your Rikti Monkey spayed or neutered.
Lunar Ronin Posted January 24 Posted January 24 5 hours ago, macskull said: There were more people, but they were also spread out across more than a dozen servers. Before the current Homecoming population spike it wasn't uncommon for fully half the online players to be on a single server (Excelsior), which was never the case back on live. Even now when it's bumping up against 1500 players Excelsior still has about a third of the total online players. Yes, that's what I meant. None of the live servers were as busy as Excelsior is, and the majority weren't as busy as Everlasting is. 1
carroto Posted January 24 Posted January 24 When I think of all the people crowding around the AH on Freedom back in the day it's hard to imagine that our population is that much higher on Excel. I know people don't do that anymore due to /ah, but I've not seen say a Rikti invasion on HC attended by anywhere near the number of people I remember from back then. Selective memory perhaps. 1 Make your own proc chance charts
Doc_Scorpion Posted January 24 Posted January 24 1 hour ago, carroto said: When I think of all the people crowding around the AH on Freedom back in the day it's hard to imagine that our population is that much higher on Excel. I know people don't do that anymore due to /ah, but I've not seen say a Rikti invasion on HC attended by anywhere near the number of people I remember from back then. Yeah, me too. I've seen nothing on Excelsior that even remotely resembles what I regularly saw on Freedumb back in the day in terms of players and activity. Not in invasions, not in GM hunts, not in the LFG channel, not anywhere. The OP is correct that compared to the rest of the market CoX was a small population, niche game. But I'm not buying the claim that Homecoming even approaches anything resembling the numbers back in the OG days. 1 Unofficial Homecoming Wiki - Paragon Wiki updated for Homecoming! Your contributions are welcome! (Not the owner/operator - just a fan who wants to spread the word.)
Lunar Ronin Posted January 24 Posted January 24 26 minutes ago, Doc_Scorpion said: Yeah, me too. I've seen nothing on Excelsior that even remotely resembles what I regularly saw on Freedumb back in the day in terms of players and activity. Not in invasions, not in GM hunts, not in the LFG channel, not anywhere. The OP is correct that compared to the rest of the market CoX was a small population, niche game. But I'm not buying the claim that Homecoming even approaches anything resembling the numbers back in the OG days. The servers back on live had a hard cap of 1,500 players. Excelsior has had more than that the past couple of weeks. 1
Senbonbanana Posted January 24 Posted January 24 34 minutes ago, Lunar Ronin said: The servers back on live had a hard cap of 1,500 players. Where did you get this info? There were 11 servers hosted in the USA. A max player count of 16,500 for the entire United States does not sound right, at all. 1
Lunar Ronin Posted January 24 Posted January 24 3 minutes ago, Senbonbanana said: Where did you get this info? There were 11 servers hosted in the USA. A max player count of 16,500 for the entire United States does not sound right, at all. From the Homecoming Team directly, here. It's been mentioned by other staff members on these forums before as well, and on the Homecoming Discord server over the years. Again, the live game was smaller than people remember. 2
City Council Number Six Posted January 24 City Council Posted January 24 1 hour ago, Lunar Ronin said: The servers back on live had a hard cap of 1,500 players. Excelsior has had more than that the past couple of weeks. Uh, what? Not sure where that came from but it's incorrect. Best guess based on supporting evidence is that the cap was set to 2,200 for most servers on live... though in the 2011-2012 time frame they never got close to that, except for the last month of course. With the limitations due to the way the code works I'm not sure how stable they would have actually been in practice with 2,200 concurrent. Those limitations were not unknown to them as there are even comments pointing out areas likely to become a performance problem, but they didn't have a clear solution to fix them. Their hardware was probably less consolidated than ours is, so it's possible they could have sustained that. Based on our experience I can't see it ever being stable for long with more than 2,000 or so on a single shard. We lowered the cap on Excelsior to 1,500 prior to the license announcement as a precautionary measure, as the shard had already been having hiccups almost nightly whenever the big raids disbanded. 5
Senbonbanana Posted January 24 Posted January 24 22 minutes ago, Lunar Ronin said: From the Homecoming Team directly, here. It's been mentioned by other staff members on these forums before as well, and on the Homecoming Discord server over the years. Again, the live game was smaller than people remember. According to Wikipedia, in September 2008 the game had just shy of 125,000 subscribers (CoX didn't go free to play until 2011) in US and Europe (15 servers total; 11 USA and 4 Europe). ~125k scattered across 15 servers (with higher concentrations on some servers, lower on others) sounds a lot closer to what I remember things being like when CoX was at it's prime.
City Council Number Six Posted January 24 City Council Posted January 24 125,000 subscribers does not equate to 125,000 concurrent by any stretch. People play at different times of the day and do not necessarily log in every day even. For example, right now we're hitting about 4,200 CCUs on weeknights, but our MAU is in the 40,000-50,000 range. That's not an atypical ratio in the MMORPG industry. 7
City Council Number Six Posted January 24 City Council Posted January 24 Also in 2008 it's quite likely that the difference in hardware capabilities was compensated by having much smaller character records at the time (less database load, much less complex queries hitting it), as well as a significant difference in playstyle -- more people street sweeping open zones, less rapid fire task forces, no league content, etc. 1 1 3
BurtHutt Posted January 24 Posted January 24 13 hours ago, Lunar Ronin said: There weren't as many people nor as much activity back on the live servers. I've found that people seriously misremember how small the live servers were in comparison to Excelsior and even Everlasting, likely skewed by modern MMORPGs. I've seen a few people over the years say that Infinity was a small server back on live. It was the third most populated server back then, just behind Freedom and Virtue. Live did not have as many players? Uh...what....uh...WHAT?! How do you know this?! Do you have any facts?! HC is awesome at being transparent and shows the number of players on at any given time. Excelsior had approximately 1500 players on and it went to the queue system. So, are you trying to tell me Live had less people on it? Really? I don't have facts but will throw common sense at you (which is not common, clearly). If Live had 1500 players per shard and had 12 shards or so, they wouldn't survive. The sub fee from 1500 people per shard would not come close to covering their costs etc. The HC capacity is far different than Live.
Doc_Scorpion Posted January 24 Posted January 24 31 minutes ago, Number Six said: We lowered the cap on Excelsior to 1,500 prior to the license announcement as a precautionary measure, as the shard had already been having hiccups almost nightly whenever the big raids disbanded. Just out of curiosity, do you recall the cap in the 2019 time frame? When I finally came back in Aug/Sept of that year, three red dots on Excelsior were pretty common. IIRC, a combination of natural erosion of the playerbase and upgrading the hardware brought an end to that. I don't recall seeing heavy loads after mid/late 2020 or so. Unofficial Homecoming Wiki - Paragon Wiki updated for Homecoming! Your contributions are welcome! (Not the owner/operator - just a fan who wants to spread the word.)
City Council Number Six Posted January 24 City Council Posted January 24 Slight update, I checked to verify and we are actually at just over 64,000 active accounts over the last 30 days. 36,000 active in the last week, which is also not an atypical ratio for us. Highest concurrent users across all shards since the announcements has peaked at about 4,950 a weekend ago, highest concurrency on weeknights running around 4,100 to 4,200. Just for funsies, let's assume player habits haven't changed in the last 20 years, which is probably a bad assumption. Let's also assume that of 125,000 paying subscribers, every single one of them logs in at least once a month. Also a bad assumption, I know plenty of people who left their sub active even when not playing for a while, especially if they prepaid for the discount, but anyway this is for entertainment value only. That means I would expect a peak concurrency of... 9,660. Spread across 13 shards. That's fairly in line with both the recommended settings for player cap in the config files as well as the behavior that starts to exhibit once you go past 1,500 concurrent, and become debilitating once you're past 2,000. 8
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now