Forums » General

Does VO have a 'Server Mesh', and does it need to?

Feb 08, 2022 TheRedSpy link
Much controversy has been made in the Star Citizen community about broken promises and timelines to deliver something called 'Server Meshing'.

In short, this is a technology that allows multiple servers to communicate with each other in a 'mesh' and spread the simulation load so it can be scaled up.

This is said to be a required technology to enable Star Citizen to have more than 50 players in a single instance, a feature Vendetta Online has had since 2002.

It's an indictment of Star Citizen's development, given that Vendetta Online almost certainly has never received the same level of funding in its entire existence than Star Citizen received in November of last year (some $USD 80M, and correct me if I'm wrong here). It also shows that the most important thing is really that you prioritise features that matter most, rather than make fancy walkabout shopping centres like they have done in Star Citizen.

I was prompted to post here after having this discussion: https://robertsspaceindustries.com/spectrum/community/SC/forum/3/thread/new-roadmap-for-cig/4875795

I found myself reflecting on the times we had large space battles in Vendetta Online, some of which did seem to reach the upper limit of the technology and netcode.

My long-tail-introduced questions then become:

1. does Vendetta Online use a server mesh, or is the netcode so good that it has this option in reserve (or something else)?

2. is it hard to make a server mesh?
Feb 09, 2022 womble link
1. Yes.
2. Yes.
Feb 12, 2022 incarnate link
This is a really complex topic, and one where I'm inclined to write massive text-walls that probably no one wants to read (aside from other network engineers), so I'll try to condense this as much as I can, which is still super long..

1) The accurate answer to most generalized engineering questions is "It Depends".

This is because most engineering decisions are use-case specific. Without a lot of deep knowledge on the parameters of the problem being solved, it's hard to come up with an optimal solution. Different games may have very different technical challenges, even if the context seems similar.

2) Good technical game-development is about Heuristics and not Simulations.

"Heuristics" are all about the optimal hack that makes something "seem" real, and run at a playable framerate. "Simulations" are about "ground-truth", making something absolutely accurate and precise.

Comparing a real-time game engine with an offline renderer is an example.. the game engine is doing degrees of "fake" lighting to maintain a playable framerate, where the offline renderer is trying to get something more "physically perfect" and might be 10,000 times slower.

3) There is no unified language or reference terminology for high-end game server constructs.

So, I don't know exactly what they call "Server Meshing", or what that means, or exactly how that intersects with their game's mechanics. But, I can try to make some educated guesses.

All That Being Said, here are some thoughts..

Obviously, everyone has "a game server" that is actually made up of a cluster of individual server-nodes connected together (usually in an "elastic" arrangement that scales in size based on current player load). That's the basics, which everyone does, and I suspect that isn't actually what they're talking about here..

Based on your description of "more than 50 players in a single instance", I would guess this is something equivalent to our "sector" construct (one piece of our overall "game server"). Basically a daemon that is sometimes registered to handle a particular "geographical" region, and manage the gameplay that occurs within that region, probably including some collision detection and physics simulation. In old EverQuest parlance it would have been a "zone".

From there, I would guess that what they mean by "Server Meshing" is basically the ability to utilize multiple independent server nodes (unique cloud nodes, or physical servers, or whatever) to maintain a lockstep simulation of the same virtual-location (from a physics standpoint), and theoretically share the load over a network.

It may also operate completely independently of "virtual location", which has been a popular direction of late. There have been a lot of different architectures around this, and various attempts at this, with varying degrees of success.

However, this kind of solution is usually approached when one is trying to have a single concurrent scene with thousands, or even tens-of-thousands of players. Not like.. a few hundred? Without implying any critique of the SC team (I know nothing about their requirements or system), that seems a little unusual.

There are some inherent trade-offs in pursuing this kind of solution. It's more bandwidth intensive within the given datacenter (you have to mesh out the activity times the number of nodes, or do some kind of multicast thing, or have some kind of smart packet-routing.. different tradeoffs), and it can be fairly complex (which can be more involved to debug and maintain). You may also get less server performance per CPU-cycle, but at (hopefully) an offset of greater overall scale.

But, again, I have no idea what SC is doing, or what their technical criteria or requirements may be.

We've considered a lot of options and ideas in this area, over the years, for "extreme concurrency" scenarios. But, we don't presently do anything like this, and probably won't in the near future; in large part because CPU core-count has escalated so much, and become so widely available.

- Our per-sector capacity is currently limited to 500 players each, above-which a new instance will be created (basically an emergency option, also if it's running really slowly). For most cases, that player-limit may be a bit conservative. That's on a single mid-range CPU core. (Our overall efficiency per-sector is not a "network code" thing, it's an "everything" thing; more about Point #2 above, careful usage of heuristics).

- Sector Daemon resource usage can vary widely based on types of activity, but our sectors are usually pretty fast. We have a lot of varied tests of different situations, which sometimes show up on social media. There is some variance based on sector content, but obviously we have sectors with hundreds of thousands of objects, and we're moving towards physically interactive sectors (movable asteroids, debris), so this isn't an assessment based on some kind of empty-sector. Again, Point #2 from earlier: performance is all about the heuristics.

- Sector Daemons can also each scale "times N CPU-cores", using our threaded back-end, to handle higher player-counts (over 500) or intensive scenarios. Not all tasks will scale equally, but since primary load tends to be collisions / physics, and that scales very well, it gives us a lot of "headroom" by simply rolling server nodes with more cores if needed (up to 32-cores or more, from common cloud providers). We keep a few higher-core nodes online for the "high usage" cluster, which handles specially flagged sectors, and we can spin up more (and larger) if needed.

- Some wildly optimistic person might take our 500-user sector limit, and multiply it times our 8,000 game sectors, and say "wow, they can support 4 million concurrent players without instancing!", but that would be BS. Some other internal mechanism would explode long before then.

Real-world scalability is never linear, or precisely predictable, which is why we spend a lot of time worrying about new testing-models and ways of simulating load. The correct "answer" to "how many players can your server handle" is: "well, what are the players doing?"

In other words, a bunch of people sitting in stations chatting is glorified "IRC"; but a bunch of massive cap-battles in huge and dense asteroid fields, with dynamic physics debris flying around? That's.. different.

Even with testing, you never know exactly what your limitations and problems are.. until the next time a large number of people show up, and then you may find out in a hurry (and painfully). It's part of the tightrope-walk of operating this particular type of game/service, and especially when your game is constantly adding new features (and bugs).

Beyond Capacity

There are reasons to pursue architectural solutions, like "server meshing", that have little to do with absolute capacity. Transparent service migration between back-end server nodes, for instance. But, for us, and most others that I know of, the complexity has not outweighed the benefits. And you specifically asked about capacity, so I'm focusing on that..

VO and SC may not be directly comparable in architecture.

It's very easy to set a gameplay goal that makes a server architecture wildly different, or more complex. This is where, I really don't know what they're trying to do, which makes me unqualified to render much commentary on their solution. I know that's kind of an unsatisfying answer, but it is a real one.

In closing, to answer your two questions:

- No, we don't do something like "server meshing". At least, based on what I think they mean.

- Yes, it's probably pretty complicated to implement. Because you have to implement the idealistic version that you think will work, and then you have to beat the hell out of it with capacity-tests and actual players and fix all the stuff that goes awry. It is not a system that can be developed more quickly by spending more money, it takes time, testing and iteration. Adding more developers won't really make it come together any faster; it just isn't that kind of problem.

Building any MMORPG server is not for the faint of heart, if one has any aspirations of "scale". The real-time thing just makes it harder still.
Feb 12, 2022 TheRedSpy link
Thanks for weighing in!

They have released fairly technical videos (in the sense that it is full of jargon and hard for ordinary players to understand, not sure actually how technical they are) explaining the overall architecture, this one being the latest: https://www.youtube.com/watch?v=TSzUWl4r2rU. Appreciate you will be probably too busy to look into it further.

From what has been explained by the development team in public videos, SC inherited a codebase that was designed purely for a single game server to serve clients and handle all components of the game simulation.

They then started building game assets, mechanics and art around that have then had to literally retrofit the server code into the existing code so that it can scale up in gameplay and concurrent users, but 10 years later that work is not complete and the game universe is still only practically playable with 50 people in one instance of the full universe (which is only made up of one star system that is a giant - i mean, really big - game level).

VO and SC couldn't be comparable in architecture, because they are so different in design choices. Sectors are a prime example, Star Citizen is quite literally going for a 'no loading screen' experience for space travel between locations, which works nicely today, subject to the 50-player limit.

If I wasn't following along VO development for now over 11 years, I'd say I'm surprised you haven't taken the opportunity to look through the game as-is out of professional curiosity. It's been fascinating comparing their design goals to VO's.

If I had to summarise, SC is on a path of trying to implement many of VO's core gameplay mechanics, but with extreme levels of fidelity and immersion (many of which are unnecessary for a great player experience, a fact I am only aware of because of my experience of VO's design choices). For instance, like VO, they still have a static economy, and are still not close to doing any sort of dynamic economy testing in their test environment, although it has been teased at a high level. But doing a trade involves literally walking through a marketplace-as-a-game-level (only to then use a VO-style terminal), getting in a physical cockpit, flying off a planet and going to a trading outpost only to then do the same.

My observation has been that most fans of the genre express a desire for a more immersive, first-person style EVE online game. Massive scale in terms of economy, but still with those little experiences of fighting battles or working out the details of a profitable route, avoiding (or being) pirates etc.

SC was supposed to be the great hope for that, but to the frustration of many backers is clearly that its developers have prioritized immersion detail down to an insane level, ignoring what makes the experience truly exceptional - the sense of scale. Your comments about scalability being something that cannot be implemented faster, even with more money or developers, corroborate what I suspected, which is that all the immersion-layer features can and have likely been built in tandem with work on scaling up concurrent player counts - and it really is that hard to do.

Thanks and unless you object I'll do a write up for the SC forums with your comments and a brief profile of the game. I think you'll find some players would be interested to come and see how VO works.
Feb 14, 2022 incarnate link
VO and SC couldn't be comparable in architecture, because they are so different in design choices. Sectors are a prime example, Star Citizen is quite literally going for a 'no loading screen' experience for space travel between locations, which works nicely today, subject to the 50-player limit.

That particular example is a misnomer. I also wanted a "no loading screen" experience, but obviously that was more involved to deliver back in 2002. The limitation was more on the client-side than the server, as you have to be able to background-load content from disk, into ram, and then directly into the GPU, without impacting the framerate of content already rendering.

VO actually has this now, and we could also have "no loading screens". The server-side has never really been a barrier at all; sectors are a virtualized construct, and we already pre-load sectors based on predicted player activity, so they're already "online before you get there". Handoff time, between sectors, can be very brief and transparent.

My comment about comparable architecture was more directed at technical goals that SC might have, which could be unusual or different from what we want to do. Although I referenced "gameplay" in my comment, these might not be player-visible content goals, but rather expectations around particular implementations, which mandate particular technology roadmaps. That's where it gets hard to comment on the engineering choices that someone else is making, when you doing really know the complete story of "why" (which is not something one is usually going to learn from a public, marketing-type description).

There are Many Ways to Do The Same Thing.

I briefly watched the first half of the video you linked, and I think I can illustrate a little of the difference around the two approaches.

I wrote above about how VO's "sector" construct is a process that runs on a server, which in our case is responsible for a particular region of geography, and what happens within it. Physics, gameplay, data changes, etc.

- SC essentially has a "sector" for each individual player, which follows them around, dynamically loading content data based on expected player visibility and relevance.

- VO uses locally static, fixed "regions", on the server side, served on a per-process basis; but then uses analysis of player movement to spin up "the next" location in a transparent way.

(We also have a separate persistent mechanism, which uses lightweight Erlang threads to continuously track all player and NPC activity, independently of the "physical world" simulation provided by the "sector").

So, you could imagine that while SC has a "sphere of informational availability" that surrounds you everywhere you go, VO transitions you from "area to area" across a more fixed-grid of static, subdivided regions.

Generally speaking, the two approaches are effectively player-identical. Either one could be used to achieve the same gameplay goals, and the difference to the user would be invisible. The client-side experience, in the context of the features you mention, would be feasible with either implementation (particularly at "nearby player" counts of only a few hundred).

The issues are more on the back-end, in the development and operational mechanics.

Basic differences..

- SC's is a more "general purpose" design. Their server data-streaming seemingly maps 1:1 with their game client data, which is probably why they're taking that route. It probably gives them a lot of flexibility with their existing management tools, modification of the universe, etc. This implementation is very "dynamic", any player can be in any location and it can be kind of a unique locational snowflake in terms of content loaded, memory footprint, etc.

- VO's architecture intentionally uses more of a static (and regional) process construct, to give us a fixed and predictable data size on the server. We know how much memory every sector will require before we load it (which remains true, even with player-manipulated sectors); this information helps with scaling and the "elastic" nature of the server load-balancing and distribution. We can easily optimize "heuristics" around more fixed and granular data structures, and we also get to efficiently share player-usage by guaranteeing that players near the same content are generally served by the same process.

There are limitations on either side..

- SC probably has a process-per-player, which they then have to connect to other nearby-players to exchange data at high speeds ("meshing"), or they have to dynamically merge (and split) processes based on player proximity or player-count, load, etc. Unrelated, but I suspect they also have a substantial per-process memory footprint, if they're really loading a 1:1 of the client data. (Also, if they stick solely to "meshing", then extreme single-region concurrency, like maximum battle scale, may be a long-term challenge).

- VO has a process-per-region, which can handle a "lot" of nearby-players (up to 500, by default, with an expandable limit based on more CPU core availability) and exchange their data very easily, because it's all shared in a single process. Exceeding the sector's player-limit, for a particular region, we're more likely to live-migrate the process to a larger cluster server (and increase core and thread count), while also trying to keep our "standard" level of scale high enough that that's an infrequent occurrence. Also, our per-process memory footprint is pretty small, and doesn't expand much during runtime, because there isn't much dynamic allocation.

- SC's solution makes it simple to handle "anything can happen anywhere, and all data is geographical and generic". If you drop a piece of cargo somewhere, it will just "be" there, where it is, there's no concept of authority for that region of space.

- VO's current "sectors" have boundaries and edges of regional authority. However, in practice, the granularity of VO's space and sector assignments is virtual, and datasets could be overlapping, or be re-shuffled pretty simply. It's all very flexible, since it's invisible to the player anyhow. It doesn't "need" to be a 16x16 grid overlaid on a star-system, or even be correlated to navigation at all; we can have an infinity of fractal complexity if required (so you can.. visit the sub-atomic scale? I don't know). The point of our architecture is just that the data-set is fairly quantized and knowable-in-advance, so we know how much information we're likely to move around at any given time (which makes other aspects of the server easier to predict, design and scale).

Six of one, Half dozen of the other..

Again, either one of these solutions will work fine from the player perspective, as long as they're implemented well, and stable.

There's a bit of "the devil's in the details" in terms of what challenges will come out of any particular implementation and architecture, and time will tell on that (which is, frankly, true for everyone; ourselves included).
Feb 14, 2022 Anewold link
i wouldnt be surprised if SC includes some mesh server that runs on the gamers side in the game to increase server counts, got each player to run a "thread" of a larger non localized cluster.

in terms of VO and SC i see vo being a lot less hardware intensive (which it is currently and would need to be for cross play) and SC being very intensive atm SC reported 100% cpu usage on a i7 7820x, 22gb ram usage and 100% on a 1080ti gpu (ofc). meanwhile vo uses way less.

i find how servers work interesting tho i have about 0.1% knowledge on em in terms of game usage.

also question mainly to inc. if vo is ~400-500mb why does preload take 1.5gb ram? id assume some assets are streamed from the server? but if thats the case why require 3 times the amount then what the application is actually in size?

also knowing loading speed of jumping sectors is there any other benefit to preloading?
Feb 14, 2022 incarnate link
i wouldnt be surprised if SC includes some mesh server that runs on the gamers side in the game to increase server counts, got each player to run a "thread" of a larger non localized cluster.

Aside from potential issues like cheating and content authority, doing peer-to-peer server clustering adds a lot of variable-latency problems, particularly for a real-time game. I would not want to "serve" from my players.. not in a persistent world, anyway (a minimal "instanced dungeon" scenario, for a small group, might be okay).

also question mainly to inc. if vo is ~400-500mb why does preload take 1.5gb ram?

VO's assets are compressed on disk, to reduce download time and install space. This is different from concepts like "texture compression", which we also do, which reduces runtime memory footprint. We do both.. analogous to "zipping" a bunch of runtime-optimized files. Some assets compress-on-disk very well, like object geometry, but they cannot be loaded that way into memory. So operational usage requires unpacking the data into a functional data-structure in memory.

No, I don't think there's much value to preloading, beyond client side sector load times, which will practically go away once the newer background-I/O and compression stuff is released beyond the Android version. Only people with really old systems or very slow disks (but lots of ram) will benefit from preload, at that point.
Apr 07, 2022 MonkRX link
Hey Incarnate,

I'm not a developer but have some background working with people in web development. In terms of scalability, how does the server scale up or down? Is it spawning additional threads? Additional processes? The more direct question I have is, do you host or breakdown the application into "microservices"? In web, its extremely popular to develop and spawn separate applications for seperate features on your website (lets say, a comments section) so it can scale independently from the rest of the site. (Such as a ton of people commenting on a recently dropped youtube video).

However, this runs into the issue of scaling in time for the load. Scaling up microservices can be spawning a container, such as docker or kubernetes, or an entire virtual machine; and this requires a warm up time. So the SLA for a new instance to handle more load is typically in the seconds to minutes, as opposed to (I'm assuming) microseconds to nanoseconds on spawning a new thread. The difference here is you gain physical infrastructure and processing power with a container or VM, but on a new thread - you don't.

How does VO's engine handle scaling? Do you guys have any dynamic infrastructure scaling, and if so, how is the application written to handle it?
Apr 13, 2022 incarnate link
In terms of scalability, how does the server scale up or down? Is it spawning additional threads? Additional processes?

The game has always been based around the idea of starting processes, which serve "regions of space" on specific back-end clusters, which are groups of VMs. The clusters are respectively balanced by resource utilization (measured across the component VMs), and some other heuristic factors.

Those back-end clusters are elastic, so they could have any amount of VMs within them (or CPU/RAM resources), and the elasticity of the given cluster can be determined by the amount of "spare capacity", and heuristics of expectations of maximum peak short-term increase in player activity.

The individual VMs can also host quite a few processes, which is part of the inherent "buffer" of the amount of additional immediate capacity that can be served, at any given time, based on the amount of unused capacity in the "existing cluster" without increasing the number of VMs (and CPUs / RAM).

Additionally, there are different classes of processes, which can be served by entirely different clusters. So, for instance, if your biggest "transient user load" exposure is in new-user onboarding (a likely concern for any game with a mobile port), then a specialty class of cluster can be established just for sectors that are encountered by new players. In that way, you can sustain a fairly intense number of new players, and potentially maximize CPU availability and VM usage, thanks to economy-of-scale of relatively lightweight newbie sectors. And, at the same time, an overwhelming instantaneous surge in new-users will not "break" the rest of the game, because it's allocated to a separate cluster. (I'm big on the concept of "elegant failure"; failure can sometimes be unavoidable, so at least try to do it as elegantly as possible).

The processes themselves also have threads, which are used to allow the processes to scale themselves to handle unexpectedly large concurrencies in a single region. This can be useful in a dynamic scenario, where the organic startup / shutdown of processes can result in a single sector running on a given VM, because its process CPU utilization is sufficiently high (and can use multiple cores). But, it similarly can be useful in an "expected load" scenario, where say a major Event Sector might be spun up (in advance) on a particular dedicated node or specialty cluster-assignment, to give it a maximum of resources during gameplay (maybe more cores, etc).

So, you could say that it runs the full gamut, we do spin up and down VMs themselves. This, as you say, has some time associated with their provisioning.. as well as some uncertainty in that time and availability, depending on who else is serving from the same datacenter (like other companies) and what their activity may be like. We do spin up processes on those VMs, and those processes also utilize threads for a number of their internal activities.

Interestingly, the game has always worked this way. We originally owned all our own physical hardware, prior to the cloud era, but the architecture was still based on the idea of provisioning more physical hardware as quickly as we could, to support possible "sudden" increases in userbase. Ie, quickly installing more physical machines into a rack to linearly scale our capacity. "Back in the day" this was achieved with solutions like "pxe boot" with our whole cluster booting FreeBSD over the network and loading off of a shared NFS drive. These days, with cloud architectures, it's much easier.