April 12, 2018
We recently found a huge leak in the UE4 garbage collector, particularly rearing its head on dedicated servers for large open-world games. Servers for arena type games with short play times (<1 hour) seem to be little affected, and clients don’t see the problem much either.
“Huge” is not hyperbole here – we’re talking the garbage collector alone using up to 10 GB of ram in under 24 hours of server operation!
Our “Lizard” memory profiler was a huge help here – we enabled it on one of the live servers we were looking at and collected a summary of allocations over about 12 hours:
~5 GB coming directly from the garbage collector thread tasks? That doesn’t look good.
After some digging, it turns out every one of these outstanding allocations were coming from arrays used with the GC Array Pool (FGCArrayPool). This intentionally keeps arrays around from one GC to another to reduce the number of allocations which happen (memory allocations can be slow and GC is time-critical). This pool isn’t expected to be nearly that large, and it was growing rapidly. It also wasn’t getting cleared out – two issues to dig into.
GC Array Pool Growing
There’s a console command (gc.DumpPoolStats), and a matching native function (FGCArrayPool::DumpStats), to write out some statistics about the GC array pool to the log. To gather some more information about the problem, we added a call to that at the end of CollectGarbageInternal, and then left a debug server running overnight…
On a completely idle server, overnight, we go from this:
GCPoolStats: 5365 Pools totaling 398,015 KB. Avg: Objs=9495, Size=74 KB. 2916352 (6 Items @ 22784 KB = 136,704 KB) 163840 (31 Items @ 1280 KB = 39,680 KB) 57344 (265 Items @ 448 KB = 118,720 KB) 24576 (183 Items @ 192 KB = 35,136 KB) 16384 (46 Items @ 128 KB = 5,888 KB) 2730 (19 Items @ 21 KB = 399 KB) 2046 (3 Items @ 15 KB = 45 KB) 1638 (15 Items @ 12 KB = 180 KB) 1364 (7 Items @ 10 KB = 70 KB) 1170 (107 Items @ 9 KB = 963 KB) 1022 (4 Items @ 7 KB = 28 KB) 910 (4 Items @ 7 KB = 28 KB) 744 (26 Items @ 5 KB = 130 KB) 682 (51 Items @ 5 KB = 255 KB) 584 (5 Items @ 4 KB = 20 KB) 510 (110 Items @ 3 KB = 330 KB) 408 (85 Items @ 3 KB = 255 KB) 340 (628 Items @ 2 KB = 1,256 KB) 254 (4 Items @ 1 KB = 4 KB) 204 (3766 Items @ 1 KB = 3,766 KB)
GCPoolStats: 5365 Pools totaling 934,702 KB. Avg: Objs=22300, Size=174 KB. 2916352 (7 Items @ 22784 KB = 159,488 KB) 163840 (120 Items @ 1280 KB = 153,600 KB) 57344 (862 Items @ 448 KB = 386,176 KB) 24576 (471 Items @ 192 KB = 90,432 KB) 16384 (79 Items @ 128 KB = 10,112 KB) 8192 (1 Items @ 64 KB = 64 KB) 2730 (43 Items @ 21 KB = 903 KB) 2046 (3 Items @ 15 KB = 45 KB) 1638 (19 Items @ 12 KB = 228 KB) 1364 (12 Items @ 10 KB = 120 KB) 1170 (174 Items @ 9 KB = 1,566 KB) 1022 (12 Items @ 7 KB = 84 KB) 910 (3 Items @ 7 KB = 21 KB) 744 (15 Items @ 5 KB = 75 KB) 682 (59 Items @ 5 KB = 295 KB) 584 (2 Items @ 4 KB = 8 KB) 510 (175 Items @ 3 KB = 525 KB) 408 (38 Items @ 3 KB = 114 KB) 340 (991 Items @ 2 KB = 1,982 KB) 254 (21 Items @ 1 KB = 21 KB) 204 (2258 Items @ 1 KB = 2,258 KB)
The exact same number of arrays in the pool, but a lot of them have moved from the bottom size bucket (1KB) to the high ones (192 KB, 448 KB, and 1280 KB). Total memory use increased by 500MB as a result (this is a lot less than the memory profile from the live server, but the garbage collection was a lot more stable due to less object churn as there were no players on the server to actually do anything).
Some digging around later revealed the reason – the pool is a simple LIFO list, used across multiple threads – the non-deterministic execution order of the threaded GC results in it occasionally pulling a small array when it needs a big one, and vice-versa, even with two otherwise identical runs. If a large array is taken from the pool when it only needs a small one little harm is done – it only uses as much as it needs and returns it to the pool. But if a small array is taken from the pool when it needs a large one, it will resize it to the larger size before using it, and return it to the pool as the new larger size.
Over time, this results in arrays being slowly pulled out of the small size category and resized to larger sizes. Eventually, we surmise, nearly every array in the pool would be in the largest size categories! With over 5000 arrays, that explains the ridiculous memory use.
GC Array Pool not being cleaned up
We weren’t seeing the same issue on the client, but why?
It turns out, the GC array pool is completely emptied and all its memory freed when a “full purge” happens on the GC. This seems to occur in two main situations – changing maps, and unloading of streaming levels. The server didn’t change maps (and the client does only when entering or leaving a game), but the more important is the 2nd case – the servers don’t stream anything out! As a client moves around the world, level tiles are constantly loading and unloading, which results in the GC array pool being cleared.
Once we knew what we were looking for, we could actually replicate the issue on the client as well – a client standing still, with the game paused (no movement, no streaming) resulted in exactly the same kind of leak – much smaller, due to the client having less loaded, so the arrays were smaller – but still there.
Unfortunately, we don’t have a true fix for this one, as that would require a rewrite of most of the Garbage Collector’s use of arrays. But we do have a workaround: clear the GC pool after each invocation of the GC. It’s not ideal, but it preserves the benefit of the pool during the GC (unlike disabling it completely), without the leak over time.
Simply add the following line to CollectGarbageInternal() (approx line 1450 in GarbageCollection.cpp, as of 4.19) .. add it after the call to “ClearWeakReferences()”:
Credit(s): Gareth Martin (Coconut Lizard)
Status: Currently unfixed in UE4.19