Introducing WinApi: Comparing GC pressure and performance with WinForms

Saturday, 29th Oct 2016

GitHub: WinApi

TL;DR - Performance stats

Direct message loop performance: 20-35% faster.
Heap allocation: 0MB vs. roughly, 0.75GB / 100k messages.
Memory page faults (Soft): 0.005% - A mere 5k vs. roughly 1 million faults/100k messages).

WinApi's primary objective is to provide access to the native layers of the Windows API from the CLR. However, even on first look it should be clear that the WinApi.Windows namespace infringes on the WinForms territory, even though its a tiny sub-fraction of the size of WinForms. Over the years WinForms has been well optimized to be decent - It's not the most efficient beast, but for common programs, it probably takes up less than 2-5% percent of your application's time that it doesn't matter on modern hardware - or so is the general line of thought. However, what one cannot refute is that it never was the same as say, ATL/WTL in C++ or direct Win32 programming to be able to handle message loop heavy applications, or high-performance games.

The WinApi.Windows advantage

The assembly code generated by the JIT is directly comparable to ATL/WTL or a Win32 application written by hand.
The message loop is completely GC-allocation free.
You have complete control over how messages are processed. You can entirely short-circuit, or manually extend connection points into the message loop logic.
It's the perfect replacement for WinForms when you want to handle the GUI logic your own way - has no inherent GUI subclassing and drawing functionality - You're free to use any kind of external GUI logic, and drawing library powered by DirectX (like WPF and WinRT XAML), Cairo, Skia or even the defacto System.Drawing that uses GdiPlus underneath.

Performance analysis

Setting the stage

Measuring GUI performance, is in general very tricky. Instead, I'm going to skip over the traditional benchmarking models, and do a little trick for a very quick and practically accurate analysis.

The reason I'm doing this, is not just to do a really quick estimate, but traditional models will be unfair to WinForms. Why? Because, WinApi.Windows is a very efficient and light-weight wrapper. And it has no GC allocations during the message loop. So, a hefty framework like WinForms is always going to lose in micro-performance tests, and the results will be misleading.

Rather, what I'm going to do, is to open up a simple window and give it a ton of messages to chew. But the idea is that these messages have to go through the entire process loop, but not generate additional calls to other areas such as the graphics API for example, since that would end up benchmarking the 2D, and 3D libraries.

What's the simplest way to do this? Well, resize the window! Win32 sends off WM_POSITIONCHANGING, WM_POSITIONCHANGED, which in turn generates WM_SIZE, WM_MOVE by the DefWindowProc as you resize. This also ends up generating WM_NCPAINT and WM_PAINT messages as long as the CS_REDRAW styles are set. Perfect - except for the painting part. I do want WM_PAINT being generated since its one of the high-frequency messages, but I just don't want any of graphics API calls.

The key here is that, we don't care too much about how quickly the program runs to completion.

Wait. What? Isn't that the whole point of this? - Not at all. Why? Because, no developer in the right mind would make a program that just goes on resizing itself and call it a useful piece of software. It's way too simple to emulate the conditions of a real-life program to provide any useful timing comparisons. However, the interesting part is that since we trigger the whole layout-paint cycle, we can indeed collect useful information on how the memory allocations take place, memory faults, garbage collections, and a few more - which in turns translates to very significant performance aspects in real-life programs.

The trick is to use this otherwise useless program to give us useful data that's practically applicable.

Note: Most of the allocations for a program that's as simple as this will likely end up in Generation-0, which will commonly mislead many developers to think its okay. Gen-0 is after-all the most efficient, isn't it? However, in most practical scenarios they get bumped up the generations since a lot of them survive the entire course of the event, and havoc with fragmentation starts to show earlier than most would expect. Large objects are a whole another story which almost never gets defragmented during the course of the program.

The high-level test program

Create a simple window with nothing but 2 labels that are auto-stretched to the entire window.
The controls should always react to changes, resizing itself and triggering the whole layout-paint cycle, but never painting anything other than its default background until the very end.
Create one new thread, and keep triggering resize as fast as it can process the messages!
Stop. Measure. Repeat.

I'm going to set the thread to send off 100,000 resize messages, and then stop. 100k may be on the high-side for simple application, but not uncommon for high-performance realtime applications.

Note: This is likely to vary quite a bit depending on the background CPU usage. This particular type of program will also be heavily affected by dynamic CPU clocks, since it may well complete in it's quantum slice, and wait on the next message, which in turns will affect CPU sleep states. So, its important to use High-performance mode, or a constant CPU clocking speed. However, the rest is okay, since we don't care too much about how quickly it just finishes.

The WinForms program

public partial class Form1 : Form
{
    public Form1()
    {
        InitializeComponent();
    }

    private int m_times;
    private bool m_done;
    private DateTime m_startTime;
    private DateTime m_endTime;
    private const int Iterations = 100000;
    private Task m_task;

    protected override void OnLoad(EventArgs e)
    {
        base.OnLoad(e);
        var r = new Random();

        m_task = Task.Run(() =>
        {
            while (m_times < Iterations)
            {
                m_times++;
                this.SetBounds(50, 50, 1200 - r.Next(0, 1100), 900 - r.Next(0, 800));
            }
            m_endTime = DateTime.Now;
            m_done = true;
            this.SetBounds(50, 50, 700, 500);
        });
        m_startTime = DateTime.Now;
    }

    protected override void OnSizeChanged(EventArgs e)
    {
        base.OnSizeChanged(e);

        // Paint only after everything's done to show
        // the result.
        if (!m_done) return;

        var str = $"\r\n{DateTime.Now}: No. of changes done: {m_times}";
        textBox1.Text = str;

        var sb = new StringBuilder();

        sb.AppendLine($"Start Time: {m_startTime}");
        sb.AppendLine($"End Time: {m_endTime}");
        sb.AppendLine();

        if (m_endTime != DateTime.MinValue)
            sb.AppendLine($"Total Time: {m_endTime - m_startTime}");
        textBox2.Text = sb.ToString();
    }
}

That's it. Very simple! Now, lets run it and wait until it ends.

[Image]

So, on my i7 machine, it took about 4 minutes and 34 seconds.
Okay, now, let's look at the more interesting data.

[Image]

The key data, that's of interest are:

Gen 0 Collections: 370
Gen 1 Collections: 186
Finalization Survivors: 1279
Total Bytes Allocated: 749.18MB

Yikes! That's a lot of allocations. Now, 186 Gen-1 is no small task. Infact, even if we ignore the fact that its Gen-1, and total all of them as Gen-0, its 370+186=556 collections. That's one GC collection every 180 messages!. And so, totally it allocates three-quarters of a gigabyte of memory for nothing, but just to process the messages - Add your application logic on top of that - not just for allocations, but also for GCs and more importantly, more of those from Gen 0 could very well have been promoted to Gen 1.

Clearly, that's a lot of stuff that's going on. Let's just look at a little more data.

[Image]

Cycles: 514,070,543,274
Kernel time: 2m:48s
User time: 0m:46s
Total time: 3m:34s
Page faults: 1,119,365

Note that while the page faults here is just the soft faults, it still means certain CPU caches will have to invalidated a lot more. We'll just leave it here. That's sufficient information for making a rough estimate. Let's move on.

The WinApi.Windows program

Now the same thing with WinApi,

class Program
{
    static int Main(string[] args)
    {
        ApplicationHelpers.SetupDefaultExceptionHandlers();
        try
        {
            var factory = WindowFactory.Create();
            using (var win = factory.CreateWindow(() => new MainWindow(),
                "Hello", constructionParams: 
                    new FrameWindowConstructionParams()))
            {
                win.Show();
                return new EventLoop().Run(win);
            }
        }
        catch (Exception ex) {
            ApplicationHelpers.ShowCriticalError(ex);
        }
        return 0;
    }

    public sealed class MainWindow : EventedWindowCore
    {
        private const int Iterations = 100000;

        private readonly HorizontalStretchLayout m_layout = 
                new HorizontalStretchLayout();
        private bool m_done;
        private DateTime m_endTime;
        private DateTime m_startTime;
        private Task m_task;
        private StaticBox m_textBox1;
        private NativeWindow m_textBox2;
        private int m_times;

        protected override void OnCreate(ref CreateWindowPacket packet)
        {

            this.m_textBox1 = StaticBox.Create(hParent: this.Handle,
                styles: WindowStyles.WS_CHILD | WindowStyles.WS_VISIBLE, 
                exStyles: 0);

            // You can use this to create the static box like this as well. 
            // But there's rarely any performance benefit in doing so, and
            // this doesn't have a WindowProc that's connected.
            this.m_textBox2 = WindowFactory.CreateExternalWindow("static",
                hParent: this.Handle,
                styles: WindowStyles.WS_CHILD | WindowStyles.WS_VISIBLE,
                exStyles: 0);

            this.m_layout.ClientArea = this.GetClientRect();
            this.m_layout.Margin = new Rectangle(10, 10, 10, 10);
            this.m_layout.Children.Add(this.m_textBox1);
            this.m_layout.Children.Add(this.m_textBox2);
            this.m_layout.PerformLayout();

            var r = new Random();

            this.m_task = Task.Run(() =>
            {
                while (this.m_times < Iterations)
                {
                    this.m_times++;
                    this.SetPosition(50, 50,
                        1200 - r.Next(0, 1100),
                        900 - r.Next(0, 800));
                }
                this.m_endTime = DateTime.Now;
                this.m_done = true;
                this.SetPosition(50, 50, 700, 500);
            });
            this.m_startTime = DateTime.Now;
            base.OnCreate(ref packet);
        }

        protected override void OnSize(ref SizePacket packet)
        {
            var size = packet.Size;
            this.m_layout.SetSize(ref size);

            base.OnSize(ref packet);

            if (!this.m_done) return;

            var str = $"\r\n{DateTime.Now}: No. of changes done: {this.m_times}";
            this.m_textBox1.SetText(str);

            var sb = new StringBuilder();

            sb.AppendLine($"Start Time: {this.m_startTime}");
            sb.AppendLine($"End Time: {this.m_endTime}");
            sb.AppendLine();

            if (this.m_endTime != DateTime.MinValue) 
                sb.AppendLine($"Total Time: {this.m_endTime - this.m_startTime}");
            this.m_textBox2.SetText(sb.ToString());
        }
    }
}

There we go. It does exactly what the WinForms application does. Now, lets run this and see the results.

[Image]

It took about 3 minutes and 16 seconds. That's over a minute faster!.
So, something clearly is more efficient. But why? Let's take a look at the memory stats.

[Image]

Gen 0 Collections: 0
Gen 1 Collections: 0
Finalization Survivors: 0
Total Bytes Allocated: 0

What!? To be candid, this is quite misleading, though its kinda true. I use the excellent Process Hacker to collect the details of the CLR. However, the way this works is that, its needs to do a GC in order to get these statistics. But what happened here, is that a GC never took place. 100k message loops, but not even a single GC has happened!. That's really cool. Now we're barking up the C/C++ tree, and in style :)

Let's look at the other bits of data.

[Image]

Cycles: 328,556,899,144
Kernel time: 1m:58s
User time: 0m:15s
Total time: 2m:14s
Page faults: 5494

Here most of the stuff, including the Cycles doesn't interest us much, since we already know that from the timing. While these aren't very accurate (since there could have been a small difference when the application has been running beyond the life time of the test), it isn't in anyway going to skew our results. Evidently, WinApi has been highly efficient. But why? Look at the Page faults. Its a mere 5.5k as opposed to the 1 million 120 thousand that happened with windows forms. That should give a clue. By using the stack for almost everything, leaving the GC entirely for your application, WinApi takes the C#/.NET Win32 desktop applications right into the C++ performance arena.

What WinApi.Windows doesn't do

While you can do everything right on top WinApi, you'll have to do the layout, and control subclassing all by yourself. Actually, that's only major feature set that's missing from WinApi, that WinForms can do.

Providing a set of base interfaces, and base classes, even in parity with WinForms's IControl shouldn't be too time consuming. However, the reason I haven't written them is I have no intent to even try to provide a replacement for WinForms, which already does what its designed to do quite well. My intent with this project is not a GUI library, but rather a lean, light and efficient native interop layer.

If you want to build your own modern toolkit with technologies like Direct2D, Skia, etc, this should serve as the perfect foundation on Windows, and a low level way to interact with the OS.

WinApi.Windows.Controls provides some basic controls like Button, Edit and Static - but its more of a sample library on how to write GDI based controls at this stage, than an actual usable library. Today, we tend to build using the more modern libraries like Direct2D and Skia (WPF, and Windows Runtime XAML for example, both sit on top of DirectX) - not the aged GDI.