True multithreading in ShiVa pt.1: Introduction

Lua Coroutines, which we discussed before in other tutorials, offer a nice way of running long calculations step by step every frame over a period of time, so your programs do not appear to be stuck or frozen while you wait for those functions to return. Unfortunately, Coroutines do not offer true multithreading, which means that your expensive multicore CPU largely goes underutilized. Thanks to ShiVa’s underlying C++ core, Lua Coroutines are not the only way to design parallel operations. Using the power of the C++11 STL, you can achieve true concurrency via std::thread(). Part 1 this week will go into the basics and show you how to create threads without data races.

Setup

This tutorial requires a C++11 compliant compiler. For the majority of readers, this will be Visual Studio 2015 or above. Since we are exclusively using the STL to manage our threads, there will be no need for you to manually take care of hardware-dependent code. The same source should therefor compile in GCC 4.7+ and Clang 3.1+ (Xcode 4.5+) without any changes.
When you compile with Clang or GCC, make sure you add the directive

-std=c++11

to the make file compiler flags, or change the C++ language dialect in Xcode respectively.

Lua timer

For this tutorial, I am also using a timer function that ticks forward every second and outputs its state to the log, so it is easy to detect when an application is stalling or something interferes with the log.

    local cft = this.curFrametime ( )
    local dt = application.getLastFrameTime ( )
    local sum = (cft + dt)
    if sum > 1 then
        local s = this.secs ( ) + 1
        log.message ( "Time: " ..s )
        this.secs ( s )
        this.curFrametime ( sum - 1 )
    else
        this.curFrametime ( sum )
    end

Plugin

For ease of use, the thread code will be inside a C++ plugin, so that we do not have to export a native project every time we make small changes to the application inside ShiVa. You need to create a new plugin in either ShiVa 1.9.2 or 2.0 – either will work – and give it some basic functions:

	bOK = thr.init(sAI) --init plugin with target AI name
	bOK = thr.launchTimerThreads(nThreads) --create a number of threads

The .init function will merely check the AI string and store it, so we can use it later in the tutorial when we need to send data back.

const char* _sAI = "test_threading";
int Callback_thr_init ( int _iInCount, const S3DX::AIVariable *_pIn, S3DX::AIVariable *_pOut )
{
	// [...]
    // Output Parameters
    S3DX::AIVariable bOK ;
	{
		auto in = sAI.GetStringValue();
		if (in == nullptr || in[0] == '\0') bOK.SetBooleanValue(false);
		else _sAI = in;
		bOK.SetBooleanValue(true);
	}
    // [...]
}

You also need to include some additional STL headers on top of your MultiThreadingPlugin.cpp file:

#include <thread>
#include <atomic>
#include <chrono>
#include <mutex>

Anatomy of a Thread

There are a number of ways to create threads, the two most common ones are calls to std::thread() and std::async(). Since async() lives in the header and employs more advanced techniques that are of little use to our example, we will focus on std::thread().

std::thread t(someThread); // valid, no arguments
std::thread t(someThread, 12); // passing 12 to someThread()

Generally speaking, a thread will call upon a function to execute and optionally pass arguments to that function. The thread will then run until completion and finish, which will take down your entire process (= game). To prevent that, threads either need to be joined or detached. The .join() function will wait for a thread to finish and then return the result, while detach() will immediately allow the caller to continue working, turning the detached thread into a background daemon. For our use in ShiVa, detached processes will be our main focus. Joined processes can still be used to further parallelize long calculations, however the go-to in ShiVa will be detach().
For instance, your program needs to calculate all primes from 2 to 4 million. You would first detach() a thread that manages the prime calculation. Inside the detached thread, you would then spawn 4 threads which each calculated a portion of the problem, like 2..1M, 1M..2M, 2M..3M and 3M..4M, after which you will join these sub-threads since you can now afford to wait. The main ShiVa game thread will not be blocked since the parent thread of these joined threads is still detached.
You can call join() and detach() only once for any given thread. To prevent errors, always check threads you wish to join() by testing against .joinable(). Threads that have been detached once cannot be joined afterwards.
For an in-depth introduction to C++11 threads, please read bogotobogo.com/cplusplus/multithreaded

Your first Thread

For our first test, we will create 4 independently running threads which take 3 seconds each to complete, and then output a short message to the log as soon as a thread finishes. The Lua call is of course trivial:

thr.launchTimerThreads ( 4 )

The corresponding C++ function makes use of STL threads for the first time:

int Callback_thr_launchTimerThreads ( int _iInCount, const S3DX::AIVariable *_pIn, S3DX::AIVariable *_pOut )
{
    S3DX_API_PROFILING_START( "thr.launchTimerThreads" ) ;
    // [...]
    // Output Parameters
    S3DX::AIVariable bOK ;
	for (int i = 0; i < nThreads.GetNumberValue(); ++i) {
		std::thread t(timerThread, i);
		t.detach();
	}
	bOK.SetBooleanValue(true);
    // [...]
}

As you can see, the function spawns threads in a for-loop until nThreads is reached. These threads call on a function named "timerThread" and pass a single argument "i" along, which will be used for keeping track. After thread creation, the threads are then immediately detached. This effectively puts them into the background, so the function can return without waiting for the threads to finish.
The thread function itself is equally simple. At first, the threadID gets converted into a const char* so ShiVa can log it, then a bunch of status logging follows. The function sleeps for a bit afterwards while keeping track of time, and logs again when it wakes up.

void timerThread(int threadID) {
	std::stringstream sstr;
	sstr << std::this_thread::get_id();
	const auto& tids = sstr.str();
	auto tid = tids.c_str();
	S3DX::log.message("T", threadID, " (", tid, ") started");
	S3DX::log.message("T", threadID, " (", tid, ") will start a clock");
	S3DX::log.message("T", threadID, " (", tid, ") waits for 3s now");
	using namespace std::chrono_literals;
	auto start = std::chrono::high_resolution_clock::now();
	std::this_thread::sleep_for(3s);
	auto end = std::chrono::high_resolution_clock::now();
	std::chrono::duration elapsed = end - start;
	S3DX::log.message("T", threadID, " (", tid, ") ended");
	S3DX::log.message("T", threadID, " (", tid, ") ran for ", (float)elapsed.count(), "ms");
	S3DX::log.message("T", threadID, " (", tid, ") will now return");
	return;
}

Welcome to the races

The threads are launched one after the other with a for-loop, so they should start and end in sequence: Launch one after the other, run in parallel, then finish one after the other. We naively assume the following output:

T0 (#tid) started
T0 (#tid) will start a clock
T0 (#tid) waits for 3s now
-- same for T1, T2, T3
-- now some waiting messages from the ShiVa timer
-- then the end messages:
T0 (#tid) ended
T0 (#tid) ran for 3000 ms
T0 (#tid) will now return
-- same for T1, T2, T3

But this is not what we get. Not only do the threads not start in sequence T0-3, but the messages order also changes all the time (T1 T0 T2 T4 - T3 T2 T4 T0 - etc.). On top of that, there are true mistakes in the output, like a double log of T0 (in red). Times are never exactly 3 seconds either. In short, the log looks nothing like we expected.

What we have inadvertently created is known as a data race. A data race occurs when:
1. two or more threads (= our timerThread) in a single process (= ShiVa application) access the same memory location concurrently, and
2. at least one of the accesses is for writing, and
3. the threads are not using any exclusive locks to control their accesses to that memory.
The shared resource our threads are trying to write to (1.) is the string buffer for S3DX::log.message. Since we only have this interface for logging, we cannot change our code here.Neither can we do changes in (2.) since logging requires writing to that buffer, and all threads just happen to finish at roughly the same time. Our only chance to solve this problem is (3.): We need to manage write access to the log by introducing some sort of locking mechanism.

A simple locking mechanism

The two most important multithreaded locking/organizational mechanisms are mutexes and atomics. Mutexes are generally easier to implement, but have the unfortunate side effect of stalling other threads that use the same lock, while atomics allow for lock-free concurrency, but are somewhat harder to understand and correctly to program. We will focus on mutexes for now.
A calling thread owns a mutex from the time that it successfully calls either lock or try_lock until it calls unlock. When a thread owns a mutex, all other threads will block (for calls to lock) if they attempt to claim ownership of the mutex. In our case, calling "lock" on a mutex that is shared across our timerThreads at the right time will ensure that only one of these threads will be active at that time. It is therefor imperative that you keep your mutex-protected code sections as short as possible and unlock the mutex as soon as you are done.
Let's fix our tutorial code by introducing a mutex:

std::mutex mu_timer;
void timerThread(int threadID) {
	// [...]
	{
		std::lock_guard mylock(mu_timer);
		S3DX::log.message("T", threadID, " (", tid, ") started");
		S3DX::log.message("T", threadID, " (", tid, ") will start a clock");
		S3DX::log.message("T", threadID, " (", tid, ") waits for 3s now");
	}
	// [...]
}

First, we declare the mutex "mu_timer" outside the thread context. We then wrap it inside a "std::lock_guard" which provides a convenient RAII-style mechanism for owning a mutex for the duration of a scoped block. When a lock_guard object is created, it attempts to take ownership of the mutex it is given. When control leaves the scope in which the lock_guard object was created, the lock_guard is destructed and the mutex is released. The protected section of the code must therefor be wrapped inside its own scope {}.
I only wrapped the thread starter messages and secured them with a lock guard, so we can see the difference between the "mutexed" starter and "unmutexed" end messages: Even though the thread launch is now in order, we still cannot make any assumptions about the end messages of the threads, which change every time the program starts.

Summary

C++11 turned C++ into a multithreaded language that supports multithreading on a language level without the need for platform- or OS-specific threading code on the user level. The most use ShiVa developers will get out of std::thread is through detach(), which gives you the freedom to offload long calculations onto otherwise underutilized CPU cores and compute the result without stalling the main game thread. As soon as you have multiple threads reading and writing from and to the same resource(s), you open yourself up for horrible data races which must be avoided at all cost. The easiest way to avoid races is through lock_guarded mutexes.
Next week, we are going to have a look at how to retrieve the results from threaded calculations, we will explore how user AIs can offload long calculations into ShiVa AIModel states, and write a custom spinlock with atomics.