Detailed Design of a Many-core System

Now, we will look into the detailed design of the solution to the problems we have while implementing a key-value store on a many-core system. We'll be discussing the solution to these three problems:

  1. Memory limitation of TilePro64

  2. Limitations of a multi-threaded Memcached

  3. Allocation of tasks to cores

Solution: Limitations of memory and Memcached

Our first challenge is that TilePro64 has a 32-bit instruction set, and we cannot assign more than 4GB of virtual address space to a single process. The second challenge identified how the use of global locks hurt the performance of our key-value store. Both problems can be solved by implementing a version of Memcached that supports multiple processes:

  1. Use multiple processes to access the key-value store in their own address space and overcome the memory limitation.

  2. Use data shardingBreaking data into smaller chunks. (a direct consequence of using multiple processes) to parallelize data access and stop the use of locks within a shard.

Use multiple processes

When using domain-specific processors, we run into unique problems like the TilePro64 having only a 32-bit virtual address space. So, rather than using a single process that uses multiple threads, we will use those threads to communicate with independent processes that contain the key-value data shard in their own dedicated address space. This will allow us to overcome the limitation of the 32-bit virtual address space, and each process will run its operations in serial while also allowing us to avoid inter-thread synchronizations.

Modifying Memcached

We have a multithreaded key-value store which unfortunately needs locks. Locks limit parallel processing in our system. As we allow each process to have its own dedicated address space to overcome the 4GB memory limit, it also allows us to stop using any locking protection. This is because all requests directed to a hashtable are handled by a single process, which is already serialized.

Flow of requests

After implementing the new multi-process version of Memcached, the following is the new flow of requests:

  1. A network core receives a request and passes it to a worker thread.

  2. The worker thread uses modulus arithmetic to identify the relevant shard process.

  3. The worker thread writes the request's information into a shared memory, which is both accessible by the thread and the process. Shared memory is used to store request data until its process becomes available to process it.

  4. Next, the worker thread notifies the shard process to STORE or GET request.

  5. The shard process can then directly send the reply back to the client.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.