Creating a New (Complex) Algorithm

In this section, we will explain the steps for creating a more complicated algorithm (for vertically partitioned data) in Python.

In the previous section, we saw how to create a simple algorithm in the Vantage. However,it only explains the essence. Algorithms trend to be more complex. In this section, we will explain the steps to create a more complicated algorithm, which includes an iterative process (e.g., for loop) between the central server and the nodes, in Python.

Scenario

For this example, we will consider the scenario in which we wish to analyse vertically-partitioned data from two different parties (Fig. 1). Namely, we want to obtain a regression model's β\beta coefficients. However, adapting the steps presented here for the horizontally-partitioned case is straightforward.

Fig. 1. Vertically-partitioned data. In this case, parties have different features from the same instances (i.e., patients).

Assumptions

We will assume the following:

  1. It is possible to mathematically decompose the algorithm in a distributed fashion.

  2. Data across parties is aligned (i.e., patient 1 is in row 1 of all data partitions, patient 2 is in row 2 of all data partitions, and so on). This assumption is crucial for vertically-partitioned data.

  3. Data have been pre-processed appropriately (i.e., data are clean and column names match).

Standardizing IO

The algorithm will receive input parameters in a txt file. Furthermore, it will write the output to a txt file too. Both of these files will be located in the /app/ directory (i.e., /app/input.txt and app/output.txt. In order to keep maximum flexibility, JSON format must also be used as input/output format for all functions.

Environment variables

The following environment variables need to be available to the algorithm:

Variable name

Description

DATABASE_URI

Path to the data file (e.g., csv file).

Of course, this path will change for each party (i.e., node).

HOST

Host name and protocol (http/https) of the central server

API_PATH

Api path of the central server

PORT

Port to which the central server listens

From version 0.3.0 the same environment variables will be used, however the host, api_pathand port point towards a local proxy server as the algorithm container no longer has internet access.

The algorithm

We will need to create at least four different files (of course, in your case you could have more [supporting] files):

  1. main.py

  2. master.py

  3. algorithm_central.py

  4. algorithm_nodes.py

  5. Dockerfile

  6. requirements.txt

Let's go in detail through each of these: what they are supposed to do and how are they supposed to be structured.

main.py

This file is the main entry-point for the Docker container. Furthermore, it will also trigger the execution of the algorithm in the Docker container. Let's look at it in detail.

You can think of a Docker image as a template/blueprint. You can think of a Docker container as an instance of such template.

First, some standard imports

Then, we need to import the functions that will be triggered in the master container and in the nodes.

Notice that although there are more functions that are called in the master container, these are all called by the master function, which is the one that is triggered. A master container is an instance of a docker image which triggers the master function.

For convenience, we will also quickly define a couple of functions to display information to the console and to store in the log-file. These functions are particularly useful when debugging your code:

Afterwards, we will read the input to the algorithm from the file input.txt (as defined previously):

In the next step, we need to define which function will be triggered. When the user makes a request to run the algorithm, it will trigger the master function. When the algorithm is running, it will trigger the corresponding node functions. This looks as follows:

The convention for defining the keys in the method dictionary is using snake case (i.e., lowercase_separated_by_underscores). Please avoid any other casing!

Then, we will actually call the function. Notice how the prototype for all functions is the same (method(*args, **kwargs)), except for the master function, which requires a token as an input too.

Finally, we will write the output as a txt-file:

master.py

This file will do the heavy lifting. It will coordinate the flow of the algorithm between the central server and the nodes as a researcher would do. Let's look at it a little bit closer.

First, the imports:

Then, we will define the parameters of the algorithm execution:

Just like in main.py, we will define a couple of functions for printing in the console.

Afterwards, we will define a master function, which will contain the workflow of the algorithm. Its prototype will be master(token, organization1_id, organization2_id) , which can be extended to as many organizations (i.e., parties, nodes) as necessary. In this case, for the sake of simplicity we will only consider two.

Inside master, the first thing is to set up the connection client. We can do so like this:

For convenience, we will define the names of the columns of interest of each party (node).

In the code-block above the column-names are hard coded. This is not ideal, as now you have to rebuild the entire image if different column-names are used. A better solution would be to supply these as input to the master method.

Suppose that our algorithm requires counting the number of patients at each node. After that, we need to compute the conjugate of the feature matrix at each node, too. Every time that we want to execute a function in a node, we need to create a task and wait for its results. This would result in the following code:

The function wait_for_results is defined outside master and looks like this:

It is worth mentioning something important here. The way the code is now, we would first ask for the number of patients, wait for the results, ask for the conjugate matrices, and wait for the results. However, these two processes are independent of each other. In other words, we don't need the number of patients to compute the conjugate matrices. Therefore, we can improve the performance of our code by doing a little bit of reshuffling:

When implementing your algorithm, keep in mind which steps need to be done sequentially and which steps can be done in parallel and order your code accordingly.

Now, let's suppose that our algorithm needs to obtain the sum of the conjugate matrices. This operation is done in the master container as follows:

Finally, let's suppose that the algorithm will make an initial guess of the β\beta coefficients and will update this value iteratively depending on the global conjugate matrix and the data of at each node. It will stop updating after a certain number of iterations has been reached (defined earlier as ITERATIONS) or when the change between the coefficients is less than TOLERANCE (i.e., when β\beta coefficients have converged). We can do this using the following code:

algorithm_central.py

In this file, we will define the (mathematical) functions that will be executed by the master container. For our example, this file would look something like this.

We will start with the same basics as before: imports and a couple of functions for printing to the console.

Afterwards, we will define all the actual functions. In our example, this was the function compute_global_conjugate_matrix

algorithm_nodes.py

In this file, we will define the (mathematical) functions that will be executed by the nodes. For our example, this file would look something like this.

We will start with the same basics as before: imports and a couple of functions for printing to the console.

Afterwards, we will define all the actual functions. In our example, these were the functions count_patients, compute_conjugate_matrix, and update_beta

Dockerizing the algorithm

Finally, just like in the previous example, we will dockerize the algorithm and push it to a registry from which the nodes can retrieve the algorithms. To do so, we need to generate a requirements.txt where we specify which packages are needed:

Then, we will generate a Dockerfile, which dictates the how the Docker image is build.

Note that we first copy requirements.txt and install the dependencies and then copy all other project files. This way if we make an update to the code, not the entire docker image is rebuild, which safes us a lot of time when developing.

Now a docker-image can be created from our project using this docker-recipe, and we can push it to the registry:

Final remarks

You can see the final versions of each file below:

Some important considerations:

In this section, we showed how to create an iterative distributed learning algorithm (for vertically-partitioned data). We showed the required files (and their structure), workflow, and considerations in a hypothetical case. If you want to see a real-life example, take a look at our Github repository for the VERTIGO algorithm.

Last updated

Was this helpful?