Creating a New (Complex) Algorithm
In this section, we will explain the steps for creating a more complicated algorithm (for vertically partitioned data) in Python.
In the previous section, we saw how to create a simple algorithm in the Vantage. However,it only explains the essence. Algorithms trend to be more complex. In this section, we will explain the steps to create a more complicated algorithm, which includes an iterative process (e.g., for loop) between the central server and the nodes, in Python.
Scenario
For this example, we will consider the scenario in which we wish to analyse vertically-partitioned data from two different parties (Fig. 1). Namely, we want to obtain a regression model's coefficients. However, adapting the steps presented here for the horizontally-partitioned case is straightforward.
Fig. 1. Vertically-partitioned data. In this case, parties have different features from the same instances (i.e., patients).
Assumptions
We will assume the following:
It is possible to mathematically decompose the algorithm in a distributed fashion.
Data across parties is aligned (i.e., patient 1 is in row 1 of all data partitions, patient 2 is in row 2 of all data partitions, and so on). This assumption is crucial for vertically-partitioned data.
Data have been pre-processed appropriately (i.e., data are clean and column names match).
Standardizing IO
The algorithm will receive input parameters in a txt file. Furthermore, it will write the output to a txt file too. Both of these files will be located in the /app/ directory (i.e., /app/input.txt and app/output.txt. In order to keep maximum flexibility, JSON format must also be used as input/output format for all functions.
When passing strings in a JSON, use double quotes ("string" ) and not single quotes ('string') since the latter are not JSON compliant.
You cannot send numpy arrays in JSON. Thus, when you need to send one, be sure to:
Convert it to a list (e.g.,
numpy_array.tolist())Pack it into the JSON
Send it to a function
Receive it in the function
Convert it back to a numpy array (e.g.,
np.array(a_list))
Take algorithm_nodes as an example.
The txt-files are likely to be replaced by a more robust system in the future.
Environment variables
The following environment variables need to be available to the algorithm:
Variable name
Description
DATABASE_URI
Path to the data file (e.g., csv file).
Of course, this path will change for each party (i.e., node).
HOST
Host name and protocol (http/https) of the central server
API_PATH
Api path of the central server
PORT
Port to which the central server listens
The algorithm
We will need to create at least four different files (of course, in your case you could have more [supporting] files):
main.pymaster.pyalgorithm_central.pyalgorithm_nodes.pyDockerfilerequirements.txt
Let's go in detail through each of these: what they are supposed to do and how are they supposed to be structured.
main.py
This file is the main entry-point for the Docker container. Furthermore, it will also trigger the execution of the algorithm in the Docker container. Let's look at it in detail.
First, some standard imports
Then, we need to import the functions that will be triggered in the master container and in the nodes.
Notice that although there are more functions that are called in the master container, these are all called by the master function, which is the one that is triggered. A master container is an instance of a docker image which triggers the master function.
For convenience, we will also quickly define a couple of functions to display information to the console and to store in the log-file. These functions are particularly useful when debugging your code:
Afterwards, we will read the input to the algorithm from the file input.txt (as defined previously):
In the next step, we need to define which function will be triggered. When the user makes a request to run the algorithm, it will trigger the master function. When the algorithm is running, it will trigger the corresponding node functions. This looks as follows:
Then, we will actually call the function. Notice how the prototype for all functions is the same (method(*args, **kwargs)), except for the master function, which requires a token as an input too.
Finally, we will write the output as a txt-file:
master.py
This file will do the heavy lifting. It will coordinate the flow of the algorithm between the central server and the nodes as a researcher would do. Let's look at it a little bit closer.
First, the imports:
For now don't worry about pytaskmanager.node.FlaskIO. It will be installed in the Docker container later on. The import pytaskmanager will change in version 0.3.0.
Then, we will define the parameters of the algorithm execution:
Just like in main.py, we will define a couple of functions for printing in the console.
Afterwards, we will define a master function, which will contain the workflow of the algorithm. Its prototype will be master(token, organization1_id, organization2_id) , which can be extended to as many organizations (i.e., parties, nodes) as necessary. In this case, for the sake of simplicity we will only consider two.
Inside master, the first thing is to set up the connection client. We can do so like this:
For convenience, we will define the names of the columns of interest of each party (node).
Suppose that our algorithm requires counting the number of patients at each node. After that, we need to compute the conjugate of the feature matrix at each node, too. Every time that we want to execute a function in a node, we need to create a task and wait for its results. This would result in the following code:
The function wait_for_results is defined outside master and looks like this:
It is worth mentioning something important here. The way the code is now, we would first ask for the number of patients, wait for the results, ask for the conjugate matrices, and wait for the results. However, these two processes are independent of each other. In other words, we don't need the number of patients to compute the conjugate matrices. Therefore, we can improve the performance of our code by doing a little bit of reshuffling:
Now, let's suppose that our algorithm needs to obtain the sum of the conjugate matrices. This operation is done in the master container as follows:
Finally, let's suppose that the algorithm will make an initial guess of the coefficients and will update this value iteratively depending on the global conjugate matrix and the data of at each node. It will stop updating after a certain number of iterations has been reached (defined earlier as ITERATIONS) or when the change between the coefficients is less than TOLERANCE (i.e., when coefficients have converged). We can do this using the following code:
algorithm_central.py
In this file, we will define the (mathematical) functions that will be executed by the master container. For our example, this file would look something like this.
We will start with the same basics as before: imports and a couple of functions for printing to the console.
Afterwards, we will define all the actual functions. In our example, this was the function compute_global_conjugate_matrix
algorithm_nodes.py
In this file, we will define the (mathematical) functions that will be executed by the nodes. For our example, this file would look something like this.
We will start with the same basics as before: imports and a couple of functions for printing to the console.
Afterwards, we will define all the actual functions. In our example, these were the functions count_patients, compute_conjugate_matrix, and update_beta
Dockerizing the algorithm
Finally, just like in the previous example, we will dockerize the algorithm and push it to a registry from which the nodes can retrieve the algorithms. To do so, we need to generate a requirements.txt where we specify which packages are needed:
Then, we will generate a Dockerfile, which dictates the how the Docker image is build.
Now a docker-image can be created from our project using this docker-recipe, and we can push it to the registry:
Final remarks
You can see the final versions of each file below:
Some important considerations:
You will probably be making lots of matrix operations. Watch out for your matrices' dimensionality!
If this is your first time implementing your algorithm, make sure to have validated it previously outside of the infrastructure (ideally comparing its performance against its centralized counterpart)
When communicating between the central server and the nodes, make sure you don't send raw data!
In this section, we showed how to create an iterative distributed learning algorithm (for vertically-partitioned data). We showed the required files (and their structure), workflow, and considerations in a hypothetical case. If you want to see a real-life example, take a look at our Github repository for the VERTIGO algorithm.
Last updated
Was this helpful?