Creating a New Algorithm
In this section the basic steps for creating an algorithm for horizontal partitioned data are explained.
It is assumed that the algorithm that you want to use is mathematically possible for a separated dataset. In the following sections we use a simple average algorithm to explain the different steps to create an algorithm for Vantage. The following steps should be taken to create a Vantage ready algorithm:
Mathematically decompose the model
Implement and test locally using your preferred language
Standardize I/O
Dockerize and push to a registry
The mathematical problem
We want to now the average of column X from a dataset Q which contains n samples. Dataset Q is horizontally partitioned in dataset and . The average of dataset Q is computed as:
Now we would like to compute from dataset A and B. This could be computed as:
Thus we need to count the number of samples in each dataset, and we need to the total sum of each dataset. Only then we can compute the global average of dataset A and B.
We cannot simply compute the average on each node and combine them, as this would be mathematically incorrect. This would only work in the case if dataset A and B contain the same number of samples.
The algorithm
The algorithm consist of two separate parts: 1) the central part of the algorithm, and 2) the part that is executed at the node. In case of the average the nodes compute the sum and the count of samples in the dataset, the central part of the algorithm will combine these to a global average.
In these examples we use python, however you are free to choose any language. The only requirements are: 1) It needs to be able to create HTTP-requests, and 2) and needs to be able to read and write to files.
The node algorithm
The node that runs this part contains a CSV-file with one column numbers which we want to use to compute the global mean. We assume that this column has no NaN values.
The central algorithm
The central algorithm receives the sums and counts from all sites and combines these to a global mean.
Local testing
To test simple create two datasets A and B, both having a numerical column numbers. Then run the following:
Standardizing IO
The algorithm receives parameter input in a txt-file, and also writes the output back to a txt-file. The database is also available as a path in the environment variables. In the case o
IO files and variables
/app/input.txt
The recommended format (to keep maximum flexibility) is a JSON file containing three keys: method, args and kwargs in which the method is the method name and the args and kwargs the input for this method in python-style. In case of the node algorithm in python:
/app/output.txt
This file contains the output of the method that was triggered in the Docker image. If possible, use JSON as the output format.
Environment variables
There following environment variables are available to the algorithm:
DATABASE_URIcontains the path to the database fileHOSTcontains the host name and protocol (http/https) of the central serverAPI_PATHcontains the api-path of the central serverPORTcontains the port to which the central server listens
HOST, PORT and API_PATH are going to be changed in a future release of Vantage as the containers lose their internet connection. They will communicate by proxy to the central server.
Algorithms
In the previous section we created the node- and central-algorithm. In order to read the input, write the output we need another part of code. This can be seen as the main entry point for both the algorithms. It should handle the following:
Read
/app/input.txt, extract method, args and kwargsIn case of the central algorithm the
/app/token.txtneeds to be read. This allows the central algorithm to post tasks back to the server. Note: This is subject to change.
Execute method (using parameters) specified in
/app/input.txtWrite output from the method to
/app/output.txt
This main entry point script could be very similar for different algorithms.
Then we need one final piece of code. The master_algorithm is responsible for creating (node_algorithm) tasks at the server, and for retrieving the results of these tasks. After this has been done, the central_algorithm method can run to compute the global mean. The project should look as follows now:
Dockerize distributed algorithm
In this final step, the algorithm is dockerized and pushed to a registry from which the nodes can retrieve the algorithms. Add the following file to the project:
Now a docker-image can be created from our project using this docker-recipe, and we can push it to the registry.
Last updated
Was this helpful?