Creating a New Algorithm
In this section the basic steps for creating an algorithm for horizontal partitioned data are explained.
The final code of this tutorial is published on Github. The algorithm is also published in our Docker registry: harbor2.vantage6.ai/demo/average
It is assumed that it is mathematically possible to create a federated version of the algorithm you want to use. In the following sections we create a federated algorithm to compute the average of a distributed dataset. An overview of the steps that we are going though:
Mathematically decompose the model
Implement and test locally
Vantage6 algorithm wrapper
Dockerize and push to a registry
📝 The mathematical problem
We want to now the average of column X from a dataset Q which contains n samples. Dataset Q is horizontally partitioned in dataset and . The average of dataset Q is computed as:
Now we would like to compute from dataset A and B. This could be computed as:
We both need to count the number of samples in each dataset and we need the total sum of each dataset. Then we can compute the global average of dataset A and B.
We cannot simply compute the average on each node and combine them, as this would be mathematically incorrect. This would only work in the case if dataset A and B contain the same number of samples.
👨💻 Implementation
In this examples we use python, however you are free to choose any language. The only requirements are: 1) It needs to be able to create HTTP-requests, and 2) and needs to be able to read and write to files.
However is you use a different language you are not able to use our wrapper. Reach us on Discord to discuss how this works.
Now that we have figured out the maths, we can translate it to an implementation. A federated algorithm consist of two parts:
Central part of the algorithm which is responsible for combining the partial results from the data station. In our case that would be dividing the sum of the totals with the sum of observations.
Federated part of the algorithm which is responsible for the creating the partial results. In our case this would be computing the total (=sum) and number of observations.
💕 Federated Part
The node that runs this part contains a CSV-file with one column numbers which we want to use to compute the global mean. We assume that this column has no NaN values.
❤ The central algorithm
The central algorithm receives the sums and counts from all sites and combines these to a global mean. This could be from one or more sites.
🧪 Local testing
To test simple create two datasets A and B, both having a numerical column numbers. Then run the following:
🌯 Algorithm wrapper
A good starting point would be to use the boilerplate from our Github. This section gives background on all the steps needed to get to this boilerplate but also provides some background information.
Now that we have a federated implementation of our algorithm we need to incorporate it in the vantage6 infrastructure. The infrastructure handles the communication with the server and provides data access to the algorithm.
The algorithm consumes a file containing the input. This contains both the method name to be triggered as well as the arguments provided to the method. The algorithm has also access to a CSV file (in the future this could also be a database) on which the algorithm can run. Finally when the algorithm is finished it writes back the output to a different file.
The central part of the algorithm needs to be able to create (sub)tasks. These subtasks are responsible to execute the federated part of the algorithm. The central part of the algorithm can either be executed on the machine of the researcher or also on one of the nodes in the vantage6 network. In this example we only show the case in which one of the nodes executes the central part of the algorithm. The node provides the algorithm with a JWT token so that the central part of the algorithm has access to the server to post these subtasks.
📂Package Structure
The algorithm need to be structured as a package. This way the algorithm can be installed within the Docker image. The minimal file-structure would be:
We also recommend adding a README.md, LICENSE and requirements.txt to the project_folder.
setup.py
Contains the setup method to create a package from your algorithm code. Here you specify some details about your package and the dependencies it requires.
🐳 Dockerfile
Contains the recipe for building the Docker image. Typically you only have to change the argument PKG_NAME to the name of you package. This name should be the same as as the name you specified in the setup.py. In our case that would be v6-average-py.
__init__.py
__init__.pyThis contains the code for your algorithm. It is possible to split this into multiple files, however the methods that should be available to the researcher should be in this file. You can do that by simply importing them into this file (e.g. from .average import my_nested_method)
We can distinguish two types of methods that a user can trigger:
name
description
prefix
arguments
master
Central part of the algorithm. Recieves a client as argument which provides an interface to the central server. This way the master can create tasks and collect their results.
(client, data, *args, **kwargs)
Remote procedure call
Consumes the data at the node to compute the partial.
RPC_
(data, *args, **kwargs)
For our average algorithm the implementation will look as follows:
🏡 Local testing
Now that we have a vantage6 implementation of the algorithm it is time to test it. Before we run it into a vantage6 setup we can test it locally by using the ClientMockProtocol which simulates the communication with the central server.
Before we can locally test it we need to (editable) install the algorithm package so that the Mock client can use it. Simply move to the root directory of your algorithm package (with the setup.py file) and run the following:
Then create a script to test the algorithm:
🏗️ Building and 🚛 Distributing
Now that we have a full tested algorithm for the vantage6 infrastructure. We need to package it so that it can be distributed to the data-stations/nodes. Algorithms are delivered in Docker images. So that's where we need the Dockerfile for. To build an image from our algorithm (make sure you have docker installed and it's running) you can run the following command from the root directory of your algorithm project.
The option -t specifies the (unique) identifier used by the researcher to use this algorithm. Usually this includes the registry address (harbor2.vantage6.ai) and the project name (demo).
🤞 Cross-language serialization
It is possible that a vantage6 algorithm is developed in one programming language, but you would like to run the task from another language. For these kinds of usecases the python algorithm wrapper and client support cross-language serialization. By default input to the algorithms and output back to the client are serialized using pickle. However, it is possible to define a different serialization format.
Input and output serialization can be specified as follows:
Last updated
Was this helpful?