Creating a New Algorithm

In this section the basic steps for creating an algorithm for horizontal partitioned data are explained.

It is assumed that it is mathematically possible to create a federated version of the algorithm you want to use. In the following sections we create a federated algorithm to compute the average of a distributed dataset. An overview of the steps that we are going though:

  1. Mathematically decompose the model

  2. Implement and test locally

  3. Vantage6 algorithm wrapper

  4. Dockerize and push to a registry

📝 The mathematical problem

We want to now the average of column X from a dataset Q which contains n samples. Dataset Q is horizontally partitioned in dataset A=[a1,a2...aj]=[q1,q2...qj]A = [a_1, a_2 ... a_j] = [q_1, q_2 ... q_j] and B=[b1,b2...bk]=[qj+1,qj+2...qn]B = [b_{1}, b_{2} ... b_k] = [q_{j+1}, q_{j+2}...q_{n}] . The average of dataset Q is computed as:

Qmean=1ni=1nqi=q1+q2+...+qnnQ_{mean} = \frac{1}{n} \sum \limits_{i=1}^{n} {q_i} = \frac{q_1 + q_2 + ... + q_n}{n}

Now we would like to compute QmeanQ_{mean} from dataset A and B. This could be computed as:

Qmean=(a1+a2+...+aj)+(b1+b2+...+bk)j+k=A+Bj+kQ_{mean} = \frac{(a_1+a_2+...+a_j) + (b_1+b_2+...+b_k)}{j+k} = \frac{\sum A + \sum B }{j+k}

We both need to count the number of samples in each dataset and we need the total sum of each dataset. Then we can compute the global average of dataset A and B.

👨💻 Implementation

Now that we have figured out the maths, we can translate it to an implementation. A federated algorithm consist of two parts:

  1. Central part of the algorithm which is responsible for combining the partial results from the data station. In our case that would be dividing the sum of the totals with the sum of observations.

  2. Federated part of the algorithm which is responsible for the creating the partial results. In our case this would be computing the total (=sum) and number of observations.

The central part of the algorithm can either be run on the machine of the researcher himself or in a master container which runs on a node, the latter is the preferred method.

In case the researcher runs this part himself he needs to have a proper setup to do so (i.e. python 3.5+ and the necessary dependencies). This is useful when developing new algorithms.

💕 Federated Part

The node that runs this part contains a CSV-file with one column numbers which we want to use to compute the global mean. We assume that this column has no NaN values.

❤ The central algorithm

The central algorithm receives the sums and counts from all sites and combines these to a global mean. This could be from one or more sites.

🧪 Local testing

To test simple create two datasets A and B, both having a numerical column numbers. Then run the following:

🌯 Algorithm wrapper

Now that we have a federated implementation of our algorithm we need to incorporate it in the vantage6 infrastructure. The infrastructure handles the communication with the server and provides data access to the algorithm.

The algorithm consumes a file containing the input. This contains both the method name to be triggered as well as the arguments provided to the method. The algorithm has also access to a CSV file (in the future this could also be a database) on which the algorithm can run. Finally when the algorithm is finished it writes back the output to a different file.

The central part of the algorithm needs to be able to create (sub)tasks. These subtasks are responsible to execute the federated part of the algorithm. The central part of the algorithm can either be executed on the machine of the researcher or also on one of the nodes in the vantage6 network. In this example we only show the case in which one of the nodes executes the central part of the algorithm. The node provides the algorithm with a JWT token so that the central part of the algorithm has access to the server to post these subtasks.

In this example the node uses a CSV-file as database 📔. There are implementations that use traditional databases and triple stores. We expect to support these use cases better in the future.

📂Package Structure

The algorithm need to be structured as a package. This way the algorithm can be installed within the Docker image. The minimal file-structure would be:

We also recommend adding a README.md, LICENSE and requirements.txt to the project_folder.

setup.py

Contains the setup method to create a package from your algorithm code. Here you specify some details about your package and the dependencies it requires.

The setup.py above is sufficient in most cases. However if you want to do more advanced stuff (like adding static data, or a CLI) you can use the extra arguments from setup.

🐳 Dockerfile

Contains the recipe for building the Docker image. Typically you only have to change the argument PKG_NAME to the name of you package. This name should be the same as as the name you specified in the setup.py. In our case that would be v6-average-py.

__init__.py

This contains the code for your algorithm. It is possible to split this into multiple files, however the methods that should be available to the researcher should be in this file. You can do that by simply importing them into this file (e.g. from .average import my_nested_method)

We can distinguish two types of methods that a user can trigger:

name

description

prefix

arguments

master

Central part of the algorithm. Recieves a client as argument which provides an interface to the central server. This way the master can create tasks and collect their results.

(client, data, *args, **kwargs)

Remote procedure call

Consumes the data at the node to compute the partial.

RPC_

(data, *args, **kwargs)

The client the master method receives is a ContainerClient which is different than the client you use as a user.

Everything that is behind a return statement is send back to the central server. This should never contain any privacy sensitive information

For our average algorithm the implementation will look as follows:

🏡 Local testing

Now that we have a vantage6 implementation of the algorithm it is time to test it. Before we run it into a vantage6 setup we can test it locally by using the ClientMockProtocol which simulates the communication with the central server.

Before we can locally test it we need to (editable) install the algorithm package so that the Mock client can use it. Simply move to the root directory of your algorithm package (with the setup.py file) and run the following:

Then create a script to test the algorithm:

🏗️ Building and 🚛 Distributing

Now that we have a full tested algorithm for the vantage6 infrastructure. We need to package it so that it can be distributed to the data-stations/nodes. Algorithms are delivered in Docker images. So that's where we need the Dockerfile for. To build an image from our algorithm (make sure you have docker installed and it's running) you can run the following command from the root directory of your algorithm project.

The option -t specifies the (unique) identifier used by the researcher to use this algorithm. Usually this includes the registry address (harbor2.vantage6.ai) and the project name (demo).

In case you are using docker hub as registry, you do not have to specify the registry or project as these are set by default to the Docker hub and your docker hub username.

Reach out to us on Discord if you want to use our registries (harbor.vantage6.ai and harbor2.vantage6.ai).

🤞 Cross-language serialization

It is possible that a vantage6 algorithm is developed in one programming language, but you would like to run the task from another language. For these kinds of usecases the python algorithm wrapper and client support cross-language serialization. By default input to the algorithms and output back to the client are serialized using pickle. However, it is possible to define a different serialization format.

Input and output serialization can be specified as follows:

Last updated

Was this helpful?