Classic Tutorial
In this section the basic steps for creating an algorithm for horizontal partitioned data are explained.
Last updated
Was this helpful?
In this section the basic steps for creating an algorithm for horizontal partitioned data are explained.
Last updated
Was this helpful?
The final code of this tutorial is published on . The algorithm is also published in our Docker registry: harbor2.vantage6.ai/demo/average
It is assumed that it is mathematically possible to create a federated version of the algorithm you want to use. In the following sections we create a federated algorithm to compute the average of a distributed dataset. An overview of the steps that we are going through:
Mathematically decompose the model
Federated implementation and local testing
Vantage6 algorithm wrapper
Dockerize and push to a registry
This tutorial shows you how to create a federated mean algorithm.
The mean of is computed as:
When dataset is horizontally partitioned in dataset and , we would like to compute from dataset A and B. This could be computed as:
We cannot simply compute the average on each node and combine them, as this would be mathematically incorrect. This would only work if dataset A and B contain the exact same number of samples.
In this example we use python, however you are free to use any language. The only requirements are: 1) It has to be able to create HTTP-requests, and 2) has to be able to read and write to files.
A federated algorithm consist of two parts:
A federated part of the algorithm which is responsible for creating the partial results. In our case this would be computing (1) the sum of the observations, and (2) the number of observations.
A central part of the algorithm which is responsible for combining the partial results from the nodes. In the case of the federated mean that would be dividing the total sum of the observations by the total number of observations.
The node that runs this part contains a CSV-file with one column (specified by the argument column_name) which we want to use to compute the global mean. We assume that this column has no NaN values.
The central algorithm receives the sums and counts from all sites and combines these to a global mean. This could be from one or more sites.
To test, simply create two datasets A and B, both having a numerical column numbers. Then run the following:
Now that we have a federated implementation of our algorithm we need to make it compatible with the vantage6 infrastructure. The infrastructure handles the communication with the server and provides data access to the algorithm.
The algorithm consumes a file containing the input. This contains both the method name to be triggered as well as the arguments provided to the method. The algorithm also has access to a CSV file (in the future this could also be a database) on which the algorithm can run. When the algorithm is finished, it writes back the output to a different file.
The central part of the algorithm has to be able to create (sub)tasks. These subtasks are responsible for executing the federated part of the algorithm. The central part of the algorithm can either be executed on one of the nodes in the vantage6 network or on the machine of a researcher. In this example we only show the case in which one of the nodes executes the central part of the algorithm. The node provides the algorithm with a JWT token so that the central part of the algorithm has access to the server to post these subtasks.
We also recommend adding a README.md
, LICENSE
and requirements.txt
to the project_folder.
Contains the setup method to create a package from your algorithm code. Here you specify some details about your package and the dependencies it requires.
Contains the recipe for building the Docker image. Typically you only have to change the argument PKG_NAME
to the name of you package. This name should be the same as as the name you specified in the setup.py
. In our case that would be v6-average-py
.
__init__.py
This contains the code for your algorithm. It is possible to split this into multiple files, however the methods that should be available to the researcher should be in this file. You can do that by simply importing them into this file (e.g. from .average import my_nested_method
)
We can distinguish two types of methods that a user can trigger:
name
description
prefix
arguments
master
Central part of the algorithm. Receives a client
as argument which provides an interface to the central server. This way the master can create tasks and collect their results.
(client, data, *args, **kwargs)
Remote procedure call
Consumes the data at the node to compute the partial.
RPC_
(data, *args, **kwargs)
The client
the master method receives is a ContainerClient
which is different than the client you use as a user.
Everything that is returned by thereturn
statement is sent back to the central vantage6-server. This should never contain any privacy-sensitive information.
For our average algorithm the implementation will look as follows:
Now that we have a vantage6 implementation of the algorithm it is time to test it. Before we run it in a vantage6 setup we can test it locally by using the ClientMockProtocol
which simulates the communication with the central server.
Before we can locally test it we need to (editable) install the algorithm package so that the Mock client can use it. Simply go to the root directory of your algorithm package (with the setup.py
file) and run the following:
Then create a script to test the algorithm:
Now that we have a fully tested algorithm for the vantage6 infrastructure. We need to package it so that it can be distributed to the data-stations/nodes. Algorithms are delivered in Docker images. So that's where we need the Dockerfile
for. To build an image from our algorithm (make sure you have docker installed and it's running) you can run the following command from the root directory of your algorithm project.
The option -t
specifies the (unique) identifier used by the researcher to use this algorithm. Usually this includes the registry address (harbor2.vantage6.ai) and the project name (demo).
It is possible that a vantage6 algorithm is developed in one programming language, but you would like to run the task from another language. For these use-cases, the Python algorithm wrapper and client support cross-language serialization. By default, input to the algorithms and output back to the client are serialized using pickle. However, it is possible to define a different serialization format.
Input and output serialization can be specified as follows:
Both the number of samples in each dataset and the total sum of each dataset is needed. Then we can compute the global average of dataset and .
However, if you use a different language you are not able to use our wrapper. Reach out to us on to discuss how this works.
A good starting point would be to use the boilerplate code from our . This section outlines the steps needed to get to this boilerplate but also provides some background information.
The algorithm needs to be structured as a Python . This way the algorithm can be installed within the Docker image. The minimal file-structure would be:
The setup.py
above is sufficient in most cases. However if you want to do more advanced stuff (like adding static data, or a CLI) you can use the from setup
.
Reach out to us on if you want to use our registries (harbor.vantage6.ai and harbor2.vantage6.ai).