1 of 53

3 | Petronas

Welcome

Good that you are here!

Check out our new documentation This documentation space is no longer maintained. Please find the latest documentation at https://docs.vantage6.ai!

What is vantage6? 🚆

Vantage6 stands for privacy preserving infrastructure for secure insight exchange.

The project is inspired by the (PHT) concept. In this analogy vantage6 is the tracks and stations. Compatible algorithms are the trains, and computation tasks are the journey.

vantage6 is here for:

delivering algorithms to data stations and collecting their results
managing users, organizations, collaborations, computation tasks and their results
providing control (security) at the data-stations to their owners

vantage6 is not (yet):

formatting the data at the data station
aligning data across the data stations
a finished/polished product

vantage6 is designed with three fundamental functional aspects of Federated learning.

Autonomy. All involved parties should remain independent and autonomous.
Heterogeneity. Parties should be allowed to have differences in hardware and operating systems.
Flexibility. Related to the latter, a federated learning infrastructure should not limit the use of relevant data.

Resources 🏭

Documentation

-> this documentation
-> unfinished technical documentation
-> general vantage6 website

Source code

-> contains all components (and the python-client).
-> contains all features, bugfixes and feature request we are working on. To submit one yourself, you can create a .

The old/previous (seperated) repositories can still be found at the IKNL Github in archived form:

-> contains all other repositories, used for synchronization and releasing

Community

-> discussion platform, ask anything here
-> for if you prefer a quick chat with the developers

🔍 Contents

This documentation space is intended for users of the vantage6 solution. You will find information on how to setup your own federated learning network, and how to maintain and interact with it.

Here you will not find:

in depth technical documentation
background on federated learning

🤝 Community

Vantage6 is completely open source under the .

If you want to join, find us on our channel.

Background

Architecture

An overview of the vantage6 infrastructure and its components

Overview

Vantage6 uses both a client-server and peer-to-peer model. In the figure below the client can pose a question to the server, the question is then delivered as an algorithm to the node. When the algorithm completes, the results are sent back to the client via the server. An algorithm can communicate directly with other algorithms that run on other nodes if required.

The server is in charge of processing the tasks as well as of handling administrative functions such as authentication and authorization. Conceptually, vantage6 consists of the following parts:

Partners

Contact us via the Discourse forum!

Anja van Gestel
Bart van Beusekom
Frank Martin
Hasan Alradhi
Gijs Geleijnse
Melle Sieswerda

Djura Smits
Lourens Veen

Johan van Soest

Would you like to contribute? Check out our !

How to contribute

Developing new features, fixing known issues, adding documentation, adding new tests, or reporting issues

👩‍💻 If you're a developer

Join our , and get in touch with the infrastructure developers.

Install

Requirements

vantage6 consists of several that can be installed. Which component(s) you need depends on your use case. Also the requirements differ per component.

Client

You can interact with the server via the API. You can explore the server API on https://<serverdomain>/apidocs (e.g. for Petronas).

You can use any language to interact with the server as long as it supports HTTP requests. For Python and R we have written wrappers to simplify the interaction with the server: see for more details on how to install these.

Python

There is a snake in my boot

Python 3.7.x

Installation of any of the vantage6 packages requires Python 3.7. For installation instructions, see python.org, anaconda.com or use the package manager native to your OS and/or distribution (e.g. apt for debian or Ubuntu, yum for fedora, or yast for SuSE).

We recommend you install vantage6 in a new, clean environment.

Other version of Python >= 3.6 will most likely also work, but may give issues with installing dependencies. For now, we test vantage6 on version 3.7, so that is a safe choice...

Docker

Required for both the node and server

Docker facilitates encapsulation of applications and their dependencies in packages that can be easily distributed to diverse systems. Algorithms are stored in Docker images which nodes can download and execute. Besides the algorithms, both the node and server are also running from a docker container themselves.

Please refer to this page on how to install Docker. To verify that Docker is installed and running you can run the hello-world example from Docker.

docker run hello-world

🐳 Always make sure that Docker is running while using vantage6!

🐳 We recommend to always use the latest version of Docker.

Note that for Linux, some may be required. Vantage6 needs to be able to run docker without sudo, and these steps ensure just that.

Client

We provide four ways in which you can interact with the server to manage your vantage6 resources: the user interface (UI), the Python client, the R client, and the server API.

What you need to install depends on which interface you choose. In order to use the UI or the server API, you usually don't need to install anything: the UI is a website, and the API can be called via HTTP requests from a programming language of your choice. For the UI, you only need to set it up in case you are setting up your own server (see User Interface for instructions).

Installation instructions for the Python client and R client are below. For most use cases, we recommend to use the UI (for anything except creating tasks) and/or the Python Client (which covers server API functionality completely).

Python client library

Before you install the Python client, we recommended to check the version of the server you are going to interact with first. The easiest way of doing that is checking the /version endpoint of the server you are going to use:

Retrieve version information of the server

GET https://SERVER[/api_path]/version

Then you can install the vantage6-client with:

where you add the version you want to install. You may also leave out the version to install the most recent version.

R client library

The R client currently only supports creating tasks and retrieving their results. It can not (yet) be used to manage resources, such as creating and deleting users and organizations.

You can install the R client by running:

Node

To install the vantage6-node make sure you have met the . Then install the latest version:

Server

A (central) server allows parties to connect and exchange data.

To install the vantage6-server make sure you have met the . Then install the latest version:

This command will install the vantage6 command line interface (CLI), from which you can create new servers (see Use ).

Optional components

There are several optional components that you can set up apart from the vantage6-server itself.

You can set up a , which is a web application that will allow your users to communicate more easily with your vantage6 server.

User Interface

The User Interface (UI) is a web application that aims to make it easy to interact with the server. It allows you to manage all your resources (such as creating collaborations, editing users, or viewing tasks), except for creating new tasks. We aim to incorporate this functionality in the near future.

If you plan on deploying your own server and want to use the UI, follow the installation instructions on the UI Github page. The UI is an Angular application and as such, you may be required to install Node.js. Once you have deployed the UI to the internet, any user that is registered on your vantage6 server will be able to use it.

The UI is not compatible with older versions (<3.3) of vantage6.

If you plan on using the existing Petronas server, you can simply go to https://portal.petronas.vantage6.ai and login with your user account.

RabbitMQ

Horizontal scaling for servers with high workloads

Please note that RabbitMQ is an optional component. It enables the server to handle multiple requests at the same time. This is important if a server has a high workload.

There are several options to host your own RabbitMQ server. You can run or host . When you have set up your RabbitMQ service, you can connect the server to it by adding the following to the server configuration:

Be sure to create the user and vhost that you specify exist! Otherwise, you can add them via the .

Note that the RabbitMQ currently (vantage6 version 3.2) does not yet work if you start your server via

Docker registry

A registry (repository) provides storage and versioning for Docker images. Installing a (private) Docker registry can be useful if you want to securely host your own algorithms.

Docker registry

Docker provides a registry as a turn-key solution on Docker Hub. Instructions for setting it up can be found here: https://hub.docker.com/_/registry.

Harbor

is another option for running a registry. Harbor provides access control, a user interface and automated scanning on vulnerabilities.

Use

Preliminaries

These apply to all components

Concepts

There are several entities in vantage6, such as users, organizations, tasks, etc. The following statements should help you understand the relationships.

A collaboration is a collection of one or more organizations.
For each collaboration, each participating organization needs a node to compute tasks.
Each organization can have users who can perform certain actions.
The permissions of the user are defined by the assigned rules.
It is possible to collect multiple rules into a role, which can also be assigned to a user.
Users can create tasks for one or more organizations within a collaboration.
A task should produce a result for each organization involved in the task.

The following schema is a simplified version of the database:

End to end encryption

Encryption in vantage6 is handled at organization level. Whether encryption is used or not, is set at collaboration level. All the nodes in the collaboration need to agree on this setting. You can enable or disable encryption in the node configuration file, see .

The encryption module encrypts data so that the server is unable to read communication between users and nodes. The only messages that go from one organization to another through the server are computation requests and their results. Only the algorithm input and output are encrypted. Other metadata (e.g. time started, finished, etc), can be read by the server.

The encryption module uses RSA keys. The public key is uploaded to the vantage6-server. Tasks and other users can use this public key (this is automatically handled by the python-client and R-client) to send messages to the other parties.

The RSA key is used to create a shared secret which is used for encryption and decryption of the payload

When the node starts, it checks that the public key stored at the server is derived from the local private key. If this is not the case, the node will replace the public key at the server.

If an organization has multiple nodes and/or users, they must use the same private key.

In case you want to generate a new private key, you can use the command vnode create-private-key. If a key already exists at the local system, the existing key is reused (unless you use the --force flag). This way, it is easy to configure multiple nodes to use the same key.

It is also possible to generate the key yourself and upload it by using the following endpoint:

Update the public key

PATCH https://SERVER[/api_path]/organization/<ID>

Client

We provide four ways in which you can interact with the server to manage your vantage6 resources:

User Interface (UI)
Python client
R client

The UI and the clients make it much easier to interact with the server than directly interacting with the server API through HTTP requests, especially as data is serialized and encrypted automatically. For most use cases, we recommend to use the UI and/or the Python client.

The R client is only suitable for creating tasks and retrieve their results. With the Python client it is possible to use the entire API.

Permissions

Note that whenever you interact with the server, you are limited by your permissions. For instance, if you try to create another user but do not have permission to do so, you will receive an error message. All permissions are described by rules, which can be aggregated in roles. Contact your administrator if you find your permissions are inappropriate.

There are predefined roles such as 'Researcher' and 'Organization Admin' that are automatically created by the server. These can be assigned to any new user by the administrator that is creating the user.

User Interface

The User Interface (UI) is a web application that aims to make it easy to interact with the server. At present, it provides all functionality except for creating tasks. We aim to incorporate this functionality in the near future.

Using the UI should be relatively straightforward. There are buttons that should help you e.g. create a collaboration or delete a user. If anything is unclear, please contact us via .

Python client

A Python client to interact with the vantage6 server

It is assumed you installed the . The Python client aims to completely cover the vantage6-server communication possibilities. It can create computation tasks and collect their results, manage organizations, collaborations, users, etc. The server hosts an API which the client uses for this purpose.

For tutorials on how to use the clients, please visit our discourse pages: .

We only show a few examples here. The methods in the library are all documented in their docstring, you can view them using help(...) , e.g. help(client.user.create) will show you the parameters needed to create a new user. We also have more extensive tutorials on how to use the clients available on our discourse pages: and in the and subsequent pages, which follow after introducing our R client.

The following groups (related to the ) of methods are available, most of them have a list(), create()

Authentication

How to authenticate a client with the vantage6 server

This page and the following pages introduce some minimal examples for administrative tasks that you can perform with our Python client. We start by authenticating.

To authenticate, we create a config file to store our login information. We do this so we do not have to define the server_url, server_port and so on every time we want to use the client. Moreover, it enables us to separate the sensitive information (login details, organization key) that you do not want to make publicly available, from other parts of the code you might write later (e.g. on submitting particular tasks) that you might want to share publicly.

# config.py

server_url = "https://MY VANTAGE6 SERVER" # e.g. https://petronas.vantage6.ai or 
                                          # http://localhost for a local dev server
server_port = 443 # This is specified when you first created the server
server_api = "" # This is specified when you first created the server

username = "MY USERNAME"
password = "MY PASSWORD"

organization_key = "FILEPATH TO MY PRIVATE KEY" # This can be empty if you do not want to set up encryption

Note that the organization_key should be a filepath that points to the private key that was generated when the organization to which your login belongs was first created (see Creating an organization).

Then, we connect to the vantage 6 server by initializing a Client object, and authenticating

from vantage6.client import Client
# Note: we assume here the config.py you just created is in the current directory.
# If it is not, then you need to make sure it can be found on your PYTHONPATH
import config

# Initialize the client object, and run the authentication
client = Client(config.server_url, config.server_port, config.server_api, verbose=True)
client.authenticate(config.username, config.password)

# Optional: setup the encryption, if you have an organization_key
client.setup_encryption(config.organization_key)

Above, we have added verbose=True as additional argument when creating the Client(...) object. This will print much more information that can be used to debug the issue.

Creating an organization

In this section, you will learn how to use the client to create a new organization on the server.

Here, we assume that you have a Python session with an authenticated Client object, as created in Authentication. We also assume that you have a login on the Vantage6 server that has the permissions to create a new organization (regular end-users typically do not have these permissions, this is typically only for administrators).

The first (optional, but recommended) step is to create an RSA keypair. A keypair, consisting of a private and a public key, can be used to encrypt data transfers. Users from the organization you are about to create will only be able to use encryption if such a keypair has been set up and if they have access to the private key.

from vantage6.common import (warning, error, info, debug, bytes_to_base64s, check_config_write_permissions)
from vantage6.client.encryption import RSACryptor
from pathlib import Path

# Generated a new private key
# Note that the file below doesn't exist yet: you will create it
private_key_filepath = r'/path/to/private/key' 
private_key = RSACryptor.create_new_rsa_key(Path(private_key_filepath))

# Generate the public key based on the private one
public_key_bytes = RSACryptor.create_public_key_bytes(private_key)
public_key = bytes_to_base64s(public_key_bytes)

Now, we can create an organization

client.organization.create(
    name = 'The_Shire',
    address1 = '501 Buckland Road',
    address2 = 'Matamata',
    zipcode = '3472',
    country = 'New Zealand',
    domain = 'the_shire.org',
    public_key = public_key
)

You can use public_key = None if you haven't set up encryption

Users can now be created for this organization. Any users that are created and who have access to the private key we generated above can now use encryption by running

after they authenticate.

Creating a collaboration

In this section, you will learn how to use the client to create a new collaboration on the server.

Here, we assume that you have a Python session with an authenticated Client object, as created in . We also assume that you have a login on the Vantage6 server that has the permissions to create a new collaboration (regular end-users typically do not have these permissions, this is typically only for administrators).

A collaboration is an association of multiple that want to run analyses together. First, you will need to find the organization id's of the organizations you want to be part of the collaboration.

Once you know the id's of the organizations you want in the collaboration (e.g. 1 and 2), you can create the collaboration:

Note that a collaboration can require participating organizations to use encryptions, by passing the encrypted = True argument (as we did above) when creating the collaboration. It is recommended to do so, but requires that a keypair was created when and that each user of that organization has access to the private key so that they can run the client.setup_encryption(...) command after .

Registering a node

In this section, you will learn how to use the client to register a new node with the server.

Here, we assume that you have a Python session with an authenticated Client object, as created in . We also assume that you have a login on the Vantage6 server that has the permissions to create a new node (regular end-users typically do not have these permissions, this is typically only for administrators).

A node is associated with both a collaboration and an organization (see ). You will need to find the collaboration and organization id's for the node you want to register:

Then, we register a node with the desired organization and collaboration. In this example, we create a node for the organization with id 1 and collaboration with id 1.

Remember to save the api_key that is returned here, since you will need it when .

Creating a task

In this section, you will learn how to create a task from a client.

Preliminaries

Here we assume that

you have a Python session with an authenticated Client object, as created in

R Client

It is assumed you installed the vantage6-client. The R client can create tasks and retrieve their results. If you want to do more administrative tasks, either use the API directly or use the Python client.

Initialization of the R client can be done by:

setup.client <- function() {
  # Username/password should be provided by the administrator of
  # the server.
  username <- "[email protected]"
  password <- "password"
  
  host <- 'https://petronas.vantage6.ai:443'
  api_path <- ''
  
  # Create the client & authenticate
  client <- vtg::Client$new(host, api_path=api_path)
  client$authenticate(username, password)

  return(client)
}

# Create a client
client <- setup.client()

Then this client can be used for the different algorithms. Refer to the README in the repository on how to call the algorithm. Usually this includes installing some additional client-side packages for the specific algorithm you are using.

The R client is subject to change. We aim to make it more similar to the Python client.

Example

First you need to install the client side of the algorithm by:

This is the code to run the coxph:

Server API

The server API is documented on the URL:

View the API documentation of the server you are using

GET https://SERVER[/api_path]/apidocs

For Petronas, the API docs can thus be found at . This page will show you which API endpoints exist and how you can use them. All endpoints communicate via HTTP requests, so you can communicate with them using any platform or programming language that supports HTTP requests.

Node

The node runs algorithms requested by clients

It is assumed you have successfully installed vantage6-node. To verify this you can run the command vnode --help. If that prints a list of commands, the installation is completed. Also, make sure that Docker is running.

An organization runs a node for each of the collaborations it participates in

Quick start

To create a new node, run the command below. A menu will be started that allows you to set up a node configuration file. For more details, check out the page.

To run a node, execute the command below. The --attach flag will cause log output to be printed to the console.

Finally, a node can be stopped again with:

Available commands

Below is a list of all commands you can run for your node(s). To see all available options per command use the --help flag, i.e. vnode start --help .

See the following sections on how to configure and maintain a vantage6-node instance:

Security

Probably important

As a data owner it is important that you take the necessary steps to protect your data. Vantage6 allows algorithms to run on your data and share the results with other parties. It is important that you review the algorithms before allowing them to run on your data.

Once you approved the algorithm, it is important that you can verify that the approved algorithm is the algorithm that runs on your data. There are two important steps to be taken to accomplish this:

Set the (optional) allowed_images option in the node-configuration file. You can specify a regex expression here. For example

Server

The server manages users, organizations, collaborations, tasks and results. In this section we will explain how to configure and manage a server.

It is assumed that you successfully installed vantage6-server. To verify this, you can run the command vserver --help . If that prints a list of commands, your installation is successful. Also, make sure that Docker is running.

Quick start

To create a new server, run the command below. A menu will be started that allows you to set up a server configuration file. For more details, check out the Configure page.

To run a server, execute the command below. The --attach flag will cause log output to be printed to the console.

When the server is run for the first time an user is created:

username: root
password: root

Finally, a server can be stopped again with:

Available commands

The following commands are available in your environment. To see all the options that are available per command use the --help flag, e.g. vserver start --help.

The following sections explain how to use these commands to configure and maintain a vantage6-server instance:

Deployment

In this section we'll explain how to deploy a vantage6 server.

vantage6 uses Flask as backbone, together with flask-socketio for websocket support. The server runs as a standalone process (listening on its own ip address/port).

From version 3.2+ it is possible to horizontally scale the server (This upgrade is also made available to version 2.3.4)

Documentation on how to deploy it will be shared here. Reach out to us on Discord for now.

~~Because there is no message broker used for the websocket channel, it is currently not possible to horizontally scale the vantage6-server~~

There are many deployment options, so these examples are not complete and exhaustive.

NGINX

Below a basic setup. Note that SSL is not configured in this example.

When you the server, make sure to include the /subpath that has been set in the NGINX configuration into the api_path setting (e.g. api_path: /subpath/api)

Azure app service

TODO

Logging

Logging is enabled by default. To configure the logger look at in the logging section.

Useful commands:

vserver files: shows you where the log file is stored
vserver attach: show live logs of a running server in your current console. This can also be achieved when starting the server with vserver start --attach

Algorithms

Concepts

Algorithms are executed at the vantage6-node. The node receives a computation task from the vantage6-server. The node will then retrieve the algorithm, execute it and return the results to the server.

Algorithms are shared using which are stored in a which is accessible to the nodes. In the following sections we explain the fundamentals of algorithm containers.

Interface between the node and algorithm container

Library to simplify and standardized the node-algorithm input and output

Creating subtasks from an algorithm container

Communicate with other algorithm containers and the vantage6-server

Cross language data serialization

Wrappers

The algorithm wrapper simplifies and standardizes the interaction between algorithm and node. The client libraries and the algorithm wrapper are tied together and use the same standards. The algorithm wrapper:

reads the environment variables and file mounts and supplies these to your algorithm.
provides an entrypoint for the docker container
allows to write a single algorithm for multiple types of data sources

The wrapper is language specific and currently we support Python and R. Extending this concept to other languages is not so complex.

Federated functions

The signature of your function has to contain data as the first argument. The method name should have a RPC_ prefix. Everything that is returned by the function will be written to the output file.

Central functions

It is quite common to have a central part of your federated analysis which orchestrates the algorithm and combines the partial results. A common pattern for a central function would be:

Request partial models from all participants
Obtain the partial models
Combine the partial models to a global model
(optional) Repeat step 1-3 until the model converges

It is possible to run the central part of the analysis on your own machine, but it is also possible to let vantage6 handle the central part. There are several advantages to letting vantage6 handle this:

You don't have to keep your machine running during the analysis
You don't need to use the same programming language as the algorithm in case a language specific serialization is used in the algorithm

Note that central functions also run at a node and not at the server.

In contrast to the federated functions, central functions are not prefixed. The first argument needs to be client and the second argument needs to be data. The data argument contains the local data and the client argument provides an interface to the vantage6-server.

The argument data is not present in the R-wrapper. This is a consistency issue which will be solved in a future release.

Different wrappers

The docker wrappers read the local data source and supplies this to your functions in your algorithm. Currently CSV and SPARQL for Python and a CSV wrapper for R is supported. Since the wrapper handles the reading of the data, you need to rebuild your algorithm with a different wrapper to make it compatible with a different type of data source. You do this by updating the CMD directive in the dockerfile.

Data serialization

TODO

Mock client

TODO

Child containers

When a user creates a task, one or more nodes spawn an algorithm container. These algorithm containers can create new tasks themselves.

Every algorithm is supplied with a JWT token. This token can be used to communicate with the vantage6-server. In case you use a algorithm wrapper, you simply can use the supplied Client in the case you use a central function.

A child container can be a parent container itself. There is no limit to the amount of task layers that can be created. It is common to have only a single parent container which handles many child containers.

The token to which the containers have access is limited. The token can only be used to create a task in the same collaboration and using the same image.

Networking

The algorithm container is deployed in an isolated network to prevent it from reaching unwanted destinations. There are two exceptions:

When the VPN feature is enabled on the server all algorithm containers are able to reach each other using an ip and port.
The central server is reachable through a local proxy service. In the algorithm you can use the HOST, POST and API_PATH to find the address of the server.

We are currently working on a whitelisting feature which allows a node to configure addresses that the algorithm container is able to reach.

VPN connection

Algorithm containers can expose one or more ports. These ports can then be used by other algorithm containers to exchange data. The infrastructure uses the Dockerfile from which the algorithm has been build to determine to which ports are used by the algorithm. This is done by using the EXPOSE and LABEL directives.

For example when an algorithm uses two ports, one port for communication com and one port for data exchange data. The following block should be added to you algorithm Dockerfile:

Port 8888 and 8889 are the internal ports to which the algorithm container listens. When another container want to communicate with this container it can retrieve the IP and external port from the central server by using the result_id and the label of the port you want to use (com or data in this case)

Cross language

Because algorithms are exchanged through Docker images they can be written in any language. This is an advantage as developers can use their preferred language for the problem they need to solve.

The are only available for R and Python, so when you use different language you need to handle the IO yourself. Consult the section on what the node supplies to your algorithm container.

When data is exchanged between the user and the algorithm they both need to be able to read the data. When the algorithm uses a language specific serialization (e.g. a pickle in the case of Python or RData in the case of R) the user needs to use the same language to read the results. A better solution would be to use a type of serialization that is not specific to a language. For our wrappers we use JSON for this purpose.

Package & distribute

Once the algorithm is completed it needs to be packaged and made available for retrieval by the nodes. The algorithm is packaged in a Docker image. A Docker image is created from a Dockerfile, which acts as blue-print. Once the Docker image is created it needs to be uploaded to a registry so that nodes can retrieve it.

Dockerfile

A minimal Dockerfile should include a base-image, injecting your algorithm and execution command of your algorithm. For example:

# python3 image as base
FROM python:3

# copy your algorithm in the container
COPY . /app

# maybe your algorithm is installable.
RUN pip install /app

# execute your application
CMD python /app/app.py

When using the the Dockerfile needs to follow a certain format. You should only change the PKG_NAME value to the Python package name of your algorithm.

When using the python wrapper your algorithm file needs to be installable. See for more information on how to create a python package.

When using the the Dockerfile needs to follow a certain format. You should only change the PKG_NAME value to the R package name of your algorithm.

Additional Docker directives are needed when using direct communication between different algorithm containers, see for more information on this.

Build & upload

If you are in the folder containing the Dockerfile, you can build the project as follows:

The -t indicated the name of your image. This name is also used as reference where the image is located on the internet. If you use Docker hub to store your images, you only specify your username as repo followed by your image name and tag: USERNAME/IMAGE_NAME:IMAGE_TAG. When using a private registry repo should contain the URL of the registry also: e.g. harbor2.vantage6.ai/PROJECT/IMAGE_NAME:TAG.

Then you can push you image:

Now that is has been uploaded it is available for nodes to retrieve when they need it.

Signed images

It is possible to use the Docker the framework to create signed images. When using signed image the node can verify the author of the algorithm image adding an additional protection layer.

Dockerfile

Build project
CMD
Expose

Harbor or Docker hub or whatever

public vs private

signed

Tutorial

TODO

Introduction

TODO

References

Glossary

Here we will provide definitions of all the important concepts used in VANTAGE6 (and Federated Learning).

📝 Currently, we are working on a paper where most of these concepts are explained in a more cohesive, well-structured manner, together with how vantage6 works. As soon as it is ready, we will post it on our website.

Autonomy: the ability of a party to be in charge of the control and management of its own data.

Classic Tutorial

In this section the basic steps for creating an algorithm for horizontal partitioned data are explained.

The final code of this tutorial is published on Github. The algorithm is also published in our Docker registry: harbor2.vantage6.ai/demo/average

It is assumed that it is mathematically possible to create a federated version of the algorithm you want to use. In the following sections we create a federated algorithm to compute the average of a distributed dataset. An overview of the steps that we are going through:

Mathematically decompose the model
Federated implementation and local testing
Vantage6 algorithm wrapper
Dockerize and push to a registry

This tutorial shows you how to create a federated mean algorithm.

Mathematical decomposition

The mean of is computed as:

When dataset is horizontally partitioned in dataset and , we would like to compute from dataset A and B. This could be computed as:

Both the number of samples in each dataset and the total sum of each dataset is needed. Then we can compute the global average of dataset and .

We cannot simply compute the average on each node and combine them, as this would be mathematically incorrect. This would only work if dataset A and B contain the exact same number of samples.

Federated implementation

In this example we use python, however you are free to use any language. The only requirements are: 1) It has to be able to create HTTP-requests, and 2) has to be able to read and write to files.

However, if you use a different language you are not able to use our wrapper. Reach out to us on to discuss how this works.

A federated algorithm consist of two parts:

A federated part of the algorithm which is responsible for creating the partial results. In our case this would be computing (1) the sum of the observations, and (2) the number of observations.
A central part of the algorithm which is responsible for combining the partial results from the nodes. In the case of the federated mean that would be dividing the total sum of the observations by the total number of observations.

The central part of the algorithm can either be run on the machine of the researcher himself or in a master container which runs on a node. The latter is the preferred method.

In case the researcher runs this part, he/she needs to have a proper setup to do so (i.e. Python 3.5+ and the necessary dependencies). This can be useful when developing new algorithms.

1 Federated part

The node that runs this part contains a CSV-file with one column (specified by the argument column_name) which we want to use to compute the global mean. We assume that this column has no NaN values.

2 Central part

The central algorithm receives the sums and counts from all sites and combines these to a global mean. This could be from one or more sites.

Local testing

To test, simply create two datasets A and B, both having a numerical column numbers. Then run the following:

Vantage6 integration

A good starting point would be to use the boilerplate code from our . This section outlines the steps needed to get to this boilerplate but also provides some background information.

In this example we use a csv-file. It is also possible to use other types of data-sources. This tutorial makes use of our algorithm wrapper which is currently only available for csv and SPARQL.

Other wrappers like SQL, OMOP, etc. are under consideration. Let us now if you want to use one of these or other data-sources.

Now that we have a federated implementation of our algorithm we need to make it compatible with the vantage6 infrastructure. The infrastructure handles the communication with the server and provides data access to the algorithm.

The algorithm consumes a file containing the input. This contains both the method name to be triggered as well as the arguments provided to the method. The algorithm also has access to a CSV file (in the future this could also be a database) on which the algorithm can run. When the algorithm is finished, it writes back the output to a different file.

The central part of the algorithm has to be able to create (sub)tasks. These subtasks are responsible for executing the federated part of the algorithm. The central part of the algorithm can either be executed on one of the nodes in the vantage6 network or on the machine of a researcher. In this example we only show the case in which one of the nodes executes the central part of the algorithm. The node provides the algorithm with a JWT token so that the central part of the algorithm has access to the server to post these subtasks.

📂Algorithm Structure

The algorithm needs to be structured as a Python . This way the algorithm can be installed within the Docker image. The minimal file-structure would be:

We also recommend adding a README.md, LICENSE and requirements.txt to the project_folder.

setup.py

Contains the setup method to create a package from your algorithm code. Here you specify some details about your package and the dependencies it requires.

The setup.py above is sufficient in most cases. However if you want to do more advanced stuff (like adding static data, or a CLI) you can use the from setup.

🐳 Dockerfile

Contains the recipe for building the Docker image. Typically you only have to change the argument PKG_NAME to the name of you package. This name should be the same as as the name you specified in the setup.py. In our case that would be v6-average-py.

`init.py`

This contains the code for your algorithm. It is possible to split this into multiple files, however the methods that should be available to the researcher should be in this file. You can do that by simply importing them into this file (e.g. from .average import my_nested_method)

We can distinguish two types of methods that a user can trigger:

The client the master method receives is a ContainerClient which is different than the client you use as a user.

Everything that is returned by thereturn statement is sent back to the central vantage6-server. This should never contain any privacy-sensitive information.

For our average algorithm the implementation will look as follows:

Local testing

Now that we have a vantage6 implementation of the algorithm it is time to test it. Before we run it in a vantage6 setup we can test it locally by using the ClientMockProtocol which simulates the communication with the central server.

Before we can locally test it we need to (editable) install the algorithm package so that the Mock client can use it. Simply go to the root directory of your algorithm package (with the setup.py file) and run the following:

Then create a script to test the algorithm:

Building and Distributing

Now that we have a fully tested algorithm for the vantage6 infrastructure. We need to package it so that it can be distributed to the data-stations/nodes. Algorithms are delivered in Docker images. So that's where we need the Dockerfile for. To build an image from our algorithm (make sure you have docker installed and it's running) you can run the following command from the root directory of your algorithm project.

The option -t specifies the (unique) identifier used by the researcher to use this algorithm. Usually this includes the registry address (harbor2.vantage6.ai) and the project name (demo).

In case you are using docker hub as registry, you do not have to specify the registry or project as these are set by default to the Docker hub and your docker hub username.

Reach out to us on if you want to use our registries (harbor.vantage6.ai and harbor2.vantage6.ai).

Cross-language serialization

It is possible that a vantage6 algorithm is developed in one programming language, but you would like to run the task from another language. For these use-cases, the Python algorithm wrapper and client support cross-language serialization. By default, input to the algorithms and output back to the client are serialized using pickle. However, it is possible to define a different serialization format.

Input and output serialization can be specified as follows:

Release notes

3.5.2

30 november 2022

Bugfix

Fix for automatic addition of column. This failed in some SQL dialects because reserved keywords (i.e. 'user' for PostgresQL) where not escaped (PR#415)
Correct installation order for uWSGI in node and server docker file ()

3.5.1

30 november 2022

Bugfix

Backwards compatibility for which organization initiated a task between v3.0-3.4 and v3.5 ()
Fixed VPN client container. Entry script was not executable in Github pipelines ()

3.5.0

30 november 2022

When upgrading to 3.5.0, you might need to add the otp_secret column to the user table manually in the database. This may be avoided by upgrading to 3.5.2.

Feature
- TOTP Multi-Factor-Authenticator has been added. Admins can enforce that all users enable MFA (, ).
- The server support email is now settable in the configuration file, used to be fixed at [email protected] (, ).

3.4.2

3 november 2022

Bugfix
- Fixed a bug in the local proxy server which made algorithm containers crash in case the client.create_new_task method was used ().
- Fixed a bug that crashed the node when a non existing image was send in a task ().

3.4.0 & 3.4.1

25 oktober 2022

Feature
- Add columns to the SQL database on startup (, ). This simpifies the upgrading proces when a new column is added in the new release, as you do no longer need to manually add columns. When downgrading the columns will not be deleted.
- Docker wrapper for Parquet files (, ). Parquet provides a way to store tabular data with the datatypes included which is an advantage over CSV.

This version is also the first version to be released with container images for both ARM and x86 architecture.

3.4.1 is a rebuild from 3.4.0 in which the all dependencies are fixed, as the build led to a broken server image.

3.3.7

Bugfix
- The function client.util.change_my_password() was updated ()

3.3.6

Bugfix
- Temporary fix for a bug that prevents the master container from creating tasks in an encrypted collaboration. This temporary fix disables the parallel encryption module in the local proxy. This functionality will be restored in a future release.

This version is also the first version where the User Interface is available in the right version. From this point onwards, the user interface changes will also be part of the release notes.

3.3.5

Feature
- The release pipeline has been expanded to automatically push new Docker images of node/server to the harbor2 service.
Bugfix

Note that 3.3.4 was only released on PyPi and that version is identical to 3.3.5. That version was otherwise skipped due to a temporary mistake in the release pipeline.

3.3.3

Bugfix
- Token refresh was broken for both users and nodes. (, )
- Local proxy encrpytion was broken. This prefented algorithms from creating sub tasks when encryption was enabled. (, )

3.3.2

Bugfix
- vpn_client_image and network_config_image are settable through the node configuration file. (, )
- The option --all

3.3.1

Bugfix
- Fixed faulty error status codes from the /collaboration endpoint ().
- Default roles are always returned from the /role endpoint. This fixes the error when a user was assigned a default role but could not reach anything (as it could not view its own role) (

3.3.0

Feature
- Login requirements have been updated. Passwords are now required to have sufficient complexity (8+ characters, and at least 1 uppercase, 1 lowercase, 1 digit, 1 special character). Also, after 5 failed login attempts, a user account is blocked for 15 minutes (these defaults can be changed in a server config file).
- Added endpoint /password/change to allow users to change their password using their current password as authentication. It is no longer possible to change passwords via client.user.update()

3.2.0

Feature
- Horizontal scaling for the vantage6-server instance by adding support for RabbitMQ.
- It is now possible to connect other docker containers to the private algorithm network. This enables you to attach services to the algorithm network using the docker_services setting.

3.1.0

Feature
- Algorithm-to-algorithm communication can now take place over multiple ports, which the algorithm developer can specify in the Dockerfile. Labels can be assigned to each port, facilitating communication over multiple channels.
- Multi-database support for nodes. It is now also possible to assign multiple data sources to a single node in Petronas; this was already available in Harukas 2.2.0. The user can request a specific data source by supplying the database argument when creating a task.

3.0.0

Feature
- Direct algorithm-to-algorithm communication has been added. Via a VPN connection, algorithms can exchange information with one another.
- Pagination is added. Metadata is provided in the headers by default. It is also possible to include them in the output body by supplying an additional parameterinclude=metadata. Parameters page and per_page

2.3.5

31 Oktober 2022

Bugfix
- Encryption module in the local proxy server has been fixed

2.3.0 - 2.3.4

Feature
- Allows for horizontal scaling of the server instance by adding support for RabbitMQ. Note that this has not been released for version 3(!)
Bugfix

2.2.0

Feature
- Multi-database support for nodes. It is now possible to assign multiple data sources to a single node. The user can request a specific data source by supplying the database argument when creating a task.
- The mailserver now supports TLS and SSL options

2.1.2 and 2.1.3

Bugfix
- Changes to the way the application interacts with the database. Solves the issue of unexpected disconnects from the DB and thereby freezing the application.

2.1.1

Bugfix
- Updating the country field in an organization works again\
- The client.result.list(...) broke when it was not able to deserialize one of the in- or outputs.

2.1.0

Feature
- Custom algorithm environment variables can be set using the algorithm_env key in the configuration file. .
- Support for non-file-based databases on the node. .

2.0.0.post1

Bugfix
- Fixed a bug that prevented the usage of secured registry algorithms

2.0.0

Feature
- Role/rule based access control
  - Roles consist of a bundle of rules. Rules profided access to certain API endpoints at the server.

1.2.3

Feature
- The node is now compatible with the Harbor v2.0 API

1.2.2

Bug fixes
- Fixed a bug that ignored the --system flag from vnode start
- Logging output muted when the --config option is used in vnode start

1.2.1

Bug fixes
- starting the server for the first time resulted in a crash as the root user was not supplied with an email address.
- Algorithm containers could still access the internet through their host. This has been patched.

1.2.0

Features
- Cross language serialization. Enabling algorithm developers to write algorithms that are not language dependent.
- Reset password is added to the API. For this purpose two endpoints have been added: /recover/lostand recover/reset . The server config file needs to extended to be connected to a mail-server in order to make this work.

1.1.0

Features
- new command vnode clean to clean up temporary docker volumes that are no longer used
- Version of the individual packages are printed in the console on startup

1.0.0

Updated Command Line Interface (CLI)
- The commands vnode list , vnode start and the new commandvnode attach are aimed to work with multiple nodes at a single machine.
- System and user-directories can be used to store configurations by using the