Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Contact us via the Discourse forum!
Anja van Gestel
Bart van Beusekom
Frank Martin
Hasan Alradhi
Gijs Geleijnse
Melle Sieswerda
Djura Smits
Lourens Veen
Johan van Soest
Would you like to contribute? Check out our !
A registry (repository) provides storage and versioning for Docker images. Installing a (private) Docker registry can be useful if you want to securely host your own algorithms.
Docker provides a registry as a turn-key solution on Docker Hub. Instructions for setting it up can be found here: https://hub.docker.com/_/registry.
is another option for running a registry. Harbor provides access control, a user interface and automated scanning on vulnerabilities.
There is a snake in my boot
Installation of any of the vantage6 packages requires Python 3.7. For installation instructions, see python.org, anaconda.com or use the package manager native to your OS and/or distribution (e.g. apt for debian or Ubuntu, yum for fedora, or yast for SuSE).
vantage6 consists of several that can be installed. Which component(s) you need depends on your use case. Also the requirements differ per component.
You can interact with the server via the API. You can explore the server API on https://<serverdomain>/apidocs (e.g. for Petronas).
You can use any language to interact with the server as long as it supports HTTP requests. For Python and R we have written wrappers to simplify the interaction with the server: see for more details on how to install these.
A (central) server allows parties to connect and exchange data.
To install the vantage6-server make sure you have met the . Then install the latest version:
This command will install the vantage6 command line interface (CLI), from which you can create new servers (see Use ).
There are several optional components that you can set up apart from the vantage6-server itself.
You can set up a , which is a web application that will allow your users to communicate more easily with your vantage6 server.
An overview of the vantage6 infrastructure and its components
Vantage6 uses both a client-server and peer-to-peer model. In the figure below the client can pose a question to the server, the question is then delivered as an algorithm to the node. When the algorithm completes, the results are sent back to the client via the server. An algorithm can communicate directly with other algorithms that run on other nodes if required.
The server is in charge of processing the tasks as well as of handling administrative functions such as authentication and authorization. Conceptually, vantage6 consists of the following parts:
A docker registry can be used to store algorithms but it is also possible to use Docker hub for this. For instructions on how to install your own Docker registry see Docker registry.
If you want to enable algorithm containers that are running on different nodes, to directly communicate with one another, you require a VPN server. Refer to EduVPN on how to install the VPN server.
If you have a server with a high workload whose performance you want to improve, you may want to set up a RabbitMQ service which enables horizontal scaling of the Vantage6 server. SeeRabbitMQ on how to set this up.
pip install vantage6Depending on your algorithm it may be required to use a specific language to retrieve the results. This could happen when the output of an algorithm contains a language specific datatype and or serialization.
E.g. when the algorithm is written in R and the output is written back in RDS (=specific to R) you would also need R to read the final input.
Please consult the developer of your algorithm if this is the case.
The (minimal) requirements of the node and server are similar. Note that not all of these are hard requirements: it could well be that it also works on other hardware, operating systems, versions of Python etc.
Hardware
x86 CPU architecture + virtualization enabled
1 GB memory
50GB+ storage
Stable and fast (1 Mbps+ internet connection)
Public IP address
Software
Operating system
Ubuntu 18.04+ or
MacOS Big Sur+
Windows 10
The hardware requirements of the node also depend on the algorithms that the node will run. For example, you need a lot less compute power for a descriptive statistical algorithm than for a machine learning model.
A (central) server that coordinates communication with clients and nodes
One or more node(s) that have access to data and execute algorithms
Organizations that are interested in collaborating
Users (i.e. researchers or other applications) that request computations from the nodes
A Docker registry that functions as database of algorithms
In this section we explain each of the individual components that are part of the vantage6 network.
Here, when we refer to the server, this includes not only the vantage6-server, but also other components that the vantage6-server uses.
The server is responsible for coordinating all communication in the vantage6 network. It consists of several components:
vantage6-server
Docker registry
VPN server (Optionally)
The vantage6-server contains the users, organizations, collaborations, tasks and their results. It handles authentication and authorization to the system and is the central point of contact for clients and nodes. The Docker registry contains algorithms which can be used by clients to request a computation. The VPN server is required if algorithms need to be able to engage in peer-to-peer communication.
The node is responsible for executing the algorithms on the local data. It protects the data by allowing only specified algorithms to be executed after verifying their origin. The vantage6-node is responsible for picking up the task, executing the algorithm and sending the results back to the server. The node needs access to local data. This data can either be a file (e.g. csv) or a service (e.g. a database).
The client provides an interface to the server. This allows users and applications to create tasks and retrieve their results. The client also enables you to manage entities at the server (i.e. creating users, organizations and collaborations). Note that the client can directly interact with the server through the API or using one of our client libraries (e.g the python client).
Good that you are here!
Check out our new documentation This documentation space is no longer maintained. Please find the latest documentation at https://docs.vantage6.ai!
Vantage6 stands for privacy preserving infrastructure for secure insight exchange.
The project is inspired by the (PHT) concept. In this analogy vantage6 is the tracks and stations. Compatible algorithms are the trains, and computation tasks are the journey.
vantage6 is here for:
delivering algorithms to data stations and collecting their results
managing users, organizations, collaborations, computation tasks and their results
providing control (security) at the data-stations to their owners
vantage6 is not (yet):
formatting the data at the data station
aligning data across the data stations
a finished/polished product
vantage6 is designed with three fundamental functional aspects of Federated learning.
Autonomy. All involved parties should remain independent and autonomous.
Heterogeneity. Parties should be allowed to have differences in hardware and operating systems.
Flexibility. Related to the latter, a federated learning infrastructure should not limit the use of relevant data.
Documentation
-> this documentation
-> unfinished technical documentation
-> general vantage6 website
Source code
-> contains all components (and the python-client).
-> contains all features, bugfixes and feature request we are working on. To submit one yourself, you can create a .
The old/previous (seperated) repositories can still be found at the IKNL Github in archived form:
-> contains all other repositories, used for synchronization and releasing
Community
-> discussion platform, ask anything here
-> for if you prefer a quick chat with the developers
This documentation space is intended for users of the vantage6 solution. You will find information on how to setup your own federated learning network, and how to maintain and interact with it.
Here you will not find:
in depth technical documentation
background on federated learning
Vantage6 is completely open source under the .
If you want to join, find us on our channel.
We provide four ways in which you can interact with the server to manage your vantage6 resources: the user interface (UI), the Python client, the R client, and the server API.
What you need to install depends on which interface you choose. In order to use the UI or the server API, you usually don't need to install anything: the UI is a website, and the API can be called via HTTP requests from a programming language of your choice. For the UI, you only need to set it up in case you are setting up your own server (see User Interface for instructions).
Installation instructions for the Python client and R client are below. For most use cases, we recommend to use the UI (for anything except creating tasks) and/or the Python Client (which covers server API functionality completely).
Before you install the Python client, we recommended to check the version of the server you are going to interact with first. The easiest way of doing that is checking the /version endpoint of the server you are going to use:
GET https://SERVER[/api_path]/version
Then you can install the vantage6-client with:
where you add the version you want to install. You may also leave out the version to install the most recent version.
R client library
The R client currently only supports creating tasks and retrieving their results. It can not (yet) be used to manage resources, such as creating and deleting users and organizations.
You can install the R client by running:
Required for both the node and server
Docker facilitates encapsulation of applications and their dependencies in packages that can be easily distributed to diverse systems. Algorithms are stored in Docker images which nodes can download and execute. Besides the algorithms, both the node and server are also running from a docker container themselves.
Please refer to this page on how to install Docker. To verify that Docker is installed and running you can run the hello-world example from Docker.
docker run hello-worldNote that for Linux, some may be required. Vantage6 needs to be able to run docker without sudo, and these steps ensure just that.
The User Interface (UI) is a web application that aims to make it easy to interact with the server. It allows you to manage all your resources (such as creating collaborations, editing users, or viewing tasks), except for creating new tasks. We aim to incorporate this functionality in the near future.
If you plan on deploying your own server and want to use the UI, follow the installation instructions on the UI Github page. The UI is an Angular application and as such, you may be required to install Node.js. Once you have deployed the UI to the internet, any user that is registered on your vantage6 server will be able to use it.
The UI is not compatible with older versions (<3.3) of vantage6.
If you plan on using the existing Petronas server, you can simply go to https://portal.petronas.vantage6.ai and login with your user account.
Horizontal scaling for servers with high workloads
Please note that RabbitMQ is an optional component. It enables the server to handle multiple requests at the same time. This is important if a server has a high workload.
There are several options to host your own RabbitMQ server. You can run or host . When you have set up your RabbitMQ service, you can connect the server to it by adding the following to the server configuration:
Be sure to create the user and vhost that you specify exist! Otherwise, you can add them via the .
Note that the RabbitMQ currently (vantage6 version 3.2) does not yet work if you start your server via

TODO
pip install vantage6==VERSIONdevtools::install_github('IKNL/vtg', subdir='src')vserver startrabbitmq_uri: amqp://<username>:<password@<hostname>:5672/<vhost>vantage6-server -> server source code
vantage6-client -> (python) client source code
vantage6-common -> common functionality
These apply to all components
There are several entities in vantage6, such as users, organizations, tasks, etc. The following statements should help you understand the relationships.
A collaboration is a collection of one or more organizations.
For each collaboration, each participating organization needs a node to compute tasks.
Each organization can have users who can perform certain actions.
The permissions of the user are defined by the assigned rules.
It is possible to collect multiple rules into a role, which can also be assigned to a user.
Users can create tasks for one or more organizations within a collaboration.
A task should produce a result for each organization involved in the task.
The following schema is a simplified version of the database:
Encryption in vantage6 is handled at organization level. Whether encryption is used or not, is set at collaboration level. All the nodes in the collaboration need to agree on this setting. You can enable or disable encryption in the node configuration file, see .
The encryption module encrypts data so that the server is unable to read communication between users and nodes. The only messages that go from one organization to another through the server are computation requests and their results. Only the algorithm input and output are encrypted. Other metadata (e.g. time started, finished, etc), can be read by the server.
The encryption module uses RSA keys. The public key is uploaded to the vantage6-server. Tasks and other users can use this public key (this is automatically handled by the python-client and R-client) to send messages to the other parties.
The RSA key is used to create a shared secret which is used for encryption and decryption of the payload
When the node starts, it checks that the public key stored at the server is derived from the local private key. If this is not the case, the node will replace the public key at the server.
If an organization has multiple nodes and/or users, they must use the same private key.
In case you want to generate a new private key, you can use the command vnode create-private-key. If a key already exists at the local system, the existing key is reused (unless you use the --force flag). This way, it is easy to configure multiple nodes to use the same key.
It is also possible to generate the key yourself and upload it by using the following endpoint:
PATCH https://SERVER[/api_path]/organization/<ID>
When a user creates a task, one or more nodes spawn an algorithm container. These algorithm containers can create new tasks themselves.
Every algorithm is supplied with a JWT token. This token can be used to communicate with the vantage6-server. In case you use a algorithm wrapper, you simply can use the supplied Client in the case you use a central function.
A child container can be a parent container itself. There is no limit to the amount of task layers that can be created. It is common to have only a single parent container which handles many child containers.
The token to which the containers have access is limited. The token can only be used to create a task in the same collaboration and using the same image.
To install the vantage6-node make sure you have met the . Then install the latest version:
Algorithms are executed at the vantage6-node. The node receives a computation task from the vantage6-server. The node will then retrieve the algorithm, execute it and return the results to the server.
Algorithms are shared using which are stored in a which is accessible to the nodes. In the following sections we explain the fundamentals of algorithm containers.
Interface between the node and algorithm container
Library to simplify and standardized the node-algorithm input and output
Creating subtasks from an algorithm container
Communicate with other algorithm containers and the vantage6-server
Cross language data serialization
The User Interface (UI) is a web application that aims to make it easy to interact with the server. At present, it provides all functionality except for creating tasks. We aim to incorporate this functionality in the near future.
Using the UI should be relatively straightforward. There are buttons that should help you e.g. create a collaboration or delete a user. If anything is unclear, please contact us via .
The server API is documented on the URL:
GET https://SERVER[/api_path]/apidocs
For Petronas, the API docs can thus be found at . This page will show you which API endpoints exist and how you can use them. All endpoints communicate via HTTP requests, so you can communicate with them using any platform or programming language that supports HTTP requests.

TODO



pip install vantage6We provide four ways in which you can interact with the server to manage your vantage6 resources:
User Interface (UI)
The UI and the clients make it much easier to interact with the server than directly interacting with the server API through HTTP requests, especially as data is serialized and encrypted automatically. For most use cases, we recommend to use the UI and/or the Python client.
Note that whenever you interact with the server, you are limited by your permissions. For instance, if you try to create another user but do not have permission to do so, you will receive an error message. All permissions are described by rules, which can be aggregated in roles. Contact your administrator if you find your permissions are inappropriate.
There are predefined roles such as 'Researcher' and 'Organization Admin' that are automatically created by the server. These can be assigned to any new user by the administrator that is creating the user.
In this section, you will learn how to use the client to create a new organization on the server.
Here, we assume that you have a Python session with an authenticated Client object, as created in Authentication. We also assume that you have a login on the Vantage6 server that has the permissions to create a new organization (regular end-users typically do not have these permissions, this is typically only for administrators).
The first (optional, but recommended) step is to create an RSA keypair. A keypair, consisting of a private and a public key, can be used to encrypt data transfers. Users from the organization you are about to create will only be able to use encryption if such a keypair has been set up and if they have access to the private key.
from vantage6.common import (warning, error, info, debug, bytes_to_base64s, check_config_write_permissions)
from vantage6.client.encryption import RSACryptor
from pathlib import Path
# Generated a new private key
# Note that the file below doesn't exist yet: you will create it
private_key_filepath = r'/path/to/private/key'
private_key = RSACryptor.create_new_rsa_key(Path(private_key_filepath))
# Generate the public key based on the private one
public_key_bytes = RSACryptor.create_public_key_bytes(private_key)
public_key = bytes_to_base64s(public_key_bytes)Now, we can create an organization
client.organization.create(
name = 'The_Shire',
address1 = '501 Buckland Road',
address2 = 'Matamata',
zipcode = '3472',
country = 'New Zealand',
domain = 'the_shire.org',
public_key = public_key
)Users can now be created for this organization. Any users that are created and who have access to the private key we generated above can now use encryption by running
after they authenticate.
Logging is enabled by default. To configure the logger look at in the logging section.
Useful commands:
vserver files: shows you where the log file is stored
vserver attach: show live logs of a running server in your current console. This can also be achieved when starting the server with vserver start --attach
In this section, you will learn how to use the client to create a new collaboration on the server.
Here, we assume that you have a Python session with an authenticated Client object, as created in . We also assume that you have a login on the Vantage6 server that has the permissions to create a new collaboration (regular end-users typically do not have these permissions, this is typically only for administrators).
A collaboration is an association of multiple that want to run analyses together. First, you will need to find the organization id's of the organizations you want to be part of the collaboration.
Once you know the id's of the organizations you want in the collaboration (e.g. 1 and 2), you can create the collaboration:
Note that a collaboration can require participating organizations to use encryptions, by passing the encrypted = True argument (as we did above) when creating the collaboration. It is recommended to do so, but requires that a keypair was created when and that each user of that organization has access to the private key so that they can run the client.setup_encryption(...) command after .
In this section, you will learn how to use the client to register a new node with the server.
Here, we assume that you have a Python session with an authenticated Client object, as created in . We also assume that you have a login on the Vantage6 server that has the permissions to create a new node (regular end-users typically do not have these permissions, this is typically only for administrators).
A node is associated with both a collaboration and an organization (see ). You will need to find the collaboration and organization id's for the node you want to register:
Then, we register a node with the desired organization and collaboration. In this example, we create a node for the organization with id 1 and collaboration with id 1.
Remember to save the api_key that is returned here, since you will need it when .
Probably important
As a data owner it is important that you take the necessary steps to protect your data. Vantage6 allows algorithms to run on your data and share the results with other parties. It is important that you review the algorithms before allowing them to run on your data.
Once you approved the algorithm, it is important that you can verify that the approved algorithm is the algorithm that runs on your data. There are two important steps to be taken to accomplish this:
Set the (optional) allowed_images option in the node-configuration file. You can specify a regex expression here. For example
Because algorithms are exchanged through Docker images they can be written in any language. This is an advantage as developers can use their preferred language for the problem they need to solve.
The are only available for R and Python, so when you use different language you need to handle the IO yourself. Consult the section on what the node supplies to your algorithm container.
When data is exchanged between the user and the algorithm they both need to be able to read the data. When the algorithm uses a language specific serialization (e.g. a pickle in the case of Python or RData in the case of R) the user needs to use the same language to read the results. A better solution would be to use a type of serialization that is not specific to a language. For our wrappers we use JSON for this purpose.
^harbor2.vantage6.ai/[a-zA-Z]+/[a-zA-Z]+: allows all images from the vantage6 registry^harbor2.vantage6.ai/algorithms/glm: only allows this specific image, but all builds of this image
^harbor2.vantage6.ai/algorithms/glm@sha256:82becede498899ec668628e7cb0ad87b6e1c371cb8a1e597d83a47fac21d6af3: allows only this specific build from the GLM to run on your data
Enable DOCKER_CONTENT_TRUST to verify the origin of the image. For more details see the documentation from Docker.
By enabling DOCKER_CONTENT_TRUST you might not be able to use certain algorithms. You can check this by verifying that the images you want to be used are signed.
In case you are using our Docker repository you need to use harbor2.vantage6.ai as harbor.vantage6.ai does not have a notary.
Communication between algorithm containers can use language specific serialization as long as the different parts of the algorithm use the same language.

TODO
client.setup_encryption('/path/to/private/key')client.organization.list(fields=['id', 'name'])collaboration_name = "fictional_collab"
organization_ids = [1,2] # the id's of the respective organizations
client.collaboration.create(name = collaboration_name,
organizations = organization_ids,
encrypted = True)client.organization.list(fields=['id', 'name'])
client.collaboration.list(fields=['id', 'name'])# A node is associated with both a collaboration and an organization
organization_id = 1
collaboration_id = 1
api_key = client.node.create(collaboration = collaboration_id, organization = organization_id)
print(f"Registered a node for collaboration with id {collaboration_id}, organization with id {organization_id}. The API key that was generated for this node was {api_key}")How to authenticate a client with the vantage6 server
This page and the following pages introduce some minimal examples for administrative tasks that you can perform with our Python client. We start by authenticating.
To authenticate, we create a config file to store our login information. We do this so we do not have to define the server_url, server_port and so on every time we want to use the client. Moreover, it enables us to separate the sensitive information (login details, organization key) that you do not want to make publicly available, from other parts of the code you might write later (e.g. on submitting particular tasks) that you might want to share publicly.
# config.py
server_url = "https://MY VANTAGE6 SERVER" # e.g. https://petronas.vantage6.ai or
# http://localhost for a local dev server
server_port = 443 # This is specified when you first created the server
server_api = "" # This is specified when you first created the server
username = "MY USERNAME"
password = "MY PASSWORD"
organization_key = "FILEPATH TO MY PRIVATE KEY" # This can be empty if you do not want to set up encryptionNote that the organization_key should be a filepath that points to the private key that was generated when the organization to which your login belongs was first created (see Creating an organization).
Then, we connect to the vantage 6 server by initializing a Client object, and authenticating
from vantage6.client import Client
# Note: we assume here the config.py you just created is in the current directory.
# If it is not, then you need to make sure it can be found on your PYTHONPATH
import config
# Initialize the client object, and run the authentication
client = Client(config.server_url, config.server_port, config.server_api, verbose=True)
client.authenticate(config.username, config.password)
# Optional: setup the encryption, if you have an organization_key
client.setup_encryption(config.organization_key)Above, we have added verbose=True as additional argument when creating the Client(...) object. This will print much more information that can be used to debug the issue.
Found a security issue?
Please see our SECURITY policy.
Whenever you found an issue and wrote a fix or when you just want to help improving the code, where happy that you want to contribute to the code base. The process to submit a fix would be as follows:
Create an issue on the Github page.
Create a branch starting with /bugfix in the case of a bug, or with /feature in the case of a new feature.
Implement your fix.
Push the branch and open a .
Then it is out of your hands for now. A reviewer will look at the issue and code and will request changes if needed. Once the code is finalized it will be released as soon as possible (depening on the implications this could be released as a patch version).
A great way to learn about the source is writing unit-tests to test existing components. This helps us getting our coverage up.
Maintaining good code quality is important, so do some housekeeping, refactoring and cleaning really helps us writing good code.
Writing documentation, both in the code and on this Documentation page.
Join our Discord channel, and get into the Development section.
Feel free to install our development version of vantage6 and try to break it.
Issues can be reported on our Github page.
One of the top priorities of vantage6 is being secure in what it does. Therefore having people reporting possible security issues to us is immensely helpful. If you have security concerns please report them directly to [email protected].
It is assumed you installed the vantage6-client. The R client can create tasks and retrieve their results. If you want to do more administrative tasks, either use the API directly or use the Python client.
Initialization of the R client can be done by:
setup.client <- function() {
# Username/password should be provided by the administrator of
# the server.
username <- "[email protected]"
password <- "password"
host <- 'https://petronas.vantage6.ai:443'
api_path <- ''
# Create the client & authenticate
client <- vtg::Client$new(host, api_path=api_path)
client$authenticate(username, password)
return(client)
}
# Create a client
client <- setup.client()Then this client can be used for the different algorithms. Refer to the README in the repository on how to call the algorithm. Usually this includes installing some additional client-side packages for the specific algorithm you are using.
The R client is subject to change. We aim to make it more similar to the Python client.
First you need to install the client side of the algorithm by:
This is the code to run the coxph:
The algorithm container is deployed in an isolated network to prevent it from reaching unwanted destinations. There are two exceptions:
When the VPN feature is enabled on the server all algorithm containers are able to reach each other using an ip and port.
The central server is reachable through a local proxy service. In the algorithm you can use the HOST, POST and API_PATH to find the address of the server.
Algorithm containers can expose one or more ports. These ports can then be used by other algorithm containers to exchange data. The infrastructure uses the Dockerfile from which the algorithm has been build to determine to which ports are used by the algorithm. This is done by using the EXPOSE and LABEL directives.
For example when an algorithm uses two ports, one port for communication com and one port for data exchange data. The following block should be added to you algorithm Dockerfile:
Port 8888 and 8889 are the internal ports to which the algorithm container listens. When another container want to communicate with this container it can retrieve the IP and external port from the central server by using the result_id and the label of the port you want to use (com or data in this case)
A Python client to interact with the vantage6 server
It is assumed you installed the . The Python client aims to completely cover the vantage6-server communication possibilities. It can create computation tasks and collect their results, manage organizations, collaborations, users, etc. The server hosts an API which the client uses for this purpose.
For tutorials on how to use the clients, please visit our discourse pages: .
We only show a few examples here. The methods in the library are all documented in their docstring, you can view them using help(...) , e.g. help(client.user.create) will show you the parameters needed to create a new user. We also have more extensive tutorials on how to use the clients available on our discourse pages: and in the and subsequent pages, which follow after introducing our R client.
The following groups (related to the ) of methods are available, most of them have a list(), create()
Here we will provide definitions of all the important concepts used in VANTAGE6 (and Federated Learning).
📝 Currently, we are working on a paper where most of these concepts are explained in a more cohesive, well-structured manner, together with how vantage6 works. As soon as it is ready, we will post it on our website.
A
Autonomy: the ability of a party to be in charge of the control and management of its own data.
C
devtools::install_github('iknl/vtg.coxph', subdir="src")print( client$getCollaborations() )
# Should output something like this:
# id name
# 1 1 ZEPPELIN
# 2 2 PIPELINE
# Select a collaboration
client$setCollaborationId(1)
# Define explanatory variables, time column and censor column
expl_vars <- c("Age","Race2","Race3","Mar2","Mar3","Mar4","Mar5","Mar9",
"Hist8520","hist8522","hist8480","hist8501","hist8201",
"hist8211","grade","ts","nne","npn","er2","er4")
time_col <- "Time"
censor_col <- "Censor"
# vtg.coxph contains the function `dcoxph`.
result <- vtg.coxph::dcoxph(client, expl_vars, time_col, censor_col)# port 8888 is used by the algorithm for communication purposes
EXPOSE 8888
LABEL p8888 = "com"
# port 8889 is used by the algorithm for data-exchange
EXPOSE 8889
LABEL p8889 = "data"delete()get()client.user
client.organization
client.rule
client.role
client.collaboration
client.task
client.result
client.util
client.node
help(client.task.create)
#Create a new task
#
# Parameters
# ----------
# collaboration : int
# Id of the collaboration to which this task belongs
# organizations : list
# Organization ids (within the collaboration) which need
# to execute this task
# name : str
# Human readable name
# image : str
# Docker image name which contains the algorithm
# description : str
# Human readable description
# input : dict
# Algorithm input
# data_format : str, optional
# IO data format used, by default LEGACY
# database: str, optional
# Name of the database to use. This should match the key
# in the node configuration files. If not specified the
# default database will be tried.
#
# Returns
# -------
# dict
# Containing the task informationD
Distributed learning: see Federated Learning
Docker: a platform that uses operating system virtualization to deliver software in packages called containers. It is worth noting that although they are often confused, Docker containers are not virtual machines.
F
FAIR data: data that are Findable, Accessible, Interoperable, and Reusable. For more information, see the original paper.
Federated learning: a novel approach for analyzing data that are spread across different parties. Its main idea is that parties run computations on their local data, yielding either aggregated parameters or encrypted values. These are then shared to generate a global (statistical) model. In other words, instead of bringing the data to the algorithms, federated learning brings the algorithms to the data. This way, patient-sensitive information is not disclosed. Federated learning is some times known as distributed learning. However, we try to avoid this term, since it can be confused with distributed computing, where different computers share their processing power to solve very complex calculations.
H
Heterogeneity: the condition in which in a federated learning scenario, parties are allowed to have differences in hardware and software (i.e., operating systems).
Horizontally-partitioned data: data spread across different parties where the latter have the same features of different instances (i.e., patients). See also vertically-partitioned data.
M
Multiparty computation: an approach to perform analyses across different parties by performing operations on encrypted data.
P
Party: an entity that takes part in one (or more) collaborations
Python: a high-level general purpose programming language. It aims to help programmers write clear, logical code. vantage6 is written in Python.
S
Secure multiparty computation: see Multiparty computation
V
vantage6: priVAcy preserviNg federaTed leArninG infrastructurE for Secure Insight eXchange. In short, vantage6 is an infrastructure for executing federated learning analyses. However, it can also be used as a FAIR data station and as a model repository.
Vertically-partitioned data: data spread across different parties where the latter have different features of the same instances (i.e., patients). See also horizontally-partitioned data.
The algorithm wrapper simplifies and standardizes the interaction between algorithm and node. The client libraries and the algorithm wrapper are tied together and use the same standards. The algorithm wrapper:
reads the environment variables and file mounts and supplies these to your algorithm.
provides an entrypoint for the docker container
allows to write a single algorithm for multiple types of data sources
The wrapper is language specific and currently we support Python and R. Extending this concept to other languages is not so complex.
The signature of your function has to contain data as the first argument. The method name should have a RPC_ prefix. Everything that is returned by the function will be written to the output file.
It is quite common to have a central part of your federated analysis which orchestrates the algorithm and combines the partial results. A common pattern for a central function would be:
Request partial models from all participants
Obtain the partial models
Combine the partial models to a global model
(optional) Repeat step 1-3 until the model converges
It is possible to run the central part of the analysis on your own machine, but it is also possible to let vantage6 handle the central part. There are several advantages to letting vantage6 handle this:
You don't have to keep your machine running during the analysis
You don't need to use the same programming language as the algorithm in case a language specific serialization is used in the algorithm
Note that central functions also run at a node and not at the server.
In contrast to the federated functions, central functions are not prefixed. The first argument needs to be client and the second argument needs to be data. The data argument contains the local data and the client argument provides an interface to the vantage6-server.
The argument data is not present in the R-wrapper. This is a consistency issue which will be solved in a future release.
The docker wrappers read the local data source and supplies this to your functions in your algorithm. Currently CSV and SPARQL for Python and a CSV wrapper for R is supported. Since the wrapper handles the reading of the data, you need to rebuild your algorithm with a different wrapper to make it compatible with a different type of data source. You do this by updating the CMD directive in the dockerfile.
TODO
Once the algorithm is completed it needs to be packaged and made available for retrieval by the nodes. The algorithm is packaged in a Docker image. A Docker image is created from a Dockerfile, which acts as blue-print. Once the Docker image is created it needs to be uploaded to a registry so that nodes can retrieve it.
A minimal Dockerfile should include a base-image, injecting your algorithm and execution command of your algorithm. For example:
# python3 image as base
FROM python:3
# copy your algorithm in the container
COPY . /app
# maybe your algorithm is installable.
RUN pip install /app
# execute your application
CMD python /app/app.pyWhen using the the Dockerfile needs to follow a certain format. You should only change the PKG_NAME value to the Python package name of your algorithm.
When using the the Dockerfile needs to follow a certain format. You should only change the PKG_NAME value to the R package name of your algorithm.
Additional Docker directives are needed when using direct communication between different algorithm containers, see for more information on this.
If you are in the folder containing the Dockerfile, you can build the project as follows:
The -t indicated the name of your image. This name is also used as reference where the image is located on the internet. If you use Docker hub to store your images, you only specify your username as repo followed by your image name and tag: USERNAME/IMAGE_NAME:IMAGE_TAG. When using a private registry repo should contain the URL of the registry also: e.g. harbor2.vantage6.ai/PROJECT/IMAGE_NAME:TAG.
Then you can push you image:
Now that is has been uploaded it is available for nodes to retrieve when they need it.
It is possible to use the Docker the framework to create signed images. When using signed image the node can verify the author of the algorithm image adding an additional protection layer.
Dockerfile
Build project
CMD
Expose
Harbor or Docker hub or whatever
public vs private
signed
In this section we'll explain how to deploy a vantage6 server.
vantage6 uses Flask as backbone, together with flask-socketio for websocket support. The server runs as a standalone process (listening on its own ip address/port).
There are many deployment options, so these examples are not complete and exhaustive.
...
Below a basic setup. Note that SSL is not configured in this example.
TODO
The server manages users, organizations, collaborations, tasks and results. In this section we will explain how to configure and manage a server.
It is assumed that you successfully installed vantage6-server. To verify this, you can run the command vserver --help . If that prints a list of commands, your installation is successful. Also, make sure that Docker is running.
To create a new server, run the command below. A menu will be started that allows you to set up a server configuration file. For more details, check out the Configure page.
To run a server, execute the command below. The --attach flag will cause log output to be printed to the console.
When the server is run for the first time an user is created:
username: root
password: root
Finally, a server can be stopped again with:
The following commands are available in your environment. To see all the options that are available per command use the --help flag, e.g. vserver start --help.
The following sections explain how to use these commands to configure and maintain a vantage6-server instance:
The node runs algorithms requested by clients
It is assumed you have successfully installed vantage6-node. To verify this you can run the command vnode --help. If that prints a list of commands, the installation is completed. Also, make sure that Docker is running.
An organization runs a node for each of the collaborations it participates in
To create a new node, run the command below. A menu will be started that allows you to set up a node configuration file. For more details, check out the page.
To run a node, execute the command below. The --attach flag will cause log output to be printed to the console.
Finally, a node can be stopped again with:
Below is a list of all commands you can run for your node(s). To see all available options per command use the --help flag, i.e. vnode start --help .
See the following sections on how to configure and maintain a vantage6-node instance:
vserver new


# python vantage6 algorithm base image
FROM harbor.vantage6.ai/algorithms/algorithm-base
# this should reflect the python package name
ARG PKG_NAME="v6-summary-py"
# install federated algorithm
COPY . /app
RUN pip install /app
ENV PKG_NAME=${PKG_NAME}
# Tell docker to execute `docker_wrapper()` when the image is run.
CMD python -c "from vantage6.tools.docker_wrapper import docker_wrapper; docker_wrapper('${PKG_NAME}'List the available server instances
vserver shell
Run a server instance python shell
vserver import
Import server entities as a batch
vserver version
Shows the versions of all the components of the running server
Command
Description
vserver new
Create a new server configuration file
vserver start
Start a server
vserver stop
Stop a server
vserver files
List the files that a server is using
vserver attach
Show a server's logs in the current terminal
vserver list
List all available nodes
vnode create-private-key
Create and upload a new public key for your organization
Command
Description
vnode new
Create a new node configuration file
vnode start
Start a node
vnode stop
Stop one or all nodes
vnode files
List the files of a node
vnode attach
Print the node logs to the console
vnode list
you already have the algorithm you want to run available as a container in a docker registry (see here for more details on developing your own algorithm)
the nodes are configured to look at the right database
In this manual, we'll use the averaging algorithm from harbor2.vantage6.ai/demo/average, so the second requirement is met. This container assumes a comma-separated (*.csv) file as input, and will compute the average over one of the named columns. We'll assume the nodes in your collaboration have been configured to look at a comma-separated database, i.e. their config contains something like
so that the third requirement is also met. As an end-user running the algorithm, you'll need to align with the node owner about which database name is used for the database you are interested in. For more info on configuring the nodes, see configuring the node.
First, you'll want to determine which collaboration to submit this task to, and which organizations from this collaboration you want to be involved in the analysis
In this example, we see that the collaboration called 'example_collab1' has three organizations associated with it, of which the organization id's are 2, 3 and 4. To figure out the names of these organizations, we run:
i.e. this collaboration consists of the organizations example_org1 (with id 2), example_org2 (with id 3) and example_org3 (with id 4).
Now, we have two options: create a task that will run the master algorithm, or create a task that will (only) run the RPC methods. Typically, the RPC methods only run the node local analysis (e.g. compute the averages per node), whereas the master algorithms performs aggregation of those results as well (e.g. starts the node local analyses and then also computes the overall average). First, let us create a task that runs the master algorithm of the harbor2.vantage6.ai/demo/average container
Note that the kwargs we specified in the input_ are specific to this algorithm: this algorithm expects an argument column_name to be defined, and will compute the average over the column with that name. Furthermore, note that here we created a task for collaboration with id 1 (i.e. our example_collab1) and the organizations with id 2 and 3 (i.e. example_org1 and example_org2). I.e. the algorithm need not necessarily be run on all the organizations involved in the collaboration. Finally, note that client.task.create() has an optional argument called database. Suppose that we would have wanted to run this analysis on the database called my_other_database instead of the default database, we could have specified an additional database = 'my_other_database' argument. Check help(client.task.create) for more information.
You might be interested to know output of the RPC algorithm (in this example: the averages for the 'age' column for each node). In that case, you can run only the RPC algorithm, omitting the aggregation that the master algorithm will normally do:
Of course, it will take a little while to run your algorithm. You can use the following code snippet to run a loop that checks the server every 3 seconds to see if the task has been completed:
When the results are in, you can get the result_id from the task object:
and then retrieve the results
The number of results may be different depending on what you run, but for the master algorithm in this example, we can retrieve it using:
while for the RPC algorithm, dispatched to two nodes, we can retreive it using
# The Dockerfile tells Docker how to construct the image with your algorithm.
# Once pushed to a repository, images can be downloaded and executed by the
# network hubs.
FROM harbor2.vantage6.ai/base/custom-r-base
# this should reflect the R package name
ARG PKG_NAME='vtg.package'
LABEL maintainer="Main Tainer <[email protected]>"
# Install federated glm package
COPY . /usr/local/R/${PKG_NAME}/
WORKDIR /usr/local/R/${PKG_NAME}
RUN Rscript -e 'library(devtools)' -e 'install_deps(".")'
RUN R CMD INSTALL --no-multiarch --with-keep.source .
# Tell docker to execute `docker.wrapper()` when the image is run.
ENV PKG_NAME=${PKG_NAME}
CMD Rscript -e "vtg::docker.wrapper('$PKG_NAME')"docker build -t repo/image:tag .docker push repo/image:tagserver {
# Public port
listen 80;
server_name _;
# vantage6-server. In the case you use a sub-path here, make sure
# to foward also it to the proxy_pass
location /subpath {
include proxy_params;
# internal ip and port
proxy_pass http://127.0.0.1:5000/subpath;
}
# Allow the websocket traffic
location /socket.io {
include proxy_params;
proxy_http_version 1.1;
proxy_buffering off;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
proxy_pass http://127.0.0.1:5000/socket.io;
}
}vserver start --name <your_server> --attachvserver stop --name <your_server>vnode newvnode start --name <your_node> --attachvnode stop --name <your_node> databases:
default: /path/to/my/example.csv
my_other_database: /path/to/my/example2.csv>>> client.collaboration.list(fields=['id', 'name', 'organizations'])
[{'id': 1, 'name': 'example_collab1', 'organizations': [{'id': 2, 'link': '/api/organization/2', 'methods': ['GET', 'PATCH']}, {'id': 3, 'link': '/api/organization/3', 'methods': ['GET', 'PATCH']}, {'id': 4, 'link': '/api/organization/4', 'methods': ['GET', 'PATCH']}]}]>>> client.organization.list(fields=['id', 'name'])
[{'id': 1, 'name': 'root'}, {'id': 2, 'name': 'example_org1'}, {'id': 3, 'name': 'example_org2'}, {'id': 4, 'name': 'example_org3'}]input_ = {'method': 'master',
'kwargs': {'column_name': 'age'},
'master': True}
average_task = client.task.create(collaboration=1,
organizations=[2,3],
name="an-awesome-task",
image="harbor2.vantage6.ai/demo/average",
description='',
input=input_,
data_format='json')input_ = {'method': 'average_partial',
'kwargs': {'column_name': 'age'},
'master': False}
average_task = client.task.create(collaboration=1,
organizations=[2,3],
name="an-awesome-task",
image="harbor2.vantage6.ai/demo/average",
description='',
input=input_,
data_format='json')print("Waiting for results")
task_id = average_task['id']
task_info = client.task.get(task_id)
while not task_info.get("complete"):
task_info = client.task.get(task_id, include_results=True)
print("Waiting for results")
time.sleep(3)
print("Results are ready!")result_id = task_info['id']result_info = client.result.list(task=result_id)>>> result_info['data'][0]['result']
{'average': 53.25}>>> result_info['data'][0]['result']
{'sum': 253, 'count': 4}
>>> result_info['data'][1]['result']
{'sum': 173, 'count': 4}Logging is enabled by default. To configure the logger look at in the logging section.
Useful commands:
vnode files: shows you where the log file is stored
vnode attach: shows live logs of a running server in your current console. This can also be achieved when starting the node with vnode start --attach
How to batch-import organizations, users, collaborations, etc.
All users that are imported using vserver import receive the superuser role. We are looking into ways to also be able to import roles. For more background info refer to this .
To batch import users, organizations and collaborations you can use the vserver import /path/to/file.yaml command. The yaml file is expected to have the following format:
def RPC_my_algorithm(data, *args, **kwargs):
passRPC_my_algorithm <- function(data, ...) {
}def main(client, data, *args, **kwargs):
# Run a federated function. Note that we omnit the
# RPC_ prefix. This prefix is added automatically
# by the infrastructure
task = client.create_new_task(
{
"method": "my_algorithm",
"args": [],
"kwargs": {}
},
organization_ids=[...]
)
# wait for the federated part to complete
# and return
results = wait_and_collect(task)
return resultsmain <- function(client, ...) {
# Run a federated function. Note that we omnit the
# RPC_ prefix. This prefix is added automatically
# by the infrastructure
result <- client$call("my_algorithm", ...)
# Optionally do something with the results
# return the results
return(result)
}...
CMD python -c "from vantage6.tools.docker_wrapper import sparql_wrapper; sparql_wrapper('${PKG_NAME}')"...
CMD python -c "from vantage6.tools.docker_wrapper import docker_wrapper; docker_wrapper('${PKG_NAME}')"...
CMD Rscript -e "vtg::docker.wrapper('$PKG_NAME')"organizations:
- name: IKNL
domain: iknl.nl
address1: Godebaldkwartier 419
address2:
zipcode: 3511DT
country: Netherlands
users:
- username: admin
firstname: admin
lastname: robot
password: password
- username: [email protected]
firstname: Frank
lastname: Martin
password: password
- username: [email protected]
firstname: Melle
lastname: Sieswerda
password: password
public_key: LS0tLS1CRUdJTiBQVUJMSUMgS0VZLS0tLS0KTUlJQ0lqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FnOEFNSUlDQ2dLQ0FnRUF2eU4wWVZhWWVZcHVWRVlpaDJjeQphTjdxQndCUnB5bVVibnRQNmw2Vk9OOGE1eGwxMmJPTlQyQ1hwSEVGUFhZQTFFZThQRFZwYnNQcVVKbUlseWpRCkgyN0NhZTlIL2lJbUNVNnViUXlnTzFsbG1KRTJQWDlTNXVxendVV3BXMmRxRGZFSHJLZTErUUlDRGtGSldmSEIKWkJkczRXMTBsMWlxK252dkZ4OWY3dk8xRWlLcVcvTGhQUS83Mm52YlZLMG9nRFNaUy9Jc1NnUlk5ZnJVU1FZUApFbGVZWUgwYmI5VUdlNUlYSHRMQjBkdVBjZUV4dXkzRFF5bXh2WTg3bTlkelJsN1NqaFBqWEszdUplSDAwSndjCk80TzJ0WDVod0lLL1hEQ3h4eCt4b3cxSDdqUWdXQ0FybHpodmdzUkdYUC9wQzEvL1hXaVZSbTJWZ3ZqaXNNaisKS2VTNWNaWWpkUkMvWkRNRW1QU29rS2Y4UnBZUk1lZk0xMWtETTVmaWZIQTlPcmY2UXEyTS9SMy90Mk92VDRlRgorUzVJeTd1QWk1N0ROUkFhejVWRHNZbFFxTU5QcUpKYlRtcGlYRWFpUHVLQitZVEdDSC90TXlrRG1JK1dpejNRCjh6SVo1bk1IUnhySFNqSWdWSFdwYnZlTnVaL1Q1aE95aE1uZHU0c3NpRkJyUXN5ZGc1RlVxR3lkdE1JMFJEVHcKSDVBc1ovaFlLeHdiUm1xTXhNcjFMaDFBaDB5SUlsZDZKREY5MkF1UlNTeDl0djNaVWRndEp5VVlYN29VZS9GKwpoUHVwVU4rdWVTUndGQjBiVTYwRXZQWTdVU2RIR1diVVIrRDRzTVQ4Wjk0UVl2S2ZCanU3ZXVKWSs0Mmd2Wm9jCitEWU9ZS05qNXFER2V5azErOE9aTXZNQ0F3RUFBUT09Ci0tLS0tRU5EIFBVQkxJQyBLRVktLS0tLQo=
- name: Small Organization
domain: small-organization.example
address1: Big Ambitions Drive 4
address2:
zipcode: 1234AB
country: Nowhereland
users:
- username: [email protected]
firstname: admin
lastname: robot
password: password
- username: stan
firstname: Stan
lastname: the man
password: password
public_key: LS0tLS1CRUdJTiBQVUJMSUMgS0VZLS0tLS0KTUlJQ0lqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FnOEFNSUlDQ2dLQ0FnRUF2eU4wWVZhWWVZcHVWRVlpaDJjeQphTjdxQndCUnB5bVVibnRQNmw2Vk9OOGE1eGwxMmJPTlQyQ1hwSEVGUFhZQTFFZThQRFZwYnNQcVVKbUlseWpRCkgyN0NhZTlIL2lJbUNVNnViUXlnTzFsbG1KRTJQWDlTNXVxendVV3BXMmRxRGZFSHJLZTErUUlDRGtGSldmSEIKWkJkczRXMTBsMWlxK252dkZ4OWY3dk8xRWlLcVcvTGhQUS83Mm52YlZLMG9nRFNaUy9Jc1NnUlk5ZnJVU1FZUApFbGVZWUgwYmI5VUdlNUlYSHRMQjBkdVBjZUV4dXkzRFF5bXh2WTg3bTlkelJsN1NqaFBqWEszdUplSDAwSndjCk80TzJ0WDVod0lLL1hEQ3h4eCt4b3cxSDdqUWdXQ0FybHpodmdzUkdYUC9wQzEvL1hXaVZSbTJWZ3ZqaXNNaisKS2VTNWNaWWpkUkMvWkRNRW1QU29rS2Y4UnBZUk1lZk0xMWtETTVmaWZIQTlPcmY2UXEyTS9SMy90Mk92VDRlRgorUzVJeTd1QWk1N0ROUkFhejVWRHNZbFFxTU5QcUpKYlRtcGlYRWFpUHVLQitZVEdDSC90TXlrRG1JK1dpejNRCjh6SVo1bk1IUnhySFNqSWdWSFdwYnZlTnVaL1Q1aE95aE1uZHU0c3NpRkJyUXN5ZGc1RlVxR3lkdE1JMFJEVHcKSDVBc1ovaFlLeHdiUm1xTXhNcjFMaDFBaDB5SUlsZDZKREY5MkF1UlNTeDl0djNaVWRndEp5VVlYN29VZS9GKwpoUHVwVU4rdWVTUndGQjBiVTYwRXZQWTdVU2RIR1diVVIrRDRzTVQ4Wjk0UVl2S2ZCanU3ZXVKWSs0Mmd2Wm9jCitEWU9ZS05qNXFER2V5azErOE9aTXZNQ0F3RUFBUT09Ci0tLS0tRU5EIFBVQkxJQyBLRVktLS0tLQo=
- name: Big Organization
domain: big-organization.example
address1: Offshore Accounting Drive 19
address2:
zipcode: 54331
country: Nowhereland
users:
- username: [email protected]
firstname: admin
lastname: robot
password: password
public_key: LS0tLS1CRUdJTiBQVUJMSUMgS0VZLS0tLS0KTUlJQ0lqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FnOEFNSUlDQ2dLQ0FnRUF2eU4wWVZhWWVZcHVWRVlpaDJjeQphTjdxQndCUnB5bVVibnRQNmw2Vk9OOGE1eGwxMmJPTlQyQ1hwSEVGUFhZQTFFZThQRFZwYnNQcVVKbUlseWpRCkgyN0NhZTlIL2lJbUNVNnViUXlnTzFsbG1KRTJQWDlTNXVxendVV3BXMmRxRGZFSHJLZTErUUlDRGtGSldmSEIKWkJkczRXMTBsMWlxK252dkZ4OWY3dk8xRWlLcVcvTGhQUS83Mm52YlZLMG9nRFNaUy9Jc1NnUlk5ZnJVU1FZUApFbGVZWUgwYmI5VUdlNUlYSHRMQjBkdVBjZUV4dXkzRFF5bXh2WTg3bTlkelJsN1NqaFBqWEszdUplSDAwSndjCk80TzJ0WDVod0lLL1hEQ3h4eCt4b3cxSDdqUWdXQ0FybHpodmdzUkdYUC9wQzEvL1hXaVZSbTJWZ3ZqaXNNaisKS2VTNWNaWWpkUkMvWkRNRW1QU29rS2Y4UnBZUk1lZk0xMWtETTVmaWZIQTlPcmY2UXEyTS9SMy90Mk92VDRlRgorUzVJeTd1QWk1N0ROUkFhejVWRHNZbFFxTU5QcUpKYlRtcGlYRWFpUHVLQitZVEdDSC90TXlrRG1JK1dpejNRCjh6SVo1bk1IUnhySFNqSWdWSFdwYnZlTnVaL1Q1aE95aE1uZHU0c3NpRkJyUXN5ZGc1RlVxR3lkdE1JMFJEVHcKSDVBc1ovaFlLeHdiUm1xTXhNcjFMaDFBaDB5SUlsZDZKREY5MkF1UlNTeDl0djNaVWRndEp5VVlYN29VZS9GKwpoUHVwVU4rdWVTUndGQjBiVTYwRXZQWTdVU2RIR1diVVIrRDRzTVQ4Wjk0UVl2S2ZCanU3ZXVKWSs0Mmd2Wm9jCitEWU9ZS05qNXFER2V5azErOE9aTXZNQ0F3RUFBUT09Ci0tLS0tRU5EIFBVQkxJQyBLRVktLS0tLQo=
collaborations:
- name: ZEPPELIN
participants:
- name: IKNL
api_key: 123e4567-e89b-12d3-a456-426614174001
- name: Small Organization
api_key: 123e4567-e89b-12d3-a456-426614174002
- name: Big Organization
api_key: 123e4567-e89b-12d3-a456-426614174003
tasks: ["hello-world"]
encrypted: false
- name: PIPELINE
participants:
- name: IKNL
api_key: 123e4567-e89b-12d3-a456-426614174004
- name: Big Organization
api_key: 123e4567-e89b-12d3-a456-426614174005
tasks: ["hello-world"]
encrypted: false
- name: SLIPPERS
participants:
- name: Small Organization
api_key: 123e4567-e89b-12d3-a456-426614174006
- name: Big Organization
api_key: 123e4567-e89b-12d3-a456-426614174007
tasks: ["hello-world", "hello-world"]
encrypted: falseIn this section you will learn how to (re)configure a server
The vantage6-server requires a configuration file to run. This is a yaml file with specific contents. You can create and edit this file manually. To create an initial configuration file you can also use the configuration wizard: vserver new.
The directory where to store the configuration file depends on you operating system (OS). It is possible to store the configuration file at system or at user level. By default, server configuration files are stored at system level. The default directories per OS are as follows:
OS
System
User
The command vserver looks in certain directories by default. It is possible to use any directory and specify the location with the --config flag. However, note that using a different directory requires you to specify the --config flag every time!
Each server instance (configuration) can have multiple environments. You can specify these under the key environments which allows four types: dev ,test, acc and prod . If you do not want to specify any environment, you should only specify the key application (not within environments) .
We use . In short:
dev Development environment. It is ok to break things here
The most straightforward way of creating a new server configuration is using the command vserver new which allows you to configure most settings. See the what each setting represents.
By default, the configuration is stored at system level, which makes this configuration available for all users. In case you want to use a user directory you can add the --user flag when invoking the vserver new command.
To update a configuration you need to modify the created yaml file. To see where this file is located you can use the command vserver files . Do not forget to specify the flag --system in case of a system-wide configuration or the flag --user in case of a user-level configuration.
If the nodes and the server run at the same machine, you have to make sure that the node can reach the server.
Windows and (intel) Mac
Setting the server IP to 0.0.0.0 makes the server reachable at your localhost (this is also the case when the dockerized version is used). In order for the node to reach this server, set the server_url setting to host.docker.internal.
⚠️ On the M1 mac the local server might not be reachable from your nodes as host.docker.internal does not seem to refer to the host machine. Reach out to us on Discourse for a solution if you need this!
Linux
You should bind the server to 0.0.0.0. In the node configuration files, you can then use 172.17.0.1, assuming you use the default docker network settings.

test Testing environment. Here, you can verify that everything works as expected. This environment should resemble the target environment where the final solution will be deployed as much as possible.acc Acceptance environment. If the tests were successful, you can try this environment, where the final user will test his/her analysis to verify if everything meets his/her expectations.
prod Production environment. The version of the proposed solution where the final analyses are executed.
Windows
C:\ProgramData\vantage6\server
C:\Users\<user>\AppData\Local\vantage6\server\
MacOS
/Library/Application Support/vantage6/server/
/Users/<user>/Library/Application Support/vantage6/server/
Ubuntu
/etc/xdg/vantage6/server/
~/.config/vantage6/server/
vserver new is invoked.application:
...
environments:
test:
# Human readable description of the server instance. This is to help
# your peers to identify the server
description: Test
# Should be prod, acc, test or dev. In case the type is set to test
# the JWT-tokens expiration is set to 1 day (default is 6 hours). The
# other types can be used in future releases of vantage6
type: test
# IP adress to which the server binds. In case you specify 0.0.0.0
# the server listens on all interfaces
ip: 0.0.0.0
# Port to which the server binds
port: 5000
# API path prefix. (i.e. https://yourdomain.org/api_path/<endpoint>). In the
# case you use a referse proxy and use a subpath, make sure to include it
# here also.
api_path: /api
# The URI to the server database. This should be a valid SQLAlchemy URI,
# e.g. for an Sqlite database: sqlite:///database-name.sqlite,
# or Postgres: postgresql://username:[email protected]/database).
uri: sqlite:///test.sqlite
# This should be set to false in production as this allows to completely
# wipe the database in a single command. Useful to set to true when
# testing/developing.
allow_drop_all: True
# The secret key used to generate JWT authorization tokens. This should
# be kept secret as others are able to generate access tokens if they
# know this secret. This parameter is optional. In case it is not
# provided in the configuration it is generated each time the server
# starts. Thereby invalidating all previous distributed keys.
# OPTIONAL
jwt_secret_key: super-secret-key! # recommended but optional
# Settings for the logger
logging:
# Controls the logging output level. Could be one of the following
# levels: CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET
level: DEBUG
# Filename of the log-file, used by RotatingFileHandler
file: test.log
# Whether the output is shown in the console or not
use_console: True
# The number of log files that are kept, used by RotatingFileHandler
backup_count: 5
# Size in kB of a single log file, used by RotatingFileHandler
max_size: 1024
# format: input for logging.Formatter,
format: "%(asctime)s - %(name)-14s - %(levelname)-8s - %(message)s"
datefmt: "%Y-%m-%d %H:%M:%S"
# Configure a smtp mail server for the server to use for administrative
# purposes. e.g. allowing users to reset their password.
# OPTIONAL
smtp:
port: 587
server: smtp.yourmailserver.com
username: your-username
password: super-secret-password
# set how long reset token provided via email are valid (default 1 hour)
email_token_validity_minutes: 60
# Set an email address you want to direct your users to for support
support_email: [email protected]
# If algorithm containers need direct communication between each other
# the server also requires a VPN server. (!) This must be a EduVPN
# instance as vantage6 makes use of their API (!)
# OPTIONAL
vpn_server:
# the URL of your VPN server
url: https://your-vpn-server.ext
# OATH2 settings, make sure these are the same as in the
# configuration file of your EduVPN instance
redirect_url: http://localhost
client_id: your_VPN_client_user_name
client_secret: your_VPN_client_user_password
# Username and password to acccess the EduVPN portal
portal_username: your_eduvpn_portal_user_name
portal_userpass: your_eduvpn_portal_user_password
prod:
...The only VPN that is currently compatible
Please note that EduVPN is an optional component. It enables the use of advanced algorithms that require node-to-node communication.
EduVPN provides an API for the OpenVPN server, which is required for automated certificate retrieval by the nodes. Like vantage6, it is an open source platform.
The following documentation shows you how to install EduVPN:
After the installation is done, you need to configure the server to:
Enable client-to-client communication. This can be achieved in the configuration file by the clientToClient setting (see ).
Do not block LAN communication (set blockLan to false). This allows your docker subnetworks to continue to communicate, which is required for vantage6 to function normally.
Enable
EduVPN allows to listen to multiple protocols (UDP/TCP) and ports at the same time. Be aware that all nodes need to be connected using the same protocol and port in order to communicate with each other.
This server listens to TCP/443 only. Make sure you set clientToClient to true and blockLan to false. The range needs to be supplied to the node configuration files. Also note that the server configured below uses .
We need to add an API user. The username, vantage6-user in this case, and the client_secret have to be added to the vantage6-server configuration file.
This section explains the node resources that the algorithm container has access to.
The algorithm has access to several file mounts:
The paths to these files and directories are stored in the environment variables, which we will explain now.
The contain the file paths to the file-mounts. The following environment variables are available:
tcp/443Create an application account.
// /etc/vpn-server-api/config.php
<?php
return [
// List of VPN profiles
'vpnProfiles' => [
'internet' => [
// The number of this profile, every profile per instance has a
// unique number
// REQUIRED
'profileNumber' => 1,
// The name of the profile as shown in the user and admin portals
// REQUIRED
'displayName' => 'vantage6 :: vpn service',
// The IPv4 range of the network that will be assigned to clients
// REQUIRED
'range' => '10.76.0.0/16',
// The IPv6 range of the network that will be assigned to clients
// REQUIRED
'range6' => 'fd58:63db:3245:d20d::/64',
// The hostname the VPN client(s) will connect to
// REQUIRED
'hostName' => 'eduvpn.vantage6.ai',
// The address the OpenVPN processes will listen on
// DEFAULT = '::'
'listen' => '::',
// The IP address to use for connecting to OpenVPN processes
// DEFAULT = '127.0.0.1'
'managementIp' => '127.0.0.1',
// Whether or not to route all traffic from the client over the VPN
// DEFAULT = false
'defaultGateway' => true,
// Block access to local LAN when VPN is active
// DEFAULT = false
'blockLan' => false,
// IPv4 and IPv6 routes to push to the client, only used when
// defaultGateway is false
// DEFAULT = []
'routes' => [],
// IPv4 and IPv6 address of DNS server(s) to push to the client
// DEFAULT = []
// Quad9 (https://www.quad9.net)
'dns' => ['9.9.9.9', '2620:fe::fe'],
// Whether or not to allow client-to-client traffic
// DEFAULT = false
'clientToClient' => true,
// Whether or not to enable OpenVPN logging
// DEFAULT = false
'enableLog' => false,
// Whether or not to enable ACLs for controlling who can connect
// DEFAULT = false
'enableAcl' => false,
// The list of permissions to allow access, requires enableAcl to
// be true
// DEFAULT = []
'aclPermissionList' => [],
// The protocols and ports the OpenVPN processes should use, MUST
// be either 1, 2, 4, 8 or 16 proto/port combinations
// DEFAULT = ['udp/1194', 'tcp/1194']
'vpnProtoPorts' => [
'tcp/1195',
],
// List the protocols and ports exposed to the VPN clients. Useful
// for OpenVPN port sharing. When empty (or missing), uses list
// from vpnProtoPorts
// DEFAULT = []
'exposedVpnProtoPorts' => [
'tcp/443',
],
// Hide the profile from the user portal, i.e. do not allow the
// user to choose it
// DEFAULT = false
'hideProfile' => false,
// Protect to TLS control channel with PSK
// DEFAULT = tls-crypt
'tlsProtection' => 'tls-crypt',
//'tlsProtection' => false,
],
],
// API consumers & credentials
'apiConsumers' => [
'vpn-user-portal' => '***',
'vpn-server-node' => '***',
],
];...
'Api' => [
'consumerList' => [
'vantage6-user' => [
'redirect_uri_list' => [
'http://localhost',
],
'display_name' => 'vantage6',
'require_approval' => false,
'client_secret' => '***'
]
]
...run_id. i.o. if a user creates a task a new run_id is assigned.PORT
Contains the port to which the vantage6-server listens. Is used in combination with HOST and API_PATH.
API_PATH
Contains the path from the vantage6-server.
*_DATABASE_URI
Contains the URI of the local database. The * is replaced by the key specified in the file.
INPUT_FILE
path to the input file. The input file contains the user defined input for the algorithms.
OUTPUT_FILE
Path to the output file. The contents of the output file are send back to the vantage6-server when the algorithm container exits.
TOKEN_FILE
Path to the token file. The token file contains a JWT token which can be used to access the vantage6-server. This way the algorithm container is able to post new tasks and retrieve results.
TEMPORARY_FOLDER
Path to the temporary folder. This folder can be used to store intermediate results. These intermediate results are shared between all containers that have the same run_id. Algorithm containers which are created from an algorithm container themselves share the same run_id.
HOST
Contains the URL to the vantage6-server.
Managing the database of the vantage6 server can be done by an administrator through the shell.
Through the shell it is possible to manage all server entities. To start the shell, use vserver shell [options].
In the next sections the different database models that are available are explained. You can retrieve any record and edit any property of it. Don't forget to call .save() once you are done editing.
-->db.Organization
--> db.User
--> db.Role and db.Rule
Every db. object has a help() method which prints some info on what data is stored in it (e.g. db.Organization.help()).
To store an organization you can use the db.Organization model:
Retrieving organizations from the database:
A lot of entities (e.g. users) at the server are connected to an organization. E.g. you can see which (computation) tasks are issued by the organization or see which collaborations it is participating in.
A user can have multiple roles and rules assigned to them. These are used to determine if the user has permission to view, edit, create or delete certain resources using the API. A role is a collection of rules.
Users belong to an organization. So if you have not created an organization as of yet, then you should first. To create a user you can use the db.User model:
You can retrieve users in the following ways:
To modify a user, simply adjust the properties and save the object.
A collaboration consists of one or more organizations. To create a collaboration you need at least one in your database. To create a collaboration you can use the db.Collaboration model:
Tasks, nodes and organizations are directly related to collaborations. We can obtain these by:
Setting the encryption to False at the server does not mean that the nodes will send encrypted results. This is only the case if the nodes also agree on this setting.
Before nodes can login, they need to exist in the server's database. A new node can be created as follows:
API keys are hashed before stored in the database. Therefore you need to save the API key immediately. If you lose it, you can reset the API key later via the shell or via the API.
Tasks(/results) created from the shell are not picked up by nodes that are already running. The signal to notify them of a new task cannot be emitted this way.
A task is intended for one or more organizations. For each organization the task is intended for, a corresponding (initially empty) result should be created. Each task can have multiple results, for example a result from each organization.
Tasks can have a child/parent relationship. Note that the run_id is for parent and child tasks the same.
Tasks that share a run_id have access to the same temporary folder at the node. This allows for multi-stage algorithms.
Obtaining results:

Collaborations --> db.Collaboration
Nodes --> db.Node
Tasks --> db.Task
Results --> db.Result
In this section you will learn how to (re)configure nodes.
The vantage6-node requires a configuration file to run. This is a yaml file with a specific format. To create an initial configuration file, start the configuration wizard via: vnode new . You can also create and/or edit this file manually.
The directory where the configuration file is stored depends on your operating system (OS). It is possible to store the configuration file at system or at user level. By default, node configuration files are stored at user level. The default directories per OS are as follows:
# create new organiztion
organization = db.Organization(
name="IKNL",
domain="iknl.nl",
address1="Zernikestraat 29",
address2="Eindhoven",
zipcode="5612HZ",
country="Netherlands"
)
# store organization in the database
organization.save()# get all organizations in the database
organizations = db.Organization.get()
# get organization by its unique id
organization = db.Organization.get(1)
# get organization by its name
organization = db.Organization.get_by_name("IKNL")# retrieve organization from which we want to know more
organization = db.Organization.get_by_name("IKNL")
# get all collaborations in which the organization participates
collaborations = organization.collaborations
# get all users from the organization
users = organization.users
# get all created tasks (from all users)
tasks = organization.created_tasks
# get the results of all these tasks
results = organization.results
# get all nodes of this organization (for each collaboration
# an organization participates in, it needs a node)
nodes = organization.nodes# display all available rules
db.Rule.get()
# display rule 1
db.Rule.get(1)
# display all available roles
db.Role.get()
# display role 3
db.Role.get(3)
# show all rules that belong to role 3
db.Role.get(3).rules
# retrieve a certain rule from the DB
rule = db.Rule.get_by_("node", Scope, Operation)
# create a new role
role = db.Role(name="role-name", rules=[rule])
role.save()
# or assign the rule directly to the user
user = db.User.get_by_username("some-existing-username")
user.rules.append(rule)
user.save()# first obtain the organization to which the new user belongs
org = db.Organization.get_by_name("IKNL")
# obtain role 3 to assign to the new user
role_3 = db.Role.get(3)
# create the new users, see section Roles and Rules on how to
# deal with permissions
new_user = db.User(
username="root",
password="super-secret",
firstname="John",
lastname="Doe",
roles=[role_3],
rules=[],
organization=org
)
# store the user in the database
new_user.save()# get all users
db.User.get()
# get user 1
db.User.get(1)
# get user by username
db.User.get_by_username("root")
# get all users from the organization IKNL
db.Organization.get_by_name("IKNL").usersuser = db.User.get_by_username("some-existing-username")
# update the firstname
user.firstname = "Brandnew"
# update the password; it is automatically hashed.
user.password = "something-new"
# store the updated user in the database
user.save()# create a second organization to collaborate with
other_organization = db.Organization(
name="IKNL",
domain="iknl.nl",
address1="Zernikestraat 29",
address2="Eindhoven",
zipcode="5612HZ",
country="Netherlands"
)
other_organization.save()
# get organization we have created earlier
iknl = db.Organization.get_by_name("IKNL")
# create the collaboration
collaboration = db.Collaboration(
name="collaboration-name",
encrypted=False,
organizations=[iknl, other_organization]
)
# store the collaboration in the database
collaboration.save()# obtain a collaboration which we like to inspect
collaboration = db.Collaboration.get(1)
# get all nodes
collaboration.nodes
# get all tasks issued for this collaboration
collaboration.tasks
# get all organizations
collaboration.organizations# we'll use a uuid as the API-key, but you can use anything as
# API key
from uuid import uuid1
# nodes always belong to an organization *and* a collaboration,
# this combination needs to be unique!
iknl = db.Organization.get_by_name("IKNL")
collab = iknl.collaborations[0]
# generate and save
api_key = str(uuid1())
print(api_key)
node = db.Node(
name = f"IKNL Node - Collaboration {collab.name}",
organization = iknl,
collaboration = collab,
api_key = api_key
)
# save the new node to the database
node.save()# obtain organization from which this task is posted
iknl = db.Organization.get_by_name("IKNL")
# obtain collaboration for which we want to create a task
collaboration = db.Collaboration.get(1)
# obtain the next run_id. Tasks sharing the same run_id
# can share the temporary volumes at the nodes. Usually this
# run_id is assigned through the API (as the user is not allowed
# to do so). All tasks from a master-container share the
# same run_id
run_id = db.Task.next_run_id()
task = db.Task(
name="some-name",
description="some human readable description",
image="docker-registry.org/image-name",
collaboration=collaboration,
run_id=run_id,
database="default",
initiator=iknl,
)
task.save()
# input the algorithm container (docker-registry.org/image-name)
# expects
input_ = {
}
import datetime
# now create a result model for each organization within the
# collaboration. This could also be a subset
for org in collaboration.organizations:
res = db.Result(
input=input_,
organization=org,
task=task,
assigned_at=datetime.datetime.now()
)
res.save()# get a task to which we want to create some
# child tasks
parent_task = db.Task.get(1)
child_task = db.Task(
name="some-name",
description="some human readable description",
image="docker-registry.org/image-name",
collaboration=collaboration,
run_id=parent_task.run_id,
database="default",
initiator=iknl,
parent=parent_task
)
child_task.save()# obtain all Results
db.Result.get()
# obtain only completed results
[result for result in db.Result.get() if result.complete]
# obtain result by its unique id
db.Result.get(1)
C:\ProgramData\vantage\node
C:\Users\<user>\AppData\Local\vantage\node
MacOS
/Library/Application Support/vantage6/node
/Users/<user>/Library/Application Support/vantage6/node
Linux
/etc/vantage6/node
/home/<user>/.config/vantage6/node
The command vnode looks in certain directories by default. It is possible to use any directory and specify the location with the --config flag. However, note that using a different directory requires you to specify the --config flag every time!
Each node instance (configuration) can have multiple environments. You can specify these under the key environments which allows four types: dev , test,acc and prod . If you do not want to specify any environment, you should only specify the key application (not within environments) .
We use DTAP for key environments. In short:
dev Development environment. It is ok to break things here
test Testing environment. Here, you can verify that everything works as expected. This environment should resemble the target environment where the final solution will be deployed as much as possible.
acc Acceptance environment. If the tests were successful, you can try this environment, where the final user will test his/her analysis to verify if everything meets his/her expectations.
prod Production environment. The version of the proposed solution where the final analyses are executed.
The most straightforward way of creating a new server configuration is using the command vnode new which allows you to configure the most basic settings.
By default, the configuration is stored at user level, which makes this configuration available only for your user. In case you want to use a system directory you can add the --system flag when invoking the vnode new command.
To update a configuration you need to modify the created yaml file. To see where this file is located, you can use the command vnode files . Do not forget to specify the flag --system in case of a system-wide configuration or the --user flag in case of a user-level configuration.
Refer to here if you want to run both the node and server on the same machine.
Operating System
System-folder
User-folder
Windows
application:
# API key used to authenticate at the server.
api_key: ***
# URL of the vantage6 server
server_url: https://petronas.vantage6.ai
# port the server listens to
port: 443
# API path prefix that the server uses. Usually '/api' or an empty string
api_path: ''
# subnet of the VPN server
vpn_subnet: 10.76.0.0/16
# add additional environment variables to the algorithm containers.
# this could be usefull for passwords or other things that algorithms
# need to know about the node it is running on
# OPTIONAL
algorithm_env:
# in this example the environment variable 'player' has
# the value 'Alice' inside the algorithm container
player: Alice
# specify custom Docker images to use for starting the different
# components.
# OPTIONAL
images:
node: harbor2.vantage6.ai/infrastructure/node:petronas
alpine: harbor2.vantage6.ai/infrastructure/alpine
vpn_client: harbor2.vantage6.ai/infrastructure/vpn_client
network_config: harbor2.vantage6.ai/infrastructure/vpn_network
# path or endpoint to the local data source. The client can request a
# certain database to be used if it is specified here. They are
# specified as label:local_path pairs.
databases:
default: D:\data\datafile.csv
# end-to-end encryption settings
encryption:
# whenever encryption is enabled or not. This should be the same
# as the `encrypted` setting of the collaboration to which this
# node belongs.
enabled: false
# location to the private key file
private_key: /path/to/private_key.pem
# To control which algorithms are allowed at the node you can set
# the allowed_images key. This is expected to be a valid regular
# expression
allowed_images:
- ^harbor.vantage6.ai/[a-zA-Z]+/[a-zA-Z]+
# credentials used to login to private Docker registries
docker_registries:
- registry: docker-registry.org
username: docker-registry-user
password: docker-registry-password
# Settings for the logger
logging:
# Controls the logging output level. Could be one of the following
# levels: CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET
level: DEBUG
# Filename of the log-file, used by RotatingFileHandler
file: my_node.log
# whenever the output needs to be shown in the console
use_console: True
# The number of log files that are kept, used by RotatingFileHandler
backup_count: 5
# Size kb of a single log file, used by RotatingFileHandler
max_size: 1024
# format: input for logging.Formatter,
format: "%(asctime)s - %(name)-14s - %(levelname)-8s - %(message)s"
datefmt: "%Y-%m-%d %H:%M:%S"
# directory where local task files (input/output) are stored
task_dir: C:\Users\<your-user>\AppData\Local\vantage6\node\tno1
In this section the basic steps for creating an algorithm for horizontal partitioned data are explained.
The final code of this tutorial is published on Github. The algorithm is also published in our Docker registry: harbor2.vantage6.ai/demo/average
It is assumed that it is mathematically possible to create a federated version of the algorithm you want to use. In the following sections we create a federated algorithm to compute the average of a distributed dataset. An overview of the steps that we are going through:
Mathematically decompose the model
Federated implementation and local testing
Vantage6 algorithm wrapper
Dockerize and push to a registry
This tutorial shows you how to create a federated mean algorithm.
The mean of is computed as:
When dataset is horizontally partitioned in dataset and , we would like to compute from dataset A and B. This could be computed as:
Both the number of samples in each dataset and the total sum of each dataset is needed. Then we can compute the global average of dataset and .
We cannot simply compute the average on each node and combine them, as this would be mathematically incorrect. This would only work if dataset A and B contain the exact same number of samples.
In this example we use python, however you are free to use any language. The only requirements are: 1) It has to be able to create HTTP-requests, and 2) has to be able to read and write to files.
However, if you use a different language you are not able to use our wrapper. Reach out to us on to discuss how this works.
A federated algorithm consist of two parts:
A federated part of the algorithm which is responsible for creating the partial results. In our case this would be computing (1) the sum of the observations, and (2) the number of observations.
A central part of the algorithm which is responsible for combining the partial results from the nodes. In the case of the federated mean that would be dividing the total sum of the observations by the total number of observations.
The node that runs this part contains a CSV-file with one column (specified by the argument column_name) which we want to use to compute the global mean. We assume that this column has no NaN values.
The central algorithm receives the sums and counts from all sites and combines these to a global mean. This could be from one or more sites.
To test, simply create two datasets A and B, both having a numerical column numbers. Then run the following:
A good starting point would be to use the boilerplate code from our . This section outlines the steps needed to get to this boilerplate but also provides some background information.
Now that we have a federated implementation of our algorithm we need to make it compatible with the vantage6 infrastructure. The infrastructure handles the communication with the server and provides data access to the algorithm.
The algorithm consumes a file containing the input. This contains both the method name to be triggered as well as the arguments provided to the method. The algorithm also has access to a CSV file (in the future this could also be a database) on which the algorithm can run. When the algorithm is finished, it writes back the output to a different file.
The central part of the algorithm has to be able to create (sub)tasks. These subtasks are responsible for executing the federated part of the algorithm. The central part of the algorithm can either be executed on one of the nodes in the vantage6 network or on the machine of a researcher. In this example we only show the case in which one of the nodes executes the central part of the algorithm. The node provides the algorithm with a JWT token so that the central part of the algorithm has access to the server to post these subtasks.
The algorithm needs to be structured as a Python . This way the algorithm can be installed within the Docker image. The minimal file-structure would be:
We also recommend adding a README.md, LICENSE and requirements.txt to the project_folder.
Contains the setup method to create a package from your algorithm code. Here you specify some details about your package and the dependencies it requires.
Contains the recipe for building the Docker image. Typically you only have to change the argument PKG_NAME to the name of you package. This name should be the same as as the name you specified in the setup.py. In our case that would be v6-average-py.
__init__.pyThis contains the code for your algorithm. It is possible to split this into multiple files, however the methods that should be available to the researcher should be in this file. You can do that by simply importing them into this file (e.g. from .average import my_nested_method)
We can distinguish two types of methods that a user can trigger:
The client the master method receives is a ContainerClient which is different than the client you use as a user.
Everything that is returned by thereturn statement is sent back to the central vantage6-server. This should never contain any privacy-sensitive information.
For our average algorithm the implementation will look as follows:
Now that we have a vantage6 implementation of the algorithm it is time to test it. Before we run it in a vantage6 setup we can test it locally by using the ClientMockProtocol which simulates the communication with the central server.
Before we can locally test it we need to (editable) install the algorithm package so that the Mock client can use it. Simply go to the root directory of your algorithm package (with the setup.py file) and run the following:
Then create a script to test the algorithm:
Now that we have a fully tested algorithm for the vantage6 infrastructure. We need to package it so that it can be distributed to the data-stations/nodes. Algorithms are delivered in Docker images. So that's where we need the Dockerfile for. To build an image from our algorithm (make sure you have docker installed and it's running) you can run the following command from the root directory of your algorithm project.
The option -t specifies the (unique) identifier used by the researcher to use this algorithm. Usually this includes the registry address (harbor2.vantage6.ai) and the project name (demo).
It is possible that a vantage6 algorithm is developed in one programming language, but you would like to run the task from another language. For these use-cases, the Python algorithm wrapper and client support cross-language serialization. By default, input to the algorithms and output back to the client are serialized using pickle. However, it is possible to define a different serialization format.
Input and output serialization can be specified as follows:
name
description
prefix
arguments
master
Central part of the algorithm. Receives a client as argument which provides an interface to the central server. This way the master can create tasks and collect their results.
(client, data, *args, **kwargs)
Remote procedure call
Consumes the data at the node to compute the partial.
RPC_
(data, *args, **kwargs)
import pandas
def federated_part(path, column_name="numbers"):
"""Compute the sum and number of observations of a column"""
# extract the column numbers from the CSV
numbers = pandas.read_csv(path)[column_name]
# compute the sum, and count number of rows
local_sum = numbers.sum()
local_count = len(numbers)
# return the values as a dict
return {
"sum": local_sum,
"count": local_count
}def central_part(node_outputs):
"""Combine the partial results to a global average"""
global_sum = 0
global_count = 0
for output in node_outputs:
global_sum += output["sum"]
global_count += output["count"]
return {"average": global_sum / global_count}outputs = [
federated_part("path/to/dataset/A"),
federated_part("path/to/dataset/B")
]
Q_average = central_part(outputs)["average"]
print(f"global average = {Q_average}.")project_folder
├── Dockerfile
├── setup.py
└── algorithm_pkg
└── __init__.pyfrom os import path
from codecs import open
from setuptools import setup, find_packages
# we're using a README.md, if you do not have this in your folder, simply
# replace this with a string.
here = path.abspath(path.dirname(__file__))
with open(path.join(here, 'README.md'), encoding='utf-8') as f:
long_description = f.read()
# Here you specify the meta-data of your package. The `name` argument is
# needed in some other steps.
setup(
name='v6-average-py',
version="1.0.0",
description='vantage6 average',
long_description=long_description,
long_description_content_type='text/markdown',
url='https://github.com/IKNL/v6-average-py',
packages=find_packages(),
python_requires='>=3.6',
install_requires=[
'vantage6-client',
# list your dependencies here:
# pandas, ...
]
)# This specifies our base image. This base image contains some commonly used
# dependancies and an install from all vantage6 packages. You can specify a
# different image here (e.g. python:3). In that case it is important that
# `vantage6-client` is a dependancy of you project as this contains the wrapper
# we are using in this example.
FROM harbor.vantage6.ai/algorithms/algorithm-base
# Change this to the package name of your project. This needs to be the same
# as what you specified for the name in the `setup.py`.
ARG PKG_NAME="v6-average-py"
# This will install your algorithm into this image.
COPY . /app
RUN pip install /app
# This will run your algorithm when the Docker container is started. The
# wrapper takes care of the IO handling (communication between node and
# algorithm). You dont need to change anything here.
ENV PKG_NAME=${PKG_NAME}
CMD python -c "from vantage6.tools.docker_wrapper import docker_wrapper; docker_wrapper('${PKG_NAME}')"import time
from vantage6.tools.util import info
def master(client, data, column_name):
"""Combine partials to global model
First we collect the parties that participate in the collaboration.
Then we send a task to all the parties to compute their partial (the
row count and the column sum). Then we wait for the results to be
ready. Finally when the results are ready, we combine them to a
global average.
Note that the master method also receives the (local) data of the
node. In most usecases this data argument is not used.
The client, provided in the first argument, gives an interface to
the central server. This is needed to create tasks (for the partial
results) and collect their results later on. Note that this client
is a different client than the client you use as a user.
"""
# Info messages can help you when an algorithm crashes. These info
# messages are stored in a log file which is send to the server when
# either a task finished or crashes.
info('Collecting participating organizations')
# Collect all organization that participate in this collaboration.
# These organizations will receive the task to compute the partial.
organizations = client.get_organizations_in_my_collaboration()
ids = [organization.get("id") for organization in organizations]
# Request all participating parties to compute their partial. This
# will create a new task at the central server for them to pick up.
# We've used a kwarg but is is also possible to use `args`. Although
# we prefer kwargs as it is clearer.
info('Requesting partial computation')
task = client.create_new_task(
input_={
'method': 'average_partial',
'kwargs': {
'column_name': column_name
}
},
organization_ids=ids
)
# Now we need to wait untill all organizations(/nodes) finished
# their partial. We do this by polling the server for results. It is
# also possible to subscribe to a websocket channel to get status
# updates.
info("Waiting for results")
task_id = task.get("id")
task = client.get_task(task_id)
while not task.get("complete"):
task = client.get_task(task_id)
info("Waiting for results")
time.sleep(1)
# Once we now the partials are complete, we can collect them.
info("Obtaining results")
results = client.get_results(task_id=task.get("id"))
# Now we can combine the partials to a global average.
global_sum = 0
global_count = 0
for result in results:
global_sum += result["sum"]
global_count += result["count"]
return {"average": global_sum / global_count}
def RPC_average_partial(data, column_name):
"""Compute the average partial
The data argument contains a pandas-dataframe containing the local
data from the node.
"""
# extract the column_name from the dataframe.
info(f'Extracting column {column_name}')
numbers = data[column_name]
# compute the sum, and count number of rows
info('Computing partials')
local_sum = numbers.sum()
local_count = len(numbers)
# return the values as a dict
return {
"sum": local_sum,
"count": local_count
}pip install -e .from vantage6.tools.mock_client import ClientMockProtocol
# Initialize the mock server. The datasets simulate the local datasets from
# the node. In this case we have two parties having two different datasets:
# a.csv and b.csv. The module name needs to be the name of your algorithm
# package. This is the name you specified in `setup.py`, in our case that
# would be v6-average-py.
client = ClientMockProtocol(
datasets=["local/a.csv", "local/b.csv"],
module="v6-average-py"
)
# to inspect which organization are in your mock client, you can run the
# following
organizations = client.get_organizations_in_my_collaboration()
org_ids = ids = [organization["id"] for organization in organizations]
# we can either test a RPC method or the master method (which will trigger the
# RPC methods also). Lets start by triggering an RPC method and see if that
# works. Note that we do *not* specify the RPC_ prefix for the method! In this
# example we assume that both a.csv and b.csv contain a numerical column `age`.
average_partial_task = client.create_new_task(
input_={
'method':'average_partial',
'kwargs': {
'column_name': 'age'
}
},
organization_ids=org_ids
)
# You can directly obtain the result (we dont have to wait for nodes to
# complete the tasks)
results = client.get_results(average_partial_task.get("id"))
print(results)
# To trigger the master method you also need to supply the `master`-flag
# to the input. Also note that we only supply the task to a single organization
# as we only want to execute the central part of the algorithm once. The master
# task takes care of the distribution to the other parties.
average_task = client.create_new_task(
input_={
'master': 1,
'method':'master',
'kwargs': {
'column_name': 'age'
}
},
organization_ids=[org_ids[0]]
)
results = client.get_results(average_task.get("id"))
print(results)docker build -t harbor2.vantage6.ai/demo/average .docker push harbor2.vantage6.ai/demo/averageclient.post_task(
name='mytask',
image='harbor2.vantage6.ai/testing/v6-test-py',
collaboration_id=COLLABORATION_ID,
organization_ids=ORGANIZATION_IDS,
data_format='json', # Specify input format to the algorithm
input_={
'method': 'column_names',
'kwargs': {'data_format': 'json'}, # Specify output format
}
)30 november 2022
Bugfix
Fix for automatic addition of column. This failed in some SQL dialects because reserved keywords (i.e. 'user' for PostgresQL) where not escaped (PR#415)
Correct installation order for uWSGI in node and server docker file ()
30 november 2022
Bugfix
Backwards compatibility for which organization initiated a task between v3.0-3.4 and v3.5 ()
Fixed VPN client container. Entry script was not executable in Github pipelines ()
30 november 2022
When upgrading to 3.5.0, you might need to add the otp_secret column to the user table manually in the database. This may be avoided by upgrading to 3.5.2.
Feature
TOTP Multi-Factor-Authenticator has been added. Admins can enforce that all users enable MFA (, ).
The server support email is now settable in the configuration file, used to be fixed at [email protected] (, ).
3 november 2022
Bugfix
Fixed a bug in the local proxy server which made algorithm containers crash in case the client.create_new_task method was used ().
Fixed a bug that crashed the node when a non existing image was send in a task ().
25 oktober 2022
Feature
Add columns to the SQL database on startup (, ). This simpifies the upgrading proces when a new column is added in the new release, as you do no longer need to manually add columns. When downgrading the columns will not be deleted.
Docker wrapper for Parquet files (, ). Parquet provides a way to store tabular data with the datatypes included which is an advantage over CSV.
Bugfix
The function client.util.change_my_password() was updated ()
Bugfix
Temporary fix for a bug that prevents the master container from creating tasks in an encrypted collaboration. This temporary fix disables the parallel encryption module in the local proxy. This functionality will be restored in a future release.
Feature
The release pipeline has been expanded to automatically push new Docker images of node/server to the harbor2 service.
Bugfix
Note that 3.3.4 was only released on PyPi and that version is identical to 3.3.5. That version was otherwise skipped due to a temporary mistake in the release pipeline.
Bugfix
Token refresh was broken for both users and nodes. (, )
Local proxy encrpytion was broken. This prefented algorithms from creating sub tasks when encryption was enabled. (, )
Bugfix
vpn_client_image and network_config_image are settable through the node configuration file. (, )
The option --all
Bugfix
Fixed faulty error status codes from the /collaboration endpoint ().
Default roles are always returned from the /role endpoint. This fixes the error when a user was assigned a default role but could not reach anything (as it could not view its own role) (
Feature
Login requirements have been updated. Passwords are now required to have sufficient complexity (8+ characters, and at least 1 uppercase, 1 lowercase, 1 digit, 1 special character). Also, after 5 failed login attempts, a user account is blocked for 15 minutes (these defaults can be changed in a server config file).
Added endpoint /password/change to allow users to change their password using their current password as authentication. It is no longer possible to change passwords via client.user.update()
Feature
Horizontal scaling for the vantage6-server instance by adding support for RabbitMQ.
It is now possible to connect other docker containers to the private algorithm network. This enables you to attach services to the algorithm network using the docker_services setting.
Feature
Algorithm-to-algorithm communication can now take place over multiple ports, which the algorithm developer can specify in the Dockerfile. Labels can be assigned to each port, facilitating communication over multiple channels.
Multi-database support for nodes. It is now also possible to assign multiple data sources to a single node in Petronas; this was already available in Harukas 2.2.0. The user can request a specific data source by supplying the database argument when creating a task.
Feature
Direct algorithm-to-algorithm communication has been added. Via a VPN connection, algorithms can exchange information with one another.
Pagination is added. Metadata is provided in the headers by default. It is also possible to include them in the output body by supplying an additional parameterinclude=metadata. Parameters page and per_page
31 Oktober 2022
Bugfix
Encryption module in the local proxy server has been fixed
Feature
Allows for horizontal scaling of the server instance by adding support for RabbitMQ. Note that this has not been released for version 3(!)
Bugfix
Feature
Multi-database support for nodes. It is now possible to assign multiple data sources to a single node. The user can request a specific data source by supplying the database argument when creating a task.
The mailserver now supports TLS and SSL options
Bugfix
Changes to the way the application interacts with the database. Solves the issue of unexpected disconnects from the DB and thereby freezing the application.
Bugfix
Updating the country field in an organization works again\
The client.result.list(...) broke when it was not able to deserialize one of the in- or outputs.
Feature
Custom algorithm environment variables can be set using the algorithm_env key in the configuration file. .
Support for non-file-based databases on the node. .
Bugfix
Fixed a bug that prevented the usage of secured registry algorithms
Feature
Role/rule based access control
Roles consist of a bundle of rules. Rules profided access to certain API endpoints at the server.
Feature
The node is now compatible with the Harbor v2.0 API
Bug fixes
Fixed a bug that ignored the --system flag from vnode start
Logging output muted when the --config option is used in vnode start
Bug fixes
starting the server for the first time resulted in a crash as the root user was not supplied with an email address.
Algorithm containers could still access the internet through their host. This has been patched.
Features
Cross language serialization. Enabling algorithm developers to write algorithms that are not language dependent.
Reset password is added to the API. For this purpose two endpoints have been added: /recover/lostand recover/reset . The server config file needs to extended to be connected to a mail-server in order to make this work.
Features
new command vnode clean to clean up temporary docker volumes that are no longer used
Version of the individual packages are printed in the console on startup
Updated Command Line Interface (CLI)
The commands vnode list , vnode start and the new commandvnode attach are aimed to work with multiple nodes at a single machine.
System and user-directories can be used to store configurations by using the
/node endpoint, while this functionality is required to use the VPNvnode stop--forcePerformance upgrade in the /organization endpoint. This caused long delays when retrieving organization information when the organization has many tasks (PR#288).
Organization admins are no longer allowed to create and delete nodes as these should be managed at collaboration level. Therefore, the collaboration admin rules have been extended to include create and delete nodes rules (PR#289).
Fixed some issues that made 3.3.0 incompatible with 3.3.1 (Issue#285).
/user/{id}Added the default roles 'viewer', 'researcher', 'organization admin' and 'collaboration admin' to newly created servers. These roles may be assigned to users of any organization, and should help users with proper permission assignment.
Added option to filter get all roles for a specific user id in the GET /role endpoint.
RabbitMQ has support for multiple servers when using vserver start. It already had support for multiple servers when deploying via a Docker compose file.
When exiting server logs or node logs with Ctrl+C, there is now an additional message alerting the user that the server/node is still running in the background and how they may stop them.
Change
Node proxy server has been updated
Updated PyJWT and related dependencies for improved JWT security.
When nodes are trying to use a wrong API key to authenticate, they now receive a clear message in the node logs and the node exits immediately.
When using vserver import, API keys must now be provided for the nodes you create.
Moved all swagger API docs from YAML files into the code. Also, corrected errors in them.
API keys are created with UUID4 instead of UUID1. This prevents that UUIDs created milliseconds apart are not too similar.
Rules for users to edit tasks were never used and have therefore been deleted.
Bugfix
In the Python client, client.organization.list() now shows pagination metadata by default, which is consistent all other list() statements.
When not providing an API key in vnode new, there used to be an unclear error message. Now, we allow specifying an API key later and provide a clearer error message for any other keys with inadequate values.
It is now possible to provide a name when creating a name, both via the Python client as via the server.
A GET /role request crashed if parameter organization_id was defined but not include_root. This has been resolved.
Users received an 'unexpected error' when performing a GET /collaboration?organization_id=<id> request and they didn't have global collaboration view permission. This was fixed.
GET /role/<id> didn't give an error if a role didn't exist. Now it does.
Many additional select and filter options on API endpoints, see swagger docs endpoint (/apidocs). The new options have also been added to the Python client.
Users are now always able to view their own data
Usernames can be changed though the API
Bugfix
(Confusing) SQL errors are no longer returned from the API.
Clearer error message when an organization has multiple nodes for a single collaboration.
Node no longer tries to connect to the VPN if it has no vpn_subnet setting in its configuration file.
Fix the VPN configuration file renewal
Superusers are no longer able to post tasks to collaborations its organization does not participate in. Note that superusers were never able to view the results of such tasks.
It is no longer possible to post tasks to organization which do not have a registered node attach to the collaboration.
The vnode create-private-key command no longer crashes if the ssh directory does not exist.
The client no longer logs the password
The version of the alpine docker image (that is used to set up algorithm runs with VPN) was fixed. This prevents that many versions of this image are downloaded by the node.
Improved reading of username and password from docker registry, which can be capitalized differently depending on the docker version.
Fix error with multiple-database feature, where default is now used if specific database is not found
The CLI commands vserver new and vnode new have been extended to facilitate configuration of the VPN server.
Filter options for the client have been extended.
Roles can no longer be used across organizations (except for roles in the default organization)
Added vnode remove command to uninstall a node. The command removes the resources attached to a node installation (configuration files, log files, docker volumes etc).
Added option to specify configuration file path when running vnode create-private-key.
Bugfix
Fixed swagger docs
Improved error message if docker is not running when a node is started
Improved error message for vserver version and vnode version if no servers or nodes are running
Patching user failed if users had zero roles - this has been fixed.
Creating roles was not possible for a user who had permission to create roles only for their own organization - this has been corrected.
GET /result
GET /collaboration
GET /collaboration/{id}/organization
GET /collaboration/{id}/node
GET /collaboration/{id}/task
GET /organization
GET /role
GET /role/{id}/rule
GET /rule
GET /task
GET /task/{id}/result
GET /node
API keys are encrypted in the database
Users cannot shrink their own permissions by accident
Give node permission to update public key
Dependency updates
Bugfix
Fixed database connection issues
Don't allow users to be assigned to non-existing organizations by root
Fix node status when node is stopped and immediately started up
Check if node names are allowed docker names
Performance improvements on the /organization endpoint
Nodes are now disconnected more gracefully. This fixes the issue that nodes appear offline while they are in fact online
Fixed a bug that prevented deleting a node from the collaboration
A role is now allowed to have zero rules
Some http error messages have improved
Organization fields can now be set to an empty string
Added flag --attach to the vserver start and vnode start command. This directly attaches the log to the console.
Auto updating the node and server instance is now limited to the major version. See this Github issue.
e.g. if you've installed the Trolltunga version of the CLI you will always get the Trolltunga version of the node and server.
Infrastructure images are now tagged using their version major. (e.g. trolltunga or harukas )
It is still possible to use intermediate versions by specifying the --image option when starting the node or server. (e.g. vserver start --image harbor.vantage6.ai/infrastructure/server:2.0.0.post1 )
Bugfix
Fixed issue where node crashed if the database did not exist on startup. See this Github issue.
Major update on the python-client. The client also contains management tools for the server (i.e. to creating users, organizations and managing permissions. The client can be imported from from vantage6.client import Client .
You can use the agrument verbose on the client to output status messages. This is usefull for example when working with Jupyter notebooks.
Added CLI vserver version , vnode version , vserver-local version and vnode-local version commands to report the version of the node or server they are running
The logging contains more information about the current setup, and refers to this documentation and our Discourd channel
Bugfix
Issue with the DB connection. Session management is updated. Error still occurs from time to time but can be reset by using the endpoint /health/fix . This will be patched in a newer version.
Fixed config folder mounting point when the option --config option is used in vnode start
User table in the database is extended to contain an email address which is mandatory.
Bug fixes
Collaboration name needs to be unique
API consistency and bug fixes:
GET organization was missing domain key
PATCH /organization could not patch domain
GET /collaboration/{id}/node has been made consistent with /node
GET /collaboration/{id}/organization has been made consistent with /organization
PATCH /user root-user was not able to update users
DELETE /user root-user was not able to delete users
GET /task null values are now consistent: [] is replaced by null
POST, PATCH, DELETE /node root-user was not able to perform these actions
GET /node/{id}/task output is made consistent with the
other
questionairy dependency is updated to 1.5.2
vantage6-toolkit repository has been merged with the vantage6-client as they were very tight coupled.
Custom task and log directories can be set in the configuration file
Improved CLI messages
Docker images are only pulled if the remote version is newer. This applies both to the node/server image and the algorithm images
Client class names have been simplified (UserClientProtocol -> Client)
Bug fixes
Removed defective websocket watchdog. There still might be disconnection issues from time to time.
--user/--systemCurrent status (online/offline) of the nodes can be seen using vnode list , which also reports which environments are available per configuration.
Developer container has been added which can inject the container with the source. vnode start --develop [source]. Note that this Docker image needs to be build in advance from the development.Dockerfile and tag devcon.
vnode config_file has been replaced by vnode files which not only outputs the config file location but also the database and log file location.
New database model
Improved relations between models. Thereby updating the Python API, see here.
Input for the tasks is now stored in the result table. This was required as the input is encrypted individually for each organization (end-to-end encryption (E2EE) between organizations).
The Organization model has been extended with the public_key (String) field. This field contains the public key from each organization, which is used by the E2EE module.
The Collaboration model has been extended with the encrypted (Boolean) field which keeps track if all messages (tasks, results) need to be E2EE for this specific collaboration.
The Task keeps track of the initiator (organization) of the organization. This is required to encrypt the results for the initiator.
End to end encryption
All messages between all organizations are by default be encrypted.
Each node requires the private key of the organization as it needs to be able to decrypt incoming messages. The private key should be specified in the configuration file using the private_key label.
In case no private key is specified, the node generates a new key an uploads the public key to the server.
If a node starts (using vnode start), it always checks if the public_key on the server matches the private key the node is currently using.
In case your organization has multiple nodes running they should all point to the same private key.
Users have to encrypt the input and decrypt the output, which can be simplified by using our client vantage6.client.Client for Python or vtg::Client for R.
Algorithms are not concerned about encryption as this is handled at node level.
Algorithm container isolation
Containers have no longer an internet connection, but are connected to a private docker network.
Master containers can access the central server through a local proxy server which is both connected to the private docker network as the outside world. This proxy server also takes care of the encryption of the messages from the algorithms for the intended receiving organization.
In case a single machine hosts multiple nodes, each node is attached to its own private Docker network.
Temporary Volumes
Each algorithm mounts temporary volume, which is linked to the node and the run_id of the task
The mounting target is specified in an environment variable TEMPORARY_FOLDER. The algorithm can write anything to this directory.
These volumes need to be cleaned manually. (docker rm VOLUME_NAME)
Successive algorithms only have access to the volume if they share the same run_id . Each time a user creates a task, a new run_id is issued. If you need to share information between containers, you need to do this through a master container. If a master container creates a task, all slave tasks will obtain the same run_id.
RESTful API
All RESTful API output is HATEOS formatted. (wiki)
Local Proxy Server
Algorithm containers no longer receive an internet connection. They can only communicate with the central server through a local proxy service.
It handles encryption for certain endpoints (i.e. /task, the input or /result the results)
Dockerized the Node
All node code is run from a Docker container. Build versions can be found at our Docker repository: harbor.distributedlearning.ai/infrastructure/node . Specific version can be pulled using tags.
For each running node, a Docker volume is created in which the data, input and output is stored. The name of the Docker volume is: vantage-NODE_NAME-vol . This volume is shared with all incoming algorithm containers.
Each node is attached to the public network and a private network: vantage-NODE_NAME-net.
