Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The Remote Backend Compiler (RBC) package implements the OmniSciDB client support for defining so-called Runtime UDFs. That is, while OmniSciDB server is running, one can register new SQL functions to Omnisci Calcite server as well as provide their implementations in LLVM IR string form. The RBC package supports creating Runtime UDFs from Python functions.
A User-Defined Function brings the capability of defining new SQL functionalities that work in a rowwise fashion manner. The figure below illustrates how a UDF works:
First, we need to connect RBC to Omnisci server using the RemoteOmnisci
remote class.
One can define UDF functions using omnisci
as a decorator:
Gitbook View, Documentation Sources, Project homepage.
The purpose of the RBC project is to implement the concept of Remote Backend Compiler (RBC). The concept of RBC is about splitting the compilation of a user-provided program source code to machine-executable instructions in between two different computer systems - an RBC client and a JIT server - using the following workflow:
In the RBC client, the user-provided source code of a program is compiled to a LLVM IR string.
The LLVM IR string together with the program metadata is sent to a server where it will be registered and made available for execution in the JIT server.
The RBC concept can be applied in various situations. For example, the RBC enables executing client programs for analyzing or processing Big data stored in a remote server when retrieving the data over the network would not be feasible due to the large size or be too inefficient.
LLVM IR is an intermediate representation of a compiled program used in the LLVM compiler toolchain. The low-level LLVM IR language is based on Static Single Assignment (SSA) representation that many high-level languages can be compiled into. The LLVM IR can be an input to a Just-in-time (JIT) compiler which will complete the compilation process resulting in a machine-executable program.
In the RBC project, the client software is implemented in Python and uses Numba for compiling Python functions into LLVM IR. In addition, the RBC client software can use Clang compilers for compiling C/C++ functions into LLVM IR as well. The RBC project provides a Python/Numba based JIT server as a prototype of the RBC concept.
As an application, the RBC client software can be used in connection with OmniSciDB - an analytical database and SQL engine - for run-time registration of custom SQL functions: User-Defined Functions (UDFs) and User-Defined Table Functions (UDTFs). OmniSciDB uses JIT technology that enables compiling SQL queries into machine-executable programs to be run on modern CPU and GPU hardware.
In this section we cover the following topics:
In the following, we'll describe how to install the RBC and related software.
Since the RBC project is under active development and to get the latest updates and bug fixes fastest, one should use the approach described in Developers Corner.
It is recommended to install the rbc and omniscidb conda packages to separate conda environments.
To check that omniscidb package is installed successfully, make sure that the omniscidb environment is activated and run:
To start omniscidb server with runtime UDF/UDTFs support enabled, run the omnisci_server
command with --enable-runtime-udf
and --enable-table-functions
flags:
Not supported
In this section we cover the following topics:
ColumnList support requires OmniSciDB 5.6 or newer
In OmniSciDB, an array has the following internal representation:
*Notice that returning an empty array is an invalid operation that might crash the server.
You can either check if an array is null or if an array value is null
OmniSciDB supports many but not all of them are supported by the Remote Backend Compiler.
One can also use one of the array creation functions as specified in the : full
, full_like
, zeros
, zeros_like
, ones
and ones_like
:
Datatype
Size (bytes)
Notes
BOOLEAN
1
TRUE: 'true'
, '1'
, 't'
. FALSE: 'false'
, '0'
, 'f'
. Text values are not case-sensitive.
TINYINT
1
Minimum value: -127
; maximum value: 127
SMALLINT
2
Minimum value: -32,767
; maximum value: 32,767
INT
4
Minimum value: -2,147,483,647
; maximum value: 2,147,483,647
.
BIGINT
8
Minimum value: -9,223,372,036,854,775,807
; maximum value: 9,223,372,036,854,775,807
.
We are developing and testing the RBC software under Linux using Conda packaging system as it conveniently provides all the necessary dependencies for the RBC package as well as for the OmniSciDB software.
In the following, we explain how to setup Conda environments for developing RBC as well as OmniSciDB software, how to get the software, and how to run and test the software.
To create the needed development environments, follow the instructions below.
Install and setup conda (unless it is already installed):
Create rbc-dev
environment:
Create omniscidb-dev
environment:
It is recommended to keep the rbc and omniscidb development environments separate to minimize the risk for any software version conflict.
OmniSciDB can be built for CPU-only or CUDA-enabled mode.
For CUDA-enabled OmniSciDB development, create omniscidb-cuda-dev
environment:
In addition, make sure that your system has CUDA Toolkit installed and the CUDA driver functional (run nvidia-smi
to verify). The highest CUDA Toolkit version that OmniSciDB supports is currently 11.0
Here follow the instructions for installing the CUDA Toolkit version 11.0.3:
Checkout rbc
sources:
Checkout omniscidb
sources:
Although the omniscidb-dev
environment contains all the prerequisites for building and running OmniSciDB server within the conda environment, we slightly adjust the conda environment to make the management of the building process for different build targets easier. For that, use the following script for activating the omniscidb-dev
environment:
The script activate-omniscidb-internal-dev.sh
must be sourced (not run via bash
or another shell) to the existing terminal session. The activate script will show various information about how to develop the OmniSciDB software within a Conda environment.
The activate script above uses a custom script /usr/local/cuda/env.sh
for setting CUDA environment variables for conda environment. Use the following commands to install the env.sh
script:
The basic workflow for developing and testing RBC package contains the following commands:
The omniscidb related rbc tests are run only when the omniscidb server is running. The server must be started with the options --enable-runtime-udf --enable-table-functions
, see OmniSciDB development section below.
By default, it is assumed that the omniscidb server can be accessed behind the localhost port 6274. Only in the case, the server is running elsewhere, one needs to configure the rbc testing environment as follows:
Create a configuration file with the following content (update credentials as needed):
The default location for the configuration file depends on the system but one can set an environment variable OMNISCI_CLIENT_CONF
that should contain a full path to the configuration file. Default locations of the configuration file are shown below:
The basic workflow for developing and testing OmniSciDB software in the context of RBC development is as follows:
that will build CUDA-enabled omniscidb server. See the instruction from the activate script about how to build CPU-only omniscidb server, for instance.
To execute the omniscidb test-suite, make sure that the current working directory is the build directory and run:
To start omniscidb server with runtime UDF/UDTFs support enabled, run:
The RBC package implements support for defining and registering user-defined functions for the OmniSciDB SQL engine. The following types of user-defined functions are supported:
UDFs that are applied to a DB table row-wise,
UDTFs (table functions) that are applied to the DB table columns.
In the following, we explain how to use the RBC Python package rbc
for connecting to an OmniSciDB server, defining custom UDFs and a UDTF, registering these to the OmniSciDB server, and finally, how to use the user-defined functions in a SQL query using rbc
tools as an example.
Assuming that an OmniSciDB server is running, registering a new user-defined function requires establishing a connection to the server. This can be done directly:
or using an existing connection session id (not implemented, see rbc issue 180):
For the sake of having a complete example, let's create a sample table:
We create two new UDFs that increment row values by 1, one UDF for FLOAT columns and another for INT columns:
Notice that the two UDFs can be defined as a single Python function myincr
because RBC/OmniSciDB supports overloading user-defined function names.
To register these UDFs to OmnisciDB, one can call
but when using the omnisci
object for making SQL queries then the registration of any new UDFs is triggered automatically.
That's it! Now anyone connected to the OmniSciDB server can use the SQL function myincr
in their queries.
For example, one can use the RBC provided omnisci
object to send queries:
that will output:
Table functions act on database table columns and their results are stored in so-called output columns of temporary tables. Let's implement a new SQL table function that computes a new table with all columns incremented by user-specified value:
Before trying it out, let's explain some of the details here:
Return value - The return value of a UDTF definition defines the length of the output columns. The output columns arguments memory is pre-allocated for the size m * len(x)
where column sizer parameter m
is a literal constant specified by the user in a SQL query and len(x)
represents the size of the first input column. In case the UDTF definition returns the output column size value smaller than m * len(x)
, the memory of output columns will be re-allocated accordingly. The return type of a UDTF definition must be 32-bit integer and the type of column sizer parameters can be RowMultiplier
, ConstantParameter
, or Constant
.
Cursor - The Cursor<...>
represents the cursor over input table columns. For instance, Cursor<float, int>
would correspond to two arguments of the UDTF definition, one being the input column containing float values and another being input column containing int values.
One can call the new table function to increment the FLOAT column x
by value 2.3
from a SQL query as follows:
that will output