Only this pageAll pages
Powered by GitBook
1 of 23

RBC reference docs

Loading...

Getting started

Loading...

Loading...

Loading...

User Manual

Loading...

Loading...

Loading...

Loading...

Loading...

OmniSciDB Integration

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Developers corner

Loading...

Loading...

Setup development environments

We are developing and testing the RBC software under Linux using Conda packaging system as it conveniently provides all the necessary dependencies for the RBC package as well as for the OmniSciDB software.

In the following, we explain how to setup Conda environments for developing RBC as well as OmniSciDB software, how to get the software, and how to run and test the software.

Setting up a development environments

To create the needed development environments, follow the instructions below.

  • Install and setup conda (unless it is already installed):

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
source $HOME/miniconda3/bin/activate
conda init bash  # or zsh when on macOS Catalina
  • Create rbc-dev environment:

conda env create -n rbc-dev \
  -f https://raw.githubusercontent.com/Quansight/pearu-sandbox/master/conda-envs/rbc-dev.yaml
  • Create omniscidb-dev environment:

conda env create -n omniscidb-dev \
  -f https://raw.githubusercontent.com/Quansight/pearu-sandbox/master/conda-envs/omniscidb-dev.yaml

It is recommended to keep the rbc and omniscidb development environments separate to minimize the risk for any software version conflict.

CUDA

OmniSciDB can be built for CPU-only or CUDA-enabled mode.

For CUDA-enabled OmniSciDB development, create omniscidb-cuda-dev environment:

conda env create -n omniscidb-cuda-dev \
  -f https://raw.githubusercontent.com/Quansight/pearu-sandbox/master/conda-envs/omniscidb-dev.yaml

In addition, make sure that your system has CUDA Toolkit installed and the CUDA driver functional (run nvidia-smi to verify). The highest CUDA Toolkit version that OmniSciDB supports is currently 11.0

Here follow the instructions for installing the CUDA Toolkit version 11.0.3:

sudo mkdir -p /usr/local/src/cuda-installers
cd /usr/local/src/cuda-installers
sudo wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda_11.0.3_450.51.06_linux.run
sudo bash cuda_11.0.3_450.51.06_linux.run --toolkit --toolkitpath=/usr/local/cuda-11.0.3/ --installpath=/usr/local/cuda-11.0.3/ --override --no-opengl-libs --no-man-page --no-drm --silent

sudo sh /usr/local/src/cuda-installers/cuda_11.0.3_450.51.06_linux.run
┌──────────────────────────────────────────────────────────────────────────────┐
│ CUDA Installer                                                               │
│ - [X] Driver                                                                 │
│      [X] 450.51.06                                                           │
│ + [ ] CUDA Toolkit 11.0                                                      │
│   [ ] CUDA Samples 11.0                                                      │
│   [ ] CUDA Demo Suite 11.0                                                   │
│   [ ] CUDA Documentation 11.0                                                │
│   Options                                                                    │
│   Install                                                                    │
...
#   ^--- X-select only Driver and PRESS Install

Getting software

  • Checkout rbc sources:

mkdir -p ~/git/xnd-project
cd ~/git/xnd-project

# If you are a member of xnd-project organization, use

git clone git@github.com:xnd-project/rbc.git

# else fork https://github.com/xnd-project/rbc and clone your rbc fork,
# or use

git clone https://github.com/xnd-project/rbc.git
  • Checkout omniscidb sources:

mkdir -p ~/git/omnisci
cd ~/git/omnisci

# If you are a member of omnisci organization, use

git clone git@github.com:omnisci/omniscidb-internal.git

# Otherwise, fork https://github.com/omnisci/omniscidb and clone the fork.
#
# Or use

git clone https://github.com/omnisci/omniscidb.git
  • Although the omniscidb-dev environment contains all the prerequisites for building and running OmniSciDB server within the conda environment, we slightly adjust the conda environment to make the management of the building process for different build targets easier. For that, use the following script for activating the omniscidb-dev environment:

cd ~/git/omnisci/
wget https://raw.githubusercontent.com/Quansight/pearu-sandbox/master/working-envs/activate-omniscidb-internal-dev.sh

The script activate-omniscidb-internal-dev.sh must be sourced (not run via bash or another shell) to the existing terminal session. The activate script will show various information about how to develop the OmniSciDB software within a Conda environment.

  • The activate script above uses a custom script /usr/local/cuda/env.sh for setting CUDA environment variables for conda environment. Use the following commands to install the env.sh script:

cd ~/git/omnisci/
wget https://raw.githubusercontent.com/Quansight/pearu-sandbox/master/set_cuda_env.sh
sudo ln -s ~/git/omnisci/set_cuda_env.sh /usr/local/cuda/env.sh

Development

RBC

The basic workflow for developing and testing RBC package contains the following commands:

cd ~/git/xnd-project/rbc
conda activate rbc-dev
python setup.py develop

# Possible commands for running rbc tests

pytest -sv -r A rbc/tests
pytest -sv -r A rbc/tests -x -k <test function>

The omniscidb related rbc tests are run only when the omniscidb server is running. The server must be started with the options --enable-runtime-udf --enable-table-functions , see OmniSciDB development section below.

By default, it is assumed that the omniscidb server can be accessed behind the localhost port 6274. Only in the case, the server is running elsewhere, one needs to configure the rbc testing environment as follows:

  • Create a configuration file with the following content (update credentials as needed):

# File: client.conf

[user]
# OmniSciDB user name
name: admin
# OmniSciDB user password
password: HyperInteractive

[server]
# OmniSciDB server host name or IP
host: localhost
# OmniSciDB server port
port: 6274
  • The default location for the configuration file depends on the system but one can set an environment variable OMNISCI_CLIENT_CONF that should contain a full path to the configuration file. Default locations of the configuration file are shown below:

# Linux, macOS:
OMNISCI_CLIENT_CONF=$HOME/.config/omnisci/client.conf
# Windows:
OMNISCI_CLIENT_CONF=%UserProfile/.config/omnisci/client.conf
OMNISCI_CLIENT_CONF=%AllUsersProfile/.config/omnisci/client.conf

OmniSciDB

The basic workflow for developing and testing OmniSciDB software in the context of RBC development is as follows:

cd ~/git/omnisci/omniscidb-internal  # or git/omnisci/omniscidb
source ~/git/omnisci/activate-omniscidb-internal-dev.sh

mkdir -p build && cd build
cmake -Wno-dev $CMAKE_OPTIONS_CUDA ..
make -j $NCORES

that will build CUDA-enabled omniscidb server. See the instruction from the activate script about how to build CPU-only omniscidb server, for instance.

To execute the omniscidb test-suite, make sure that the current working directory is the build directory and run:

mkdir tmp && bin/initdb tmp
make sanity_tests

To start omniscidb server with runtime UDF/UDTFs support enabled, run:

mkdir data && bin/initdb data
bin/omnisci_server --enable-runtime-udf --enable-table-functions

Compiling code with RBC decorators

RBC Documentation

Gitbook View, Documentation Sources, Project homepage.

Overview

The purpose of the RBC project is to implement the concept of Remote Backend Compiler (RBC). The concept of RBC is about splitting the compilation of a user-provided program source code to machine-executable instructions in between two different computer systems - an RBC client and a JIT server - using the following workflow:

  • In the RBC client, the user-provided source code of a program is compiled to a LLVM IR string.

  • The LLVM IR string together with the program metadata is sent to a server where it will be registered and made available for execution in the JIT server.

The RBC concept can be applied in various situations. For example, the RBC enables executing client programs for analyzing or processing Big data stored in a remote server when retrieving the data over the network would not be feasible due to the large size or be too inefficient.

  • LLVM IR is an intermediate representation of a compiled program used in the LLVM compiler toolchain. The low-level LLVM IR language is based on Static Single Assignment (SSA) representation that many high-level languages can be compiled into. The LLVM IR can be an input to a Just-in-time (JIT) compiler which will complete the compilation process resulting in a machine-executable program.

In the RBC project, the client software is implemented in Python and uses Numba for compiling Python functions into LLVM IR. In addition, the RBC client software can use Clang compilers for compiling C/C++ functions into LLVM IR as well. The RBC project provides a Python/Numba based JIT server as a prototype of the RBC concept.

As an application, the RBC client software can be used in connection with OmniSciDB - an analytical database and SQL engine - for run-time registration of custom SQL functions: User-Defined Functions (UDFs) and User-Defined Table Functions (UDTFs). OmniSciDB uses JIT technology that enables compiling SQL queries into machine-executable programs to be run on modern CPU and GPU hardware.

Structure of RBC documentation

  • Getting Started

  • Developers Corner

Simple example (OmniSciDB)

The RBC package implements support for defining and registering user-defined functions for the OmniSciDB SQL engine. The following types of user-defined functions are supported:

  • UDFs that are applied to a DB table row-wise,

  • UDTFs (table functions) that are applied to the DB table columns.

In the following, we explain how to use the RBC Python package rbc for connecting to an OmniSciDB server, defining custom UDFs and a UDTF, registering these to the OmniSciDB server, and finally, how to use the user-defined functions in a SQL query using rbc tools as an example.

Connecting to OmniSciDB server

Assuming that an OmniSciDB server is running, registering a new user-defined function requires establishing a connection to the server. This can be done directly:

from rbc.omniscidb import RemoteOmnisci
omnisci = RemoteOmnisci(user='admin', password='HyperInteractive',
                        host='127.0.0.1', port=6274, dbname='omnisci')

or using an existing connection session id (not implemented, see rbc issue 180):

omnisci = RemoteOmnisci(connection=con)

For the sake of having a complete example, let's create a sample table:

omnisci.sql_execute('drop table if exists simple_table')
omnisci.sql_execute('create table if not exists simple_table (x FLOAT, i INT);');
omnisci.load_table_columnar('simple_table',
                            x = [1.1, 1.2, 1.3, 1.4, 1.5],
                            i = [0, 1, 2, 3, 4])

Defining UDFs using Python functions

We create two new UDFs that increment row values by 1, one UDF for FLOAT columns and another for INT columns:

@omnisci('float(float)', 'int(int)')
def myincr(x):
    return x + 1

Notice that the two UDFs can be defined as a single Python function myincr because RBC/OmniSciDB supports overloading user-defined function names.

Registering UDFs - row-wise function

To register these UDFs to OmnisciDB, one can call

omnisci.register()

but when using the omnisci object for making SQL queries then the registration of any new UDFs is triggered automatically.

That's it! Now anyone connected to the OmniSciDB server can use the SQL function myincr in their queries.

SQL query using omnisci.sql_execute

For example, one can use the RBC provided omnisci object to send queries:

descr, result = omnisci.sql_execute('SELECT x, myincr(x) FROM simple_table')
for x, x1 in result:
    print(f'x={x:.4}, x1={x1:.4}')

that will output:

x=1.1, x1=2.1
x=1.2, x1=2.2
x=1.3, x1=2.3
x=1.4, x1=2.4
x=1.5, x1=2.5

Defining UDTFs - table functions

Table functions act on database table columns and their results are stored in so-called output columns of temporary tables. Let's implement a new SQL table function that computes a new table with all columns incremented by user-specified value:

@omnisci('int(Cursor<float>, float, RowMultiplier, OutputColumn<float>)')
def incrby(x, dx, m, y):
    for i in range(len(x)):
        y[i] = x[i] + dx
    return len(x)

omnisci.register()

Before trying it out, let's explain some of the details here:

  • Return value - The return value of a UDTF definition defines the length of the output columns. The output columns arguments memory is pre-allocated for the size m * len(x) where column sizer parameter m is a literal constant specified by the user in a SQL query and len(x) represents the size of the first input column. In case the UDTF definition returns the output column size value smaller than m * len(x), the memory of output columns will be re-allocated accordingly. The return type of a UDTF definition must be 32-bit integer and the type of column sizer parameters can be RowMultiplier, ConstantParameter, or Constant.

  • Cursor - The Cursor<...> represents the cursor over input table columns. For instance, Cursor<float, int> would correspond to two arguments of the UDTF definition, one being the input column containing float values and another being input column containing int values.

One can call the new table function to increment the FLOAT column x by value 2.3 from a SQL query as follows:

descr, result = omnisci.sql_execute('''
  SELECT * FROM TABLE(INCRBY(CURSOR(SELECT x FROM simple_table),
                             CAST(2.3 AS FLOAT), 1))
''')
for y, in result:
    print(f'y={y:.4}')

that will output

y=3.4
y=3.5
y=3.6
y=3.7
y=3.8

Signature Specification

Untitled

Calling External Functions

User Defined Aggregate Function (UDAF)

Not supported

The Signature Class

User Defined Functions (UDF)

Runtime UDF Support

The Remote Backend Compiler (RBC) package implements the OmniSciDB client support for defining so-called Runtime UDFs. That is, while OmniSciDB server is running, one can register new SQL functions to Omnisci Calcite server as well as provide their implementations in LLVM IR string form. The RBC package supports creating Runtime UDFs from Python functions.

A User-Defined Function brings the capability of defining new SQL functionalities that work in a rowwise fashion manner. The figure below illustrates how a UDF works:

function add1 is called for every row and produce a new row

Example

First, we need to connect RBC to Omnisci server using the RemoteOmnisci remote class.

from rbc.omniscidb import RemoteOmnisci
omnisci = RemoteOmnisci(user='admin', password='HyperInteractive',
                        host='127.0.0.1', port=6274, dbname='omnisci')

One can define UDF functions using omnisci as a decorator:

@omnisci('int32(int32)')
def incr(i):
    return i + 1

Getting started

In this section we cover the following topics:

  • How to install RBC and OmnisciDB software to Conda environments

  • How to defined UDFs and UDTFs and register these to OmniSciDB server

User Defined Table Functions (UDTF)

ColumnList

ColumnList support requires OmniSciDB 5.6 or newer

Cursor

Column

Installation

In the following, we'll describe how to install the RBC and related software.

  • Since the RBC project is under active development and to get the latest updates and bug fixes fastest, one should use the approach described in Developers Corner.

Install RBC using conda

conda install -c conda-forge rbc

# or to install rbc to a new environemnt, run
conda create -n rbc -c conda-forge rbc
conda activate rbc

# check if rbc installed succesfully
python -c 'import rbc; print(rbc.__version__)'

Install RBC using pip (alternative)

pip install rbc-project
# check if rbc installed succesfully
python -c 'import rbc; print(rbc.__version__)'

Install OmniSciDB using conda

  • It is recommended to install the rbc and omniscidb conda packages to separate conda environments.

conda create -n omniscidb omniscidb
conda activate omniscidb

# or for CUDA enabled omniscidb, use

conda create -n omniscidb-cuda omniscidb=*_cuda
conda activate omniscidb-cuda

To check that omniscidb package is installed successfully, make sure that the omniscidb environment is activated and run:

omnisci_server --version

To start omniscidb server with runtime UDF/UDTFs support enabled, run the omnisci_server command with --enable-runtime-udf and --enable-table-functions flags:

# Create DB, run only once
mkdir -p omnisci_data
omnisci_initdb omnisci_data

# Start server:
omnisci_server --data=omnisci_data --enable-runtime-udf --enable-table-functions

Install OmniSciDB using docker


# CPU version
# https://hub.docker.com/r/omnisci/core-os-cpu
docker run \
  -d \
  --name omnisci \
  -p 6274:6274 \
  -v /home/username/omnisci-storage:/omnisci-storage \
  omnisci/core-os-cpu
  
# GPU version
# https://hub.docker.com/r/omnisci/core-os-cuda
docker run \
  --runtime=nvidia \
  -d \
  --name omnisci \
  -p 6274:6274 \
  -v /home/username/omnisci-storage:/omnisci-storage \
  omnisci/core-os-cuda

Supported Types and Data Structures

OmniSciDB supports many data types but not all of them are supported by the Remote Backend Compiler.

Scalar types

Datatype

Size (bytes)

Notes

BOOLEAN

1

TRUE: 'true', '1', 't'. FALSE: 'false', '0', 'f'. Text values are not case-sensitive.

TINYINT

1

Minimum value: -127; maximum value: 127

SMALLINT

2

Minimum value: -32,767; maximum value: 32,767

INT

4

Minimum value: -2,147,483,647; maximum value: 2,147,483,647.

BIGINT

8

Minimum value: -9,223,372,036,854,775,807; maximum value: 9,223,372,036,854,775,807.

Array

RBC also supports Array<T> where T is one of the following scalar types seen above. Both fixed and variable length arrays are supported in RBC. For more information, see the Array page.

Column

ColumnList

Cursor

Developing RBC and OmniSciDB

In this section we cover the following topics:

  • Getting software sources and setting up development environments

Array

In OmniSciDB, an array has the following internal representation:

typedef struct {
    T* data;  // contiguous memory block
    int64_t size;
    int8 is_null;  // boolean values in omniscidb are represented as int8 variables
} Array;

Creating an Array programmatically

from numba import types as nb_types
from rbc.omnisci_backend import Array

@omnisci('double[](int64)')
def create_array(size):
    array = Array(size, nb_types.double)
    for i in range(size):
        array[i] = nb_types.double(i)
    return array

*Notice that returning an empty array is an invalid operation that might crash the server.

One can also use one of the array creation functions as specified in the python Array API standard: full, full_like, zeros, zeros_like, ones and ones_like:

from numba import types as nb_types
import rbc.omnisci_backend as omni

@omnisci('double[](int64)')
def zero_array(size):
    return omni.zeros(size, nb_types.double)

Accessing a pointer member

@omnisci('double(double[], int64)')
def get_array_member(array, idx):
    return array[idx]

Getting the size of an array

@omnisci('int64(double[])')
def get_array_size(array):
    return len(array)

Checking for null

You can either check if an array is null or if an array value is null

@omnisci('int8(double[])')
def is_array_null(array):
    return array.is_null()

@omnisci('int8(double[], int64)')
def is_array_null(array, idx):
    return array.is_null(idx)