Try / chDB in Y minutes
chDB is an embeddable, in-process SQL OLAP engine powered by ClickHouse. It's as if SQLite and ClickHouse had an offspring (no offence to either party). chDB takes up ≈100mb of disk space, runs on smaller machines (even on a 64mb RAM container), and provides language bindings for Python, Node.js, Go, Rust and C/C++.
Using chDB · SQL dialect · Reading data · Writing data · User-defined functions · Python DB API · Further reading
✨ This is an open source guide. Feel free to improve it!
Using chDB
It's as simple as pip install chdb and then:
import chdb
res = chdb.query("select 42")
print(res, end="")
Using a database engine to select the number 42 is probably not very exciting, but bear with me.
SQL dialect
chDB is a wrapper around ClickHouse, so it supports exactly the same SQL syntax, including joins, CTEs, set operations, aggregations and window functions.
For example, let's create a sampled table of 10000 random numbers and calculate the mean and 95th percentile:
from chdb.session import Session
db = Session()
db.query("create database db")
db.query("use db")
db.query("""
create table data (id UInt32, x UInt32)
engine MergeTree order by id sample by id
as
select number+1 as id, randUniform(1, 100) as x
from numbers(10000);
""")
query_sql = """
select
avg(x) as "avg",
round(quantile(0.95)(x), 2) as p95
from data
sample 0.1
settings output_format_pretty_row_numbers = 0;
"""
res = db.query(query_sql, "PrettyCompactNoEscapes")
print(res, end="")
Note a couple of things here:
Sessionprovides a stateful database connection (the data is stored in the temporary folder and discarded when the connection is closed).- The second argument to the
querymethod specifies the output format. There are many supported formats such asCSV,SQLInsert,JSONandXML(try changing the format in the above example and re-running the code). The default one isCSV.
Reading data
As with output formats, chDB supports any input format supported by ClickHouse.
For example, we can read a dataset from CSV:
query_sql = "select * from 'employees.csv'"
res = chdb.query(query_sql, "PrettyCompactNoEscapes")
print(
f"{res.rows_read()} rows | "
f"{res.bytes_read()} bytes | "
f"{res.elapsed()} seconds"
)
Or work with an external dataset as if it were a database table:
query_sql = """
select distinct city
from 'employees.csv'
"""
res = chdb.query(query_sql, "CSV")
print(res, end="")
We can even query Pandas dataframes as if they were tables:
import chdb.dataframe as cdf
import pandas as pd
employees = pd.read_csv("employees.csv")
departments = pd.read_csv("departments.csv")
query_sql = """
select
emp_id, first_name,
dep.name as dep_name,
salary
from __emp__ as emp
join __dep__ as dep using(dep_id)
order by salary desc;
"""
res = cdf.query(sql=query_sql, emp=employees, dep=departments)
print(res, end="")
Writing data
The easiest way to export data is to use the output format (the second parameter in the query method), and then write the data to disk:
from pathlib import Path
query_sql = "select * from 'employees.csv'"
res = chdb.query(query_sql, "Parquet")
# export to Parquet
path = Path("/tmp/employees.parquet")
path.write_bytes(res.bytes())
# import from Parquet
query_sql = """
select * from '/tmp/employees.parquet' limit 5
settings output_format_pretty_row_numbers = 0
"""
res = chdb.query(query_sql, "PrettyCompactNoEscapes")
print(res, end="")
We can also easily convert the chDB result object into a PyArrow table:
query_sql = "select * from 'employees.csv'"
res = chdb.query(query_sql, "Arrow")
table = chdb.to_arrowTable(res)
print(table.schema)
Or Pandas dataframe:
query_sql = "select * from 'employees.csv'"
res = chdb.query(query_sql, "Arrow")
frame = chdb.to_df(res)
frame.info()
To persist a chDB session to a specific folder on disk, use the path constructor parameter. This way you can restore the session later:
from chdb.session import Session
# create a persistent session
db = Session(path="/tmp/employees")
# create a database and a table
db.query("create database db")
db.query("""
create table db.employees (
emp_id UInt32 primary key,
first_name String, last_name String,
birth_dt Date, hire_dt Date,
dep_id String, city String,
salary UInt32,
) engine MergeTree;
""")
# load data into the table
db.query("""
insert into db.employees
select * from 'employees.csv'
""")
# ...
# restore the session later
db = Session(path="/tmp/employees")
# query the data
res = db.query("select count(*) from db.employees")
print(res, end="")
User-defined functions
We can define a function in Python and use it in chDB SQL queries.
Here is a split_part function that splits a string on the given separator and returns the resulting field with the given index (counting from one):
from chdb.udf import chdb_udf
@chdb_udf()
def split_part(s, sep, idx):
idx = int(idx)-1
return s.split(sep)[idx]
second = chdb.query("select split_part('a;b;c', ';', 2)")
print(second, end="")
And here is a sumn function that calculates a sum from 1 to N:
from chdb.udf import chdb_udf
@chdb_udf(return_type="Int32")
def sumn(n):
n = int(n)
return n*(n+1)//2
sum20 = chdb.query("select sumn(20)")
print(sum20, end="")
Currently chDB only supports scalar functions that take strings as parameters. If the function returns a type other than string, we should pass it as return_type to the chdb_udf decorator.
Python Database API
The chDB Python package adheres to the Python DB API (PEP 249), so you can use it just like you'd use stdlib's sqlite3 module:
from contextlib import closing
from chdb import dbapi
print(f"chdb version: {dbapi.get_client_info()}")
with closing(dbapi.connect()) as conn:
with closing(conn.cursor()) as cur:
cur.execute("select version()")
print("description:", cur.description)
print("data:", cur.fetchone())
Further reading
See the chDB documentation for details on using chDB with other programming languages, sample Jupyter notebooks, and SQL reference.
Anton Zhiyanov · original · CC-BY-SA-4.0 · 2023-12-11
Anton Zhiyanov