Getting Started with chDB: Your In-Process SQL OLAP Engine

Feb 5, 2023 | Programming

Welcome to the exciting world of chDB! If you’ve been looking for a powerful and efficient way to run SQL operations on your data without the need to install ClickHouse, you’ve landed in the right place. Let’s dive into how you can set up and utilize chDB to work with various data formats easily.

Why chDB?

ChDB is an in-process SQL OLAP engine powered by ClickHouse. It boasts several features, including:

  • In-process SQL OLAP Engine — no installation required
  • Minimized data copying using Python’s memoryview
  • Support for numerous formats such as Parquet, CSV, JSON, Arrow, ORC, and many more
  • Compatible with Python DB API 2.0

Installation

Before getting your hands dirty with the actual usage, you need to set up chDB on your machine. Currently, it supports Python version 3.8+ on both macOS and Linux (x86_64 and ARM64).

pip install chdb

Usage

Once installed, you’re ready to run SQL queries through chDB. Here’s how:

Command Line Execution

python3 -m chdb SQL [OutputFormat]

For example:

python3 -m chdb "SELECT 1, 'abc' Pretty"

Accessing Data Formats

You can carry out queries on various file types like Parquet, CSV, and JSON. Below is a basic rundown:

import chdb
res = chdb.query("SELECT version(), Pretty")
print(res)

Using DataFrames

Imagine treading through a garden where flowers (data) are spread across multiple beds (DataFrames). ChDB allows us to pluck these flowers seamlessly through SQL queries. For example, if you have two DataFrames:

df1 = pd.DataFrame({'a': [1, 2, 3], 'b': ['one', 'two', 'three']})
df2 = pd.DataFrame({'c': [1, 2, 3], 'd': [1, 2, 3]})

ret_tbl = cdf.query("SELECT * FROM __tbl1__ t1 JOIN __tbl2__ t2 ON t1.a = t2.c", tbl1=df1, tbl2=df2)
print(ret_tbl)

In this analogy, each DataFrame is a flower bed, and the SQL query is your garden tool, helping you gather data from them into a single bouquet (the result).

Advanced Features

chDB also supports creating stateful sessions, user-defined functions (UDFs), and deeper integration with Python types.

For instance, creating a stateful session looks something like this:

from chdb import session as chs
sess = chs.Session()
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic")
sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = Log;")
sess.query("INSERT INTO db_xxx.log_table_xxx VALUES ('a', 1), ('b', 3), ('c', 2), ('d', 5);")
print(sess.query("SELECT * FROM db_xxx.log_table_xxx, Pretty"))

Troubleshooting

If you encounter issues while using chDB, here are some troubleshooting tips:

  • Ensure you’re using Python 3.8+.
  • Check for new updates or patches in case of bugs.
  • Consult the project documentation for any missed steps.
  • If the error persists, consider reaching out to the community via Discord or refer to the chDB documentation.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Final Thoughts

With its diverse capabilities, chDB can significantly improve how we interact with data. Whether you’re querying a file or working with complex data structures, the options are nearly endless. Go ahead and explore what chDB can do for your projects!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox