Hans-Juergen Schoenig: Data locality: Scaling PostgreSQL with Citus intelligently

原文英文,约1300词,阅读约需5分钟。发表于:

While sharding is often advertised as “THE solution to PostgreSQL scalability”, it is necessary to keep some technical aspects in consideration in terms of performance. The rule is: Sharding should not be used without a deeper awareness of what it is you are actually doing to the data. It’s important to keep in mind that sharding has to be applied in a clever and thoughtful manner. One of the most common mistakes is to ignore the concept of “data locality”. It’s important for many IT problems, but crucial in the context of database sharding. Citus is one of the most sophisticated sharding solutions in the PostgreSQL world. It can help you to achieve maximum scalability and allows for efficient analytics as well as OLTP. Citus is available on-premise or as part of the Microsoft Azure cloud. What is data locality? Let’s take a look together. Preparing data for sharding To demonstrate the concept, we first have to create two tables. For the sake of simplicity, we’ll use customers and sales: postgres=# CREATE TABLE t_customer ( id int, name text ); CREATE TABLE postgres=# CREATE TABLE t_sales ( id int, customer_id int, whatever text ); CREATE TABLE The data model is really straightforward. In this scenario, the typical way to analyse the data is to join the customer with the sales table. Why is this relevant? To understand it, first let’s distribute the table and add some data: postgres=# SELECT create_distributed_table('t_customer', 'id'); create_distributed_table -------------------------- (1 row) postgres=# SELECT create_distributed_table('t_sales', 'id'); create_distributed_table -------------------------- (1 row) Note that the data is sharded using the “id” which is not the join criteria. In the next step, we can load some data: postgres=# INSERT INTO t_customer SELECT *, 'dummy' FROM generate_series(1, 1000000); INSERT 0 1000000 postgres=# INSERT INTO t_sales SELECT id, random()*100000, 'dummy' FROM generate_series(1, 10[...]

本文介绍了使用分片技术时需要考虑的技术细节,特别是数据本地性的概念。如果不考虑数据本地性,分片可能会导致性能下降。Citus是PostgreSQL中最复杂的分片解决方案之一,可实现最大的可扩展性,并允许高效的分析和OLTP。

Hans-Juergen Schoenig: Data locality: Scaling PostgreSQL with Citus intelligently
相关推荐 去reddit讨论