How reading of the connector works?

Push Down

Spark supports push down the processing of queries, or parts of queries, into the connected data source. This means that a specific predicate, aggregation function, or other operation, could be passed through to ClickHouse for processing.

The results of this push down can include the following benefits:

Improved overall query performance
Reduced network traffic between Spark and ClickHouse
Reduced load on ClickHouse

These benefits often result in significant cost reduction.

The connector implements most push down interfaces defined by DataSource V2, such as SupportsPushDownLimit, SupportsPushDownFilters, SupportsPushDownAggregates, SupportsPushDownRequiredColumns.

The below example shows how SupportsPushDownAggregates and SupportsPushDownRequiredColumns work.

Bucket Join

Sort merge join is a general solution for two large table inner join, it requires two table shuffle by join key first, then do local sort by join key in each data partition, finally do stream-stream like look up to get the final result.

In some cases, the tables store collocated by join keys, w/ Storage-Partitioned Join(or V2 Bucket Join), Spark could leverage the existing ClickHouse table layout to eliminate the expensive shuffle and sort operations.