This file implements the DPhyp algorithm for enumerating connected subgraphs of hypergraphs (see hypergraph.h for a hypergraph definition). More...

#include <assert.h>
#include <string>
#include "sql/join_optimizer/bit_utils.h"
#include "sql/join_optimizer/hypergraph.h"

Classes
class	hypergraph::NeighborhoodCache

Namespaces
namespace	hypergraph

Macros
#define	DEBUGGING_DPHYP 0

#define	HYPERGRAPH_PRINTF(...)

Functions
template<class Receiver >
bool	hypergraph::EnumerateAllConnectedPartitions (Receiver *receiver)

std::string	hypergraph::PrintSet (NodeMap x)

NodeMap	hypergraph::FindNeighborhood (const Hypergraph &g, NodeMap subgraph, NodeMap forbidden, NodeMap just_grown_by, NeighborhoodCache cache, NodeMap full_neighborhood_arg)
	Find the neighborhood of the given subgraph (S); informally, the set of nodes immediately reachable from that subgraph. More...

template<class Receiver >
bool	hypergraph::EnumerateComplementsTo (const Hypergraph &g, size_t lowest_node_idx, NodeMap subgraph, NodeMap full_neighborhood, NodeMap neighborhood, Receiver *receiver)

template<class Receiver >
bool	hypergraph::ExpandSubgraph (const Hypergraph &g, size_t lowest_node_idx, NodeMap subgraph, NodeMap full_neighborhood, NodeMap neighborhood, NodeMap forbidden, Receiver *receiver)

template<class Receiver >
bool	hypergraph::TryConnecting (const Hypergraph &g, NodeMap subgraph, NodeMap subgraph_full_neighborhood, NodeMap complement, Receiver *receiver)

template<class Receiver >
bool	hypergraph::ExpandComplement (const Hypergraph &g, size_t lowest_node_idx, NodeMap subgraph, NodeMap subgraph_full_neighborhood, NodeMap complement, NodeMap neighborhood, NodeMap forbidden, Receiver *receiver)

template<class Receiver >
bool	hypergraph::EnumerateAllConnectedPartitions (const Hypergraph &g, Receiver *receiver)

Detailed Description

This file implements the DPhyp algorithm for enumerating connected subgraphs of hypergraphs (see hypergraph.h for a hypergraph definition).

The core idea of the algorithm is that if the join structure of a query is expressed as a hypergraph, where the relations are nodes and the join predicates are hyperedges, one can efficiently find all legal join orders without Cartesian products by finding all possible subpartitions of the hypergraph. (Simple inner joins will have regular edges, but outer joins, antijoins etc., can be encoded as hyperedges to constrain the allowed join orderings, so that we do not join e.g. an inner and outer table together before said inner table has been joined to the entire set. Also, hyper-predicates such as t1.a + t2.b = t3.c will naturally give rise to hyperedges.)

The algorithm is described in the paper “Dynamic Programming Strikes Back” by Neumann and Moerkotte. There is a somewhat extended version of the paper (that also contains a few corrections) in Moerkotte's treatise “Building Query Compilers”. Some critical details are still missing, which we've had to fill in ourselves. We don't currently implement the extension to generalized hypergraphs, but it should be fairly straightforward to do later. The algorithm is simple in concept but hard to grasp; we will only give a very rough outline here:

Pick a seed node of the graph.
Grow that seed along hyperedges, taking care never to make an unconnected graph or seeing the same subgraph twice.
For each connected subgraph (csg): Repeat steps 1–2 independently to create a separate connected subgraph (the so-called complement, cmp), and try to connect the subgraph and its complement to create a larger graph (a so-called csg-cmp-pair).
When such a csg-cmp-pair is found, call the receiver back with the csg and cmp. This is a valid subjoin that can be costed.

The entry point for doing this is EnumerateAllConnectedPartitions().

For complex joins, we may have to run DPhyp multiple times in a mode where we just count the number of partitions over various constrained graphs, and this will be a critical part of query planning time. Thus, it is coded as a template over a receiver type that gets callbacks for each partition. If the receiver just interested in counting, this saves a significant amount of call overhead. The templatization also allows the microbenchmarks to more accurately measure changes in the algorithm itself without having to benchmark the receiver.

Classes

Namespaces

Macros

Functions

Detailed Description

Macro Definition Documentation

◆ DEBUGGING_DPHYP

◆ HYPERGRAPH_PRINTF