Find the neighborhood of the given subgraph (S); informally, the set of nodes immediately reachable from that subgraph.
There's an additional constraint in that the edges used to do so must not touch the forbidden set of nodes (X). The DPhyp paper calls this function N(S, X) (with a calligraphic N).
How to calculate the neighborhood efficiently is one of the least explicitly described parts of the paper. The definition goes about as follows:
- Find E↓'(S,X), the set of “interesting hypernodes” (outgoing edge destinations from S). These are the (endpoints of) edges that have one side entirely within S, that have the other side entirely outside S, and none of the sides touch the forbidden set X.
- Minimize E↓'(S,X) by removing all “subsumed hypernodes”, giving E↓(S,X). u subsumes v if it is a proper subset; if so, we can never go to where v points to before we've been at u, so it's pointless to keep v.
- For each hypernode in E↓(S,X), pick node with lowest index as a representative, because our subset enumeration algorithms cannot enumerate subsets of hypernodes, only subsets of normal nodes. (Actually, any node that's part of the hypernode would do; it does not even need to be consistent.) These nodes together constitute the neighborhood.
There are a couple of points to note here:
First, adding more nodes than needed to the neighborhood does not affect correctness of the algorithm, only speed. We try all combinations of included/excluded for the neighborhood (2^N in the number of nodes), so this covers all potential subgraphs; in theory, we could even just choose all non-forbidden nodes and reduce to the algorithm known as DPhyp, it just wouldn't be very efficient.
Second, step 3 means that we may very well end up with a non-connected subgraph. This is harmless; we may eventually grow it to a connected one or we may not, we just won't start looking for any complements until we have a connected one (and we know whether it's connected or not based on whether we saw it as a csg-cmp pair in the algorithm earlier).
Third, due to the way we grow our subgraph, only the nodes that we have just grown by can contribute to the E↓'(S,X). The reason is simple; every node from the previous neighborhood will have been added either to S or to X, and both exclude them from the new neighborhood. (Step 2 doesn't affect this, as any hypernode that was subsumed would also have to touch S or X. But there's an exception in that occasionally, we can remove nodes from X; see ExpandSubgraph().)
Fourth, perfect minimization seems to be impossible to actually implement efficiently. This is known as the minimum set problem, and the best known algorithms to do this are in O(n² / log n) time (see e.g. Pritchard: ”An Old Sub-Quadratic Algorithm for Finding Extremal Sets”), which can be quite a lot when there are lots of edges. (The trivial O(n²) algorithm is to just test every set against every smaller set, and throw it out if it's a superset.) Since loops in our hypergraphs are presumed to be fairly rare, it would not seem worth it to do full minimization.
Instead, we pick the low-hanging fruit only: Every simple edge is trivial to test against. We just collect the simple edges into a mask, and any (complex) hyperedge that overlaps with that bitmap can immediately be discarded. Even more, since we don't have to pick min(S) but can pick something arbitrary, we can let {R2,R3} (which gets R2 as its representative node) subsume {R1,R2}, even though it's not an actual subset, by pretending we picked R2 as the representative node for the latter! This is similar to what Moerkotte describes in his “Building Query Compilers” document, which seems to contain a slightly extended version of the DPhyp paper (under a different name). We could have collected all the simple edges in a separate pass first, but the microbenchmarks show that the added loop overhead isn't worth it.
Note that we also keep E↓'(S,X), the set of interesting hypernodes; we bitwise-or it into “full_neighborhood”. This is useful later when searching for edges to connect the connected subgraph and its complement; we know that only edges into “full_neighborhood” can connect the two.
This function accounts for roughly 20–70% of the total DPhyp running time, depending on the shape of the graph (~40% average across the microbenchmarks). It is fairly big to inline, but it helps speed significantly, probably due to the large amount of parameters to be passed back and forth.