-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Algorithm for counting the number of consistent extensions (size of an MEC given a CPDAG) #113
base: master
Are you sure you want to change the base?
Conversation
Ah, that's good timing, you exactly can use my Dor and Tasi algorithm for the (Link is down: Here the arxiv version https://arxiv.org/abs/1209.5860) |
Yep, that fits well, I'm very happy with the general direction (also adding GES etc). Btw you can also check whether a graph is a CPDAG in linear-time (which you wouldn't get with Dor-Tarsi) as it is possible to construct a consistent extension of a CPDAG in linear-time (if the input graph is not a CPDAG this algorithm will just output a garbage DAG and DAG-to-CPDAG will find some other graph not equal to the input one, so the check still works). Maybe I will add this as well... Regarding sanity checks: Testing whether the output of sample PC is a CPDAG sounds fine and is a safe choice. It will likely (at least for moderately large graphs) almost always find that the graph is not a CPDAG. Well, let's just see how this develops... Generally, want to keep expectations about the timeline of this PR low, have to finish my PhD thesis 😅
That's very interesting! Now that you mention sampling, I totally forgot to add above that I also have code to uniformly sample a member of an MEC efficiently (it is based on the counting method and is an exact sampler) 😅 My prior research was kind of orthogonal to (counting or) sampling CPDAGs (focussing on the combinatorial and algorithmic aspects of a single MEC not the space of MECs). At some point I skimmed the paper https://projecteuclid.org/journals/annals-of-statistics/volume-41/issue-4/Reversible-MCMC-on-Markov-equivalence-classes-of-sparse-directed-acyclic/10.1214/13-AOS1125.full and I also thought about some related problems, but never really got far. |
Strange, 10.1214/13-AOS1125 do nothing like that, see algorithm 1, and therefore sample proportional to the number of CPDAG neighbours the CPDAG has. |
Correction: They do count or estimate the number of moves for each CPDAG |
What I would find interesting in addition to the settings in the paper would be to sample MECs/CPDAGs with some control over the number of v-structures. E.g., similar to parameter I find this rather natural because MECs are determined by skeleton and v-structures, why not play around with the latter as well. Also this is in contrast to what is currently done e.g. in benchmarking causal discovery algorithms where usually random DAGs are sampled which by construction have an extremely large number of v-structures (provided they are not too sparse). In the random DAG model typically used, if you have a collider a -> b <- c (the number of which grows quadratically with the indegree) the probability of it being shielded is basically the edge probability |
Let's have a chat on https://julialang.zulipchat.com about it if you like. |
Do you want to pick this up at some point? |
I don't think so at the moment, we can close it for now. |
Next step I think is adding the following methods, most of which are already implemented, but I'm still thinking about the proper way to integrate them (and need some time to polish them):
First some explanation, what is the difference between PDAGs and CPDAGs (how I use the terms) and why it is relevant:
A CPDAG represents a Markov equivalence class (MEC) and has the same skeleton and v-structures as the DAGs in the MEC plus additional orient edges that follow from the v-structures by the Meek rules. Hence, for MEC {a -> b -> c, a <- b <- c, a <- b -> c} the CPDAG would be a - b - c.
In contrast a <- b - c is not a CPDAG (even if it is fully oriented by the Meek rules) because it does not represent (via the def. of consistent extensions) a full Markov equivalence class, but only the two DAGs {a <- b -> c and a <- b <- c} which form a subclass of the one above. (Such graphs (which are completed under the Meek rules but non necessarily CPDAGs) are also called MPDAGs.)
The first method for counting the number of consistent extensions only works for CPDAGs (and some generalizations), but not for PDAGs or MPDAGs in general. In particular, applying it to a graph which is not a CPDAG could simply give the wrong result (e.g., there are x consistent extensions but the algorithm will spit out the number y). The reason for this is that CPDAGs have a number of nice properties that the algorithm relies on (e.g. that a - b - c implies that there cannot be a -> c, which we use in GES; this holds neither for PDAGs nor for MPDAGs).
This is all fine for the GES algorithm, which is guaranteed to output a CPDAG, but the PC algorithm might not. The PC algorithm could (when mistakes happen during independence testing) output graphs which are not CPDAGs. Actually, this happens extremely often (from my experience) and the graph obtained by learning the skeleton and the v-structures moreover often has no consistent extension (i.e., Dor-Tarsi would output an error for such an input in the current implementation). It has to be said that applying the Meek rules to a PDAG, which has no consistent extension, might make the graph "extendable" again: E.g. after skeleton and v-structure phase one might have (due to errors in independence testing) a -> b <- c and g -> f <- h and b - d - e - f. Then, applying the Meek rules will fully orient the chain b - d - e - f (in which way depends on the implementation and vertex numbering). The graph is then a DAG and thus trivially has itself as consistent extension (what happened here is that the Meek rules introduced a new v-structure; this can happen if the given PDAG is not extendable).
The most unopinionated way of including the methods is to just give proper warning about the output of the PC algorithm. One could also include a check into the functions that expect a CPDAG, i.e., they first test whether the graph is a CPDAG/fulfills the necessary conditions for the algorithm to output the correct result (but that would of course make the methods slower).
Generally, I think that something along those lines would be fine (and in any case, I'm probably going to polish the algorithms itself for a while before we have to settle for a solution to this), but I wanted to draw attention to this problem.
There are a lot of practical methods, which are build on top of observational causal discovery, which to a higher or lesser degree demand the output graph to be a CPDAG and it's not great that the sample PC algorithm does not provide one. I guess the reason why people still often prefer standard PC is that every orientation in the output graph comes (more or less) directly from conditional independencies which via the Markov property have a rather clear causal interpretation (though even here applying the Meek rules might introduce (arbitrary) new v-structures in case of errors). If one were to e.g. try to augment the output of PC to be a CPDAG, the causal interpretation is certainly less clear.