Home location and causal modeling

This wonderful visual presentation of machine learning got me thinking. I know the example is meant only as an illustration of things like classification, statistical association, measuring performance, and the difference between train and test error, but I think the example is also illustrative of some of the problems I have been exploring in music content analysis research. (And this gives me the opportunity to apply and probably abuse some of the things I have been reading from Pearl’s Causality. I may thus edit the below as my understanding grows.)

In the presentation above, we have a dataset posing a classification problem: label each data point in the factor C with levels {SF, NYC}. For simplicity, let’s just consider each data point has two dimensions: elevation (E) and $/ft^2 (P). Using E alone gives a train error of 0.18 with the decision boundary at 28 feet. Using E and P reduces the train error by a few more points. We thus see there to be a statistical association between the factors C, E and P. If you give me a level of E and of P (for some data point with a level of C), than I expect my prediction of the corresponding level of C will be correct with a probability better than random (or just always selecting NYC, which has a train error of 0.44). This means that the factors C, E and P are not conditionally independent. Formally, denote the probability measure P(p,e,c) over the sample space of P\cupE\cupC. Then, from the results above: $latexP(c|e,p) \ne P(c|e)$ and P(c|e,p) \ne P(c|p); P(e|c,p) \ne P(e|c) and P(e|c,p) \ne P(e|p); P(p|e,c) \ne P(p|e), P(p|e,c) \ne P(p|c). In other words, E brings unique information about C that P does not bring, and so on for the other permutations.

Now, knowing these relationships, we might posit the simple causal diagram below to explain the associations we find in our dataset.
This directed acyclic graph (DAG) encodes directions of causality and dependencies between factors that one believes exist in the dataset sampled from the real world. This explicitly says the following: 1) C is the common cause (parent) of P and E; P and E are conditionally independent given C. We can thereby decompose the joint probability distribution:

P(p,e,c) = P(c)P(e|c)P(p|c)

defined over P\cupE\cupC.

By Bayes, we can express this equivalently as P(p,e,c) = P(e)P(c|e)P(p|c). This changes the DAG above to be:
This second DAG encodes the idea that elevation causes the city, and the city causes the house price. Though P(p,e,c) = P(c)P(e|c)P(p|c) = P(e)P(c|e)P(p|c), this difference in the DAGs appears when we perform an intervention. For instance, if we force E=e, the first DAG shows P is unaffected while in the second DAG it is affected. The effect of this intervention in the first DAG means:

P_{E=e}(p,c) = P(c)P(p|c).

while in the second DAG it produces

P_{E=e}(p,c) = P(c|E=e)P(p|c).

These are equivalent only when C is independent of E, i.e., the link between E and C does not exist.

So, from our knowledge of how things work in the real world, and the fact that our prediction of C from E and P is not perfect, there have to be other factors involved. For instance, we know that E and P are not conditionally independent: P(e|p) \ne P(e); P(p|e) \ne P(p). For instance, a home could be in the flood plain, or could have city views. This suggests there to be an arrow from E to P. Furthermore, we know P is caused by many more things than E and C, for instance, the condition of the home, the state of the US economy, the ability of the agent, the saturation of market, the season, and so on. These are “exogenous” factors. Hence, let’s pose the following more complex causal model.

The factor P is caused by C, E and the exogenous factor U (which encompasses all other factors not in C and E). Now, we can decompose the joint probability by:

P(p,e,c,u) = P(c)P(e|c)P(p|c,u)P(u)

defined over P\cupE\cupC\cupU. What is funny here is that though the graph lacks a direct line between U and C, they become causally related when we condition on P (or P and E). In other words P(c|p,u) \ne P(c|p). In the terminology of Pearl, P acts as a collider, and links together C and U. If we know something about P, then information about C tells us something about U and vice versa. Furthermore, since this model says P is not entirely caused by C and E, it can thus explain our non-perfect prediction of C from E and P.

This is all well and good, but we know in the real world that cities have many elevations. Thus, there cannot exist a map from C to E (no city has a spot of land with two elevations). So, let us define a new causal factor L (latitude/longitude) to be the common cause of C and E. In other words, we say that there exists a function f: L\to E, a function g: L\to C, and consequently there exists no direct connection between E and C. You tell me L, and I will tell you the C and the E (e.g., by consulting a map). We thus posit the following causal graph.

Now, knowing L tells us C, and E adds nothing. Knowing L also tells us E, and C adds nothing. Define the compatible probability measure P(p,e,c,l,u) over the sample space of P\cupE\cupC\cupL\cupU, which by the graph must be

P(p,e,c,l,u) = P(u)P(l)P(e|l)P(c|l)P(p|e,c,u)

over the levels of the factors.

This graph says that the outcome of C is entirely caused by the outcome of L (L is mapped to C). It also implies that the outcome of E is also entirely caused by the outcome of L (specifically the topography of the earth is a function of the level of L). From its directed arrows, this graph is saying that an outcome of L is not caused by an outcome of C, or by an outcome of E. Of course, the arbitrary system of latitude/longitude does not cause the the earth to rise and fall at particular positions, and vice versa; but the direction of the arrows means that L is mapped to E, but not vice versa; and L is mapped to C, but not vice versa. The graph also says C is conditionally independent of E given L, or P(c|e,l) = P(c|l). (L “d-separates” C from E.) However, what is weird is that conditioning on L and P can make E and C conditionally dependent, but conditioning on L and U will not make E and C conditionally dependent: P(c|e,l,p) \ne P(c|l), but P(c|e,l,u) = P(c|l). (While L or L and U “d-separates” C from E, L and P “d-connects” C and E.)

Going further, we know from the real world that there are many l \in L for a given c \in C, but only one c \in C for a given l \in L (no city is a singularity). Hence, L encompasses all the information that is in E and C and in fact contributes more to the cause of P. This suggests that we can see E and C as caused by L, but not causes of P. So, we redraw our causal diagram:


where the factor Z accounts for the fact that we know price of a home is not caused by the home’s earthly position — for instance, a change in the coordinate system on the earth should not affect prices — but by factors relatable to it (proximity to good schools, shops, population in the area, crime statistics, etc.). This model now implies that the factors E and C are actually conditionally independent of P given L, i.e., P(e,c|l,p) = P(e,c|l). (L d-separates E and C from P.)

This model may or may not be correct, but it encodes our beliefs about the real world generating process of the data we collect. Pearl calls causal modeling: “an induction game scientists play against Nature.”

Returning to the fact that in the machine learning post we see how one can predict C given E and P, our final causal model above posits that it is only though the factor L — a dimension that our dataset does not include. Knowing the levels of E and P, but not their causal relationship to L, we can make an inference on C irrespective of the causal relationships (or lack thereof) between the factors. This is not to say that using E and P to predict C instead of L is “incorrect.” We may not know the causal relationship between them. Or, if we did, it might be prohibitively expensive to observe the common cause factor.

What is not correct is to conclude from our ability to predict C from E and/or P in our dataset that C is caused by E and P, or that E is caused by C and P, or that P is caused by E and C. (To test causal relationships, one must employ interventions, experimental design and analysis, etc.) Alternatively, it is not correct to conclude that E and P are “relevant” or “important” to predicting C, or that E is “more relevant” or “more important” than P to predicting C. The “relevance” could be spurious, confounded with the effects of an unknown or unmeasured factor. Given the unknown provenance of the data, one may not even conclude that the test errors are “good estimates” or “generally good estimates” or “reliable indicators” of the true errors.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s