Math break! conditional probability

by alefnot

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Thursday, Nov. 05, 2009 Thursday, Nov. 05, 2009 at 7:13:32pm PST

Break time! It's been a long week, still a day to go for most of us. So how about some applied math? This was actually inspired by a discussion of financial modeling in Meteor Blade's On Economists Getting It Wrong (well worth reading, BTW), and originally posted in somewhat more finance-oriented form on the Economic Populist sometime back. But the concepts go far beyond finance modeling.

Let's say we're interested in estimating the likelihood of some event x happening. x might be a loan default, upcoming regulations, getting hit by lightning, whatever. If we happen to know something about x, we can assign some probability P(x), play the numbers, and improve our chances of a good outcome. That's a big "if", and unless P = 1 we can still (and possibly will likely) lose; still, knowledge improves our decision making, and that's the best we can do.

But what if x depends on some other event? Your chances of getting hit by lightning are different on a dry clear day in a cave vs standing on an aluminum stepladder holding up a golf club outside in a thunderstorm. That is, what is P(x|y), the probability P of x given some circumstance y? What we know about y informs what we know about x. This is the field of conditional probability, first formalized by Thomas Bayes, an 18th century English clergyman who gave his name to a whole field of study in Information Science.

In this post I'm interested in a particular subset of that same question. What is the likelihood of x at some point t occurring given x at some previous point? That is, what is the behavior of x in time? What is the trajectory of x? This is a very common problem in conditional probability and comes up all the time when trying to model probabilistic processes, such as energy transfer at the molecular level, bond yields in finance, genetic drift in populations, whatever.

If we extend the history of x all the way to the origin we need to assess the following conditional probability

P(x(t_n) | x(t_{n-1}), x(t_{n-2}), ..., x(t_1))

That is, what is the probability of finding x at t_n given x at t_{n-1)? In principle that depends on x at t_(n-2), which in principle depends on x at t_(n-3), which depends on ... x at the origin, t_1. It becomes helpful to classify four cases of this conditional probability.

Purely random process
For a purely random process the previous history is completely irrelevant. We can then write

P(x(t_n) | x(t_{n-1}), x(t_{n-2}), ..., x(t_1)) = P(x(t_n))

An example of this is the coin toss. You have an equal likelihood of getting a heads or a tails on the next toss, and it doesn't matter whether you just got 15 heads in the previous 15 tosses, the odds of getting heads on the next toss is still 0.5 (although after some point you might wonder if the coin or the toss is rigged). Now the chances of getting 15 straight heads are extremely small, but your chances of coming up heads next toss is 0.5.

If there are two possible independent outcomes (like a coin toss or a true/false) you get a binomial distribution, which in the limit of a large number of events convergesto a Gaussian. 15 heads in a row is in the far wing of that distribution. But your chances of getting another head then is still 50%.

Pure randomness underlies the Gaussian distribution, which is generally assumed in essentially all stochastic treatments, whether the entire process is purely random or not. It is frequently a good assumption, it is (almost) always a necessary one.

The purely random process is at worst the second easiest case to deal with, because the conditional probability chain is so short, and sometimes the easiest.

Deterministic process
For a deterministic process, the likelihood of finding x at t_n is exactly determined by x at t_{n-1}. Since that is true for all t, it follows then that x at t_{n-1} is exactly determined by x at t_{n-2}. Following that back we find that the initial condition x(t_1) determines the trajectory of x. In other words, we cannot truncate the conditional probability:

P(x(t_n) | x(t_{n-1}), x(t_{n-2}), ..., x(t_1)) = P(x(t_n) | x(t_{n-1}), x(t_{n-2}), ..., x(t_1))

If we also know the initial conditions of everything else in the space that x inhabits, we know in principle the future history of everything in the space forever and ever amen. It might get complicated, in fact it might get chaotic and if we have to use numerical techniques the finite computational word size will ultimately limit our ability to predict the future of the system -- weather prediction is Exhibit A -- so not all is roses here. And as far as we know nothing is truly purely deterministic, although the various Copenhagen nonbelievers -- most famously Einstein and Schrodinger, more recently Hawking and Weinberg -- might argue otherwise. But to a really good approximation many things are: you can hit a battleship from a pitching, moving platform 15 miles away; if you drop a pencil it will fall; in 23000 years the North Pole will point to the star Thuban in the constellation Draco; etc.

This is the easiest of the four cases to deal with, unless it goes chaotic.

Markov process
Now it gets interesting. Do you ever have the feeling that a chance process kind of depends on where you are or where you've been (which would defy the expectation of a purely chance case)? You might be right. The first two cases are the extrema of stochastic processes (characterized by one or more random variables, where the "randomness" in the second case = 0).

A Markov process is one step back from purely random:

P(x(t_n) | x(t_{n-1}), x(t_{n-2}), ..., x(t_1)) = P(x(t_n) | x(t_{n-1}))

This is also known as a Markov chain, after Andrey Markov, a Russian mathematician of the late 19th and early 20th centuries. Where x goes next depends only on where x is at the present. Note that that does not mean determined by the present -- if that were the case we would be able to follow its trajectory back to the beginning -- it means that the likelihood of finding x at t_n depends only on where x is at t_(n-1). Because of this Markov processes are said to be memoryless. A true random walk is an example of a Markov process. Brownian motion is a Markov process.

If the random variables characterizing x are both Markovian and Gaussian (that is, they are normally distributed) then the temporal autocorrelation (see below -- the convolution link has a nice animation) of x is given by an exponential decay. This is the signature of a Markov process. If you are looking in the frequency domain that will show up as a Lorentzian spectral density. Nature is full of exponential decays -- Markov processes are quite common.

Note the constraints placed on the random variable(s) of x. Let's use Brownian motion as an example. Its equation of motion (describes behavior in time) is the Langevin equation

F = ma = m dx²/d²t = m dv/dt = - Df v + A(t)

Here F = force on the particle, m = mass, a = acceleration, v = velocity, x = spatial parameter (not generally 1-dimensional). Physically as a particle moves through the fluid, it feels a drag force Df v (it takes work to push through the molecules making up the liquid, hence the negative sign) and a stochastic "bumping" force A(t), from collisions with surrounding molecules (I'm guessing the analogue in finance is buying and selling of a bond, security, whatever). The stochastic term A(t) has to be rapid in time (more precisely have ~delta function autocorrelation on the timescale of observation -- see below for definition of delta function) and be small in amplitude. A big bump might suddenly send the particle into a very non-Brownian jump and be felt some time down the line, which violates the memoryless condition. A long slow bump is like a directed push rather than a bump, and if it's long enough and slow enough it too violates the memoryless condition (x(t_n) depends on x(t_{n-2}) or even further back.) Langevin and Langevin-like equations turn out to describe a huge number of phenomena, and Brownian motion is used to model all sorts of stochastic motion, from radioactive decay to laser lineshapes to thermal diffusion to finance and econometrics. Entire books have been written about them.

The concept of relative timescale is an important one. Consider radioactive decay. The timescale of observation is generally comparable to or at least some fraction of the half-life of the isotope, which might be thousands or millions of years. The timescale of a specific heavy atom decaying into lighter daughter atoms is ~instantaneous. On the timescale of observation the timescale of the stochasticity is essentially a delta function. This is what characterizes a Markov process, and why radioactive decay takes exponential form. Now if it took hundreds or thousands of years for a specific atom to decay very very gradually the population decay would not be exponential and the process would not be Markovian.

Note that observing the lifetime of radioactive decay is mapping out a temporal autocorrelation.

Here's a finance paper (PDF alert) that explicitly uses Langevin equations to model bond yields. It's a little difficult to read because the author does not define his notation (! - leaves it in a reference) but the solution is a product of exponential decays – as it had to be because of the Markovian assumption implicit in the Langevin equation. It is also difficult to assess how good a model this is because the author does not validate it against data (!!). But we can invert that: if an observed autocorrelation does NOT exhibit exponential decay, the underlying process is not Markovian, and we'll need a different model. It also means that if the fluctuations in your system are not small and rapid, you cannot use a Markov model; if you do use one, you've made a statement about those fluctuations. A quick web trawl brings up any number of Markovian finance and econometric models.

Take home messages about Markov processes: memoryless, exponential autocorrelation, rapid and small-amplitude stochastic term, well understood.

Non-Markov processes
Now it gets messy, because a non-Markovian process is characterized by a conditional probability that is not Markovian (and, by convention, not purely random or deterministic). It is like saying a system is nonlinear -- you've made a statement about what something isn't, not about what something is. The conditional probability chain may be truncated at any point, and the actual probabilities and weighting of events at various t_i may be quite different and may be conditionally different (that is, it depends on circumstances). There are an infinite number of possibilities here, and you have to come up with the one that accurately describes the dynamics of your system. In certain fields of study, where the variables can be assumed to be Gaussian (at least to a good approximation) and where the process is stationary over the timescale of interest, you can ignore the mechanism and use the observed autocorrelation or spectral density. The alternative is to come up with a mechanism (conditional probability), find an observable -- the autocorrelation or spectral density -- and show it fits the data to at least the noise level. In general non-Markovian problems are Very Difficult and attempts to address them frequently devolve to speculative flailing. The difficulty is not in coming up with a specific conditional probability -- that's easy: coming up with one that is realistic is the problem.

It may not be obvious, but if the system is stationary for long enough, the system will become Markovian if you can expand the timescale of observation. That's because fluctuations that are non-Markovian when looking on one timescale start looking more and more delta-function like as you expand your timescale. Stochastic processes converge to exponentials, if you can wait long enough (which is, admittedly, a Very Huge If, but handy when modeling, sometimes too handy). As for the market in finance modeling, that's not stationary (for one thing it keeps growing), but over certain time domains it may look stationary, though that might take some creative detrending.

There's another wrinkle here as far as the market goes. The market is self-aware – it is conditionally autoregressive, sometimes in strange ways. Sometimes everybody's buying because, well, everybody's buying; sometimes it gets spooked for no very good reason. Obviously neither behavior is Markovian, at least on those timescales. It can be willfully amnesiac (leverage post-LTCM, the current resurrection of CDSs and CDO-like securities) or it can have a very long memory (Great Depression -- obviously a hugely non-Markovian event) depending on whether times feel good or not. The conditional autoregressiveness can tap events from well back in history like the Great Depression, which makes for interesting model challenges.

Take home messages about non-Markovian systems: messy, memory effects, realistic, not a specific case so no general solutions, best to stick to observables (autocorrelation or spectral density), timescales important.

A couple of closing thoughts. When looking at Markovian stochastic models (easily recognizable by its autocorrelation; or if it says Langevin equation or random walk) ask if the constraints on fluctuations make sense within the limits of the model. If the model is non-Markovian, how realistic is the mechanism (conditional probability), and – critically – how well do the predictions match the data? What sort of assumptions had to be made to get the model to match the data?

And finally: most real systems of interest are messy, non-Markovian, non-stationary, and Really Really Hard. Quite frequently Intractably Hard. Simplifications are necessary, so it becomes crucially important to validate assumptions, domains of applicability, and results.

If you're curious about autocorrelations: The temporal autocorrelation of a stochastic process describes how it "decorrelates" with itself as time progresses, both in timescale and form. Formally it is a convolution of a (generalized) function with itself, lagged in time, where the integral is taken over the time delay. You can think of it as how a function looks compared to itself at different times. It turns out to be a powerful analytical tool: from it you can infer dynamics of the underlying stochastic process, specifically timescales and relative timescales. It also characterizes the nature of the process, and there are three cases here: a delta function, for purely random processes; an exponential decay, for Markovian processes; and everything else (as long as the integral -- an autocorrelation is an integral, remember -- doesn't blow up), for non-Markovian processes. For a stationary process, it approaches a constant as you approach the deterministic limit (which approaches meaninglessness in this limit, so deterministic processes are treated differently). At least in science and engineering, this analysis is often done in the frequency domain: the Fourier transform of the autocorrelation function is its spectral density or sometimes power spectrum, and the analysis is called spectral analysis.

A delta function (or Dirac delta) is a (generalized) function that is nonzero at a specific value and zero elsewhere and with an integral equal to one. It is the continuum analog of the Kronecker delta. A delta function autocorrelation means that a function is completely correlated with itself when it is perfectly overlapped with itself (as it must) but completely loses all correlation as soon as the overlap shifts.