Assignment 3 part II

Stock Prediction

In this problem we will try to predict stock values. We will employ an HMM to do so.

Part 1: Training an HMM

We have previously studied HMMs in class. Consider an HMM with $K$ states. The parameters of the HMM are the transition probabilities, $P(i|j)$, the initial state probabilities $\pi(i)$, and the state output distributions $P(x|i)$. We will assume that $P(x|i)$ is a Gaussian, with mean $\mu_i$ and covariance matrix $\Phi_i$. $$P(x|i) = \frac{1}{\sqrt{(2\pi)^d |\Phi_i|}}exp\left( -\frac{1}{2}(x-\mu_i)^T\Phi_i^{-1}(x-\mu_i)\right)$$

where $d$ is the dimension of data vector $x$.

The parameters of the HMM can be learned from training data using the well-known Baum-Welch algorithm.

Given training data ${\bf X}$, which is a sequence of vectors $x_1, x_2, ... x_T$, the Baum-Welch algorithm proceeds as follows:

Forward probabilities: The forward probability $\alpha(t,i) = P(x_{1:t}, state(t)=i)$ ($x_{1:t}$ is shorthand for $x_1,x_2,\cdots,x_t$) can be computed recursively as follows: $$\alpha(1,i) = \pi(i) P(x_1|i)$$ $$\alpha(t,i) = P(x_t|i)\sum_j \alpha(t-1,j)P(i|j) ~~~ \forall~ i > 1$$.

Often the above recursion will result in underflow or overflow. To avoid this, a scaling is used: $$S_\alpha(t) = \frac{1}{\max_i \alpha(t,i)}$$ $$\alpha(t,i) = S_\alpha(t)\alpha(t,i)$$

The scaling ensures that at least one of the $\alpha$ terms is 1.0, and that no $\alpha$ term exceeds 1.0, thereby avoiding both underflow and overflow.The scaling terms $S_\alpha(t)$ must be stored to compute total data likelihood later. The scaled $\alpha(t,i)$ probabilities are used to compute $\alpha(t+1,i)$.

So the overall procedure is: Compute $\alpha$ terms at time $t$, then compute $S_\alpha(t)$, then scale $\alpha$ terms, then proceed to $t+1$.

Backward probabilities: The backward probability $\beta(t,i) = P(x_{t+1:T}|state(t)=i)$ can be computed recurively as follows: $$\beta(T,i) = 1.0 ~~\forall~i$$ $$\beta(t,i) = \sum_j \beta(t+1,j)P(x_{t+1}|j)P(j|i) ~~~ \forall t < T$$.

The backward probabilities too must be scaled to avoid over/underflow. This is performed as follows: $$S_\beta(t) = \frac{1}{\max_i \beta(t,i)}$$ $$\beta(t,i) = S_\beta(t)\beta(t,i)$$

Here the $\beta(t,i)$ terms must first be scaled prior to computing $\beta(t-1,i)$ terms. So the overall process is: Compute $\beta(t,i)$ for all $i$, compute $S_\beta(t)$, scale $\beta_(t,i)$ for all $i$, then step backward to $t-1$.

Gamma terms: The a posteriori probability of a state $i$ at time $t$ is given by $$\gamma(t,i) = \frac{\alpha(t,i)\beta(t,i)}{\sum_j \alpha(t,j)\beta(t,j)}$$

Similarly, the a posteriori probability of the state sequence $i,j$ occurring at time $t,t+1$ is given by $$\gamma(t,t+1,i,j) = \frac{\alpha(t,i)P(j|i)\beta(t+1,j)P(x_{t+1}|j)}{\sum_{k,l} \alpha(t,k)P(l|k)\beta(t+1,l)P(x_{t+1}|l)}$$

Note that in both $\gamma$ terms the denominator is merely a normalizing constant that ensures that the $\gamma$ terms sum to 1.0.

Now the update rules for the parameters of the HMM can be computed as follows.

The total data loglikelihood

The total data loglikelihood can be computed as $$L = \log\left(\sum_i \alpha(T,i)\right) - \sum_t log S_\alpha(t)$$

Note that we are accounting for the scaling factors in the above equation.

The update formulae

The initial state probabilities: In order to estimate the initial state probabilities well, one needs multiple training utterances. Otherwise, it is probably best to keep them uniform over the allowed first states of the HMM. In any case, the update formula is below. Here I am explicitly including a substrict ${\bf X}$ to indicate terms obatained from the training instance ${\bf X}$ (which, you will recall, is actually a sequence of vectors). In the update rules for the remaining parameters I am not including this subscript, but it is nevertheless implied. $$\pi(i) = \frac{\sum_{\bf X} \gamma_{\bf X}(1,i)}{\sum_{\bf X} 1}$$ The denominator above is simply the number of training instances.

Transition probabilities: $$P(j|i) = \frac{\sum_{t=1}^{T-1} \gamma(t,t+1,i,j)}{\sum_{t=1}^{T-1} \gamma(t,i)}$$

Note the two different Gamma terms in the numerator and denominator above. The numerator considers the gamma for transition with four arguments; the denominator uses the gamma with 2 arguments.

State output distribution parameters The updated mean for the state output distribution of state $i$ is given by: $$\mu_i = \frac{\sum_t \gamma(t,i)x_t}{\sum_t \gamma(t,i)}$$ $$\Phi_i = \frac{\sum_t \gamma(t,i)(x_t - \mu_i)(x_t-\mu_i)^T}{\sum_t \gamma(t,i)}$$

PART A: Training an HMM on stock data

The data here is a mat file of stock and index values downloaded over a period of time from (1 May 2007 to 18 Nov 2012). 5 different stocks are represented. Also represented are 5 stock indices. Each index is a matlab structure on its own with many entries whereas the stocks are in one big structure. This is just so that you don't confuse the two. For this homework, we will focus on the closing value of the stock/index at the end of each trade day. You can safely disregard all the other information.

Using the equations given earlier, train HMMs with 4, 8, 16 and 32 states. You must turn in

The HMM parameters for the HMMs.
A plot of how the loglikelihood of the training data changes with iterations. These must converge monotonically in every case, unless you have a bug.

PART B: Predicting the future

At each $t$, using a trained HMM, compute the forward proabilities $\alpha(t,i)$. The a posteriori probabilities for the states, given all observations until time $t$ is given by $$P(i|x_{1:t}) = \frac{\alpha(t,i)}{\sum_j \alpha(t,j)}$$

The predicted probability distribution for the $t+1$-th observation, $x_{t+1}$, given all observations until time $t$ is given by: $$P(x_{t+1}|x_{1:t}) = \sum_j P(x|j) \sum_i P(i|x_{1:t}) P(j|i)$$

Explain how the above equation was arrived at.

PART C: Making money with the HMM

We now wish to predict stock values. At each time $t$, we would like to guess what the stock values would be at $t+1$ so that the appropriate amount of money may be invested into them. A variety of rules may now be employed to predict the future; some will work better than others. A simple estimate is the MMSE estimate.

The MMSE (minimum mean squared error) estimate of a variable is the expected value of the variable. For our stock prediction, we would like to estimate $x_{t+1}$ given $x_{1:t}$. This is given by: $$\hat{x}_{t+1} = E[x_{t+1} | x_{1:t}]$$.

Using the above equation for $P(x_{t+1} | x_{1:t})$, we get $$\hat{x}_{t+1} = \sum_i P(i|x_{1:t}) \sum_j P(j|i) \mu_j$$

where $\mu_j = E[x | j]$ is the mean of the state output distribution (Gaussian) at the $j-th$ state. You should be able to convince yourself that the above equation correct.

Using the above equation, on each day, predict all stock indices for the next day. Plot the error between the predicted and true values for each day as a function of time. Do this with each of the HMMs with 4,8,16 and 32 states. Also report the total squared error of prediction for each of the stock indices.

Note that the predicted value of a stock index at any time $t$ is simply the corresopnding component of $\hat{x}_t$.