OBJECTIVE

The objective of this project was to develop a model that, for a specific observation of current indicators of the Dow Jones Industrial Average (DJIA) index:

Predicts the next period’s return for DJIA via Linear Regression
Recommends whether to Buy / Sell DJIA index via Logistic Regression

Principal Component Analysis was used for dimension reduction

Motivation: Utilize these models to inform investment decisions regarding the DJIA.

BACKGROUND AND DATA

Inputs and Outputs

Both technical and macro-economic indicators were used as inputs to the model

Inputs – Technical indicators

Exponential moving averages over different rolling periods. This helps smooth the signal as well as capture the temporal nature of the point in time observation.
Slopes of exponential moving averages. This helps also helps capture the temporal nature of the point in time observation.
Ratios of these moving averages (i.e. is the 21 day EMA higher than the 126 day EMA). These are significance of these ratios is explained in the next section.
Annualized Volatilities: Accounts for the dispersion of the rate of change – in times of stress volatility tends to increase.

Inputs – Macro indicators

Fed Funds rate changes: Negative changes tend to indicate easing or a view that the Fed feels the economy is slowing down or in recession.
Unemployment level changes
QDP growth.

Outputs

Expected forward returns calculated.
Long (Buy) / Cash (Sell) indicator

Technical Indicators – Background

Below are the conditions found in different types of markets

Bear Market

Long-term moving averages higher than short-term moving averages
Moving averages higher than index
Negative slopes

Bull Market

Long term moving averages lower than short-term moving averages
Positive slopes
Moving averages lower than index

Inflection

Order of moving averages reverses
Sign of slopes reverse

Based on this the following technical indicators were used

Ratio of 6 month EMA to DJIA Index: EMA126_I
Ratio of 3 month EMA to 6 month EMA: EMA63_R
Ratio of 1 month EMA to 3 month EMA: EMA21_R
Slope of 6 month EMA as a ratio to DJIA index
Slope of 3 month EMA as a ratio to DJIA index

Sourcing and Training and Validation Data

Data was sourced from:

DJIA levels: finance.yahoo.com
Slopes and Ratios: Calculated
Volatilities: Calculated
Macro Economic Data: government sites including Fed

Daily observations were collected, technical indicators calculated, and macro economic data overlayed. Then weekly observations were extracted:

3037 weekly observations from 1954 through 2013
2986 post removal of outliers
Training Data – Selected a random 75% of observations
Validation Data – Utilized remaining 25% of observations

Data Preparation

The distributions of each input and continuous output variable were analyzed
Different transformations were attempted, where required, to achieve normality
The data was then standardized and centered data by shifting assigning Z score.
Below the 21 day volatility transformation example is illustrated …

Data Fields

Inputs (15 variables) and Target (3)

Data Summary Post Centering and Normalizing

The table below shows the correlation matrix of the normalized, centered, continuous variables

As expected there is a high |absolute| correlation among the different moving averages, slopes and ratios
Similarly there is a correlation between the 21 and 63 day volatilities
Lastly the 1 month and 1 quarter change in Fed Funds rate also shows a level of correlation.

One of the key decisions was which time frames to include or exclude – instead of making that decision a-priori I used PCA to reduce dimensionality and remove multi co linearity.

PRINCIPAL COMPONENT ANALYSIS

PCA was applied to the input variables for the reasons listed in the prior section.

The first 4 PC’s were selected because

Their Eigenvalues > or close to 1
Their Cumulative variance accounted for 83% of the total.

This reduced the problem from a 15 variable model to four.

Then the rotation eigenvectors were used to interpret the meaning of the PC’s:

Based on the above, the following interpretations can be made:

LINEAR REGRESSION

Linear regression was applied to try to try to predict the forward monthly return of an observation:

FwdReturn: a linear function of (PC1,PC2,PC3,PC4)
In R the formula pattern is: train$FwdMReturnCtr ~ train$PC1 + train $PC2+ train$PC3+ train$PC4

The coefficients and best fit description was as follows:

Observations & Conclusions

Pr(>t) values lead us to Accept the Null Hypothesis that Intercept and Coefficiens in Red = 0. In other words this intercept and coefficients cannot be used.
Adjusted R Squared is exceedingly low at 0.0125

Similar results were seen when the actual centered and normalized 15 Varibles were used.

This leads to the following conclusions –

The behavior of forward returns as a function of the inputs cannot be modeled Linearly over this extensive time period (1954 to 2013).
It is possible a better fit may be obtained for specific, shorter time periods, e.g. for specific types of Cyclical markets. I actually observed this in a prior class where a Decision Tree or Neural Network derived from the training data was more accurate for short Cyclical periods as compared to application to the entire timeline. This was not attempted here, though I still suspect the relationship is not likely to be linear.

LOGISTICS REGRESSION

A logistic regression was applied as follows –

log(odds(LongPosition))= a linear function of (PC1,PC2,PC3,PC4)
In R: mylogit2 <- glm2(train$CycLC ~ train$PC1 + train $PC2+ train$PC3+ train$PC4, family=”binomial”)

Observations on Logistic Model

Pr(>t) values lead us to Reject the Null Hypothesis that Intercept and Coefficiens in Green = 0. In other words we have a 96% + confidence level in these intercept and coefficients.
So why did linear regression fail whereas logistics regression (at least at this point) appeared to do better? My interpretation is that when the ‘crowd’ makes decisions to enter or exit the market, they are ‘taking a bet’, based on a conscious or unconscious view on the Odds of that Bet. That odds decision appears to fit a linear model (i.e. a Linear Logistics Regression).

Validation of Model

The model was applied to the Validation data to assess accuracy of results …

The area under the ROC curve is of 0.7999 which would indicate reasonably good predictions.

The maximum % Correct predictions are at a cutoff between 30% to 60% probability.

For a Move into Market signal (i.e. move from Cash to Long), you want to minimize false positives which yields a cutoff at 30%.

At this level, we see 89 out of 703 observations are False Positives = 13%.
Another measure is proportion of False Positives / Actual Positives) = 89/(517+34) = 16%.
This compares to a total actual positives of 79%.
Hence the logit model is definite improvement over a ‘random draw’ from the sample distribution.

For a Move out of Market signal you want to minimize false negatives, with yields a cutoff at 60%

At this level, we see 16 out of 703 observations are False Negatives = 2%.
Another measure is proportion of False Negatives / Actual Negatives =16/(63+89) = 11%.
This compares to a total actual negatives of 21%.
Hence the logit model is a good improvement over a ‘random draw’

CONCLUSIONS

PCA yielded significant dimension reduction and intuitive principal components
Linear regression unable to yield a good model – interactions among variables not linear at least for the length of period explored.
Logistic regression resulted in a reasonable level of accuracy

NEXT STEPS

While the Logistics Model shows promise, the ultimate test is to back test it, simulating Buys and Sells over a historical period of time, and comparing the resulting returns to the actual DJIA return (i.e. Hold) as well as to actively managed leading broad index mutual funds under different market conditions.

============

APPENDIX

Sample R snipets

Outliers Removal

myOutlierIndexEMA21_Sl <- which(djia$EMA21_SlZ_NEW<(-3) | djia$EMA21_SlZ_NEW>(3))

myOutlierIndexEMA63_Sl <- which(djia$EMA63_SlZ_NEW<(-3) | djia$EMA63_SlZ_NEW>(3))

myOutlierIndex<-union(myOutlierIndexEMA21_Sl, myOutlierIndexEMA63_Sl)

myOutlierIndexEMA126_Sl <- which(djia$EMA126_SlZ_NEW<(-3) | djia$EMA126_SlZ_NEW>(3))

myOutlierIndex<-union(myOutlierIndex, myOutlierIndexEMA126_Sl)

myOutlierIndexEMA126_Sl_R <- which(djia$EMA126_SlZ_R<(-3) | djia$EMA126_SlZ_R>(3))

myOutlierIndex<-union(myOutlierIndex, myOutlierIndexEMA126_Sl_R)

myOutlierIndexEMA63_Sl_R <- which(djia$EMA63_SlZ_R<(-3) | djia$EMA63_SlZ_R>(3))

myOutlierIndex<-union(myOutlierIndex, myOutlierIndexEMA63_Sl_R)

myOutlierIndexFFM<- which(djia$FedFund1MDeltaZ<(-4) | djia$FedFund1MDeltaZ>(4))

myOutlierIndex<-union(myOutlierIndex, myOutlierIndexFFM)

myOutlierIndexFFQ<- which(djia$FedFund1QDeltaZ<(-4) | djia$FedFund1QDeltaZ>(4))

myOutlierIndex<-union(myOutlierIndex, myOutlierIndexFFQ)

length(myOutlierIndex)

djia$Outlier<-0

djia$Outlier[myOutlierIndex]<-1

Transformations

Min=min(djia[,i])

if (Min<0){adj = -2*Min}else {adj = 0}

d = skewness(djia[,i])[1]

hist(djia[,i],breaks=20,main=colnames(djia)[i],col='blue')

qqnorm(djia[,i],main=paste("skew=",toString(d)))

L=log(djia[,i]+adj)

dL = skewness(L)[1]

hist(L,breaks=20,main=paste(colnames(djia)[i],"Log"),col='blue')

qqnorm(L,main=paste("skew=",toString(dL)))

S=(djia[,i]+adj)^2

dS = skewness(S)[1]

hist(S,breaks=20,main=paste(colnames(djia)[i],"Square"),col='blue')

qqnorm(S,main=paste("skew=",toString(dS)))

SR=(djia[,i]+adj)^1/2

dSR = skewness(SR)[1]

hist(SR,breaks=20,main=paste(colnames(djia)[i],"Sqrt"),col='blue')

qqnorm(SR,main=paste("skew=",toString(dSR)))

Centering

v<-djia$FwdMReturn

vC<-scale(v,center=TRUE)

djia$FwdMReturnCtr<-vC

uvC<-(djia$FwdMReturnCtr*sd(djia$FwdMReturn))+mean(djia$FwdMReturn)

PCA

pcaModel <- prcomp(djia[,c('EMA21_ICtr', 'EMA63_RCtr', 'EMA126_RLogCtr', 'EMA21_Sl_SQ_NO_Ctr', 'EMA63_Sl_SQ_NO_Ctr', 'EMA126_SlZ_NEW', 'EMA63_SlZ_R', 'EMA126_SlZ_R', 'Vol21DayAnnLogCtr', 'Vol63DayAnnLogCtr', 'UnemDeltaLogAdjCtr', 'FedFund1MDeltaZ', 'FedFund1QDeltaZ', 'GDPGrowthCtr')], center = FALSE, scale = FALSE)

print(pcaModel)

summary(pcaModel)

plot(pcaModel, type = "l")

Linear Regression

fit<-lm(train$FwdMReturnCtr ~ train$PC1 + train $PC2+ train$PC3+ train$PC4)

summary(fit)

Logistic Regression

train<-subset(djia,djia$TrainSet > 1)

mylogit2 <- glm2(train$CycLC ~ train$PC1 + train $PC2+ train$PC3+ train$PC4, family="binomial")

summary(mylogit2)

p<-predict(mylogit2, train, type="response")

train$LogitPred<-p

g <- roc(CycLC ~ LogitPred, data = train)

plot(g)

val<-subset(djia,djia$TrainSet == 1)

val$LogOddsCycleL = 1.58972 -0.66736*val$PC1 + 0.11113*val$PC2 + 0.11268*val$PC3 -0.29539*val$PC4

val$OddsCycleL = 10^val$LogOddsCycleL 

val$ProbCycleL = (val$OddsCycleL / (1+val$OddsCycleL ))

g <- roc(CycLC ~ ProbCycleL, data = val)

plot(g)

if (val$CycLC[j] == 1 && val[j,115+i] == 1) {val[j,124+i]='TP'}

if (val$CycLC[j] == 0 && val[j,115+i] == 0) {val[j,124+i]='TN'}

if (val$CycLC[j] == 1 && val[j,115+i] == 0) {val[j,124+i]='FN'}

if (val$CycLC[j] == 0 && val[j,115+i] == 1) {val[j,124+i]='FP'}

Multivariate Analysis of the Dow Jones Industrial Average