Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

Non-Linear Cross-Bicorrelations between the Oil Prices and Stock Fundamentals

When we talk about correlations in finance, by default, we assume linear relationships between two time-series “co-moving”. In other words, if one time-series changes its values over a give time period, we seek for a tight correlation reflected within the other time-series. If found, we say they are correlated. But wait a minute! Is the game always about looking for an immediate feedback between two time-series? Is “correlated” always a synonymous of a direct, i.e. a linear response?

Think for a second about the oil prices. If they rise within a quarter, the effect on some companies is not the same. Certain firms can be tightly linked to oil/petroleum prices that affect their product value, distribution, sales, etc. while for other companies the change of crude oil price has a negligible effect, e.g. the online services. Is it so?

As a data scientist, quantitative (business) analysts, or investment researcher you need to know that mathematics and statistics deliver ample of tools you may use in order to derive the best possibilities for two (or more) financial assets that can be somehow related, correlated, or connected. The wording does not need to be precise to reflect the goal of an investigation: correlation, in general.

Correlation can be linear and non-linear. Non-linearity of correlation is somehow counterintuitive. We used to think in a linear way that is why it is so hard to readjust our thinking in a non-linear domain. The changes of the oil prices might have a non-negligble effect on the airlines, causing the air-ticket prices to rise or fall due to recalculated oil/petroleum surcharge. The effect can be not immediate. A rise of the oil prices today can be “detected” with a time delay of days, weeks, or even months, i.e. non-linearly correlated.

In this post analyse both linear and non-linear correlation methods that have been so far applied (and not) in the research over oil price tectonics across the markets. In particular, first we discuss one-factor and multiple linear regression models to turn, in consequence, toward non-linearity possible to be captured by cross-correlation and cross-bicorrelation methods. By developing an extensive but easy-to-follow Python code, we analyse a database of Quandl.com storing over 620 stocks’ financials factors (stock fundamentals). Finally, we show how both non-linear mathematical tools can be applied in order to detect non-linear correlations for different financial assets.

1. Linear Correlations

1.1. Oil Prices versus Stock Markets

Correlations between increasing/decreasing oil prices and stock markets have been a subject of investigation for a number of years. The main interest was initially in search for linear correlations between raw price/return time-series among the stocks and oil benchmarks (e.g. WPI Crude Oil). A justification standing behind such approach to data analysis seemed to be driven by a common sense: an impact of rising oil prices should affect companies dependent on petroleum-based products or services.

The usual presumption states that a decline in oil price is a good news for the economy, especially for net oil importers, e.g. USA or China. An increase in oil prices usually causes the raise of the input costs for most businesses and force consumers to spend more money on gasoline, thereby reducing the corporate earnings of other businesses. The opposite should be true too when the oil prices fall.

The oil prices are determined by the supply-and-damand for petroleum-based products. During economic expansion, prices might rise as a result of increased consumption. They might fall as a result of increased production. That pattern has been observed on many occasions so far.

A quantitative approach to measuring the strength of linear response of the stock market to oil price can be demonstrated using a plain linear regression modeling. If we agree that S&P 500 Index is a proxy for the US stock market, we should be able to capture the relationship. In the following example, we compare WPI Crude Oil prices with S&P 500 Index for 2-year period between Nov 2014 and Nov 2016. First we plot both time-series, and next we apply one-factor linear regression model,
$$
y(t) = \beta x(t) + \epsilon(t) \ \,
$$ in order to fit the data best:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt
from pandas_datareader import data, wb
 
%matplotlib inline
 
grey = 0.7, 0.7, 0.7
 
import warnings
warnings.filterwarnings('ignore')
 
# downloading S&P 500 Index from Yahoo! Finance
sp500 = web.DataReader("^GSPC", data_source='yahoo',
                  start='2014-11-01', end='2016-11-01')['Adj Close']
 
# WPI Crude Oil price-series
# src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse)
wpi = wpi[wpi.VALUE != "."]
wpi.VALUE = wpi.VALUE.astype(float)
wpi.index = wpi.DATE
wpi.drop('DATE', axis=1, inplace=True) 
 
# combine both time-series
data = pd.concat([wpi, sp500], axis=1).dropna()  # and remove NaNs, if any
data.columns = ["WPI", "S&P500"]
print(data.head())
              WPI       S&P500
2014-11-03  78.77  2017.810059
2014-11-04  77.15  2012.099976
2014-11-05  78.71  2023.569946
2014-11-06  77.87  2031.209961
2014-11-07  78.71  2031.920044
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
plt.figure(figsize=(8,8))
# plot time-series
ax1 = plt.subplot(2,1,1)
plt.plot(data.WPI, 'k', label="WPI Crude Oil")
plt.legend(loc=3)
plt.ylabel("WPI Crude Oil [U$/barrel]")
 
# fit one-factor linear model
slope, intercept, r2_value, p_value, std_err = 
                  stats.linregress(data.WPI, data["S&P500"])
 
# best fit: model
xline = np.linspace(np.min(data.WPI), np.max(data.WPI), 100)
line = slope*xline + intercept
 
# plot scatter plot and linear model
ax2 = ax1.twinx()
plt.plot(data["S&P500"], 'm', label="S&P500")
plt.ylabel("S&P 500 Index [U$]")
plt.legend(loc="best")
plt.subplot(2,1,2)
plt.plot(data.WPI, data["S&P500"],'.', color=grey)
plt.plot(xline, line,'--')
plt.xlabel("WPI Crude Oil [U$/barrel]")
plt.ylabel("S&P 500 Index [U$]")
plt.title("R$^2$ = %.2f" % r2_value, fontsize=11)

Our Python code generates:
unknown
where we can see that WPI Crude Oil price vs. S&P 500 linear correlation is weak, with $R^2 = 0.36$ only. This is a static picture.

As one might suspect, correlation is time-dependent. Therefore, it is wiser to check a rolling linear correlation for a given data window size. Below, we assume its length to be 50 days:

59
60
61
62
63
64
rollcorr = pd.rolling_corr(wpi, sp500, 50).dropna()
 
plt.figure(figsize=(10,3))
plt.plot(rollcorr)
plt.grid()
plt.ylabel("Linear Correlation (50 day rolling window)", fontsize=8)

unknown-1
In 2008 Andrea Pescatori measured changes in the S&P 500 and oil prices in the same way as demonstrated above. He noted that variables occasionally moved in the same direction at the same time, but even then, the relationship remained weak. He concluded on no correlation at the 95% confidence level.

The lack of correlation might be attributed to factors like: wages, interest rates, industrial metal, plastic, computer technology that can offset changes in energy costs. On the other side, corporations might have become increasingly sophisticated at reading the futures markets and are able, much better, to anticipate the shift in factor prices (e.g. a company should be able to switch production processes to compensate for added fuel costs).

No matter what we analyse, either oil prices vs. market indexes or vs. individual stock trading records, linear correlation method is solely able to measure 1-to-1 response of $y$ to $x$. The lower coefficient of correlation the less valid linear model as a descriptor of true events and mutual relationships under study.

1.2. Multiple Linear Regression Model for Oil Price Changes

In February 2016 Ben S. Bernanke found that a positive correlation of stocks and oil might arise because both are responding to the underlying shift in global demand. Using a simple multiple linear regression model he took and attempt in explaining the daily changes of the oil prices and concluded that in over 90% they could be driven by changes of commodities prices (copper), US dollar (spot price), 10-yr treasury interest rate, and in over 95% by extending the model with an inclusion of daily changes in VIX:
$$
\Delta p_{oil, t} = \beta_1 \Delta p_{copp, t} + \beta_2 \Delta p_{10yr, t} + \beta_3 \Delta p_{USD, t} + \beta_4 \Delta p_{VIX, t} + \beta_0
$$ The premise is that commodity prices, long-term interest rate, and USD are likely to respond to investors’ perceptions of global and US demand for oil. Additionally, the presence of VIX is in line with the following idea: if investors retreat from commodities and/or stocks during periods of high uncertainty/risk aversion, then shocks to volatility may be another reason for the observed tendency of stocks and oil prices moving together.

The model, inspired by James Hamilton (2014) work, aims at measuring the effect of demand shifts on the oil market. For the sake of elegance, we can code it in Python as follows. First, we download to the files all necessary price-series and process them within pandas Dataframes (more on Quantitative Aspects of pandas for Quants you will find in my upcoming book Python for Quants. Volume II.).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import pandas as pd
from scipy import stats
import datetime
from datetime import datetime
from matplotlib import pyplot as plt
import statsmodels.api as sm
 
%matplotlib inline
 
grey = 0.7, 0.7, 0.7
 
import warnings
warnings.filterwarnings('ignore')
 
# Download WPI Crude Oil prices
# src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse)
wpi = wpi[wpi.VALUE != "."]
wpi.VALUE = wpi.VALUE.astype(float)
wpi.index = wpi.DATE
wpi.drop('DATE', axis=1, inplace=True)
 
# Copper Futures 
# src: http://www.investing.com/commodities/copper-historical-data
cop = pd.read_csv("COPPER.csv", index_col=["Date"])
cop.index = pd.to_datetime(pd.Series(cop.index), format='%b %d, %Y')
 
# Nominal interest rate on 10-year Treasury bonds
# src: https://fred.stlouisfed.org/series/DGS10/
tb = pd.read_csv('DGS10.csv')
tb = tb[tb.DGS10 != "."]
tb.DGS10 = tb.DGS10.astype(float)
tb['DATE'] =  pd.to_datetime(tb['DATE'], format='%Y-%m-%d')
tb.index = tb.DATE
tb.drop('DATE', axis=1, inplace=True)
 
# Trade Weighted U.S. Dollar Index
# src: https://fred.stlouisfed.org/series/DTWEXM
usd = pd.read_csv('DTWEXM.csv')
usd = usd[usd.DTWEXM != "."]
usd.DTWEXM = usd.DTWEXM.astype(float)
usd['DATE'] =  pd.to_datetime(usd['DATE'], format='%Y-%m-%d')
usd.index = usd.DATE
usd.drop('DATE', axis=1, inplace=True)
 
# CBOE Volatility Index: VIX© (VIXCLS)
# src: https://fred.stlouisfed.org/series/VIXCLS/downloaddata
vix = pd.read_csv('VIXCLS.csv')
vix = vix[vix.VALUE != "."]
vix.VALUE = vix.VALUE.astype(float)
vix['DATE'] =  pd.to_datetime(vix['DATE'], format='%Y-%m-%d')
vix.index = vix.DATE
vix.drop('DATE', axis=1, inplace=True)

Not every time a plain use of pd.read_csv function works in the way we expect from it, therefore some extra steps are needed to be implemented in order to convert .csv files content to Dataframe, e.g. a correct formatting of date (line #18-19 or #35, 44, 53), price conversion from string to float (lines #21, 34, 43, 52), replacement of index with data included in ‘Date’ column followed by its removal (e.g. lines #22-23).

As a side note, in order to download the copper data, at the website we need manually mark a whole table, paste it into Excel or Mac’s Numbers, trim if required, and save down to .csv file. The problem might appear as dates are given in “Nov 26, 2016″ format. For this case, the following trick in Python saves a lot of time:

from datetime import datetime
expr = 'Nov 26, 2016'
datetime.strptime(expr, '%b %d, %Y')
datetime.datetime(2016, 11, 26, 0, 0)

what justifies the use of %b descriptor (see more on Python’s datetime date formats and parsing here and here).

pandas’ Dataframes are fast and convenient way to concatenate multiple time-series with indexes set to calendar dates. The following step ensures all five price-series to have data points on the same dates (if given), and if any data point (for any time-series) is missing (NaN), the whole row is removed from the end Dataframe of ‘df’ thanks to action of .dropna() function:

57
58
59
60
df = pd.concat([wpi, cop, tb, usd, vix], join='outer', axis=1).dropna()
df.columns = ["WPI", "Copp", "TrB", "USD", "VIX"]
 
print(df.head(10))
              WPI   Copp   TrB      USD    VIX
2007-11-15  93.37  3.082  4.17  72.7522  28.06
2007-11-16  94.81  3.156  4.15  72.5811  25.49
2007-11-19  95.75  2.981  4.07  72.7241  26.01
2007-11-20  99.16  3.026  4.06  72.4344  24.88
2007-11-21  98.57  2.887  4.00  72.3345  26.84
2007-11-23  98.24  2.985  4.01  72.2719  25.61
2007-11-26  97.66  3.019  3.83  72.1312  28.91
2007-11-27  94.39  2.958  3.95  72.5061  26.28
2007-11-28  90.71  3.001  4.03  72.7000  24.11
2007-11-29  90.98  3.063  3.94  72.6812  23.97

Given all price-series and our model in mind, we are more interested in running the regression using daily change in the natural logarithm of crude oil, copper price, 10-yr interest rate, USD spot price, and VIX. The use of log is a common practice in finance and economics (see here why). Contrary to the daily percent change, we derive the log returns in Python in the following way:

np.random.seed(7)
df = pd.DataFrame(100 + np.random.randn(100).cumsum(), columns=['price'])
df['pct_change'] = df.price.pct_change()
df['log_ret'] = np.log(df.price) - np.log(df.price.shift(1))
 
print(df.head())
        price  pct_change   log_ret
0  101.690526         NaN       NaN
1  101.224588   -0.004582 -0.004592
2  101.257408    0.000324  0.000324
3  101.664925    0.004025  0.004016
4  100.876002   -0.007760 -0.007790

Having that, we continue coding our main model:

62
63
64
65
66
67
68
69
70
71
72
73
headers = df.columns
 
dlog = pd.DataFrame()  # log returns Dateframe
 
for j in range(df.shape[1]):
    price = df.ix[:,j]
    dl = np.log(price) - np.log(price.shift(1))
    dlog.loc[:,j] = dl
 
dlog.dropna(inplace=True)
dlog.columns = headers
print(dlog.head())
                 WPI      Copp       TrB       USD       VIX
2007-11-16  0.015305  0.023727 -0.004808 -0.002355 -0.096059
2007-11-19  0.009866 -0.057047 -0.019465  0.001968  0.020195
2007-11-20  0.034994  0.014983 -0.002460 -0.003992 -0.044417
2007-11-21 -0.005968 -0.047024 -0.014889 -0.001380  0.075829
2007-11-23 -0.003353  0.033382  0.002497 -0.000866 -0.046910

and run multiple linear regression model for in-sample data, here, selected for 2 year in-sample period between Nov 2012 and Nov 2014:

75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
dlog0 = dlog.copy()
 
# in-sample data selection
dlog = dlog[(dlog.index > "2012-11-01") & (dlog.index < "2014-11-01")]
 
# define input data
y = dlog.WPI.values
x = [dlog.Copp.values, dlog.TrB.values, dlog.USD.values, dlog.VIX.values ]
 
# Multiple Linear Regression using 'statsmodels' library
def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results
 
# display a summary of the regression
print(reg_m(y, x).summary())
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.105
Model:                            OLS   Adj. R-squared:                  0.098
Method:                 Least Squares   F-statistic:                     14.49
Date:                Tue, 29 Nov 2016   Prob (F-statistic):           3.37e-11
Time:                        11:38:44   Log-Likelihood:                 1507.8
No. Observations:                 499   AIC:                            -3006.
Df Residuals:                     494   BIC:                            -2985.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0303      0.008     -3.800      0.000        -0.046    -0.015
x2            -0.3888      0.178     -2.180      0.030        -0.739    -0.038
x3             0.0365      0.032      1.146      0.252        -0.026     0.099
x4             0.2261      0.052      4.357      0.000         0.124     0.328
const      -3.394e-05      0.001     -0.064      0.949        -0.001     0.001
==============================================================================
Omnibus:                       35.051   Durbin-Watson:                   2.153
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               75.984
Skew:                          -0.392   Prob(JB):                     3.16e-17
Kurtosis:                       4.743   Cond. No.                         338.
==============================================================================
 
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is 
    correctly specified.

A complete model is given by the following parameters:

96
97
par = reg_m(y, x).params
print(par)
[ -3.02699345e-02  -3.88784262e-01   3.64964420e-02   2.26134337e-01
  -3.39418282e-05]

or more formally as:
$$
\Delta p_{oil, t} = 0.2261 \Delta p_{copp, t} + 0.0365 \Delta p_{10yr, t} -0.3888 \Delta p_{USD, t} -0.0303 \Delta p_{VIX, t}
$$ The adjusted $R^2$ is near zero and statistical significance of $\beta_2$ is more than 5%.

In order to use our model for “future” oil price prediction based on price movements in commodity (copper), long-term interest rate, USD spot price and VIX index, we generate out-of-sample data set and construct the estimation in the following way:

99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# out-of-sample log return Dataframe
dlog_oos = dlog0[dlog0.index >= '2014-11-01']
 
# what is the last WPI Crude Oil price in in-sample
last = df[df.index == "2014-10-31"].values[0][0]  # 80.53 USD/barrel
 
oilreg = pd.DataFrame(columns=["Date", "OilPrice"])  # regressed oil prices
par = reg_m(y, x).params  # parameters from regression
 
for i in range(dlog_oos.shape[0]):
    x1 = dlog_oos.iloc[i]["VIX"]
    x2 = dlog_oos.iloc[i]["USD"]
    x3 = dlog_oos.iloc[i]["TrB"]
    x4 = dlog_oos.iloc[i]["Copp"]
    w = par[0]*x1 + par[1]*x2 + par[2]*x3 + par[3]*x4 + par[4]
    if(i==0):
        oilprice = np.exp(w) * last
    else:
        oilprice = np.exp(w) * oilprice
    oilreg.loc[len(oilreg)] = [dlog_oos.index[i], oilprice]
 
oilreg.index = oilreg.Date
oilreg.drop('Date', axis=1, inplace=True)

where oilreg Dataframe stores time and regressed oil prices. Visualisation of the results

123
124
125
126
127
128
129
130
plt.figure(figsize=(10,5))
plt.plot(df.WPI, color=grey, label="WPI Crude Oil")
plt.plot(oilreg, "r", label="Predicted Oil Price")
plt.xlim(['2012-01-01', '2016-11-01'])  # in-sample & out-of-sample
plt.ylim([20, 120])
plt.ylabel("USD/barrel")
plt.legend(loc=3)
plt.grid()

delivers
unknown
Bernanke’s multiple regression model points at an idea saying: when stock traders respond to a change in oil prices, they do so not necessarily because the oil movement in consequential in itself. Our finding (for different data sets than Bernanke’s) points at lack of strong dependancy on VIX thus on traders activity. Again, as in case of one-factor regression model discussed in Section 1.1, the stability of underlying correlations with all four considered factors remains statistically variable in time.

2. Non-Linear Correlations

So far we have understood how fragile the quantification and interpretation of the linear correlation between two time-series can be. We agreed on its time-dependant character and how data sample size may influence derived results. Linear correlation is a tool that captures 1-to-1 feedback between two data sets. What if a response is lagged? We need to look for some other useful tools good enough for this challenge. In this Section we will analyse two of them.

2.1. Cross-Correlation and Cross-Bicorrelation

If a change of one quantity has a noticeable effect on another one, this can be observed with or without time delay. In the time-series analysis we talk about one time-series being lagged (or led) behind (before) the other one. Since an exisiting lag (lead) is not observed directly (a linear feedback) it introduces a non-linear relationship between two time-series.

A degree of non-linear correlation can be measured by the application of two independent methods, namely, cross-correlations and cross-bicorrelations designed to examine the data dependence between any two (three) values led/lagged by $r$ ($s$) periods. By period one should understand the sampling frequency of the underlying time-series, for example a quarter, i.e. 3 month period.

Let $x(t) \equiv \{ x_t \} = \{ x_1, x_2, …, x_N \}$ for $t=1,…,N$ represents, e.g. the WPI Crude Oil price-series, and $\{ y_t \}$ the time-series of a selected quantity. For two time-series, $x(t)$ and $y(t)$, the cross-correlation is defined as:
$$
C_{xy}(r) = (N-r)^{-1} \sum_{i=1}^{N-r} x(t_i) y(t_i + r)
$$ whereas the cross-bicorrelation is usually assumed as:
$$
C_{xyy}(r,s) = (N-m)^{-1} \sum_{i=1}^{N-m} x(t_i) y(t_i + r) y(t_i + s)
$$ where $m=\max(r,s)$ (Brooks & Hinich 1999). Both formulae measure the inner product between the individual values of two time-series. In case of cross-correlation $C_{xy}(r)$ it is given by $\langle x(t_i) y(t_i + r) \rangle$ where the value of $y(t)$ is lagged by $r$ periods. The summation, similarly as in the concept of the Fourier transform, picks up the maximum power for a unique set of lags $r$ or $(r,s)$ for $C_{xy}(r)$ and $C_{xyy}(r,s)$, respectively. The use of cross-correlation method should return the greatest value of $C_{xy}(r)$ among all examined $r$’s. Here, $1 \le r < N$ in a first approximation. It reveals a direct non-linear correlation between $x(t_i)$ and $y(t_i + r)$ $\forall i$.

The concept of cross-bicorrelation goes one step further and examines the relationship between $x(t_i)$ and $y(t_i + r)$ and $y(t_i + s)$, e.g. how the current oil price at time $t$, $x(t_i)$, affects $y$ lagged by $r$ and $s$ independent time periods. Note that cross-bicorrelation also allows for $C_{xxy}(r,s)$ form, i.e. if we require to look for relationship of the current and lagged (by $r$) values of $x$ time-series and period $y$ time-series lagged by $s$.

2.2. Statistical Tests for Cross-Correlation and Cross-Bicorrelation

The application of both tools for small data samples ($N < 50$) does not guarantee a statistically significant non-linear correlations every single time the computations are done. The need for statistical testing of the results returned by both cross-correlation and cross-bicorrelation methods has beed addressed by Hinich (1996) and Brook & Hinich (1999). For two time-series, with a weak stationarity, the null hypothesis $H_0$ for the test is that two series are independent pure white noise processes against the alternative hypothesis $H_1$ saying that some cross-covariances, $E[x(t_i) y(t_i + r)]$, or cross-bicovariances, $E[x(t_i) y(t_i + r) y(t_i + s)]$, are nonzero. The test statistics, respectively for cross-correlation and cross-bicorrelation, are given by:
$$
H_{xy}(N) = \sum_{r=1}^{L} (N-L) C_{xy}^2(r) = (N-L) \sum_{r=1}^{L} C_{xy}^2(r)
$$ and
$$
H_{xyy}(N) = \sum_{s=-L}^{L} ‘ \sum_{r=1}^{L} (N-m) C_{xyy}^2(r,s) \ \ \ (‘-s \ne -1, 1, 0)
$$ where it has been shown by Hinich (1996) that:
$$
H_{xy}(N) \sim \chi^2_L
$$ and
$$
H_{xyy}(N) \sim \chi^2_{(L-1)(L/2)}
$$ where $\sim$ means are asymptotically distributed with the corresponding degrees for $N \rightarrow \infty$ and $m = \max(r, s)$ as previously. For both $H_{xy}(N)$ and $H_{xyy}(N)$ statistics, the number of lags $L$ is specified as $L = N^b$ where $0 < b < 0.5$. Based on the results of the Monte-Carlo simulations, Hinich & Parrerson (1995) recommended the use of $b=0.4$ in order to maximise the power of test while ensuring a valid approximation to the asymptotic theory. Further down, we compute $H_{xy}(N)$ and $H_{xyy}(N)$ statistics for every cross-correlation and cross-bicorrelation and determine the corresponding significance assuming the confidence level of 90% ($\alpha = 0.1$) if: $$ \mbox{p-value} < \alpha $$ where $$ \mbox{p-value} = 1 - \mbox{Pr} \left[ \chi^2_{\mbox{dof}} < H_{x(y)y} | H_1 \right] \ . $$ Given that, any resultant values of cross-correlations and cross-bicorrelations one can assume as statistically significant if p-value $< \alpha$. A recommended number of lags to be investigated is $L = N^{0.4}$ where $N$ is the length of $x(t)$ and $y(t)$ time-series. Employing floor rounding for $L$ we obtain the following relationship between $N$ and $L$ for the sample sizes of $N \le 50$:

L = []
for N in range(1, 51):
    L.append(np.floor(N**0.4))
 
fig, ax = plt.subplots(figsize=(8,3))
plt.plot(np.arange(1, 51), L, '.')
plt.xlabel("N")
plt.ylabel("L")
ax.margins(x=0.025, y=0.05) # add extra padding
ax.tick_params(axis='y',which='minor',left='off') # without minor yticks
plt.grid(b=True, which='minor', color='k', linestyle=':', alpha=0.3)
plt.minorticks_on()

unknown-2

2.3. Cross-Correlation and Cross-Bicorrelation in Python

Let’s translate cross-correlation to Python language and run a simple test for a random time-series:

# cross-correlation
def Cxy(x, y, r, N):
    z = 0
    for i in range(0, N-r-1):
        z += x[i] * y[i+r]
    z /= (N-r)
    return z
 
# test statistic for cross-correlation
def Hxy(x, y, L, N):
    z = 0
    for r in range(1, L+1):
        z += Cxy(x, y, r, N)**2
    z *= (N - L)   
    return z

where we defined two functions deriving $C_{xy}(r)$ and $H_{xy}(N)$, respectively. Now, let’s consider some simplified time-series of $N=16$, both normalised to be $\sim N(0,1)$:

# sample data
y = np.array([0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0])
x = np.array([1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])
 
# normalise such that x, y ~ N(0,1)
x = (x - np.mean(x))/np.std(x, ddof=1)
y = (y - np.mean(y))/np.std(y, ddof=1)

As one can see, there is expected strong cross-correlation for lag of $r = 2$ (strongest) with $r = 1$ to be detected too. We verify the overall statistical significance of cross-correlations for given time-series in the following way:

N = len(x)
L = int(np.floor(N**0.4))
 
for r in range(1, L+1):
    print("r =%2g\t Cxy(r=%g) = %.4f" % (r, r, Cxy(x, y, r)))
 
dof = L    
pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
alpha = 0.1
 
print("Hxy(N) = %.4f" % Hxy(x, y, L, N))
print("p-value = %.4f" % pvalue)
print("H0 rejected in favour of H1 at %g%% c.l.: %s" % 
      (100*(1-alpha), pvalue<alpha))

confirming both our expectations and significance of detected cross-correlations at 90% confidence level:

r = 1	 Cxy(r=1) = 0.0578
r = 2	 Cxy(r=2) = 0.8358
r = 3	 Cxy(r=3) = -0.3200
Hxy(N) = 10.4553
p-value = 0.0151
H0 rejected in favour of H1 at 90% c.l.: True

For cross-bicorrelation we have:

# cross-bicorrelation
def Cxyy(x, y, r, s, N):
    z = 0
    m = np.max([r, s])
    for i in range(0, N-m-1):
        z += x[i] * y[i+r] * y[i+s]
    z /= (N-m)
    return z
 
# test statistic for cross-bicorrelation
def Hxyy(x, y, L, N):
    z = 0
    for s in range(2, L+1):
        for r in range(1, L+1):
            m = np.max([r, s])
            z += (N-m) * Cxyy(x, y, r, s, N)**2
    return z    
 
 
# sample data
y = np.array([0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0])
x = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0])
 
# normalise such that x, y ~ N(0,1)
x = (x - np.mean(x))/np.std(x, ddof=1)
y = (y - np.mean(y))/np.std(y, ddof=1)
 
N = len(x)  # len(x) should be equal to len(y)
L = int(np.floor(N**0.4))
 
for r in range(0, L+1):
    for s in range(r+1, L+1):
        print("r =%2g, s =%2g\t Cxyy(r=%g, s=%g) = %.5f" % 
             (r, s, r, s, Cxyy(x, y, r, s, N)))
 
dof = (L-1)*(L/2)
pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
alpha = 0.1
 
print("Hxyy(N) = %.4f" % Hxyy(x, y, L, N))
print("p-value = %.4f" % pvalue)
print("H0 rejected in favour of H1 at %g%% c.l.: %s" % 
      (100*(1-alpha), pvalue<alpha))

where for a new sample data we expect to detect two strongest cross-bicorrelations for pairs of lags $(r,s)$ equal $(2,3)$ and $(1,2)$, respectively. By running the code we obtain indeed that:

r = 0, s = 1	 Cxyy(r=0, s=1) = -0.18168
r = 0, s = 2	 Cxyy(r=0, s=2) = -0.23209
r = 0, s = 3	 Cxyy(r=0, s=3) = 0.05375
r = 1, s = 2	 Cxyy(r=1, s=2) = 0.14724
r = 1, s = 3	 Cxyy(r=1, s=3) = -0.22576
r = 2, s = 3	 Cxyy(r=2, s=3) = 0.18276
Hxyy(N) = 5.3177
p-value = 0.0833
H0 rejected in favour of H1 at 90% c.l.: True

With these two examples we showed both cross-correlation and cross-bicorrelation methods in action. We also pointed at the important fact of an existing dependency between the length of time-series and the expected number of lags to be examined. The longer time-series the higher number of degrees-of-freedom.

Given that state of the play, there exist two independent and appealing ways for the application of cross-correlation and cross-bicorrelation in finance: (1) the first one regards the examination of time-series with a small number of data points (e.g. $N \le 50$); (2) alternatively, if we deal with a long ($N \ge 1000$) time-series, the common practice is to divide them into $k$ chunks (non-overlapping windows) of length at least $N_i$ of $50$ to $100$ where $i=1,…,k$ and compute cross-correlation and cross-bicorrelation metrics for every data window separately. In consequence, within this approach, we gain an opportunity to report on time-period dependent data segment with statistically signifiant (or not) non-linear correlations! (see e.g. Coronado et alii 2015).

In the Section that follows, we solely focus on the application of the first approach. Rewarding in many aspects if you’re open-mined.

2.4. Detecting Non-Linearity between Oil Prices and Stock Fundamentals

Since looking for a direct correlation between crude oil prices and stocks can be regarded as the first approach to the examination of mutual relationships, it’s not sexy; not any more. With cross-correlation and cross-bicorrelation methods you can step up and break the boredom.

The stock fundamentals consist a broad space of parameters that actually deserve to be cross-correlated with the oil prices. It is much more intuitive that any (if any) effect of rising or falling crude oil prices may affect different companies with various time lags. Moreover, for some businesses we might detect a non-linear relationships existing only between oil and company’s e.g. revenues whereas for others, changing oil prices might affect more than one factor. If the patterns is present indeed, with the application of cross-correlation and cross-bicorrelation methods, you gain a brand new tool for data analysis. Let’s have a look at a practical example.

As previously, we will need both the WPI Crude Oil price-series (see Section 1.1) and access to a large database containing company’s fundamentals over a number of quarters or years. For the latter let me use a demo version of Quandl.com‘s Zacks Fundamentals Collection C available here (download Zacks Fundamentals Expanded ZACKS-FE.csv file). It contains 852 rows of data for 30 selected companies traded at US stock markets (NYSE, NASDAQ, etc.). This sample collection covers data back to 2011 i.e. ca. 22 to 23 quarterly reported financials. There is over 620 fundamentals to select from. Their full list you can find here (download the ZFC Definitions csv file). Let’s put it all in a new Python listing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Non-Linear Cross-Bicorrelations between the Oil Prices and Stock Fundamentals
# (c) 2016 QuantAtRisk.com, by Pawel Lachowicz
 
import numpy as np
import pandas as pd
import scipy.stats
import datetime
from matplotlib import pyplot as plt
 
%matplotlib inline
 
grey = 0.7, 0.7, 0.7
 
import warnings
warnings.filterwarnings('ignore')
 
 
# WTI Crude Oil price-series
# src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse)
wpi = wpi[wpi.VALUE != "."]
wpi.VALUE = wpi.VALUE.astype(float)
 
# Zacks Fundamentals Collection C (Fundamentals Expanded)
# src: https://www.quandl.com/databases/ZFC
df = pd.read_csv('ZACKS-FE.csv')
 
col = ['Ticker', 'per_end_date', 'per_code', 'Revenue', "EBIT", 
          "GrossProfit", "CAPEX"]
data = pd.DataFrame(columns=col)  # create an empty DataFrame
 
tickers = ["XOM", "WMT", "BA", "AAPL", "JNJ"]
 
for i in range(df.shape[0]):
    ticker = df.iloc[i]["ticker"]
    pc = df.iloc[i]["per_code"]
    # extract only reported fundamentals for quarters "QR0 and all "QR-"
    # in total 22 to 23 rows of data per stock
    if((pc[0:3] == "QR-") | (pc[0:3] == "QR0")) and (ticker in tickers):
        data.loc[len(data)] = [df.iloc[i]["ticker"],
                               df.iloc[i]["per_end_date"],
                               df.iloc[i]["per_code"],
                               df.iloc[i]["tot_revnu"],
                               df.iloc[i]["ebit"],
                               df.iloc[i]["gross_profit"],
                               df.iloc[i]["cap_expense"]] 
 
print(data.head(25))

where we created a new Dataframe, data, storing Total Revenue, EBIT, Gross Profit (revenue from operations less associated costs), and CAPEX (capital expenditure) for a sample of 5 stocks (XOM, WMT, BA, AAPL, and JNJ) out of 30 available in that demo collection. Our goal here is just to show a path for a possible analysis of all those numbers with, kept in mind, an encouragement to build a bigger tool for stock screening based on detection of non-linear correlations amongst selected fundamentals and oil prices.

First 25 lines of data Dataframe are:

   Ticker per_end_date per_code  Revenue     EBIT  GrossProfit    CAPEX
0    AAPL   2011-03-31    QR-22  24667.0   7874.0      10218.0  -1838.0
1    AAPL   2011-06-30    QR-21  28571.0   9379.0      11922.0  -2615.0
2    AAPL   2011-09-30    QR-20  28270.0   8710.0      11380.0  -4260.0
3    AAPL   2011-12-31    QR-19  46333.0  17340.0      20703.0  -1321.0
4    AAPL   2012-03-31    QR-18  39186.0  15384.0      18564.0  -2778.0
5    AAPL   2012-06-30    QR-17  35023.0  11573.0      14994.0  -4834.0
6    AAPL   2012-09-30    QR-16  35966.0  10944.0      14401.0  -8295.0
7    AAPL   2012-12-31    QR-15  54512.0  17210.0      21060.0  -2317.0
8    AAPL   2013-03-31    QR-14  43603.0  12558.0      16349.0  -4325.0
9    AAPL   2013-06-30    QR-13  35323.0   9201.0      13024.0  -6210.0
10   AAPL   2013-09-30    QR-12  37472.0  10030.0      13871.0  -8165.0
11   AAPL   2013-12-31    QR-11  57594.0  17463.0      21846.0  -1985.0
12   AAPL   2014-03-31    QR-10  45646.0  13593.0      17947.0  -3367.0
13   AAPL   2014-06-30     QR-9  37432.0  10282.0      14735.0  -5745.0
14   AAPL   2014-09-30     QR-8  42123.0  11165.0      16009.0  -9571.0
15   AAPL   2014-12-31     QR-7  74599.0  24246.0      29741.0  -3217.0
16   AAPL   2015-03-31     QR-6  58010.0  18278.0      23656.0  -5586.0
17   AAPL   2015-06-30     QR-5  49605.0  14083.0      19681.0  -7629.0
18   AAPL   2015-09-30     QR-4  51501.0  14623.0      20548.0 -11247.0
19   AAPL   2015-12-31     QR-3  75872.0  24171.0      30423.0  -3612.0
20   AAPL   2016-03-31     QR-2  50557.0  13987.0      19921.0  -5948.0
21   AAPL   2016-06-30     QR-1  42358.0  10105.0      16106.0  -8757.0
22   AAPL   2016-09-30      QR0  46852.0  11761.0      17813.0 -12734.0
23     BA   2011-03-31    QR-22  14910.0   1033.0       2894.0   -417.0
24     BA   2011-06-30    QR-21  16543.0   1563.0       3372.0   -762.0

An exemplary visualisation of both data sets for selected company (here: Exxon Mobil Corporation, XOM, engaged in the exploration and production of crude oil and natural gas, manufacturing of petroleum products, and transportation and sale of crude oil, natural gas and petroleum products) we obtain with:

49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
fig, ax1 = plt.subplots(figsize=(10,5))
plt.plot(wpi.DATE, wpi.VALUE, color=grey, label="WPI Crude Oil (daily)")
plt.grid(True)
plt.legend(loc=1)
plt.ylabel("USD per barrel")
 
selection = ["XOM"]
 
ax2 = ax1.twinx()
plt.ylabel("1e6 USD")
for ticker in selection:
    tmp = data[data.Ticker == ticker]
    lab = ticker
    tmp['per_end_date'] =  pd.to_datetime(tmp['per_end_date'], 
                                                        format='%Y-%m-%d')
    plt.plot(tmp.per_end_date, tmp.Revenue, '.-', label=lab + ": Revenue")
    plt.plot(tmp.per_end_date, tmp.EBIT, '.-g', label=lab + ": EBIT")
    plt.plot(tmp.per_end_date, tmp.GrossProfit, '.-m', label=lab + ": 
                                                            Gross Profit")
    plt.plot(tmp.per_end_date, tmp.CAPEX, '.-r', label=lab + ": CAPEX")
    plt.legend(loc=3)

unknown-2
Every company has different periods of time (per_end_date) when they report their financials. That is why we need to make sure that in the process of comparison of selected fundamentals with oil prices, the oil price is picked up around the date defined by per_end_date. Let me use $\pm 5$ day average around those points:

70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# cross-correlation
def Cxy(x, y, r, N):
    z = 0
    for i in range(0, N-r-1):
        z += x[i] * y[i+r]
    z /= (N-r)
    return z
 
# test statistic for cross-correlation
def Hxy(x, y, L, N):
    z = 0
    for r in range(1, L+1):
        z += Cxy(x, y, r, N)**2
    z *= (N - L)   
    return z
 
# cross-bicorrelation
def Cxyy(x, y, r, s, N):
    z = 0
    m = np.max([r, s])
    for i in range(0, N-m-1):
        z += x[i] * y[i+r] * y[i+s]
    z /= (N-m)
    return z
 
def Hxyy(x, y, L, N):
    z = 0
    for s in range(2, L+1):
        for r in range(1, L+1):
            m = np.max([r, s])
            z += (N-m) * Cxyy(x, y, r, s, N)**2
    return z
 
 
fig, ax1 = plt.subplots(figsize=(10,5))
plt.plot(wpi.DATE, wpi.VALUE, color=grey, label="WPI Crude Oil (daily)")
plt.grid(True)
plt.legend(loc=1)
plt.ylabel("USD per barrel")
 
selection = ["XOM"]
 
ax2 = ax1.twinx()
plt.ylabel("1e6 USD")
for ticker in selection:
    tmp = data[data.Ticker == ticker]
    lab = ticker
    tmp['per_end_date'] =  pd.to_datetime(tmp['per_end_date'], 
                                               format='%Y-%m-%d')
    plt.plot(tmp.per_end_date, tmp.Revenue, '.-', label=lab + ": Revenue")
    plt.plot(tmp.per_end_date, tmp.EBIT, '.-g', label=lab + ": EBIT")
    plt.plot(tmp.per_end_date, tmp.GrossProfit, '.-m', label=lab + ": 
                                                            Gross Profit")
    plt.plot(tmp.per_end_date, tmp.CAPEX, '.-r', label=lab + ": CAPEX")
    plt.legend(loc=3)
 
    col = ['per_end_date', 'avgWPI']
    wpi2 = pd.DataFrame(columns=col)  # create an empty DataFrame
 
    for i in range(tmp.shape[0]):
        date = tmp.iloc[i]["per_end_date"]
        date0 = date
        date1 = date + datetime.timedelta(days=-5)
        date2 = date + datetime.timedelta(days=+5)
        wpiavg = wpi[(wpi.DATE >= date1) & (wpi.DATE <= date2)]
        avg = np.mean(wpiavg.VALUE)
        wpi2.loc[len(wpi2)] = [date0, avg]
 
    # plot quarterly averaged oil price-series
    plt.sca(ax1)  # set ax1 axis for plotting
    plt.plot(wpi2.per_end_date, wpi2.avgWPI, '.-k')

unknown-3
where a black line denotes quarterly changing WPI Crude Oil prices stored now in a new Dataframe of wpi2.

Let’s leave only XOM in selection list variable. After line #140 the Dataframe of tmp stores XOM’s selected fundamentals data. In what follows we will use both Dateframes (tmp and wpi2) to look for non-linear correlations as the final goal of this investigation:

142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
    # select some factors
    fundamentals = ["Revenue", "EBIT", "GrossProfit", "CAPEX"]
 
    # compute for them cross-correlations and cross-bicorrelations
    # and display final results
    #
    for f in fundamentals:
        print("%s: %s\n" % (ticker, f))
        # input data
        x = wpi2.avgWPI.values
        y = tmp[f].values
 
        # normalised time-series
        x = (x - np.mean(x))/np.std(x, ddof=1)
        y = (y - np.mean(y))/np.std(y, ddof=1)
 
        N = len(x)  # len(x) should be equal to len(y)
        L = int(np.floor(N**0.4))
 
        print("Cross-Correlation")
        for r in range(1, L+1):
            print("  r =%2g\t Cxy(r=%g) = %.4f" % 
                  (r, r, Cxy(x, y, r, N)))
 
        dof = L
        pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
        alpha = 0.1
 
        print("  Hxy(N) = %.4f" % Hxy(x, y, L, N))
        print("  p-value = %.4f" % pvalue)
        print("  H0 rejected in favour of H1 at %g%% c.l.: %s" % 
                                      (100*(1-alpha), pvalue<alpha))
 
        print("Cross-Bicorrelation")
        for r in range(0, L+1):
            for s in range(r+1, L+1):
                print("  r =%2g, s =%2g\t Cxyy(r=%g, s=%g) = %.5f" % 
                     (r, s, r, s, Cxyy(x, y, r, s, N)))
 
        dof = (L-1)*(L/2)
        pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
 
        print("  Hxyy(N) = %.4f" % Hxyy(x, y, L, N))
        print("  p-value = %.4f" % pvalue)
        print("  H0 rejected in favour of H1 at %g%% c.l.: %s\n" % 
                                       (100*(1-alpha), pvalue<alpha))

Our Python code generates for XOM the following output:

XOM: Revenue
 
Cross-Corelation
  r = 1	 Cxy(r=1) = 0.7869
  r = 2	 Cxy(r=2) = 0.6239
  r = 3	 Cxy(r=3) = 0.4545
  Hxy(N) = 24.3006
  p-value = 0.0000
  H0 rejected in favour of H1 at 90% c.l.: True
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = -0.44830
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.30186
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.19315
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.36678
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.22161
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.24446
  Hxyy(N) = 10.3489
  p-value = 0.0000
  H0 rejected in favour of H1 at 90% c.l.: True
 
XOM: EBIT
 
Cross-Correlation
  r = 1	 Cxy(r=1) = 0.7262
  r = 2	 Cxy(r=2) = 0.5881
  r = 3	 Cxy(r=3) = 0.3991
  Hxy(N) = 20.6504
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = -0.39049
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.25963
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.17480
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.28698
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.18030
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.21284
  Hxyy(N) = 6.8759
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
 
XOM: GrossProfit
 
Cross-Correlation
  r = 1	 Cxy(r=1) = 0.7296
  r = 2	 Cxy(r=2) = 0.5893
  r = 3	 Cxy(r=3) = 0.3946
  Hxy(N) = 20.7065
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = -0.38552
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.24803
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.17301
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.28103
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.17393
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.19780
  Hxyy(N) = 6.2095
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
 
XOM: CAPEX
 
Cross-Correlation
  r = 1	 Cxy(r=1) = -0.2693
  r = 2	 Cxy(r=2) = -0.3024
  r = 3	 Cxy(r=3) = -0.2333
  Hxy(N) = 4.3682
  p-value = 0.2244
  H0 rejected in favour of H1 at 90% c.l.: False
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = 0.00844
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.10889
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.14787
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.10655
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.16628
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.08072
  Hxyy(N) = 6.4376
  p-value = 0.2244
  H0 rejected in favour of H1 at 90% c.l.: False

A quick analysis points at detection of significant non-linear correlations (at 90% confidence level) between WPI Crude Oil and Revenue, EBIT, and GrossProfit for lag $r=1$ (based on cross-correlations) and for lags $(r=0, s=1)$ (based on cross-bicorrelations). There is no significant non-linear relationship between XOM’s CAPEX and oil prices.

Interestingly, rerunning the code for Boeing Company (NYSE ticker: BA) reveals no significant cross-(bi)correlations between the same factors and oil. For example, Boeing’s revenue improves year-over-year,

unknown-4

but it appears to have any significant non-linear link to the oil prices whatsoever:

BA: Revenue
 
Cross-Correlation
  r = 1	 Cxy(r=1) = -0.3811
  r = 2	 Cxy(r=2) = -0.3216
  r = 3	 Cxy(r=3) = -0.1494
  Hxy(N) = 5.4200
  p-value = 0.1435
  H0 rejected in favour of H1 at 90% c.l.: False
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = 0.11202
  r = 0, s = 2	 Cxyy(r=0, s=2) = 0.03588
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.08083
  r = 1, s = 2	 Cxyy(r=1, s=2) = 0.01526
  r = 1, s = 3	 Cxyy(r=1, s=3) = 0.00616
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.07252
  Hxyy(N) = 0.3232
  p-value = 0.1435
  H0 rejected in favour of H1 at 90% c.l.: False

The most interesting aspect of cross-correlation and cross-bicorrelation methods applied in oil vs. stock fundamentals research regards new opportunities to discover correlations for stocks that so far have been completely ignored (or not suspected) to have anything in common with oil price movements in the markets. Dare to go this path. I hope you will find new assets to invest in. The information is there. Now you are equipped with new tools. Use them wisely. Retire early.

REFERENCES
    Bernanke B. S., 2016, The relationship between stocks and oil prices
    Brooks C., Hinich M. J., 1999, Cross-correlations and cross-bicorrelations in Sterling
          exchange rates
(pdf)
    Coronado et al., 2015, A study of co-movements between USA and Latin American stock
          markets: a cross-bicorrelations perspective
(pdf)
    Hamilton J., 2014, Oil prices as an indicator of global economic conditions
    Hinich M.J., 1996. Testing for dependence in the input to a linear time series model.
          Journal of Nonparametric Statistics 6, 205–221.
    Hinich M.J., Patterson D.M., 1995, Detecting epochs of transient dependence in white
          noise
, Mimeo, University of Texas at Austin

DOWNLOADS
    COPPER.csv, DCOILWTICO.csv, DGS10.csv, DTWEXM.csv, VIXCLS.csv, ZACKS-FE.csv

Information Ratio and its Relative Strength for Portfolio Managers

The risk and return are the two most significant quantities investigated within the investment performance. We expect the return to be highest at the minimum risk. In practice, a set of various combinations is always achieved depending on a number of factors related to the investment strategies and market behaviours, respectively. The portfolio manager or investment analyst is interested in the application of most relevant tool in order to described the overall performance. The arsenal is rich but we need to know the character of outcomes we wish to underline prior to fetching the proper instruments from the workshop.


In general there are two schools of investment performance. The first one is strictly related to the analysis of time-series on the tick-by-tick basis. All I mean by tick is a time scale of your data, e.g. minutes or months. The analyst is concerned about changes in investments and risk tracked continuously. A good example of this sort of approach is measurement of Value-at-Risk and Expected Shortfall, Below Target Risk or higher-moments of Coskewness and Cokurtosis. It all supplements the n-Asset Portfolio Construction with Efficient Frontier diagnostic.

The second school in the investment analysis is a macro-perspective. It includes the computation on $n$-period rolling median returns (or corresponding quantities), preferably annualized for the sake of comparison. This approach carries more meaning behind the lines rather than a pure tick time-series analysis. For the former, we track a sliding changes in accumulation and performance while for the latter we focus more closely on tick-by-tick risks and returns. For the sake of global picture and performance (market-oriented vantage point), a rolling investigation of risk and return reveals mutual tracking factors between different considered scenarios.

In this article I scratch the surface of the benchmark-oriented analysis with a special focus on the Information Ratio (IR) as an investment performance-related measure. Based on the market data, I test the IR in action and introduce a new supplementary measure of Information Ratio Relative Strength working as a second dimension for IR results.

1. Benchmark-oriented Analysis

Let’s assume you a portfolio manager and you manage over last 10 years 11 investment strategies. Each strategy is different, corresponds to different markets, style of investing, and asset allocation. You wish to get a macro-picture how well your investment performed and are they beating the benchmark. You collect a set of 120 monthly returns for each investment strategy plus their benchmarks. The easiest way to kick off the comparison process is the calculation of $n$-period median (or mean) returns.

Annualized Returns

The rolling $n$Y ($n$-Year) median return is defined as:
$$
\mbox{rR}_{j}(n\mbox{Y}) = \left[ \prod_{i=1}^{12n} (r_{i,j}+1)^{1/n} \right] -1
\ \ \ \ \ \mbox{for}\ \ \ j=12n,…,N\ \ (\mbox{at}\ \Delta j =1)
$$
where $r$ are the monthly median returns meeting the initial data selection criteria (if any) and $N$ denotes the total number of data available (in our case, $N=120$).

Any pre-selection (pre-filtering) of the input data becomes more insightful when contrasted with the benchmark performance. Thereinafter, by the benchmark we can assume a relative measure we will refer to (e.g. S\&P500 Index for US stocks, etc.) calculated as the rolling $n$Y median. Therefore, for instance, the 1Y (rolling) returns can be compared against the corresponding benchmark (index). The selection of a good/proper benchmark belongs to us.

Active Returns

We define active returns on the monthly basis as the difference between actual returns and benchmark returns,
$$
a_i = r_i – b_i ,
$$
as available at $i$ where $i$ denotes a continuous counting of months over past 120 months.

Annualized Tracking Risk (Tracking Error)

We assume the widely accepted definition of the tracking risk as the standard deviation of the active returns. For the sake of comparison, we transform it the form of the annualized tracking risk in the following way:
$$
\mbox{TR}_{j}(n\mbox{Y}) = \sqrt{12} \times \sqrt{ \frac{1}{12n-1} \sum_{i=1}^{12n} \left( a_{i,j}-\langle a_{i,j} \rangle \right)^2 }
$$
where the square root of 12 secures a proper normalization between calculations and $j=12n,…,N$.

Annualized Information Ratio (IR)

The information ratio is often referred to as a variation or generalised version of Sharpe ratio. It tells us how much excess return ($a_i$) is generated from the amount of excess risk taken relative to the benchmark. We calculate the annualized information ratio by dividing the annualised active returns by the annualised tracking risk as follows:
$$
\mbox{IR}_j(n\mbox{Y}) = \frac{ \left[ \prod_{i=1}^{12n} (a_{i,j}+1)^{1/n} \right] -1 } { TR_j(n\mbox{Y}) }
\ \ \ \ \ \mbox{for}\ \ \ j=12n,…,N
$$

We assume the following definition of the investment under-performance and out-performance, $P$, as defined based on IR:
$$
P_j =
\begin{cases}
P_j^{-} =\mbox{under-performance}, & \text{if } \mbox{IR}_j(n\mbox{Y}) \le 0 \\
P_j^{+} =\mbox{out-performance}, & \text{if } \mbox{IR}_j(n\mbox{Y}) > 0 .
\end{cases}
$$

Information Ratio Relative Strength of Performance

Since $P_j$ marks in time only those periods of time (months) where a given quantity under investigation achieved better results than the benchmark or not, it does not tell us anything about the importance (weight; strength) of this under/over-performance. The case could be, for instance, to observe the out-performance for 65\% of time over 10 years but barely beating the benchmark in the same period. Therefore, we introduce an additional measure, namely, the IR relative strength defined as:
$$
\mbox{IRS} = \frac{ \int P^{+} dt } { \left| \int P^{-} dt \right| } .
$$
The IRS measures the area under $\mbox{IR}_j(n\mbox{Y})$ function when $\mbox{IR}_j(n\mbox{Y})$ is positive (out-performance) relative to the area cofined between $\mbox{IR}_j(n\mbox{Y})$ function and $\mbox{IR}_j(n\mbox{Y})=0$. Thanks to that measure, we are able to estimate the revelance of
$$
\frac { n(P^{+}) } { n(P^{-}) + n(P^{+}) }
$$
ratio, i.e. the fraction of time when the examined quantity displayed out-performance over specified period of time (e.g. 10 years of data within our analysis). Here, $n(P^{+})$ counts number of $P_j^{+}$ instances.

2. Case Study

Let’s demonstrate the aforementioned theory in practice. Making use of real data, we perform a benchmark-oriented analysis for all 11 investment strategies. We derive 1Y rolling median returns supplemented by the correspoding 1Y tracking risk and information ratio measures. The following figure presents the results for 1Y returns of for the Balanced Options:

ir-fig01
Interestingly to notice, the Investment Strategy-2 (Inv-2) displays an increasing tracking risk between mid-2007 and mid-2009 corresponding to the Credit Global Financial Crisis (GFC). This consistent surge in TR is explained by worse than average performance of the individual options, most probably dictated by higher than expected volatility of assets within portfolio. The 1Y returns of Inv-2 were able to beat the benchmark over only 25% of time, i.e. about 2.25 years since Jun/30 2004. We derive that conclusion based on the quantitative inspection of the Information Ratio plot (bottom panel).

In order to understand the overall performance for all eleven investment strategies, we allow ourselves to compute the under-performance and out-performance ratios,
$$
\frac { n(P^{-}) } { n(P^{-}) + n(P^{+}) } \ \ \ \mbox{and} \ \ \ \frac { n(P^{+}) } { n(P^{-}) + n(P^{+}) } ,
$$
and plot these ratios in red and blue colour in the following figure, respectively:

ir-fig02
Only two investment strategies, Inv-7 and Inv-11, managed to keep the out-performance time longer than their under-performance. Inv-8 denoted the highest ratio of under-performance to out-performance but also Inv-2 (consider above)performed equally badly.

The 3rd dimension to this picture delivers the investigation of the IR Relative Strength measure for all investment strategies. We derive and display the results in next figure:
ir-fig03
The plot reveals close-to-linear correlation between two considered quantities. Inv-7 and Inv-11 strategies confirm their relative strength of out-performance while Inv-2 and Inv-8 their weakness in delivering strong returns over their short period of out-performance.

It is of great interest to compare Inv-3, Inv-4, and Inv-10 strategies. While relative strength of out-performance of Inv-4 remains lower as contrasted with the latter ones (and can be explained by low volatility within this sort of investment strategy), Inv-10 denotes much firmer gains over benchmark than Inv-3 despite the fact that their out-performance period was nearly the same ($\sim$40%). Only the additional cross-correlation of IR Relative Strength outcomes as a function of the average return year-over-year is able to address the drivers of this observed relation.

Discuss this topic on QaR Forum

You can discuss the current post with other quants within a new Quant At Risk Forum.

Create a Portfolio of Stocks based on Google Finance Data fed by Quandl

There is an old saying Be careful what you wish for. Many quants and traders wished for a long time for a better quality of publicly available data. It wasn’t an easy task to accomplish but with a minimum of effort, it has become possible to get all what you truly dreamt of. When I started my experience with daily backtesting of my algo strategies I initially relied on Yahoo! data. Yeah, easily accessible but of low quality. Since the inception of Google Finance the same dream had been reborn. More precise data, covering more aspects of trading, platforms, markets, instruments. But how to get them all?


Quandl.com provides us with an enormous access to an enormous database. Just visit the webpage to find out what I mean by barley putting my thoughts all together in one sentence. In fact, the portal’s accessibility via different programming languages (e.g. Matlab, C++, Python, etc.) makes that database more alive than ever. The space of parameters the quants get an access to becomes astronomical in number!

Alleluja! Let’s ride this horse. Enough of watching the online charts changing a direction of our investing moods with a one-mintute time resolution. We were born to be smarter than stochastic processes taking control over our daily strategies. In this short post, making use of the Matlab R2013a platform, I want to show you how quickly you can construct any portfolio of US stocks taken from the Dow Jones Index universe (as an example). We will use the Google Finance stock data fed by Quandl.com.

Google Finance Stock Data for Model Feeding

Similar to Yahoo! Finance’s resources, Quandl.com has an own system of coding the names of all US stocks. Therefore, we need to know it before doing any further step. The .csv file containing all of the codes for Quandl’s stock data can be obtained via API for Stock Data webpage. The file contains the header of the form:

Ticker      Stock Name      Price Code      Ratios Code      In Market?

linking Google Finance ticker with Quandl’s code (Price Code). Doing some extra job with data filtering, I created the pre-processed set of all tickers excluding those non-active in the markets and removing all invalid data. You can download it here QuandlStockCodeListUS.xlsx as an Excel workbook.

The second list we need to have is a list of potential stock we want to deal with in our portfolio. As for today, the current list of Dow Jones Index’s components you can download here: dowjones.lst. Let’s start with it:

1
2
3
4
5
6
7
8
9
10
11
12
13
% Create a Portfolio from Dow Jones Stocks based on Google Finance
%  Data fed by Quandl.com
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
 
close all; clear all; clc;
 
% Read the list of Dow Jones components
fileID = fopen('dowjones.lst');
tmp = textscan(fileID, '%s');
fclose(fileID);
dowjc=tmp{1};  % a list as a cell array

The guys from Mathworks suggest to make a new and good practice of using textscan command (lines 10-12) instead of dataread for reading data from the text files. The latter will be removed from the next Matlab release.

Next, we need to import the current list of tickers for US stocks and their corresponding codes in the Quandl’s database. We will make the use of my pre-processed data set of QuandlStockCodeListUS.xlsx as mentioned earlier in the following way:

15
16
17
% Read in the list of tickers and internal code from Quandl.com
[ndata, text, alldata] = xlsread('QuandlStockCodeListUS.xlsx');
quandlc=text(:,1); % again, as a list in a cell array

For the simplicity of our example, we will be considering all 30 stocks as a complete sample set of stocks in our portfolio. A portfolio construction is not solely a selection of stocks. It usually links a time period of the past performance of each individual stock. In the following move, we will collect daily stock returns over last 365 calendar year (corresponding to about 252 trading days).

This framework is pretty handful. Why? If you build or run live a model, most probably your wish is to update your database when US markets are closed. You fetch Quandl.com for data. Now, if your model employs risk management or portfolio optimisation as an integrated building-block, it’s very likely, you are interested in risk(-return) estimation, e.g. by calculating Value-at-Risk (VaR) or similar risk measures. Of course, the choice of the time-frame is up to you based on your strategy.

Having that said,

19
20
21
% fetch stock data for last 365 days
date2=datestr(today,'yyyy-mm-dd')     % from
date1=datestr(today-365,'yyyy-mm-dd') % to

what employs an improved Matlab’s command of datestr with formatting as a parameter. It’s really useful when it comes to data fetching from Quandl.com as the format of the date they like to see in your code is yyyy-mm-dd. The above two commands return in Matlab the following values:

date2 =
  2013-11-03
date1 =
  2012-11-03

as for the time of writing this article (string type).

Get an Access to Quandl’s Resources


Excellent! Let’s catch some fishes in the pond! Before doing it we need to make sure that Quandl’s software is installed and linked to our Matlab console. In order to download Quandl Matlab package just visit this website and follow the installation procedure at the top of the page. Add proper path in Matlab environment too. The best practice is to register yourself on Quandl. That allows you to rise the daily number of data call from 50 to 500. If you think you will need more data requests per day, from the same webpage you can send your request.

When registered, in your Profile section on Quandi, you will gain an access to your individual Authorization Token, for example:
qtokenYou will need to replace ‘yourauthenticationtoken’ in the Matlab code (see line 35) with your token in order to get the data.

Fetching Stock Data

From this point the road stops to wind. For all 30 stocks belonging to the current composition of Dow Jones Index, we get their Open, High, Low, Close prices (as retrieved from Google Finance) in the following way:

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
% create an empty array for storing stock data
stockd={};
% scan the Dow Jones tickers and fetch the data from Quandl.com
for i=1:length(dowjc)
    for j=1:length(quandlc)
        if(strcmp(dowjc{i},quandlc{j}))
            fprintf('%4.0f %s\n',i,quandlc{j});
            % fetch the data of the stock from Quandl
            % using recommanded Quandl's command and
            % saving them directly into Matlab's FTS object (fts)
            fts=0;
            [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                          'authcode','yourauthenticationtoken',...
                          'start_date',date1,'end_date',date2);
            stockd{i}=fts; % entire FTS object in an array's cell
        end
    end
end
% use 'save' command to save all actual variables from memory on the disc,
%  if you need it for multiple code running
% save data

A code in line 35 to 37 is an example how you call Quandi for data and fetch them into FINTS (financial time series object from the Financial Toolbox in Matlab). So easy. If you wish to change the time frequency, apply transformation or get the data not in FINTS but .csv format, all is available for you as a parameter at the level of calling Quandl.get() command (explore all possibilities here).

Having that, you can easily plot the Google Finance data, for instance, of IBM stock straight away:

% diplay High-Low Chart for IBM
highlow(stockd{11})
title('IBM');
ylabel('Stock Price ($)');

to get:
quandl-fig01

or prepare a return-risk plot for your further portfolio optimisation

ret=[];
for i=1:length(stockd)
    % extract the Close Price
    tmp=fts2mat(stockd{i}.Close,0);
    % calculate a vector of daily returns
    r=tmp(2:end)./tmp(1:end-1)-1;
    ret=[ret r];
end
% calculate the annualized mean returns for 30 stocks
mret=mean(ret)*sqrt(250);
% and corresponding standard deviations
msig=std(ret)*sqrt(250);
 
% plot return-risk diagram
p=100*[msig; mret]'
plot(p(:,1),p(:,2),'b.');
xlim([0 30]);
xlabel('Annualized Risk of Stock (%)')
ylabel('Annualized Mean Return (%)');

providing us with
quandl-fig02

Forum: Further Discussions

You can discuss the current post with other quants within a new Quant At Risk Forum (Data Handling and Pre-Processing section).

Anxiety Detection Model for Stock Traders based on Principal Component Analysis

Everybody would agree on one thing: the nervousness among traders may lead to massive sells of stocks, the avalanche of prices, and huge losses. We have witnessed this sort of behaviour many times. An anxiety is the feeling of uncertainty, a human inborn instinct that triggers a self-defending mechanism against high risks. The apex of anxiety is fear and panic. When the fear spreads among the markets, all goes south. It is a dream goal for all traders to capture the nervousness and fear just by looking at or studying the market behaviour, trading patterns, ask-bid spread, or flow of orders. The major problem is that we know very well how much a human behaviour is an important factor in the game but it is not observed directly. It has been puzzling me for a few years since I accomplished reading Your Money and Your Brain: How the New Science of Neuroeconomics Can Help Make You Rich by Jason Zweig sitting at the beach of one of the Gili Islands, Indonesia, in December of 2009. A perfect spot far away from the trading charts.

So, is there a way to disentangle the emotional part involved in trading from all other factors (e.g. the application of technical analysis, bad news consequences, IPOs, etc.) which are somehow easier to deduce? In this post I will try to make a quantitative attempt towards solving this problem. Although the solution will not have the final and closed form, my goal is to deliver an inspiration for quants and traders interested in the subject by putting a simple idea into practice: the application of Principal Component Analysis.

1. Principal Component Analysis (PCA)

Called by many as one of the most valuable results from applied linear algebra, the Principal Component Analysis, delivers a simple, non-parametric method of extracting relevant information from often confusing data sets. The real-world data usually hold some relationships among their variables and, as a good approximation, in the first instance we may suspect them to be of the linear (or close to linear) form. And the linearity is one of stringent but powerful assumptions standing behind PCA.

Imagine we observe the daily change of prices of $m$ stocks (being a part of your portfolio or a specific market index) over last $n$ days. We collect the data in $\boldsymbol{X}$, the matrix $m\times n$. Each of $n$-long vectors lie in an $m$-dimensional vector space spanned by an orthonormal basis, therefore they are a linear combination of this set of unit length basic vectors: $ \boldsymbol{BX} = \boldsymbol{X}$ where a basis $\boldsymbol{B}$ is the identity matrix $\boldsymbol{I}$. Within PCA approach we ask a simple question: is there another basis which is a linear combination of the original basis that represents our data set? In other words, we look for a transformation matrix $\boldsymbol{P}$ acting on $\boldsymbol{X}$ in order to deliver its re-representation:
$$
\boldsymbol{PX} = \boldsymbol{Y} \ .
$$ The rows of $\boldsymbol{P}$ become a set of new basis vectors for expressing the columns of $\boldsymbol{X}$. This change of basis makes the row vectors of $\boldsymbol{P}$ in this transformation the principal components of $\boldsymbol{X}$. But how to find a good $\boldsymbol{P}$?

Consider for a moment what we can do with a set of $m$ observables spanned over $n$ days? It is not a mystery that many stocks over different periods of time co-vary, i.e. their price movements are closely correlated and follow the same direction. The statistical method to measure the mutual relationship among $m$ vectors (correlation) is achieved by the calculation of a covariance matrix. For our data set of $\boldsymbol{X}$:
$$
\boldsymbol{X}_{m\times n} =
\left[
\begin{array}{cccc}
\boldsymbol{x_1} \\
\boldsymbol{x_2} \\
… \\
\boldsymbol{x_m}
\end{array}
\right]
=
\left[
\begin{array}{cccc}
x_{1,1} & x_{1,2} & … & x_{1,n} \\
x_{2,1} & x_{2,2} & … & x_{2,n} \\
… & … & … & … \\
x_{m,1} & x_{m,2} & … & x_{m,n}
\end{array}
\right]
$$
the covariance matrix takes the following form:
$$
cov(\boldsymbol{X}) \equiv \frac{1}{n-1} \boldsymbol{X}\boldsymbol{X}^{T}
$$ where we multiply $\boldsymbol{X}$ by its transposed version and $(n-1)^{-1}$ helps to secure the variance to be unbiased. The diagonal elements of $cov(\boldsymbol{X})$ are the variances corresponding to each row of $\boldsymbol{X}$ whereas the off-diagonal terms of $cov(\boldsymbol{X})$ represent the covariances between different rows (prices of the stocks). Please note that above multiplication assures us that $cov(\boldsymbol{X})$ is a square symmetric matrix $m\times m$.

All right, but what does it have in common with our PCA method? PCA looks for a way to optimise the matrix of $cov(\boldsymbol{X})$ by a reduction of redundancy. Sounds a bit enigmatic? I bet! Well, all we need to understand is that PCA wants to ‘force’ all off-diagonal elements of the covariance matrix to be zero (in the best possible way). The guys in the Department of Statistics will tell you the same as: removing redundancy diagonalises $cov(\boldsymbol{X})$. But how, how?!

Let’s come back to our previous notation of $\boldsymbol{PX}=\boldsymbol{Y}$. $\boldsymbol{P}$ transforms $\boldsymbol{X}$ into $\boldsymbol{Y}$. We also marked that:
$$
\boldsymbol{P} = [\boldsymbol{p_1},\boldsymbol{p_2},…,\boldsymbol{p_m}]
$$ was a new basis we were looking for. PCA assumes that all basis vectors $\boldsymbol{p_k}$ are orthonormal, i.e. $\boldsymbol{p_i}\boldsymbol{p_j}=\delta_{ij}$, and that the directions with the largest variances are the most principal. So, PCA first selects a normalised direction in $m$-dimensional space along which the variance in $\boldsymbol{X}$ is maximised. That is first principal component $\boldsymbol{p_1}$. In the next step, PCA looks for another direction along which the variance is maximised. However, because of orthonormality condition, it looks only in all directions perpendicular to all previously found directions. In consequence, we obtain an orthonormal matrix of $\boldsymbol{P}$. Good stuff, but still sounds complicated?

The goal of PCA is to find such $\boldsymbol{P}$ where $\boldsymbol{Y}=\boldsymbol{PX}$ such that $cov(\boldsymbol{Y})=(n-1)^{-1}\boldsymbol{XX}^T$ is diagonalised.

We can evolve the notation of the covariance matrix as follows:
$$
(n-1)cov(\boldsymbol{Y}) = \boldsymbol{YY}^T = \boldsymbol{(PX)(PX)}^T = \boldsymbol{PXX}^T\boldsymbol{P}^T = \boldsymbol{P}(\boldsymbol{XX}^T)\boldsymbol{P}^T = \boldsymbol{PAP}^T
$$ where we made a quick substitution of $\boldsymbol{A}=\boldsymbol{XX}^T$. It is easy to prove that $\boldsymbol{A}$ is symmetric. It takes a longer while to find a proof for the following two theorems: (1) a matrix is symmetric if and only if it is orthogonally diagonalisable; (2) a symmetric matrix is diagonalised by a matrix of its orthonormal eigenvectors. Just check your favourite algebra textbook. The second theorem provides us with a right to denote:
$$
\boldsymbol{A} = \boldsymbol{EDE}^T
$$ where $\boldsymbol{D}$ us a diagonal matrix and $\boldsymbol{E}$ is a matrix of eigenvectors of $\boldsymbol{A}$. That brings us at the end of the rainbow.

We select matrix $\boldsymbol{P}$ to be a such where each row $\boldsymbol{p_1}$ is an eigenvector of $\boldsymbol{XX}^T$, therefore
$$
\boldsymbol{P} = \boldsymbol{E}^T .
$$

Given that, we see that $\boldsymbol{E}=\boldsymbol{P}^T$, thus we find $\boldsymbol{A}=\boldsymbol{EDE}^T = \boldsymbol{P}^T\boldsymbol{DP}$ what leads us to a magnificent relationship between $\boldsymbol{P}$ and the covariance matrix:
$$
(n-1)cov(\boldsymbol{Y}) = \boldsymbol{PAP}^T = \boldsymbol{P}(\boldsymbol{P}^T\boldsymbol{DP})\boldsymbol{P}^T
= (\boldsymbol{PP}^T)\boldsymbol{D}(\boldsymbol{PP}^T) =
(\boldsymbol{PP}^{-1})\boldsymbol{D}(\boldsymbol{PP}^{-1})
$$ or
$$
cov(\boldsymbol{Y}) = \frac{1}{n-1}\boldsymbol{D},
$$ i.e. the choice of $\boldsymbol{P}$ diagonalises $cov(\boldsymbol{Y})$ where silently we also used the matrix algebra theorem saying that the inverse of an orthogonal matrix is its transpose ($\boldsymbol{P^{-1}}=\boldsymbol{P}^T$). Fascinating, right?! Let’s see now how one can use all that complicated machinery in the quest of looking for human emotions among the endless rivers of market numbers bombarding our sensors every day.

2. Covariances of NASDAQ, Eigenvalues of Anxiety


We will try to build a simple quantitative model for detection of the nervousness in the trading markets using PCA.

By its simplicity I will understand the following model assumption: no matter what the data conceal, the 1st Principal Component (1-PC) of PCA solution links the complicated relationships among a subset of stocks triggered by a latent factor attributed by us to a common behaviour of traders (human and pre-programmed algos). It is a pretty reasonable assumption, much stronger than, for instance, the influence of Saturn’s gravity on the annual silver price fluctuations. Since PCA does not tell us what its 1-PC means in reality, this is our job to seek for meaningful explanations. Therefore, a human factor fits the frame as a trial value very well.

Let’s consider the NASDAQ-100 index. It is composed of 100 technology stocks. The most current list you can find here: nasdaq100.lst downloadable as a text file. As usual, we will perform all calculations using Matlab environment. Let’s start with data collection and pre-processing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
% Anxiety Detection Model for Stock Traders
%  making use of the Principal Component Analsis (PCA)
%  and utilising publicly available Yahoo! stock data
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
 
 
% Reading a list of NASDAQ-100 components
nasdaq100=(dataread('file',['nasdaq100.lst'], '%s', 'delimiter', '\n'))';
 
% Time period we are interested in
d1=datenum('Jan 2 1998');
d2=datenum('Oct 11 2013');
 
% Check and download the stock data for a requested time period
stocks={};
for i=1:length(nasdaq100)
    try
        % Fetch the Yahoo! adjusted daily close prices between selected
        % days [d1;d2]
        tmp = fetch(yahoo,nasdaq100{i},'Adj Close',d1,d2,'d');
        stocks{i}=tmp;
        disp(i);
    catch err
        % no full history available for requested time period
    end
end

where, first, we try to check whether for a given list of NASDAQ-100’s components the full data history (adjusted close prices) are available via Yahoo! server (please refer to my previous post of Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting for more information on the connectivity).

The cell array stocks becomes populated with two-dimensional matrixes: the time-series corresponding to stock prices (time,price). Since the Yahoo! database does not contain a full history for all stocks of our interest, we may expect their different time spans. For the purpose of demonstration of the PCA method, we apply additional screening of downloaded data, i.e. we require the data to be spanned between as defined by $d1$ and $d2$ variables and, additionally, having the same (maximal available) number of data points (observations, trials). We achieve that by:

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
% Additional screening
d=[]; 
j=1; 
data={};
for i=1:length(nasdaq100)
    d=[d; i min(stocks{i}(:,1)) max(stocks{i}(:,1)) size(stocks{i},1)];
end
for i=1:length(nasdaq100)
    if(d(i,2)==d1) && (d(i,3)==d2) && (d(i,4)==max(d(:,4)))
        data{j}=sortrows(stocks{i},1);
        fprintf('%3i %1s\n',i,nasdaq100{i})
        j=j+1;
    end
end
m=length(data);

The temporary matrix of $d$ holds the index of stock as read in from nasdaq100.lst file, first and last day number of data available, and total number of data points in the time-series, respectively:

>> d
d =
      1      729757      735518        3970
      2      729757      735518        3964
      3      729757      735518        3964
      4      729757      735518        3969
     ..          ..          ..          ..
     99      729757      735518        3970
    100      729757      735518        3970

Our screening method saves $m=21$ selected stock data into data cell array corresponding to the following companies from our list:

  1 AAPL
  7 ALTR
  9 AMAT
 10 AMGN
 20 CERN
 21 CHKP
 25 COST
 26 CSCO
 30 DELL
 39 FAST
 51 INTC
 64 MSFT
 65 MU
 67 MYL
 74 PCAR
 82 SIAL
 84 SNDK
 88 SYMC
 96 WFM
 99 XRAY
100 YHOO

Okay, some people say that seeing is believing. All right. Let’s see how it works. Recall the fact that we demanded our stock data to be spanned between ‘Jan 2 1998′ and ‘Oct 11 2013′. We found 21 stocks meeting those criteria. Now, let’s assume we pick up a random date, say, Jul 2 2007 and we extract for all 21 stocks their price history over last 90 calendar days. We save their prices (skipping the time columns) into $Z$ matrix as follows:

t=datenum('Jul 2 2007');
Z=[];
for i=1:m
    [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-90));
    Z=[Z data{i}(r,2)]
end

and we plot them all together:

plot(Z)
xlim([1 length(Z)]);
ylabel('Stock price (US$)');
xlabel('T-90d');

anxiety-fig01
It’s easy to deduct that the top one line corresponds to Apple, Inc. (AAPL) adjusted close prices.

The unspoken earlier data processing methodology is that we need to transform our time-series into the comparable form. We can do it by subtracting the average value and dividing each of them by their standard deviations. Why? For a simple reason of an equivalent way of their mutual comparison. We call that step a normalisation or standardisation of the time-series under investigation:

[N,M]=size(Z);
X=(Z-repmat(mean(Z),[N 1]))./repmat(std(Z),[N 1]);

This represents the matrix $\boldsymbol{X}$ that I discussed in a theoretical part of this post. Note, that the dimensions are reversed in Matlab. Therefore, the normalised time-series,

% Display normalized stock prices
plot(X)
xlim([1 length(Z)]);
ylabel('(Stock price-Mean)/StdDev');
xlabel('T-90d');

look like:
anxiety-fig02
For a given matrix of $\boldsymbol{X}$, its covariance matrix,

% Calculate the covariance matrix, cov(X)
CovX=cov(X);
imagesc(CovX);

as for data spanned 90 calendar day back from Jul 2 2007, looks like:
anxiety-fig03
where the colour coding goes from the maximal values (most reddish) down to the minimal values (most blueish). The diagonal of the covariance matrix simply tells us that for normalised time-series, their covariances are equal to the standard deviations (variances) of 1 as expected.

Going one step forward, based on the given covariance matrix, we look for the matrix of $\boldsymbol{P}$ whose columns are the corresponding eigenvectors:

% Find P
[P,~]=eigs(CovX,5);
imagesc(P);
set(gca,'xticklabel',{1,2,3,4,5},'xtick',[1 2 3 4 5]);
xlabel('Principal Component')
ylabel('Stock');
set(gca,'yticklabel',{'AAPL', 'ALTR', 'AMAT', 'AMGN', 'CERN', ...
 'CHKP', 'COST', 'CSCO', 'DELL', 'FAST', 'INTC', 'MSFT', 'MU', ...
 'MYL', 'PCAR', 'SIAL', 'SNDK', 'SYMC', 'WFM', 'XRAY', 'YHOO'}, ...
 'ytick',[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21]);

which results in $\boldsymbol{P}$ displayed as:
anxiety-fig04
where we computed PCA for five principal components in order to illustrate the process. Since the colour coding is the same as in the previous figure, a visual inspection of of the 1-PC indicates on negative numbers for at least 16 out of 21 eigenvalues. That simply means that over last 90 days the global dynamics for those stocks were directed south, in favour of traders holding short-position in those stocks.

It is important to note in this very moment that 1-PC does not represent the ‘price momentum’ itself. It would be too easy. It represents the latent variable responsible for a common behaviour in the stock dynamics whatever it is. Based on our model assumption (see above) we suspect it may indicate a human factor latent in the trading.

3. Game of Nerves

The last figure communicates an additional message. There is a remarkable coherence of eigenvalues for 1-PC and pretty random patterns for the remaining four principal components. One may check that in the case of our data sample, this feature is maintained over many years. That allows us to limit our interest to 1-PC only.

It’s getting exciting, isn’t it? Let’s come back to our main code. Having now a pretty good grasp of the algebra of PCA at work, we may limit our investigation of 1-PC to any time period of our interest, below spanned between as defined by $t1$ and $t2$ variables:

47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
% Select time period of your interest
t1=datenum('July 1 2006');
t2=datenum('July 1 2010');
 
results=[];
for t=t1:t2
    tmp=[];
    A=[]; V=[];
    for i=1:m
        [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-60));
        A=[A data{i}(r,2)];
    end
    [N,M]=size(A);
    X=(A-repmat(mean(A),[N 1]))./repmat(std(A),[N 1]);
    CovX=cov(X);
    [V,D]=eigs(CovX,1);
    % Find all negative eigenvalues of the 1st Principal Component
    [r,c,v]=find(V(:,1)<0);
    % Extract them into a new vector
    neg1PC=V(r,1);
    % Calculate a percentage of negative eigenvalues relative
    % to all values available
    ratio=length(neg1PC)/m;
    % Build a new time-series of 'ratio' change over required
    % time period (spanned between t1 and t2)
    results=[results; t ratio];    
end

We build our anxiety detection model based on the change of number of eigenvalues of the 1st Principal Component (relative to the total their numbers; here equal 21). As a result, we generate a new time-series tracing over $[t1;t2]$ time period this variable. We plot the results all in one plot contrasted with the NASDAQ-100 Index in the following way:

75
76
77
78
79
80
81
82
83
84
85
% Fetch NASDAQ-100 Index from Yahoo! data-server
nasdaq = fetch(yahoo,'^ndx','Adj Close',t1,t2,'d');
% Plot it
subplot(2,1,1)
plot(nasdaq(:,1),nasdaq(:,2),'color',[0.6 0.6 0.6]);
ylabel('NASDAQ-100 Index');
% Add a plot corresponding to a new time-series we've generated
subplot(2,1,2)
plot(results(:,1),results(:,2),'color',[0.6 0.6 0.6])
% add overplot 30d moving average based on the same data
hold on; plot(results(:,1),moving(results(:,2),30),'b')

leading us to:
anxiety-fig05
I use 30-day moving average (a solid blue line) in order to smooth the results (moving.m). Please note, that line in #56 I also replaced the earlier value of 90 days with 60 days. Somehow, it is more reasonable to examine with the PCA the market dynamics over past two months than for longer periods (but it’s a matter of taste and needs).

Eventually, we construct the core model’s element, namely, we detect nervousness among traders when the percentage of negative eigenvalues of the 1st Principal Component increases over (at least) five consecutive days:

87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
% Model Core
x1=results(:,1);
y1=moving(results(:,2),30);
tmp=[];
% Find moments of time where the percetage of negative 1-PC
% eigenvalues increases over time (minimal requirement of
% five consecutive days
for i=5:length(x1)
    if(y1(i)>y1(i-1))&&(y1(i-1)>y1(i-2))&&(y1(i-2)>y1(i-3))&& ...
      (y1(i-3)>y1(i-4))&&(y1(i-4)>y1(i-5))
        tmp=[tmp; x1(i)];
    end
end
% When found
z=[];
for i=1:length(tmp)
    for j=1:length(nasdaq)
        if(tmp(i)==nasdaq(j,1))
            z=[z; nasdaq(j,1) nasdaq(j,2)];
        end
    end
end
subplot(2,1,1); 
hold on; plot(z(:,1),z(:,2),'r.','markersize',7);

The results of the model we over-plot with red markers on top of the NASDAQ-100 Index:
anxiety-fig06
Our simple model takes us into a completely new territory of unexplored space of latent variables. Firstly, it does not predict the future. It still (unfortunately) remains unknown. However, what it delivers is a fresh look at the past dynamics in the market. Secondly, it is easily to read out from the plot that results cluster into three subgroups.

The first subgroup corresponds to actions in the stock trading having further negative consequences (see the events of 2007-2009 and the avalanche of prices). Here the dynamics over any 60 calendar days had been continued. The second subgroup are those periods of time when anxiety led to negative dynamics among stock traders but due to other factors (e.g. financial, global, political, etc.) the stocks surged dragging the Index up. The third subgroup (less frequent) corresponds to instances of relative flat changes of Index revealing a typical pattern of psychological hesitation about the trading direction.

No matter how we might interpret the results, the human factor in trading is evident. Hopefully, the PCA approach captures it. If not, all we are left with is our best friend: a trader’s intuition.

Acknowledgments

An article dedicated to Dr. Dariusz Grech of Physics and Astronomy Department of University of Wroclaw, Poland, for his superbly important! and mind-blowing lectures on linear algebra in the 1998/99 academic year.

Contact Form Powered By : XYZScripts.com