Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

## Non-Linear Cross-Bicorrelations between the Oil Prices and Stock Fundamentals

When we talk about correlations in finance, by default, we assume linear relationships between two time-series “co-moving”. In other words, if one time-series changes its values over a give time period, we seek for a tight correlation reflected within the other time-series. If found, we say they are correlated. But wait a minute! Is the game always about looking for an immediate feedback between two time-series? Is “correlated” always a synonymous of a direct, i.e. a linear response?

Think for a second about the oil prices. If they rise within a quarter, the effect on some companies is not the same. Certain firms can be tightly linked to oil/petroleum prices that affect their product value, distribution, sales, etc. while for other companies the change of crude oil price has a negligible effect, e.g. the online services. Is it so?

As a data scientist, quantitative (business) analysts, or investment researcher you need to know that mathematics and statistics deliver ample of tools you may use in order to derive the best possibilities for two (or more) financial assets that can be somehow related, correlated, or connected. The wording does not need to be precise to reflect the goal of an investigation: correlation, in general.

Correlation can be linear and non-linear. Non-linearity of correlation is somehow counterintuitive. We used to think in a linear way that is why it is so hard to readjust our thinking in a non-linear domain. The changes of the oil prices might have a non-negligble effect on the airlines, causing the air-ticket prices to rise or fall due to recalculated oil/petroleum surcharge. The effect can be not immediate. A rise of the oil prices today can be “detected” with a time delay of days, weeks, or even months, i.e. non-linearly correlated.

In this post analyse both linear and non-linear correlation methods that have been so far applied (and not) in the research over oil price tectonics across the markets. In particular, first we discuss one-factor and multiple linear regression models to turn, in consequence, toward non-linearity possible to be captured by cross-correlation and cross-bicorrelation methods. By developing an extensive but easy-to-follow Python code, we analyse a database of Quandl.com storing over 620 stocks’ financials factors (stock fundamentals). Finally, we show how both non-linear mathematical tools can be applied in order to detect non-linear correlations for different financial assets.

1. Linear Correlations

1.1. Oil Prices versus Stock Markets

Correlations between increasing/decreasing oil prices and stock markets have been a subject of investigation for a number of years. The main interest was initially in search for linear correlations between raw price/return time-series among the stocks and oil benchmarks (e.g. WPI Crude Oil). A justification standing behind such approach to data analysis seemed to be driven by a common sense: an impact of rising oil prices should affect companies dependent on petroleum-based products or services.

The usual presumption states that a decline in oil price is a good news for the economy, especially for net oil importers, e.g. USA or China. An increase in oil prices usually causes the raise of the input costs for most businesses and force consumers to spend more money on gasoline, thereby reducing the corporate earnings of other businesses. The opposite should be true too when the oil prices fall.

The oil prices are determined by the supply-and-damand for petroleum-based products. During economic expansion, prices might rise as a result of increased consumption. They might fall as a result of increased production. That pattern has been observed on many occasions so far.

A quantitative approach to measuring the strength of linear response of the stock market to oil price can be demonstrated using a plain linear regression modeling. If we agree that S&P 500 Index is a proxy for the US stock market, we should be able to capture the relationship. In the following example, we compare WPI Crude Oil prices with S&P 500 Index for 2-year period between Nov 2014 and Nov 2016. First we plot both time-series, and next we apply one-factor linear regression model,
$$y(t) = \beta x(t) + \epsilon(t) \ \,$$ in order to fit the data best:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import numpy as np import pandas as pd from scipy import stats from matplotlib import pyplot as plt from pandas_datareader import data, wb   %matplotlib inline   grey = 0.7, 0.7, 0.7   import warnings warnings.filterwarnings('ignore')   # downloading S&P 500 Index from Yahoo! Finance sp500 = web.DataReader("^GSPC", data_source='yahoo', start='2014-11-01', end='2016-11-01')['Adj Close']   # WPI Crude Oil price-series # src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d') wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse) wpi = wpi[wpi.VALUE != "."] wpi.VALUE = wpi.VALUE.astype(float) wpi.index = wpi.DATE wpi.drop('DATE', axis=1, inplace=True)   # combine both time-series data = pd.concat([wpi, sp500], axis=1).dropna() # and remove NaNs, if any data.columns = ["WPI", "S&P500"] print(data.head())
 WPI S&P500 2014-11-03 78.77 2017.810059 2014-11-04 77.15 2012.099976 2014-11-05 78.71 2023.569946 2014-11-06 77.87 2031.209961 2014-11-07 78.71 2031.920044
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 plt.figure(figsize=(8,8)) # plot time-series ax1 = plt.subplot(2,1,1) plt.plot(data.WPI, 'k', label="WPI Crude Oil") plt.legend(loc=3) plt.ylabel("WPI Crude Oil [U$/barrel]") # fit one-factor linear model slope, intercept, r2_value, p_value, std_err = stats.linregress(data.WPI, data["S&P500"]) # best fit: model xline = np.linspace(np.min(data.WPI), np.max(data.WPI), 100) line = slope*xline + intercept # plot scatter plot and linear model ax2 = ax1.twinx() plt.plot(data["S&P500"], 'm', label="S&P500") plt.ylabel("S&P 500 Index [U$]") plt.legend(loc="best") plt.subplot(2,1,2) plt.plot(data.WPI, data["S&P500"],'.', color=grey) plt.plot(xline, line,'--') plt.xlabel("WPI Crude Oil [U$/barrel]") plt.ylabel("S&P 500 Index [U$]") plt.title("R$^2$ = %.2f" % r2_value, fontsize=11)

Our Python code generates:

where we can see that WPI Crude Oil price vs. S&P 500 linear correlation is weak, with $R^2 = 0.36$ only. This is a static picture.

As one might suspect, correlation is time-dependent. Therefore, it is wiser to check a rolling linear correlation for a given data window size. Below, we assume its length to be 50 days:

59 60 61 62 63 64 rollcorr = pd.rolling_corr(wpi, sp500, 50).dropna()   plt.figure(figsize=(10,3)) plt.plot(rollcorr) plt.grid() plt.ylabel("Linear Correlation (50 day rolling window)", fontsize=8)

In 2008 Andrea Pescatori measured changes in the S&P 500 and oil prices in the same way as demonstrated above. He noted that variables occasionally moved in the same direction at the same time, but even then, the relationship remained weak. He concluded on no correlation at the 95% confidence level.

The lack of correlation might be attributed to factors like: wages, interest rates, industrial metal, plastic, computer technology that can offset changes in energy costs. On the other side, corporations might have become increasingly sophisticated at reading the futures markets and are able, much better, to anticipate the shift in factor prices (e.g. a company should be able to switch production processes to compensate for added fuel costs).

No matter what we analyse, either oil prices vs. market indexes or vs. individual stock trading records, linear correlation method is solely able to measure 1-to-1 response of $y$ to $x$. The lower coefficient of correlation the less valid linear model as a descriptor of true events and mutual relationships under study.

1.2. Multiple Linear Regression Model for Oil Price Changes

In February 2016 Ben S. Bernanke found that a positive correlation of stocks and oil might arise because both are responding to the underlying shift in global demand. Using a simple multiple linear regression model he took and attempt in explaining the daily changes of the oil prices and concluded that in over 90% they could be driven by changes of commodities prices (copper), US dollar (spot price), 10-yr treasury interest rate, and in over 95% by extending the model with an inclusion of daily changes in VIX:
$$\Delta p_{oil, t} = \beta_1 \Delta p_{copp, t} + \beta_2 \Delta p_{10yr, t} + \beta_3 \Delta p_{USD, t} + \beta_4 \Delta p_{VIX, t} + \beta_0$$ The premise is that commodity prices, long-term interest rate, and USD are likely to respond to investors’ perceptions of global and US demand for oil. Additionally, the presence of VIX is in line with the following idea: if investors retreat from commodities and/or stocks during periods of high uncertainty/risk aversion, then shocks to volatility may be another reason for the observed tendency of stocks and oil prices moving together.

The model, inspired by James Hamilton (2014) work, aims at measuring the effect of demand shifts on the oil market. For the sake of elegance, we can code it in Python as follows. First, we download to the files all necessary price-series and process them within pandas Dataframes (more on Quantitative Aspects of pandas for Quants you will find in my upcoming book Python for Quants. Volume II.).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 import numpy as np import pandas as pd from scipy import stats import datetime from datetime import datetime from matplotlib import pyplot as plt import statsmodels.api as sm   %matplotlib inline   grey = 0.7, 0.7, 0.7   import warnings warnings.filterwarnings('ignore')   # Download WPI Crude Oil prices # src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d') wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse) wpi = wpi[wpi.VALUE != "."] wpi.VALUE = wpi.VALUE.astype(float) wpi.index = wpi.DATE wpi.drop('DATE', axis=1, inplace=True)   # Copper Futures # src: http://www.investing.com/commodities/copper-historical-data cop = pd.read_csv("COPPER.csv", index_col=["Date"]) cop.index = pd.to_datetime(pd.Series(cop.index), format='%b %d, %Y')   # Nominal interest rate on 10-year Treasury bonds # src: https://fred.stlouisfed.org/series/DGS10/ tb = pd.read_csv('DGS10.csv') tb = tb[tb.DGS10 != "."] tb.DGS10 = tb.DGS10.astype(float) tb['DATE'] = pd.to_datetime(tb['DATE'], format='%Y-%m-%d') tb.index = tb.DATE tb.drop('DATE', axis=1, inplace=True)   # Trade Weighted U.S. Dollar Index # src: https://fred.stlouisfed.org/series/DTWEXM usd = pd.read_csv('DTWEXM.csv') usd = usd[usd.DTWEXM != "."] usd.DTWEXM = usd.DTWEXM.astype(float) usd['DATE'] = pd.to_datetime(usd['DATE'], format='%Y-%m-%d') usd.index = usd.DATE usd.drop('DATE', axis=1, inplace=True)   # CBOE Volatility Index: VIX© (VIXCLS) # src: https://fred.stlouisfed.org/series/VIXCLS/downloaddata vix = pd.read_csv('VIXCLS.csv') vix = vix[vix.VALUE != "."] vix.VALUE = vix.VALUE.astype(float) vix['DATE'] = pd.to_datetime(vix['DATE'], format='%Y-%m-%d') vix.index = vix.DATE vix.drop('DATE', axis=1, inplace=True)

Not every time a plain use of pd.read_csv function works in the way we expect from it, therefore some extra steps are needed to be implemented in order to convert .csv files content to Dataframe, e.g. a correct formatting of date (line #18-19 or #35, 44, 53), price conversion from string to float (lines #21, 34, 43, 52), replacement of index with data included in ‘Date’ column followed by its removal (e.g. lines #22-23).

As a side note, in order to download the copper data, at the website we need manually mark a whole table, paste it into Excel or Mac’s Numbers, trim if required, and save down to .csv file. The problem might appear as dates are given in “Nov 26, 2016″ format. For this case, the following trick in Python saves a lot of time:

from datetime import datetime expr = 'Nov 26, 2016' datetime.strptime(expr, '%b %d, %Y')
datetime.datetime(2016, 11, 26, 0, 0)

what justifies the use of %b descriptor (see more on Python’s datetime date formats and parsing here and here).

pandas’ Dataframes are fast and convenient way to concatenate multiple time-series with indexes set to calendar dates. The following step ensures all five price-series to have data points on the same dates (if given), and if any data point (for any time-series) is missing (NaN), the whole row is removed from the end Dataframe of ‘df’ thanks to action of .dropna() function:

57 58 59 60 df = pd.concat([wpi, cop, tb, usd, vix], join='outer', axis=1).dropna() df.columns = ["WPI", "Copp", "TrB", "USD", "VIX"]   print(df.head(10))
 WPI Copp TrB USD VIX 2007-11-15 93.37 3.082 4.17 72.7522 28.06 2007-11-16 94.81 3.156 4.15 72.5811 25.49 2007-11-19 95.75 2.981 4.07 72.7241 26.01 2007-11-20 99.16 3.026 4.06 72.4344 24.88 2007-11-21 98.57 2.887 4.00 72.3345 26.84 2007-11-23 98.24 2.985 4.01 72.2719 25.61 2007-11-26 97.66 3.019 3.83 72.1312 28.91 2007-11-27 94.39 2.958 3.95 72.5061 26.28 2007-11-28 90.71 3.001 4.03 72.7000 24.11 2007-11-29 90.98 3.063 3.94 72.6812 23.97

Given all price-series and our model in mind, we are more interested in running the regression using daily change in the natural logarithm of crude oil, copper price, 10-yr interest rate, USD spot price, and VIX. The use of log is a common practice in finance and economics (see here why). Contrary to the daily percent change, we derive the log returns in Python in the following way:

np.random.seed(7) df = pd.DataFrame(100 + np.random.randn(100).cumsum(), columns=['price']) df['pct_change'] = df.price.pct_change() df['log_ret'] = np.log(df.price) - np.log(df.price.shift(1))   print(df.head())
 price pct_change log_ret 0 101.690526 NaN NaN 1 101.224588 -0.004582 -0.004592 2 101.257408 0.000324 0.000324 3 101.664925 0.004025 0.004016 4 100.876002 -0.007760 -0.007790

Having that, we continue coding our main model:

62 63 64 65 66 67 68 69 70 71 72 73 headers = df.columns   dlog = pd.DataFrame() # log returns Dateframe   for j in range(df.shape[1]): price = df.ix[:,j] dl = np.log(price) - np.log(price.shift(1)) dlog.loc[:,j] = dl   dlog.dropna(inplace=True) dlog.columns = headers print(dlog.head())
 WPI Copp TrB USD VIX 2007-11-16 0.015305 0.023727 -0.004808 -0.002355 -0.096059 2007-11-19 0.009866 -0.057047 -0.019465 0.001968 0.020195 2007-11-20 0.034994 0.014983 -0.002460 -0.003992 -0.044417 2007-11-21 -0.005968 -0.047024 -0.014889 -0.001380 0.075829 2007-11-23 -0.003353 0.033382 0.002497 -0.000866 -0.046910

and run multiple linear regression model for in-sample data, here, selected for 2 year in-sample period between Nov 2012 and Nov 2014:

75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 dlog0 = dlog.copy()   # in-sample data selection dlog = dlog[(dlog.index > "2012-11-01") & (dlog.index < "2014-11-01")]   # define input data y = dlog.WPI.values x = [dlog.Copp.values, dlog.TrB.values, dlog.USD.values, dlog.VIX.values ]   # Multiple Linear Regression using 'statsmodels' library def reg_m(y, x): ones = np.ones(len(x[0])) X = sm.add_constant(np.column_stack((x[0], ones))) for ele in x[1:]: X = sm.add_constant(np.column_stack((ele, X))) results = sm.OLS(y, X).fit() return results   # display a summary of the regression print(reg_m(y, x).summary())
 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.105 Model: OLS Adj. R-squared: 0.098 Method: Least Squares F-statistic: 14.49 Date: Tue, 29 Nov 2016 Prob (F-statistic): 3.37e-11 Time: 11:38:44 Log-Likelihood: 1507.8 No. Observations: 499 AIC: -3006. Df Residuals: 494 BIC: -2985. Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 -0.0303 0.008 -3.800 0.000 -0.046 -0.015 x2 -0.3888 0.178 -2.180 0.030 -0.739 -0.038 x3 0.0365 0.032 1.146 0.252 -0.026 0.099 x4 0.2261 0.052 4.357 0.000 0.124 0.328 const -3.394e-05 0.001 -0.064 0.949 -0.001 0.001 ============================================================================== Omnibus: 35.051 Durbin-Watson: 2.153 Prob(Omnibus): 0.000 Jarque-Bera (JB): 75.984 Skew: -0.392 Prob(JB): 3.16e-17 Kurtosis: 4.743 Cond. No. 338. ==============================================================================   Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

A complete model is given by the following parameters:

96 97 par = reg_m(y, x).params print(par)
[ -3.02699345e-02 -3.88784262e-01 3.64964420e-02 2.26134337e-01 -3.39418282e-05]

or more formally as:
$$\Delta p_{oil, t} = 0.2261 \Delta p_{copp, t} + 0.0365 \Delta p_{10yr, t} -0.3888 \Delta p_{USD, t} -0.0303 \Delta p_{VIX, t}$$ The adjusted $R^2$ is near zero and statistical significance of $\beta_2$ is more than 5%.

In order to use our model for “future” oil price prediction based on price movements in commodity (copper), long-term interest rate, USD spot price and VIX index, we generate out-of-sample data set and construct the estimation in the following way:

99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 # out-of-sample log return Dataframe dlog_oos = dlog0[dlog0.index >= '2014-11-01']   # what is the last WPI Crude Oil price in in-sample last = df[df.index == "2014-10-31"].values[0][0] # 80.53 USD/barrel   oilreg = pd.DataFrame(columns=["Date", "OilPrice"]) # regressed oil prices par = reg_m(y, x).params # parameters from regression   for i in range(dlog_oos.shape[0]): x1 = dlog_oos.iloc[i]["VIX"] x2 = dlog_oos.iloc[i]["USD"] x3 = dlog_oos.iloc[i]["TrB"] x4 = dlog_oos.iloc[i]["Copp"] w = par[0]*x1 + par[1]*x2 + par[2]*x3 + par[3]*x4 + par[4] if(i==0): oilprice = np.exp(w) * last else: oilprice = np.exp(w) * oilprice oilreg.loc[len(oilreg)] = [dlog_oos.index[i], oilprice]   oilreg.index = oilreg.Date oilreg.drop('Date', axis=1, inplace=True)

where oilreg Dataframe stores time and regressed oil prices. Visualisation of the results

123 124 125 126 127 128 129 130 plt.figure(figsize=(10,5)) plt.plot(df.WPI, color=grey, label="WPI Crude Oil") plt.plot(oilreg, "r", label="Predicted Oil Price") plt.xlim(['2012-01-01', '2016-11-01']) # in-sample & out-of-sample plt.ylim([20, 120]) plt.ylabel("USD/barrel") plt.legend(loc=3) plt.grid()

delivers

Bernanke’s multiple regression model points at an idea saying: when stock traders respond to a change in oil prices, they do so not necessarily because the oil movement in consequential in itself. Our finding (for different data sets than Bernanke’s) points at lack of strong dependancy on VIX thus on traders activity. Again, as in case of one-factor regression model discussed in Section 1.1, the stability of underlying correlations with all four considered factors remains statistically variable in time.

2. Non-Linear Correlations

So far we have understood how fragile the quantification and interpretation of the linear correlation between two time-series can be. We agreed on its time-dependant character and how data sample size may influence derived results. Linear correlation is a tool that captures 1-to-1 feedback between two data sets. What if a response is lagged? We need to look for some other useful tools good enough for this challenge. In this Section we will analyse two of them.

2.1. Cross-Correlation and Cross-Bicorrelation

If a change of one quantity has a noticeable effect on another one, this can be observed with or without time delay. In the time-series analysis we talk about one time-series being lagged (or led) behind (before) the other one. Since an exisiting lag (lead) is not observed directly (a linear feedback) it introduces a non-linear relationship between two time-series.

A degree of non-linear correlation can be measured by the application of two independent methods, namely, cross-correlations and cross-bicorrelations designed to examine the data dependence between any two (three) values led/lagged by $r$ ($s$) periods. By period one should understand the sampling frequency of the underlying time-series, for example a quarter, i.e. 3 month period.

Let $x(t) \equiv \{ x_t \} = \{ x_1, x_2, …, x_N \}$ for $t=1,…,N$ represents, e.g. the WPI Crude Oil price-series, and $\{ y_t \}$ the time-series of a selected quantity. For two time-series, $x(t)$ and $y(t)$, the cross-correlation is defined as:
$$C_{xy}(r) = (N-r)^{-1} \sum_{i=1}^{N-r} x(t_i) y(t_i + r)$$ whereas the cross-bicorrelation is usually assumed as:
$$C_{xyy}(r,s) = (N-m)^{-1} \sum_{i=1}^{N-m} x(t_i) y(t_i + r) y(t_i + s)$$ where $m=\max(r,s)$ (Brooks & Hinich 1999). Both formulae measure the inner product between the individual values of two time-series. In case of cross-correlation $C_{xy}(r)$ it is given by $\langle x(t_i) y(t_i + r) \rangle$ where the value of $y(t)$ is lagged by $r$ periods. The summation, similarly as in the concept of the Fourier transform, picks up the maximum power for a unique set of lags $r$ or $(r,s)$ for $C_{xy}(r)$ and $C_{xyy}(r,s)$, respectively. The use of cross-correlation method should return the greatest value of $C_{xy}(r)$ among all examined $r$’s. Here, $1 \le r < N$ in a first approximation. It reveals a direct non-linear correlation between $x(t_i)$ and $y(t_i + r)$ $\forall i$.

The concept of cross-bicorrelation goes one step further and examines the relationship between $x(t_i)$ and $y(t_i + r)$ and $y(t_i + s)$, e.g. how the current oil price at time $t$, $x(t_i)$, affects $y$ lagged by $r$ and $s$ independent time periods. Note that cross-bicorrelation also allows for $C_{xxy}(r,s)$ form, i.e. if we require to look for relationship of the current and lagged (by $r$) values of $x$ time-series and period $y$ time-series lagged by $s$.

2.2. Statistical Tests for Cross-Correlation and Cross-Bicorrelation

The application of both tools for small data samples ($N < 50$) does not guarantee a statistically significant non-linear correlations every single time the computations are done. The need for statistical testing of the results returned by both cross-correlation and cross-bicorrelation methods has beed addressed by Hinich (1996) and Brook & Hinich (1999). For two time-series, with a weak stationarity, the null hypothesis $H_0$ for the test is that two series are independent pure white noise processes against the alternative hypothesis $H_1$ saying that some cross-covariances, $E[x(t_i) y(t_i + r)]$, or cross-bicovariances, $E[x(t_i) y(t_i + r) y(t_i + s)]$, are nonzero. The test statistics, respectively for cross-correlation and cross-bicorrelation, are given by:
$$H_{xy}(N) = \sum_{r=1}^{L} (N-L) C_{xy}^2(r) = (N-L) \sum_{r=1}^{L} C_{xy}^2(r)$$ and
$$H_{xyy}(N) = \sum_{s=-L}^{L} ‘ \sum_{r=1}^{L} (N-m) C_{xyy}^2(r,s) \ \ \ (‘-s \ne -1, 1, 0)$$ where it has been shown by Hinich (1996) that:
$$H_{xy}(N) \sim \chi^2_L$$ and
$$H_{xyy}(N) \sim \chi^2_{(L-1)(L/2)}$$ where $\sim$ means are asymptotically distributed with the corresponding degrees for $N \rightarrow \infty$ and $m = \max(r, s)$ as previously. For both $H_{xy}(N)$ and $H_{xyy}(N)$ statistics, the number of lags $L$ is specified as $L = N^b$ where $0 < b < 0.5$. Based on the results of the Monte-Carlo simulations, Hinich & Parrerson (1995) recommended the use of $b=0.4$ in order to maximise the power of test while ensuring a valid approximation to the asymptotic theory. Further down, we compute $H_{xy}(N)$ and $H_{xyy}(N)$ statistics for every cross-correlation and cross-bicorrelation and determine the corresponding significance assuming the confidence level of 90% ($\alpha = 0.1$) if: $$\mbox{p-value} < \alpha$$ where $$\mbox{p-value} = 1 - \mbox{Pr} \left[ \chi^2_{\mbox{dof}} < H_{x(y)y} | H_1 \right] \ .$$ Given that, any resultant values of cross-correlations and cross-bicorrelations one can assume as statistically significant if p-value $< \alpha$. A recommended number of lags to be investigated is $L = N^{0.4}$ where $N$ is the length of $x(t)$ and $y(t)$ time-series. Employing floor rounding for $L$ we obtain the following relationship between $N$ and $L$ for the sample sizes of $N \le 50$:

L = [] for N in range(1, 51): L.append(np.floor(N**0.4))   fig, ax = plt.subplots(figsize=(8,3)) plt.plot(np.arange(1, 51), L, '.') plt.xlabel("N") plt.ylabel("L") ax.margins(x=0.025, y=0.05) # add extra padding ax.tick_params(axis='y',which='minor',left='off') # without minor yticks plt.grid(b=True, which='minor', color='k', linestyle=':', alpha=0.3) plt.minorticks_on()

2.3. Cross-Correlation and Cross-Bicorrelation in Python

Let’s translate cross-correlation to Python language and run a simple test for a random time-series:

# cross-correlation def Cxy(x, y, r, N): z = 0 for i in range(0, N-r-1): z += x[i] * y[i+r] z /= (N-r) return z   # test statistic for cross-correlation def Hxy(x, y, L, N): z = 0 for r in range(1, L+1): z += Cxy(x, y, r, N)**2 z *= (N - L) return z

where we defined two functions deriving $C_{xy}(r)$ and $H_{xy}(N)$, respectively. Now, let’s consider some simplified time-series of $N=16$, both normalised to be $\sim N(0,1)$:

# sample data y = np.array([0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]) x = np.array([1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])   # normalise such that x, y ~ N(0,1) x = (x - np.mean(x))/np.std(x, ddof=1) y = (y - np.mean(y))/np.std(y, ddof=1)

As one can see, there is expected strong cross-correlation for lag of $r = 2$ (strongest) with $r = 1$ to be detected too. We verify the overall statistical significance of cross-correlations for given time-series in the following way:

N = len(x) L = int(np.floor(N**0.4))   for r in range(1, L+1): print("r =%2g\t Cxy(r=%g) = %.4f" % (r, r, Cxy(x, y, r)))   dof = L pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof) alpha = 0.1   print("Hxy(N) = %.4f" % Hxy(x, y, L, N)) print("p-value = %.4f" % pvalue) print("H0 rejected in favour of H1 at %g%% c.l.: %s" % (100*(1-alpha), pvalue<alpha))

confirming both our expectations and significance of detected cross-correlations at 90% confidence level:

r = 1 Cxy(r=1) = 0.0578 r = 2 Cxy(r=2) = 0.8358 r = 3 Cxy(r=3) = -0.3200 Hxy(N) = 10.4553 p-value = 0.0151 H0 rejected in favour of H1 at 90% c.l.: True

For cross-bicorrelation we have:

# cross-bicorrelation def Cxyy(x, y, r, s, N): z = 0 m = np.max([r, s]) for i in range(0, N-m-1): z += x[i] * y[i+r] * y[i+s] z /= (N-m) return z   # test statistic for cross-bicorrelation def Hxyy(x, y, L, N): z = 0 for s in range(2, L+1): for r in range(1, L+1): m = np.max([r, s]) z += (N-m) * Cxyy(x, y, r, s, N)**2 return z     # sample data y = np.array([0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0]) x = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0])   # normalise such that x, y ~ N(0,1) x = (x - np.mean(x))/np.std(x, ddof=1) y = (y - np.mean(y))/np.std(y, ddof=1)   N = len(x) # len(x) should be equal to len(y) L = int(np.floor(N**0.4))   for r in range(0, L+1): for s in range(r+1, L+1): print("r =%2g, s =%2g\t Cxyy(r=%g, s=%g) = %.5f" % (r, s, r, s, Cxyy(x, y, r, s, N)))   dof = (L-1)*(L/2) pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof) alpha = 0.1   print("Hxyy(N) = %.4f" % Hxyy(x, y, L, N)) print("p-value = %.4f" % pvalue) print("H0 rejected in favour of H1 at %g%% c.l.: %s" % (100*(1-alpha), pvalue<alpha))

where for a new sample data we expect to detect two strongest cross-bicorrelations for pairs of lags $(r,s)$ equal $(2,3)$ and $(1,2)$, respectively. By running the code we obtain indeed that:

r = 0, s = 1 Cxyy(r=0, s=1) = -0.18168 r = 0, s = 2 Cxyy(r=0, s=2) = -0.23209 r = 0, s = 3 Cxyy(r=0, s=3) = 0.05375 r = 1, s = 2 Cxyy(r=1, s=2) = 0.14724 r = 1, s = 3 Cxyy(r=1, s=3) = -0.22576 r = 2, s = 3 Cxyy(r=2, s=3) = 0.18276 Hxyy(N) = 5.3177 p-value = 0.0833 H0 rejected in favour of H1 at 90% c.l.: True

With these two examples we showed both cross-correlation and cross-bicorrelation methods in action. We also pointed at the important fact of an existing dependency between the length of time-series and the expected number of lags to be examined. The longer time-series the higher number of degrees-of-freedom.

Given that state of the play, there exist two independent and appealing ways for the application of cross-correlation and cross-bicorrelation in finance: (1) the first one regards the examination of time-series with a small number of data points (e.g. $N \le 50$); (2) alternatively, if we deal with a long ($N \ge 1000$) time-series, the common practice is to divide them into $k$ chunks (non-overlapping windows) of length at least $N_i$ of $50$ to $100$ where $i=1,…,k$ and compute cross-correlation and cross-bicorrelation metrics for every data window separately. In consequence, within this approach, we gain an opportunity to report on time-period dependent data segment with statistically signifiant (or not) non-linear correlations! (see e.g. Coronado et alii 2015).

In the Section that follows, we solely focus on the application of the first approach. Rewarding in many aspects if you’re open-mined.

2.4. Detecting Non-Linearity between Oil Prices and Stock Fundamentals

Since looking for a direct correlation between crude oil prices and stocks can be regarded as the first approach to the examination of mutual relationships, it’s not sexy; not any more. With cross-correlation and cross-bicorrelation methods you can step up and break the boredom.

The stock fundamentals consist a broad space of parameters that actually deserve to be cross-correlated with the oil prices. It is much more intuitive that any (if any) effect of rising or falling crude oil prices may affect different companies with various time lags. Moreover, for some businesses we might detect a non-linear relationships existing only between oil and company’s e.g. revenues whereas for others, changing oil prices might affect more than one factor. If the patterns is present indeed, with the application of cross-correlation and cross-bicorrelation methods, you gain a brand new tool for data analysis. Let’s have a look at a practical example.

As previously, we will need both the WPI Crude Oil price-series (see Section 1.1) and access to a large database containing company’s fundamentals over a number of quarters or years. For the latter let me use a demo version of Quandl.com‘s Zacks Fundamentals Collection C available here (download Zacks Fundamentals Expanded ZACKS-FE.csv file). It contains 852 rows of data for 30 selected companies traded at US stock markets (NYSE, NASDAQ, etc.). This sample collection covers data back to 2011 i.e. ca. 22 to 23 quarterly reported financials. There is over 620 fundamentals to select from. Their full list you can find here (download the ZFC Definitions csv file). Let’s put it all in a new Python listing:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 # Non-Linear Cross-Bicorrelations between the Oil Prices and Stock Fundamentals # (c) 2016 QuantAtRisk.com, by Pawel Lachowicz   import numpy as np import pandas as pd import scipy.stats import datetime from matplotlib import pyplot as plt   %matplotlib inline   grey = 0.7, 0.7, 0.7   import warnings warnings.filterwarnings('ignore')     # WTI Crude Oil price-series # src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d') wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse) wpi = wpi[wpi.VALUE != "."] wpi.VALUE = wpi.VALUE.astype(float)   # Zacks Fundamentals Collection C (Fundamentals Expanded) # src: https://www.quandl.com/databases/ZFC df = pd.read_csv('ZACKS-FE.csv')   col = ['Ticker', 'per_end_date', 'per_code', 'Revenue', "EBIT", "GrossProfit", "CAPEX"] data = pd.DataFrame(columns=col) # create an empty DataFrame   tickers = ["XOM", "WMT", "BA", "AAPL", "JNJ"]   for i in range(df.shape[0]): ticker = df.iloc[i]["ticker"] pc = df.iloc[i]["per_code"] # extract only reported fundamentals for quarters "QR0 and all "QR-" # in total 22 to 23 rows of data per stock if((pc[0:3] == "QR-") | (pc[0:3] == "QR0")) and (ticker in tickers): data.loc[len(data)] = [df.iloc[i]["ticker"], df.iloc[i]["per_end_date"], df.iloc[i]["per_code"], df.iloc[i]["tot_revnu"], df.iloc[i]["ebit"], df.iloc[i]["gross_profit"], df.iloc[i]["cap_expense"]]   print(data.head(25))

where we created a new Dataframe, data, storing Total Revenue, EBIT, Gross Profit (revenue from operations less associated costs), and CAPEX (capital expenditure) for a sample of 5 stocks (XOM, WMT, BA, AAPL, and JNJ) out of 30 available in that demo collection. Our goal here is just to show a path for a possible analysis of all those numbers with, kept in mind, an encouragement to build a bigger tool for stock screening based on detection of non-linear correlations amongst selected fundamentals and oil prices.

First 25 lines of data Dataframe are:

 Ticker per_end_date per_code Revenue EBIT GrossProfit CAPEX 0 AAPL 2011-03-31 QR-22 24667.0 7874.0 10218.0 -1838.0 1 AAPL 2011-06-30 QR-21 28571.0 9379.0 11922.0 -2615.0 2 AAPL 2011-09-30 QR-20 28270.0 8710.0 11380.0 -4260.0 3 AAPL 2011-12-31 QR-19 46333.0 17340.0 20703.0 -1321.0 4 AAPL 2012-03-31 QR-18 39186.0 15384.0 18564.0 -2778.0 5 AAPL 2012-06-30 QR-17 35023.0 11573.0 14994.0 -4834.0 6 AAPL 2012-09-30 QR-16 35966.0 10944.0 14401.0 -8295.0 7 AAPL 2012-12-31 QR-15 54512.0 17210.0 21060.0 -2317.0 8 AAPL 2013-03-31 QR-14 43603.0 12558.0 16349.0 -4325.0 9 AAPL 2013-06-30 QR-13 35323.0 9201.0 13024.0 -6210.0 10 AAPL 2013-09-30 QR-12 37472.0 10030.0 13871.0 -8165.0 11 AAPL 2013-12-31 QR-11 57594.0 17463.0 21846.0 -1985.0 12 AAPL 2014-03-31 QR-10 45646.0 13593.0 17947.0 -3367.0 13 AAPL 2014-06-30 QR-9 37432.0 10282.0 14735.0 -5745.0 14 AAPL 2014-09-30 QR-8 42123.0 11165.0 16009.0 -9571.0 15 AAPL 2014-12-31 QR-7 74599.0 24246.0 29741.0 -3217.0 16 AAPL 2015-03-31 QR-6 58010.0 18278.0 23656.0 -5586.0 17 AAPL 2015-06-30 QR-5 49605.0 14083.0 19681.0 -7629.0 18 AAPL 2015-09-30 QR-4 51501.0 14623.0 20548.0 -11247.0 19 AAPL 2015-12-31 QR-3 75872.0 24171.0 30423.0 -3612.0 20 AAPL 2016-03-31 QR-2 50557.0 13987.0 19921.0 -5948.0 21 AAPL 2016-06-30 QR-1 42358.0 10105.0 16106.0 -8757.0 22 AAPL 2016-09-30 QR0 46852.0 11761.0 17813.0 -12734.0 23 BA 2011-03-31 QR-22 14910.0 1033.0 2894.0 -417.0 24 BA 2011-06-30 QR-21 16543.0 1563.0 3372.0 -762.0

An exemplary visualisation of both data sets for selected company (here: Exxon Mobil Corporation, XOM, engaged in the exploration and production of crude oil and natural gas, manufacturing of petroleum products, and transportation and sale of crude oil, natural gas and petroleum products) we obtain with:

49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 fig, ax1 = plt.subplots(figsize=(10,5)) plt.plot(wpi.DATE, wpi.VALUE, color=grey, label="WPI Crude Oil (daily)") plt.grid(True) plt.legend(loc=1) plt.ylabel("USD per barrel")   selection = ["XOM"]   ax2 = ax1.twinx() plt.ylabel("1e6 USD") for ticker in selection: tmp = data[data.Ticker == ticker] lab = ticker tmp['per_end_date'] = pd.to_datetime(tmp['per_end_date'], format='%Y-%m-%d') plt.plot(tmp.per_end_date, tmp.Revenue, '.-', label=lab + ": Revenue") plt.plot(tmp.per_end_date, tmp.EBIT, '.-g', label=lab + ": EBIT") plt.plot(tmp.per_end_date, tmp.GrossProfit, '.-m', label=lab + ": Gross Profit") plt.plot(tmp.per_end_date, tmp.CAPEX, '.-r', label=lab + ": CAPEX") plt.legend(loc=3)

Every company has different periods of time (per_end_date) when they report their financials. That is why we need to make sure that in the process of comparison of selected fundamentals with oil prices, the oil price is picked up around the date defined by per_end_date. Let me use $\pm 5$ day average around those points:

70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 # cross-correlation def Cxy(x, y, r, N): z = 0 for i in range(0, N-r-1): z += x[i] * y[i+r] z /= (N-r) return z   # test statistic for cross-correlation def Hxy(x, y, L, N): z = 0 for r in range(1, L+1): z += Cxy(x, y, r, N)**2 z *= (N - L) return z   # cross-bicorrelation def Cxyy(x, y, r, s, N): z = 0 m = np.max([r, s]) for i in range(0, N-m-1): z += x[i] * y[i+r] * y[i+s] z /= (N-m) return z   def Hxyy(x, y, L, N): z = 0 for s in range(2, L+1): for r in range(1, L+1): m = np.max([r, s]) z += (N-m) * Cxyy(x, y, r, s, N)**2 return z     fig, ax1 = plt.subplots(figsize=(10,5)) plt.plot(wpi.DATE, wpi.VALUE, color=grey, label="WPI Crude Oil (daily)") plt.grid(True) plt.legend(loc=1) plt.ylabel("USD per barrel")   selection = ["XOM"]   ax2 = ax1.twinx() plt.ylabel("1e6 USD") for ticker in selection: tmp = data[data.Ticker == ticker] lab = ticker tmp['per_end_date'] = pd.to_datetime(tmp['per_end_date'], format='%Y-%m-%d') plt.plot(tmp.per_end_date, tmp.Revenue, '.-', label=lab + ": Revenue") plt.plot(tmp.per_end_date, tmp.EBIT, '.-g', label=lab + ": EBIT") plt.plot(tmp.per_end_date, tmp.GrossProfit, '.-m', label=lab + ": Gross Profit") plt.plot(tmp.per_end_date, tmp.CAPEX, '.-r', label=lab + ": CAPEX") plt.legend(loc=3)   col = ['per_end_date', 'avgWPI'] wpi2 = pd.DataFrame(columns=col) # create an empty DataFrame   for i in range(tmp.shape[0]): date = tmp.iloc[i]["per_end_date"] date0 = date date1 = date + datetime.timedelta(days=-5) date2 = date + datetime.timedelta(days=+5) wpiavg = wpi[(wpi.DATE >= date1) & (wpi.DATE <= date2)] avg = np.mean(wpiavg.VALUE) wpi2.loc[len(wpi2)] = [date0, avg]   # plot quarterly averaged oil price-series plt.sca(ax1) # set ax1 axis for plotting plt.plot(wpi2.per_end_date, wpi2.avgWPI, '.-k')

where a black line denotes quarterly changing WPI Crude Oil prices stored now in a new Dataframe of wpi2.

Let’s leave only XOM in selection list variable. After line #140 the Dataframe of tmp stores XOM’s selected fundamentals data. In what follows we will use both Dateframes (tmp and wpi2) to look for non-linear correlations as the final goal of this investigation:

142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 # select some factors fundamentals = ["Revenue", "EBIT", "GrossProfit", "CAPEX"]   # compute for them cross-correlations and cross-bicorrelations # and display final results # for f in fundamentals: print("%s: %s\n" % (ticker, f)) # input data x = wpi2.avgWPI.values y = tmp[f].values   # normalised time-series x = (x - np.mean(x))/np.std(x, ddof=1) y = (y - np.mean(y))/np.std(y, ddof=1)   N = len(x) # len(x) should be equal to len(y) L = int(np.floor(N**0.4))   print("Cross-Correlation") for r in range(1, L+1): print(" r =%2g\t Cxy(r=%g) = %.4f" % (r, r, Cxy(x, y, r, N)))   dof = L pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof) alpha = 0.1   print(" Hxy(N) = %.4f" % Hxy(x, y, L, N)) print(" p-value = %.4f" % pvalue) print(" H0 rejected in favour of H1 at %g%% c.l.: %s" % (100*(1-alpha), pvalue<alpha))   print("Cross-Bicorrelation") for r in range(0, L+1): for s in range(r+1, L+1): print(" r =%2g, s =%2g\t Cxyy(r=%g, s=%g) = %.5f" % (r, s, r, s, Cxyy(x, y, r, s, N)))   dof = (L-1)*(L/2) pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)   print(" Hxyy(N) = %.4f" % Hxyy(x, y, L, N)) print(" p-value = %.4f" % pvalue) print(" H0 rejected in favour of H1 at %g%% c.l.: %s\n" % (100*(1-alpha), pvalue<alpha))

Our Python code generates for XOM the following output:

XOM: Revenue   Cross-Corelation r = 1 Cxy(r=1) = 0.7869 r = 2 Cxy(r=2) = 0.6239 r = 3 Cxy(r=3) = 0.4545 Hxy(N) = 24.3006 p-value = 0.0000 H0 rejected in favour of H1 at 90% c.l.: True Cross-Bicorrelation r = 0, s = 1 Cxyy(r=0, s=1) = -0.44830 r = 0, s = 2 Cxyy(r=0, s=2) = -0.30186 r = 0, s = 3 Cxyy(r=0, s=3) = -0.19315 r = 1, s = 2 Cxyy(r=1, s=2) = -0.36678 r = 1, s = 3 Cxyy(r=1, s=3) = -0.22161 r = 2, s = 3 Cxyy(r=2, s=3) = -0.24446 Hxyy(N) = 10.3489 p-value = 0.0000 H0 rejected in favour of H1 at 90% c.l.: True   XOM: EBIT   Cross-Correlation r = 1 Cxy(r=1) = 0.7262 r = 2 Cxy(r=2) = 0.5881 r = 3 Cxy(r=3) = 0.3991 Hxy(N) = 20.6504 p-value = 0.0001 H0 rejected in favour of H1 at 90% c.l.: True Cross-Bicorrelation r = 0, s = 1 Cxyy(r=0, s=1) = -0.39049 r = 0, s = 2 Cxyy(r=0, s=2) = -0.25963 r = 0, s = 3 Cxyy(r=0, s=3) = -0.17480 r = 1, s = 2 Cxyy(r=1, s=2) = -0.28698 r = 1, s = 3 Cxyy(r=1, s=3) = -0.18030 r = 2, s = 3 Cxyy(r=2, s=3) = -0.21284 Hxyy(N) = 6.8759 p-value = 0.0001 H0 rejected in favour of H1 at 90% c.l.: True   XOM: GrossProfit   Cross-Correlation r = 1 Cxy(r=1) = 0.7296 r = 2 Cxy(r=2) = 0.5893 r = 3 Cxy(r=3) = 0.3946 Hxy(N) = 20.7065 p-value = 0.0001 H0 rejected in favour of H1 at 90% c.l.: True Cross-Bicorrelation r = 0, s = 1 Cxyy(r=0, s=1) = -0.38552 r = 0, s = 2 Cxyy(r=0, s=2) = -0.24803 r = 0, s = 3 Cxyy(r=0, s=3) = -0.17301 r = 1, s = 2 Cxyy(r=1, s=2) = -0.28103 r = 1, s = 3 Cxyy(r=1, s=3) = -0.17393 r = 2, s = 3 Cxyy(r=2, s=3) = -0.19780 Hxyy(N) = 6.2095 p-value = 0.0001 H0 rejected in favour of H1 at 90% c.l.: True   XOM: CAPEX   Cross-Correlation r = 1 Cxy(r=1) = -0.2693 r = 2 Cxy(r=2) = -0.3024 r = 3 Cxy(r=3) = -0.2333 Hxy(N) = 4.3682 p-value = 0.2244 H0 rejected in favour of H1 at 90% c.l.: False Cross-Bicorrelation r = 0, s = 1 Cxyy(r=0, s=1) = 0.00844 r = 0, s = 2 Cxyy(r=0, s=2) = -0.10889 r = 0, s = 3 Cxyy(r=0, s=3) = -0.14787 r = 1, s = 2 Cxyy(r=1, s=2) = -0.10655 r = 1, s = 3 Cxyy(r=1, s=3) = -0.16628 r = 2, s = 3 Cxyy(r=2, s=3) = -0.08072 Hxyy(N) = 6.4376 p-value = 0.2244 H0 rejected in favour of H1 at 90% c.l.: False

A quick analysis points at detection of significant non-linear correlations (at 90% confidence level) between WPI Crude Oil and Revenue, EBIT, and GrossProfit for lag $r=1$ (based on cross-correlations) and for lags $(r=0, s=1)$ (based on cross-bicorrelations). There is no significant non-linear relationship between XOM’s CAPEX and oil prices.

Interestingly, rerunning the code for Boeing Company (NYSE ticker: BA) reveals no significant cross-(bi)correlations between the same factors and oil. For example, Boeing’s revenue improves year-over-year,

but it appears to have any significant non-linear link to the oil prices whatsoever:

BA: Revenue   Cross-Correlation r = 1 Cxy(r=1) = -0.3811 r = 2 Cxy(r=2) = -0.3216 r = 3 Cxy(r=3) = -0.1494 Hxy(N) = 5.4200 p-value = 0.1435 H0 rejected in favour of H1 at 90% c.l.: False Cross-Bicorrelation r = 0, s = 1 Cxyy(r=0, s=1) = 0.11202 r = 0, s = 2 Cxyy(r=0, s=2) = 0.03588 r = 0, s = 3 Cxyy(r=0, s=3) = -0.08083 r = 1, s = 2 Cxyy(r=1, s=2) = 0.01526 r = 1, s = 3 Cxyy(r=1, s=3) = 0.00616 r = 2, s = 3 Cxyy(r=2, s=3) = -0.07252 Hxyy(N) = 0.3232 p-value = 0.1435 H0 rejected in favour of H1 at 90% c.l.: False

The most interesting aspect of cross-correlation and cross-bicorrelation methods applied in oil vs. stock fundamentals research regards new opportunities to discover correlations for stocks that so far have been completely ignored (or not suspected) to have anything in common with oil price movements in the markets. Dare to go this path. I hope you will find new assets to invest in. The information is there. Now you are equipped with new tools. Use them wisely. Retire early.

REFERENCES
Bernanke B. S., 2016, The relationship between stocks and oil prices
Brooks C., Hinich M. J., 1999, Cross-correlations and cross-bicorrelations in Sterling
exchange rates
(pdf)
Coronado et al., 2015, A study of co-movements between USA and Latin American stock
markets: a cross-bicorrelations perspective
(pdf)
Hamilton J., 2014, Oil prices as an indicator of global economic conditions
Hinich M.J., 1996. Testing for dependence in the input to a linear time series model.
Journal of Nonparametric Statistics 6, 205–221.
Hinich M.J., Patterson D.M., 1995, Detecting epochs of transient dependence in white
noise
, Mimeo, University of Texas at Austin

## Information Ratio and its Relative Strength for Portfolio Managers

The risk and return are the two most significant quantities investigated within the investment performance. We expect the return to be highest at the minimum risk. In practice, a set of various combinations is always achieved depending on a number of factors related to the investment strategies and market behaviours, respectively. The portfolio manager or investment analyst is interested in the application of most relevant tool in order to described the overall performance. The arsenal is rich but we need to know the character of outcomes we wish to underline prior to fetching the proper instruments from the workshop.

In general there are two schools of investment performance. The first one is strictly related to the analysis of time-series on the tick-by-tick basis. All I mean by tick is a time scale of your data, e.g. minutes or months. The analyst is concerned about changes in investments and risk tracked continuously. A good example of this sort of approach is measurement of Value-at-Risk and Expected Shortfall, Below Target Risk or higher-moments of Coskewness and Cokurtosis. It all supplements the n-Asset Portfolio Construction with Efficient Frontier diagnostic.

The second school in the investment analysis is a macro-perspective. It includes the computation on $n$-period rolling median returns (or corresponding quantities), preferably annualized for the sake of comparison. This approach carries more meaning behind the lines rather than a pure tick time-series analysis. For the former, we track a sliding changes in accumulation and performance while for the latter we focus more closely on tick-by-tick risks and returns. For the sake of global picture and performance (market-oriented vantage point), a rolling investigation of risk and return reveals mutual tracking factors between different considered scenarios.

In this article I scratch the surface of the benchmark-oriented analysis with a special focus on the Information Ratio (IR) as an investment performance-related measure. Based on the market data, I test the IR in action and introduce a new supplementary measure of Information Ratio Relative Strength working as a second dimension for IR results.

1. Benchmark-oriented Analysis

Let’s assume you a portfolio manager and you manage over last 10 years 11 investment strategies. Each strategy is different, corresponds to different markets, style of investing, and asset allocation. You wish to get a macro-picture how well your investment performed and are they beating the benchmark. You collect a set of 120 monthly returns for each investment strategy plus their benchmarks. The easiest way to kick off the comparison process is the calculation of $n$-period median (or mean) returns.

Annualized Returns

The rolling $n$Y ($n$-Year) median return is defined as:
$$\mbox{rR}_{j}(n\mbox{Y}) = \left[ \prod_{i=1}^{12n} (r_{i,j}+1)^{1/n} \right] -1 \ \ \ \ \ \mbox{for}\ \ \ j=12n,…,N\ \ (\mbox{at}\ \Delta j =1)$$
where $r$ are the monthly median returns meeting the initial data selection criteria (if any) and $N$ denotes the total number of data available (in our case, $N=120$).

Any pre-selection (pre-filtering) of the input data becomes more insightful when contrasted with the benchmark performance. Thereinafter, by the benchmark we can assume a relative measure we will refer to (e.g. S\&P500 Index for US stocks, etc.) calculated as the rolling $n$Y median. Therefore, for instance, the 1Y (rolling) returns can be compared against the corresponding benchmark (index). The selection of a good/proper benchmark belongs to us.

Active Returns

We define active returns on the monthly basis as the difference between actual returns and benchmark returns,
$$a_i = r_i – b_i ,$$
as available at $i$ where $i$ denotes a continuous counting of months over past 120 months.

Annualized Tracking Risk (Tracking Error)

We assume the widely accepted definition of the tracking risk as the standard deviation of the active returns. For the sake of comparison, we transform it the form of the annualized tracking risk in the following way:
$$\mbox{TR}_{j}(n\mbox{Y}) = \sqrt{12} \times \sqrt{ \frac{1}{12n-1} \sum_{i=1}^{12n} \left( a_{i,j}-\langle a_{i,j} \rangle \right)^2 }$$
where the square root of 12 secures a proper normalization between calculations and $j=12n,…,N$.

Annualized Information Ratio (IR)

The information ratio is often referred to as a variation or generalised version of Sharpe ratio. It tells us how much excess return ($a_i$) is generated from the amount of excess risk taken relative to the benchmark. We calculate the annualized information ratio by dividing the annualised active returns by the annualised tracking risk as follows:
$$\mbox{IR}_j(n\mbox{Y}) = \frac{ \left[ \prod_{i=1}^{12n} (a_{i,j}+1)^{1/n} \right] -1 } { TR_j(n\mbox{Y}) } \ \ \ \ \ \mbox{for}\ \ \ j=12n,…,N$$

We assume the following definition of the investment under-performance and out-performance, $P$, as defined based on IR:
$$P_j = \begin{cases} P_j^{-} =\mbox{under-performance}, & \text{if } \mbox{IR}_j(n\mbox{Y}) \le 0 \\ P_j^{+} =\mbox{out-performance}, & \text{if } \mbox{IR}_j(n\mbox{Y}) > 0 . \end{cases}$$

Information Ratio Relative Strength of Performance

Since $P_j$ marks in time only those periods of time (months) where a given quantity under investigation achieved better results than the benchmark or not, it does not tell us anything about the importance (weight; strength) of this under/over-performance. The case could be, for instance, to observe the out-performance for 65\% of time over 10 years but barely beating the benchmark in the same period. Therefore, we introduce an additional measure, namely, the IR relative strength defined as:
$$\mbox{IRS} = \frac{ \int P^{+} dt } { \left| \int P^{-} dt \right| } .$$
The IRS measures the area under $\mbox{IR}_j(n\mbox{Y})$ function when $\mbox{IR}_j(n\mbox{Y})$ is positive (out-performance) relative to the area cofined between $\mbox{IR}_j(n\mbox{Y})$ function and $\mbox{IR}_j(n\mbox{Y})=0$. Thanks to that measure, we are able to estimate the revelance of
$$\frac { n(P^{+}) } { n(P^{-}) + n(P^{+}) }$$
ratio, i.e. the fraction of time when the examined quantity displayed out-performance over specified period of time (e.g. 10 years of data within our analysis). Here, $n(P^{+})$ counts number of $P_j^{+}$ instances.

2. Case Study

Let’s demonstrate the aforementioned theory in practice. Making use of real data, we perform a benchmark-oriented analysis for all 11 investment strategies. We derive 1Y rolling median returns supplemented by the correspoding 1Y tracking risk and information ratio measures. The following figure presents the results for 1Y returns of for the Balanced Options:

Interestingly to notice, the Investment Strategy-2 (Inv-2) displays an increasing tracking risk between mid-2007 and mid-2009 corresponding to the Credit Global Financial Crisis (GFC). This consistent surge in TR is explained by worse than average performance of the individual options, most probably dictated by higher than expected volatility of assets within portfolio. The 1Y returns of Inv-2 were able to beat the benchmark over only 25% of time, i.e. about 2.25 years since Jun/30 2004. We derive that conclusion based on the quantitative inspection of the Information Ratio plot (bottom panel).

In order to understand the overall performance for all eleven investment strategies, we allow ourselves to compute the under-performance and out-performance ratios,
$$\frac { n(P^{-}) } { n(P^{-}) + n(P^{+}) } \ \ \ \mbox{and} \ \ \ \frac { n(P^{+}) } { n(P^{-}) + n(P^{+}) } ,$$
and plot these ratios in red and blue colour in the following figure, respectively:

Only two investment strategies, Inv-7 and Inv-11, managed to keep the out-performance time longer than their under-performance. Inv-8 denoted the highest ratio of under-performance to out-performance but also Inv-2 (consider above)performed equally badly.

The 3rd dimension to this picture delivers the investigation of the IR Relative Strength measure for all investment strategies. We derive and display the results in next figure:

The plot reveals close-to-linear correlation between two considered quantities. Inv-7 and Inv-11 strategies confirm their relative strength of out-performance while Inv-2 and Inv-8 their weakness in delivering strong returns over their short period of out-performance.

It is of great interest to compare Inv-3, Inv-4, and Inv-10 strategies. While relative strength of out-performance of Inv-4 remains lower as contrasted with the latter ones (and can be explained by low volatility within this sort of investment strategy), Inv-10 denotes much firmer gains over benchmark than Inv-3 despite the fact that their out-performance period was nearly the same ($\sim$40%). Only the additional cross-correlation of IR Relative Strength outcomes as a function of the average return year-over-year is able to address the drivers of this observed relation.

Discuss this topic on QaR Forum

You can discuss the current post with other quants within a new Quant At Risk Forum.

## Create a Portfolio of Stocks based on Google Finance Data fed by Quandl

There is an old saying Be careful what you wish for. Many quants and traders wished for a long time for a better quality of publicly available data. It wasn’t an easy task to accomplish but with a minimum of effort, it has become possible to get all what you truly dreamt of. When I started my experience with daily backtesting of my algo strategies I initially relied on Yahoo! data. Yeah, easily accessible but of low quality. Since the inception of Google Finance the same dream had been reborn. More precise data, covering more aspects of trading, platforms, markets, instruments. But how to get them all?

Quandl.com provides us with an enormous access to an enormous database. Just visit the webpage to find out what I mean by barley putting my thoughts all together in one sentence. In fact, the portal’s accessibility via different programming languages (e.g. Matlab, C++, Python, etc.) makes that database more alive than ever. The space of parameters the quants get an access to becomes astronomical in number!

Alleluja! Let’s ride this horse. Enough of watching the online charts changing a direction of our investing moods with a one-mintute time resolution. We were born to be smarter than stochastic processes taking control over our daily strategies. In this short post, making use of the Matlab R2013a platform, I want to show you how quickly you can construct any portfolio of US stocks taken from the Dow Jones Index universe (as an example). We will use the Google Finance stock data fed by Quandl.com.

Google Finance Stock Data for Model Feeding

Similar to Yahoo! Finance’s resources, Quandl.com has an own system of coding the names of all US stocks. Therefore, we need to know it before doing any further step. The .csv file containing all of the codes for Quandl’s stock data can be obtained via API for Stock Data webpage. The file contains the header of the form:

Ticker Stock Name Price Code Ratios Code In Market?

linking Google Finance ticker with Quandl’s code (Price Code). Doing some extra job with data filtering, I created the pre-processed set of all tickers excluding those non-active in the markets and removing all invalid data. You can download it here QuandlStockCodeListUS.xlsx as an Excel workbook.

The second list we need to have is a list of potential stock we want to deal with in our portfolio. As for today, the current list of Dow Jones Index’s components you can download here: dowjones.lst. Let’s start with it:

1 2 3 4 5 6 7 8 9 10 11 12 13 % Create a Portfolio from Dow Jones Stocks based on Google Finance % Data fed by Quandl.com % % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz     close all; clear all; clc;   % Read the list of Dow Jones components fileID = fopen('dowjones.lst'); tmp = textscan(fileID, '%s'); fclose(fileID); dowjc=tmp{1}; % a list as a cell array

The guys from Mathworks suggest to make a new and good practice of using textscan command (lines 10-12) instead of dataread for reading data from the text files. The latter will be removed from the next Matlab release.

Next, we need to import the current list of tickers for US stocks and their corresponding codes in the Quandl’s database. We will make the use of my pre-processed data set of QuandlStockCodeListUS.xlsx as mentioned earlier in the following way:

15 16 17 % Read in the list of tickers and internal code from Quandl.com [ndata, text, alldata] = xlsread('QuandlStockCodeListUS.xlsx'); quandlc=text(:,1); % again, as a list in a cell array

For the simplicity of our example, we will be considering all 30 stocks as a complete sample set of stocks in our portfolio. A portfolio construction is not solely a selection of stocks. It usually links a time period of the past performance of each individual stock. In the following move, we will collect daily stock returns over last 365 calendar year (corresponding to about 252 trading days).

This framework is pretty handful. Why? If you build or run live a model, most probably your wish is to update your database when US markets are closed. You fetch Quandl.com for data. Now, if your model employs risk management or portfolio optimisation as an integrated building-block, it’s very likely, you are interested in risk(-return) estimation, e.g. by calculating Value-at-Risk (VaR) or similar risk measures. Of course, the choice of the time-frame is up to you based on your strategy.

Having that said,

19 20 21 % fetch stock data for last 365 days date2=datestr(today,'yyyy-mm-dd') % from date1=datestr(today-365,'yyyy-mm-dd') % to

what employs an improved Matlab’s command of datestr with formatting as a parameter. It’s really useful when it comes to data fetching from Quandl.com as the format of the date they like to see in your code is yyyy-mm-dd. The above two commands return in Matlab the following values:

date2 = 2013-11-03 date1 = 2012-11-03

Excellent! Let’s catch some fishes in the pond! Before doing it we need to make sure that Quandl’s software is installed and linked to our Matlab console. In order to download Quandl Matlab package just visit this website and follow the installation procedure at the top of the page. Add proper path in Matlab environment too. The best practice is to register yourself on Quandl. That allows you to rise the daily number of data call from 50 to 500. If you think you will need more data requests per day, from the same webpage you can send your request.

You will need to replace ‘yourauthenticationtoken’ in the Matlab code (see line 35) with your token in order to get the data.

Fetching Stock Data

From this point the road stops to wind. For all 30 stocks belonging to the current composition of Dow Jones Index, we get their Open, High, Low, Close prices (as retrieved from Google Finance) in the following way:

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 % create an empty array for storing stock data stockd={}; % scan the Dow Jones tickers and fetch the data from Quandl.com for i=1:length(dowjc) for j=1:length(quandlc) if(strcmp(dowjc{i},quandlc{j})) fprintf('%4.0f %s\n',i,quandlc{j}); % fetch the data of the stock from Quandl % using recommanded Quandl's command and % saving them directly into Matlab's FTS object (fts) fts=0; [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ... 'authcode','yourauthenticationtoken',... 'start_date',date1,'end_date',date2); stockd{i}=fts; % entire FTS object in an array's cell end end end % use 'save' command to save all actual variables from memory on the disc, % if you need it for multiple code running % save data

A code in line 35 to 37 is an example how you call Quandi for data and fetch them into FINTS (financial time series object from the Financial Toolbox in Matlab). So easy. If you wish to change the time frequency, apply transformation or get the data not in FINTS but .csv format, all is available for you as a parameter at the level of calling Quandl.get() command (explore all possibilities here).

Having that, you can easily plot the Google Finance data, for instance, of IBM stock straight away:

% diplay High-Low Chart for IBM highlow(stockd{11}) title('IBM'); ylabel('Stock Price ($)'); to get: or prepare a return-risk plot for your further portfolio optimisation ret=[]; for i=1:length(stockd) % extract the Close Price tmp=fts2mat(stockd{i}.Close,0); % calculate a vector of daily returns r=tmp(2:end)./tmp(1:end-1)-1; ret=[ret r]; end % calculate the annualized mean returns for 30 stocks mret=mean(ret)*sqrt(250); % and corresponding standard deviations msig=std(ret)*sqrt(250); % plot return-risk diagram p=100*[msig; mret]' plot(p(:,1),p(:,2),'b.'); xlim([0 30]); xlabel('Annualized Risk of Stock (%)') ylabel('Annualized Mean Return (%)'); providing us with Forum: Further Discussions You can discuss the current post with other quants within a new Quant At Risk Forum (Data Handling and Pre-Processing section). ## Anxiety Detection Model for Stock Traders based on Principal Component Analysis Everybody would agree on one thing: the nervousness among traders may lead to massive sells of stocks, the avalanche of prices, and huge losses. We have witnessed this sort of behaviour many times. An anxiety is the feeling of uncertainty, a human inborn instinct that triggers a self-defending mechanism against high risks. The apex of anxiety is fear and panic. When the fear spreads among the markets, all goes south. It is a dream goal for all traders to capture the nervousness and fear just by looking at or studying the market behaviour, trading patterns, ask-bid spread, or flow of orders. The major problem is that we know very well how much a human behaviour is an important factor in the game but it is not observed directly. It has been puzzling me for a few years since I accomplished reading Your Money and Your Brain: How the New Science of Neuroeconomics Can Help Make You Rich by Jason Zweig sitting at the beach of one of the Gili Islands, Indonesia, in December of 2009. A perfect spot far away from the trading charts. So, is there a way to disentangle the emotional part involved in trading from all other factors (e.g. the application of technical analysis, bad news consequences, IPOs, etc.) which are somehow easier to deduce? In this post I will try to make a quantitative attempt towards solving this problem. Although the solution will not have the final and closed form, my goal is to deliver an inspiration for quants and traders interested in the subject by putting a simple idea into practice: the application of Principal Component Analysis. 1. Principal Component Analysis (PCA) Called by many as one of the most valuable results from applied linear algebra, the Principal Component Analysis, delivers a simple, non-parametric method of extracting relevant information from often confusing data sets. The real-world data usually hold some relationships among their variables and, as a good approximation, in the first instance we may suspect them to be of the linear (or close to linear) form. And the linearity is one of stringent but powerful assumptions standing behind PCA. Imagine we observe the daily change of prices of$m$stocks (being a part of your portfolio or a specific market index) over last$n$days. We collect the data in$\boldsymbol{X}$, the matrix$m\times n$. Each of$n$-long vectors lie in an$m$-dimensional vector space spanned by an orthonormal basis, therefore they are a linear combination of this set of unit length basic vectors:$ \boldsymbol{BX} = \boldsymbol{X}$where a basis$\boldsymbol{B}$is the identity matrix$\boldsymbol{I}$. Within PCA approach we ask a simple question: is there another basis which is a linear combination of the original basis that represents our data set? In other words, we look for a transformation matrix$\boldsymbol{P}$acting on$\boldsymbol{X}$in order to deliver its re-representation: $$\boldsymbol{PX} = \boldsymbol{Y} \ .$$ The rows of$\boldsymbol{P}$become a set of new basis vectors for expressing the columns of$\boldsymbol{X}$. This change of basis makes the row vectors of$\boldsymbol{P}$in this transformation the principal components of$\boldsymbol{X}$. But how to find a good$\boldsymbol{P}$? Consider for a moment what we can do with a set of$m$observables spanned over$n$days? It is not a mystery that many stocks over different periods of time co-vary, i.e. their price movements are closely correlated and follow the same direction. The statistical method to measure the mutual relationship among$m$vectors (correlation) is achieved by the calculation of a covariance matrix. For our data set of$\boldsymbol{X}$: $$\boldsymbol{X}_{m\times n} = \left[ \begin{array}{cccc} \boldsymbol{x_1} \\ \boldsymbol{x_2} \\ … \\ \boldsymbol{x_m} \end{array} \right] = \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & … & x_{1,n} \\ x_{2,1} & x_{2,2} & … & x_{2,n} \\ … & … & … & … \\ x_{m,1} & x_{m,2} & … & x_{m,n} \end{array} \right]$$ the covariance matrix takes the following form: $$cov(\boldsymbol{X}) \equiv \frac{1}{n-1} \boldsymbol{X}\boldsymbol{X}^{T}$$ where we multiply$\boldsymbol{X}$by its transposed version and$(n-1)^{-1}$helps to secure the variance to be unbiased. The diagonal elements of$cov(\boldsymbol{X})$are the variances corresponding to each row of$\boldsymbol{X}$whereas the off-diagonal terms of$cov(\boldsymbol{X})$represent the covariances between different rows (prices of the stocks). Please note that above multiplication assures us that$cov(\boldsymbol{X})$is a square symmetric matrix$m\times m$. All right, but what does it have in common with our PCA method? PCA looks for a way to optimise the matrix of$cov(\boldsymbol{X})$by a reduction of redundancy. Sounds a bit enigmatic? I bet! Well, all we need to understand is that PCA wants to ‘force’ all off-diagonal elements of the covariance matrix to be zero (in the best possible way). The guys in the Department of Statistics will tell you the same as: removing redundancy diagonalises$cov(\boldsymbol{X})$. But how, how?! Let’s come back to our previous notation of$\boldsymbol{PX}=\boldsymbol{Y}$.$\boldsymbol{P}$transforms$\boldsymbol{X}$into$\boldsymbol{Y}$. We also marked that: $$\boldsymbol{P} = [\boldsymbol{p_1},\boldsymbol{p_2},…,\boldsymbol{p_m}]$$ was a new basis we were looking for. PCA assumes that all basis vectors$\boldsymbol{p_k}$are orthonormal, i.e.$\boldsymbol{p_i}\boldsymbol{p_j}=\delta_{ij}$, and that the directions with the largest variances are the most principal. So, PCA first selects a normalised direction in$m$-dimensional space along which the variance in$\boldsymbol{X}$is maximised. That is first principal component$\boldsymbol{p_1}$. In the next step, PCA looks for another direction along which the variance is maximised. However, because of orthonormality condition, it looks only in all directions perpendicular to all previously found directions. In consequence, we obtain an orthonormal matrix of$\boldsymbol{P}$. Good stuff, but still sounds complicated? The goal of PCA is to find such$\boldsymbol{P}$where$\boldsymbol{Y}=\boldsymbol{PX}$such that$cov(\boldsymbol{Y})=(n-1)^{-1}\boldsymbol{XX}^T$is diagonalised. We can evolve the notation of the covariance matrix as follows: $$(n-1)cov(\boldsymbol{Y}) = \boldsymbol{YY}^T = \boldsymbol{(PX)(PX)}^T = \boldsymbol{PXX}^T\boldsymbol{P}^T = \boldsymbol{P}(\boldsymbol{XX}^T)\boldsymbol{P}^T = \boldsymbol{PAP}^T$$ where we made a quick substitution of$\boldsymbol{A}=\boldsymbol{XX}^T$. It is easy to prove that$\boldsymbol{A}$is symmetric. It takes a longer while to find a proof for the following two theorems: (1) a matrix is symmetric if and only if it is orthogonally diagonalisable; (2) a symmetric matrix is diagonalised by a matrix of its orthonormal eigenvectors. Just check your favourite algebra textbook. The second theorem provides us with a right to denote: $$\boldsymbol{A} = \boldsymbol{EDE}^T$$ where$\boldsymbol{D}$us a diagonal matrix and$\boldsymbol{E}$is a matrix of eigenvectors of$\boldsymbol{A}$. That brings us at the end of the rainbow. We select matrix$\boldsymbol{P}$to be a such where each row$\boldsymbol{p_1}$is an eigenvector of$\boldsymbol{XX}^T$, therefore $$\boldsymbol{P} = \boldsymbol{E}^T .$$ Given that, we see that$\boldsymbol{E}=\boldsymbol{P}^T$, thus we find$\boldsymbol{A}=\boldsymbol{EDE}^T = \boldsymbol{P}^T\boldsymbol{DP}$what leads us to a magnificent relationship between$\boldsymbol{P}$and the covariance matrix: $$(n-1)cov(\boldsymbol{Y}) = \boldsymbol{PAP}^T = \boldsymbol{P}(\boldsymbol{P}^T\boldsymbol{DP})\boldsymbol{P}^T = (\boldsymbol{PP}^T)\boldsymbol{D}(\boldsymbol{PP}^T) = (\boldsymbol{PP}^{-1})\boldsymbol{D}(\boldsymbol{PP}^{-1})$$ or $$cov(\boldsymbol{Y}) = \frac{1}{n-1}\boldsymbol{D},$$ i.e. the choice of$\boldsymbol{P}$diagonalises$cov(\boldsymbol{Y})$where silently we also used the matrix algebra theorem saying that the inverse of an orthogonal matrix is its transpose ($\boldsymbol{P^{-1}}=\boldsymbol{P}^T$). Fascinating, right?! Let’s see now how one can use all that complicated machinery in the quest of looking for human emotions among the endless rivers of market numbers bombarding our sensors every day. 2. Covariances of NASDAQ, Eigenvalues of Anxiety We will try to build a simple quantitative model for detection of the nervousness in the trading markets using PCA. By its simplicity I will understand the following model assumption: no matter what the data conceal, the 1st Principal Component (1-PC) of PCA solution links the complicated relationships among a subset of stocks triggered by a latent factor attributed by us to a common behaviour of traders (human and pre-programmed algos). It is a pretty reasonable assumption, much stronger than, for instance, the influence of Saturn’s gravity on the annual silver price fluctuations. Since PCA does not tell us what its 1-PC means in reality, this is our job to seek for meaningful explanations. Therefore, a human factor fits the frame as a trial value very well. Let’s consider the NASDAQ-100 index. It is composed of 100 technology stocks. The most current list you can find here: nasdaq100.lst downloadable as a text file. As usual, we will perform all calculations using Matlab environment. Let’s start with data collection and pre-processing: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 % Anxiety Detection Model for Stock Traders % making use of the Principal Component Analsis (PCA) % and utilising publicly available Yahoo! stock data % % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz clear all; close all; clc; % Reading a list of NASDAQ-100 components nasdaq100=(dataread('file',['nasdaq100.lst'], '%s', 'delimiter', '\n'))'; % Time period we are interested in d1=datenum('Jan 2 1998'); d2=datenum('Oct 11 2013'); % Check and download the stock data for a requested time period stocks={}; for i=1:length(nasdaq100) try % Fetch the Yahoo! adjusted daily close prices between selected % days [d1;d2] tmp = fetch(yahoo,nasdaq100{i},'Adj Close',d1,d2,'d'); stocks{i}=tmp; disp(i); catch err % no full history available for requested time period end end where, first, we try to check whether for a given list of NASDAQ-100’s components the full data history (adjusted close prices) are available via Yahoo! server (please refer to my previous post of Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting for more information on the connectivity). The cell array stocks becomes populated with two-dimensional matrixes: the time-series corresponding to stock prices (time,price). Since the Yahoo! database does not contain a full history for all stocks of our interest, we may expect their different time spans. For the purpose of demonstration of the PCA method, we apply additional screening of downloaded data, i.e. we require the data to be spanned between as defined by$d1$and$d2$variables and, additionally, having the same (maximal available) number of data points (observations, trials). We achieve that by: 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 % Additional screening d=[]; j=1; data={}; for i=1:length(nasdaq100) d=[d; i min(stocks{i}(:,1)) max(stocks{i}(:,1)) size(stocks{i},1)]; end for i=1:length(nasdaq100) if(d(i,2)==d1) && (d(i,3)==d2) && (d(i,4)==max(d(:,4))) data{j}=sortrows(stocks{i},1); fprintf('%3i %1s\n',i,nasdaq100{i}) j=j+1; end end m=length(data); The temporary matrix of$d$holds the index of stock as read in from nasdaq100.lst file, first and last day number of data available, and total number of data points in the time-series, respectively: >> d d = 1 729757 735518 3970 2 729757 735518 3964 3 729757 735518 3964 4 729757 735518 3969 .. .. .. .. 99 729757 735518 3970 100 729757 735518 3970 Our screening method saves$m=21$selected stock data into data cell array corresponding to the following companies from our list:  1 AAPL 7 ALTR 9 AMAT 10 AMGN 20 CERN 21 CHKP 25 COST 26 CSCO 30 DELL 39 FAST 51 INTC 64 MSFT 65 MU 67 MYL 74 PCAR 82 SIAL 84 SNDK 88 SYMC 96 WFM 99 XRAY 100 YHOO Okay, some people say that seeing is believing. All right. Let’s see how it works. Recall the fact that we demanded our stock data to be spanned between ‘Jan 2 1998′ and ‘Oct 11 2013′. We found 21 stocks meeting those criteria. Now, let’s assume we pick up a random date, say, Jul 2 2007 and we extract for all 21 stocks their price history over last 90 calendar days. We save their prices (skipping the time columns) into$Z$matrix as follows: t=datenum('Jul 2 2007'); Z=[]; for i=1:m [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-90)); Z=[Z data{i}(r,2)] end and we plot them all together: plot(Z) xlim([1 length(Z)]); ylabel('Stock price (US$)'); xlabel('T-90d');

It’s easy to deduct that the top one line corresponds to Apple, Inc. (AAPL) adjusted close prices.

The unspoken earlier data processing methodology is that we need to transform our time-series into the comparable form. We can do it by subtracting the average value and dividing each of them by their standard deviations. Why? For a simple reason of an equivalent way of their mutual comparison. We call that step a normalisation or standardisation of the time-series under investigation:

[N,M]=size(Z); X=(Z-repmat(mean(Z),[N 1]))./repmat(std(Z),[N 1]);

This represents the matrix $\boldsymbol{X}$ that I discussed in a theoretical part of this post. Note, that the dimensions are reversed in Matlab. Therefore, the normalised time-series,

% Display normalized stock prices plot(X) xlim([1 length(Z)]); ylabel('(Stock price-Mean)/StdDev'); xlabel('T-90d');

look like:

For a given matrix of $\boldsymbol{X}$, its covariance matrix,

% Calculate the covariance matrix, cov(X) CovX=cov(X); imagesc(CovX);

as for data spanned 90 calendar day back from Jul 2 2007, looks like:

where the colour coding goes from the maximal values (most reddish) down to the minimal values (most blueish). The diagonal of the covariance matrix simply tells us that for normalised time-series, their covariances are equal to the standard deviations (variances) of 1 as expected.

Going one step forward, based on the given covariance matrix, we look for the matrix of $\boldsymbol{P}$ whose columns are the corresponding eigenvectors:

% Find P [P,~]=eigs(CovX,5); imagesc(P); set(gca,'xticklabel',{1,2,3,4,5},'xtick',[1 2 3 4 5]); xlabel('Principal Component') ylabel('Stock'); set(gca,'yticklabel',{'AAPL', 'ALTR', 'AMAT', 'AMGN', 'CERN', ... 'CHKP', 'COST', 'CSCO', 'DELL', 'FAST', 'INTC', 'MSFT', 'MU', ... 'MYL', 'PCAR', 'SIAL', 'SNDK', 'SYMC', 'WFM', 'XRAY', 'YHOO'}, ... 'ytick',[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21]);

which results in $\boldsymbol{P}$ displayed as:

where we computed PCA for five principal components in order to illustrate the process. Since the colour coding is the same as in the previous figure, a visual inspection of of the 1-PC indicates on negative numbers for at least 16 out of 21 eigenvalues. That simply means that over last 90 days the global dynamics for those stocks were directed south, in favour of traders holding short-position in those stocks.

It is important to note in this very moment that 1-PC does not represent the ‘price momentum’ itself. It would be too easy. It represents the latent variable responsible for a common behaviour in the stock dynamics whatever it is. Based on our model assumption (see above) we suspect it may indicate a human factor latent in the trading.

3. Game of Nerves

The last figure communicates an additional message. There is a remarkable coherence of eigenvalues for 1-PC and pretty random patterns for the remaining four principal components. One may check that in the case of our data sample, this feature is maintained over many years. That allows us to limit our interest to 1-PC only.

It’s getting exciting, isn’t it? Let’s come back to our main code. Having now a pretty good grasp of the algebra of PCA at work, we may limit our investigation of 1-PC to any time period of our interest, below spanned between as defined by $t1$ and $t2$ variables:

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 % Select time period of your interest t1=datenum('July 1 2006'); t2=datenum('July 1 2010');   results=[]; for t=t1:t2 tmp=[]; A=[]; V=[]; for i=1:m [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-60)); A=[A data{i}(r,2)]; end [N,M]=size(A); X=(A-repmat(mean(A),[N 1]))./repmat(std(A),[N 1]); CovX=cov(X); [V,D]=eigs(CovX,1); % Find all negative eigenvalues of the 1st Principal Component [r,c,v]=find(V(:,1)<0); % Extract them into a new vector neg1PC=V(r,1); % Calculate a percentage of negative eigenvalues relative % to all values available ratio=length(neg1PC)/m; % Build a new time-series of 'ratio' change over required % time period (spanned between t1 and t2) results=[results; t ratio]; end

We build our anxiety detection model based on the change of number of eigenvalues of the 1st Principal Component (relative to the total their numbers; here equal 21). As a result, we generate a new time-series tracing over $[t1;t2]$ time period this variable. We plot the results all in one plot contrasted with the NASDAQ-100 Index in the following way:

75 76 77 78 79 80 81 82 83 84 85 % Fetch NASDAQ-100 Index from Yahoo! data-server nasdaq = fetch(yahoo,'^ndx','Adj Close',t1,t2,'d'); % Plot it subplot(2,1,1) plot(nasdaq(:,1),nasdaq(:,2),'color',[0.6 0.6 0.6]); ylabel('NASDAQ-100 Index'); % Add a plot corresponding to a new time-series we've generated subplot(2,1,2) plot(results(:,1),results(:,2),'color',[0.6 0.6 0.6]) % add overplot 30d moving average based on the same data hold on; plot(results(:,1),moving(results(:,2),30),'b')

I use 30-day moving average (a solid blue line) in order to smooth the results (moving.m). Please note, that line in #56 I also replaced the earlier value of 90 days with 60 days. Somehow, it is more reasonable to examine with the PCA the market dynamics over past two months than for longer periods (but it’s a matter of taste and needs).

Eventually, we construct the core model’s element, namely, we detect nervousness among traders when the percentage of negative eigenvalues of the 1st Principal Component increases over (at least) five consecutive days:

87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 % Model Core x1=results(:,1); y1=moving(results(:,2),30); tmp=[]; % Find moments of time where the percetage of negative 1-PC % eigenvalues increases over time (minimal requirement of % five consecutive days for i=5:length(x1) if(y1(i)>y1(i-1))&&(y1(i-1)>y1(i-2))&&(y1(i-2)>y1(i-3))&& ... (y1(i-3)>y1(i-4))&&(y1(i-4)>y1(i-5)) tmp=[tmp; x1(i)]; end end % When found z=[]; for i=1:length(tmp) for j=1:length(nasdaq) if(tmp(i)==nasdaq(j,1)) z=[z; nasdaq(j,1) nasdaq(j,2)]; end end end subplot(2,1,1); hold on; plot(z(:,1),z(:,2),'r.','markersize',7);

The results of the model we over-plot with red markers on top of the NASDAQ-100 Index:

Our simple model takes us into a completely new territory of unexplored space of latent variables. Firstly, it does not predict the future. It still (unfortunately) remains unknown. However, what it delivers is a fresh look at the past dynamics in the market. Secondly, it is easily to read out from the plot that results cluster into three subgroups.

The first subgroup corresponds to actions in the stock trading having further negative consequences (see the events of 2007-2009 and the avalanche of prices). Here the dynamics over any 60 calendar days had been continued. The second subgroup are those periods of time when anxiety led to negative dynamics among stock traders but due to other factors (e.g. financial, global, political, etc.) the stocks surged dragging the Index up. The third subgroup (less frequent) corresponds to instances of relative flat changes of Index revealing a typical pattern of psychological hesitation about the trading direction.

No matter how we might interpret the results, the human factor in trading is evident. Hopefully, the PCA approach captures it. If not, all we are left with is our best friend: a trader’s intuition.

Acknowledgments

An article dedicated to Dr. Dariusz Grech of Physics and Astronomy Department of University of Wroclaw, Poland, for his superbly important! and mind-blowing lectures on linear algebra in the 1998/99 academic year.