Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

Non-Linear Cross-Bicorrelations between the Oil Prices and Stock Fundamentals

When we talk about correlations in finance, by default, we assume linear relationships between two time-series “co-moving”. In other words, if one time-series changes its values over a give time period, we seek for a tight correlation reflected within the other time-series. If found, we say they are correlated. But wait a minute! Is the game always about looking for an immediate feedback between two time-series? Is “correlated” always a synonymous of a direct, i.e. a linear response?

Think for a second about the oil prices. If they rise within a quarter, the effect on some companies is not the same. Certain firms can be tightly linked to oil/petroleum prices that affect their product value, distribution, sales, etc. while for other companies the change of crude oil price has a negligible effect, e.g. the online services. Is it so?

As a data scientist, quantitative (business) analysts, or investment researcher you need to know that mathematics and statistics deliver ample of tools you may use in order to derive the best possibilities for two (or more) financial assets that can be somehow related, correlated, or connected. The wording does not need to be precise to reflect the goal of an investigation: correlation, in general.

Correlation can be linear and non-linear. Non-linearity of correlation is somehow counterintuitive. We used to think in a linear way that is why it is so hard to readjust our thinking in a non-linear domain. The changes of the oil prices might have a non-negligble effect on the airlines, causing the air-ticket prices to rise or fall due to recalculated oil/petroleum surcharge. The effect can be not immediate. A rise of the oil prices today can be “detected” with a time delay of days, weeks, or even months, i.e. non-linearly correlated.

In this post analyse both linear and non-linear correlation methods that have been so far applied (and not) in the research over oil price tectonics across the markets. In particular, first we discuss one-factor and multiple linear regression models to turn, in consequence, toward non-linearity possible to be captured by cross-correlation and cross-bicorrelation methods. By developing an extensive but easy-to-follow Python code, we analyse a database of Quandl.com storing over 620 stocks’ financials factors (stock fundamentals). Finally, we show how both non-linear mathematical tools can be applied in order to detect non-linear correlations for different financial assets.

1. Linear Correlations

1.1. Oil Prices versus Stock Markets

Correlations between increasing/decreasing oil prices and stock markets have been a subject of investigation for a number of years. The main interest was initially in search for linear correlations between raw price/return time-series among the stocks and oil benchmarks (e.g. WPI Crude Oil). A justification standing behind such approach to data analysis seemed to be driven by a common sense: an impact of rising oil prices should affect companies dependent on petroleum-based products or services.

The usual presumption states that a decline in oil price is a good news for the economy, especially for net oil importers, e.g. USA or China. An increase in oil prices usually causes the raise of the input costs for most businesses and force consumers to spend more money on gasoline, thereby reducing the corporate earnings of other businesses. The opposite should be true too when the oil prices fall.

The oil prices are determined by the supply-and-damand for petroleum-based products. During economic expansion, prices might rise as a result of increased consumption. They might fall as a result of increased production. That pattern has been observed on many occasions so far.

A quantitative approach to measuring the strength of linear response of the stock market to oil price can be demonstrated using a plain linear regression modeling. If we agree that S&P 500 Index is a proxy for the US stock market, we should be able to capture the relationship. In the following example, we compare WPI Crude Oil prices with S&P 500 Index for 2-year period between Nov 2014 and Nov 2016. First we plot both time-series, and next we apply one-factor linear regression model,
$$
y(t) = \beta x(t) + \epsilon(t) \ \,
$$ in order to fit the data best:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt
from pandas_datareader import data, wb
 
%matplotlib inline
 
grey = 0.7, 0.7, 0.7
 
import warnings
warnings.filterwarnings('ignore')
 
# downloading S&P 500 Index from Yahoo! Finance
sp500 = web.DataReader("^GSPC", data_source='yahoo',
                  start='2014-11-01', end='2016-11-01')['Adj Close']
 
# WPI Crude Oil price-series
# src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse)
wpi = wpi[wpi.VALUE != "."]
wpi.VALUE = wpi.VALUE.astype(float)
wpi.index = wpi.DATE
wpi.drop('DATE', axis=1, inplace=True) 
 
# combine both time-series
data = pd.concat([wpi, sp500], axis=1).dropna()  # and remove NaNs, if any
data.columns = ["WPI", "S&P500"]
print(data.head())
              WPI       S&P500
2014-11-03  78.77  2017.810059
2014-11-04  77.15  2012.099976
2014-11-05  78.71  2023.569946
2014-11-06  77.87  2031.209961
2014-11-07  78.71  2031.920044
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
plt.figure(figsize=(8,8))
# plot time-series
ax1 = plt.subplot(2,1,1)
plt.plot(data.WPI, 'k', label="WPI Crude Oil")
plt.legend(loc=3)
plt.ylabel("WPI Crude Oil [U$/barrel]")
 
# fit one-factor linear model
slope, intercept, r2_value, p_value, std_err = 
                  stats.linregress(data.WPI, data["S&P500"])
 
# best fit: model
xline = np.linspace(np.min(data.WPI), np.max(data.WPI), 100)
line = slope*xline + intercept
 
# plot scatter plot and linear model
ax2 = ax1.twinx()
plt.plot(data["S&P500"], 'm', label="S&P500")
plt.ylabel("S&P 500 Index [U$]")
plt.legend(loc="best")
plt.subplot(2,1,2)
plt.plot(data.WPI, data["S&P500"],'.', color=grey)
plt.plot(xline, line,'--')
plt.xlabel("WPI Crude Oil [U$/barrel]")
plt.ylabel("S&P 500 Index [U$]")
plt.title("R$^2$ = %.2f" % r2_value, fontsize=11)

Our Python code generates:
unknown
where we can see that WPI Crude Oil price vs. S&P 500 linear correlation is weak, with $R^2 = 0.36$ only. This is a static picture.

As one might suspect, correlation is time-dependent. Therefore, it is wiser to check a rolling linear correlation for a given data window size. Below, we assume its length to be 50 days:

59
60
61
62
63
64
rollcorr = pd.rolling_corr(wpi, sp500, 50).dropna()
 
plt.figure(figsize=(10,3))
plt.plot(rollcorr)
plt.grid()
plt.ylabel("Linear Correlation (50 day rolling window)", fontsize=8)

unknown-1
In 2008 Andrea Pescatori measured changes in the S&P 500 and oil prices in the same way as demonstrated above. He noted that variables occasionally moved in the same direction at the same time, but even then, the relationship remained weak. He concluded on no correlation at the 95% confidence level.

The lack of correlation might be attributed to factors like: wages, interest rates, industrial metal, plastic, computer technology that can offset changes in energy costs. On the other side, corporations might have become increasingly sophisticated at reading the futures markets and are able, much better, to anticipate the shift in factor prices (e.g. a company should be able to switch production processes to compensate for added fuel costs).

No matter what we analyse, either oil prices vs. market indexes or vs. individual stock trading records, linear correlation method is solely able to measure 1-to-1 response of $y$ to $x$. The lower coefficient of correlation the less valid linear model as a descriptor of true events and mutual relationships under study.

1.2. Multiple Linear Regression Model for Oil Price Changes

In February 2016 Ben S. Bernanke found that a positive correlation of stocks and oil might arise because both are responding to the underlying shift in global demand. Using a simple multiple linear regression model he took and attempt in explaining the daily changes of the oil prices and concluded that in over 90% they could be driven by changes of commodities prices (copper), US dollar (spot price), 10-yr treasury interest rate, and in over 95% by extending the model with an inclusion of daily changes in VIX:
$$
\Delta p_{oil, t} = \beta_1 \Delta p_{copp, t} + \beta_2 \Delta p_{10yr, t} + \beta_3 \Delta p_{USD, t} + \beta_4 \Delta p_{VIX, t} + \beta_0
$$ The premise is that commodity prices, long-term interest rate, and USD are likely to respond to investors’ perceptions of global and US demand for oil. Additionally, the presence of VIX is in line with the following idea: if investors retreat from commodities and/or stocks during periods of high uncertainty/risk aversion, then shocks to volatility may be another reason for the observed tendency of stocks and oil prices moving together.

The model, inspired by James Hamilton (2014) work, aims at measuring the effect of demand shifts on the oil market. For the sake of elegance, we can code it in Python as follows. First, we download to the files all necessary price-series and process them within pandas Dataframes (more on Quantitative Aspects of pandas for Quants you will find in my upcoming book Python for Quants. Volume II.).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import pandas as pd
from scipy import stats
import datetime
from datetime import datetime
from matplotlib import pyplot as plt
import statsmodels.api as sm
 
%matplotlib inline
 
grey = 0.7, 0.7, 0.7
 
import warnings
warnings.filterwarnings('ignore')
 
# Download WPI Crude Oil prices
# src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse)
wpi = wpi[wpi.VALUE != "."]
wpi.VALUE = wpi.VALUE.astype(float)
wpi.index = wpi.DATE
wpi.drop('DATE', axis=1, inplace=True)
 
# Copper Futures 
# src: http://www.investing.com/commodities/copper-historical-data
cop = pd.read_csv("COPPER.csv", index_col=["Date"])
cop.index = pd.to_datetime(pd.Series(cop.index), format='%b %d, %Y')
 
# Nominal interest rate on 10-year Treasury bonds
# src: https://fred.stlouisfed.org/series/DGS10/
tb = pd.read_csv('DGS10.csv')
tb = tb[tb.DGS10 != "."]
tb.DGS10 = tb.DGS10.astype(float)
tb['DATE'] =  pd.to_datetime(tb['DATE'], format='%Y-%m-%d')
tb.index = tb.DATE
tb.drop('DATE', axis=1, inplace=True)
 
# Trade Weighted U.S. Dollar Index
# src: https://fred.stlouisfed.org/series/DTWEXM
usd = pd.read_csv('DTWEXM.csv')
usd = usd[usd.DTWEXM != "."]
usd.DTWEXM = usd.DTWEXM.astype(float)
usd['DATE'] =  pd.to_datetime(usd['DATE'], format='%Y-%m-%d')
usd.index = usd.DATE
usd.drop('DATE', axis=1, inplace=True)
 
# CBOE Volatility Index: VIX© (VIXCLS)
# src: https://fred.stlouisfed.org/series/VIXCLS/downloaddata
vix = pd.read_csv('VIXCLS.csv')
vix = vix[vix.VALUE != "."]
vix.VALUE = vix.VALUE.astype(float)
vix['DATE'] =  pd.to_datetime(vix['DATE'], format='%Y-%m-%d')
vix.index = vix.DATE
vix.drop('DATE', axis=1, inplace=True)

Not every time a plain use of pd.read_csv function works in the way we expect from it, therefore some extra steps are needed to be implemented in order to convert .csv files content to Dataframe, e.g. a correct formatting of date (line #18-19 or #35, 44, 53), price conversion from string to float (lines #21, 34, 43, 52), replacement of index with data included in ‘Date’ column followed by its removal (e.g. lines #22-23).

As a side note, in order to download the copper data, at the website we need manually mark a whole table, paste it into Excel or Mac’s Numbers, trim if required, and save down to .csv file. The problem might appear as dates are given in “Nov 26, 2016″ format. For this case, the following trick in Python saves a lot of time:

from datetime import datetime
expr = 'Nov 26, 2016'
datetime.strptime(expr, '%b %d, %Y')
datetime.datetime(2016, 11, 26, 0, 0)

what justifies the use of %b descriptor (see more on Python’s datetime date formats and parsing here and here).

pandas’ Dataframes are fast and convenient way to concatenate multiple time-series with indexes set to calendar dates. The following step ensures all five price-series to have data points on the same dates (if given), and if any data point (for any time-series) is missing (NaN), the whole row is removed from the end Dataframe of ‘df’ thanks to action of .dropna() function:

57
58
59
60
df = pd.concat([wpi, cop, tb, usd, vix], join='outer', axis=1).dropna()
df.columns = ["WPI", "Copp", "TrB", "USD", "VIX"]
 
print(df.head(10))
              WPI   Copp   TrB      USD    VIX
2007-11-15  93.37  3.082  4.17  72.7522  28.06
2007-11-16  94.81  3.156  4.15  72.5811  25.49
2007-11-19  95.75  2.981  4.07  72.7241  26.01
2007-11-20  99.16  3.026  4.06  72.4344  24.88
2007-11-21  98.57  2.887  4.00  72.3345  26.84
2007-11-23  98.24  2.985  4.01  72.2719  25.61
2007-11-26  97.66  3.019  3.83  72.1312  28.91
2007-11-27  94.39  2.958  3.95  72.5061  26.28
2007-11-28  90.71  3.001  4.03  72.7000  24.11
2007-11-29  90.98  3.063  3.94  72.6812  23.97

Given all price-series and our model in mind, we are more interested in running the regression using daily change in the natural logarithm of crude oil, copper price, 10-yr interest rate, USD spot price, and VIX. The use of log is a common practice in finance and economics (see here why). Contrary to the daily percent change, we derive the log returns in Python in the following way:

np.random.seed(7)
df = pd.DataFrame(100 + np.random.randn(100).cumsum(), columns=['price'])
df['pct_change'] = df.price.pct_change()
df['log_ret'] = np.log(df.price) - np.log(df.price.shift(1))
 
print(df.head())
        price  pct_change   log_ret
0  101.690526         NaN       NaN
1  101.224588   -0.004582 -0.004592
2  101.257408    0.000324  0.000324
3  101.664925    0.004025  0.004016
4  100.876002   -0.007760 -0.007790

Having that, we continue coding our main model:

62
63
64
65
66
67
68
69
70
71
72
73
headers = df.columns
 
dlog = pd.DataFrame()  # log returns Dateframe
 
for j in range(df.shape[1]):
    price = df.ix[:,j]
    dl = np.log(price) - np.log(price.shift(1))
    dlog.loc[:,j] = dl
 
dlog.dropna(inplace=True)
dlog.columns = headers
print(dlog.head())
                 WPI      Copp       TrB       USD       VIX
2007-11-16  0.015305  0.023727 -0.004808 -0.002355 -0.096059
2007-11-19  0.009866 -0.057047 -0.019465  0.001968  0.020195
2007-11-20  0.034994  0.014983 -0.002460 -0.003992 -0.044417
2007-11-21 -0.005968 -0.047024 -0.014889 -0.001380  0.075829
2007-11-23 -0.003353  0.033382  0.002497 -0.000866 -0.046910

and run multiple linear regression model for in-sample data, here, selected for 2 year in-sample period between Nov 2012 and Nov 2014:

75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
dlog0 = dlog.copy()
 
# in-sample data selection
dlog = dlog[(dlog.index > "2012-11-01") & (dlog.index < "2014-11-01")]
 
# define input data
y = dlog.WPI.values
x = [dlog.Copp.values, dlog.TrB.values, dlog.USD.values, dlog.VIX.values ]
 
# Multiple Linear Regression using 'statsmodels' library
def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results
 
# display a summary of the regression
print(reg_m(y, x).summary())
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.105
Model:                            OLS   Adj. R-squared:                  0.098
Method:                 Least Squares   F-statistic:                     14.49
Date:                Tue, 29 Nov 2016   Prob (F-statistic):           3.37e-11
Time:                        11:38:44   Log-Likelihood:                 1507.8
No. Observations:                 499   AIC:                            -3006.
Df Residuals:                     494   BIC:                            -2985.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0303      0.008     -3.800      0.000        -0.046    -0.015
x2            -0.3888      0.178     -2.180      0.030        -0.739    -0.038
x3             0.0365      0.032      1.146      0.252        -0.026     0.099
x4             0.2261      0.052      4.357      0.000         0.124     0.328
const      -3.394e-05      0.001     -0.064      0.949        -0.001     0.001
==============================================================================
Omnibus:                       35.051   Durbin-Watson:                   2.153
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               75.984
Skew:                          -0.392   Prob(JB):                     3.16e-17
Kurtosis:                       4.743   Cond. No.                         338.
==============================================================================
 
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is 
    correctly specified.

A complete model is given by the following parameters:

96
97
par = reg_m(y, x).params
print(par)
[ -3.02699345e-02  -3.88784262e-01   3.64964420e-02   2.26134337e-01
  -3.39418282e-05]

or more formally as:
$$
\Delta p_{oil, t} = 0.2261 \Delta p_{copp, t} + 0.0365 \Delta p_{10yr, t} -0.3888 \Delta p_{USD, t} -0.0303 \Delta p_{VIX, t}
$$ The adjusted $R^2$ is near zero and statistical significance of $\beta_2$ is more than 5%.

In order to use our model for “future” oil price prediction based on price movements in commodity (copper), long-term interest rate, USD spot price and VIX index, we generate out-of-sample data set and construct the estimation in the following way:

99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# out-of-sample log return Dataframe
dlog_oos = dlog0[dlog0.index >= '2014-11-01']
 
# what is the last WPI Crude Oil price in in-sample
last = df[df.index == "2014-10-31"].values[0][0]  # 80.53 USD/barrel
 
oilreg = pd.DataFrame(columns=["Date", "OilPrice"])  # regressed oil prices
par = reg_m(y, x).params  # parameters from regression
 
for i in range(dlog_oos.shape[0]):
    x1 = dlog_oos.iloc[i]["VIX"]
    x2 = dlog_oos.iloc[i]["USD"]
    x3 = dlog_oos.iloc[i]["TrB"]
    x4 = dlog_oos.iloc[i]["Copp"]
    w = par[0]*x1 + par[1]*x2 + par[2]*x3 + par[3]*x4 + par[4]
    if(i==0):
        oilprice = np.exp(w) * last
    else:
        oilprice = np.exp(w) * oilprice
    oilreg.loc[len(oilreg)] = [dlog_oos.index[i], oilprice]
 
oilreg.index = oilreg.Date
oilreg.drop('Date', axis=1, inplace=True)

where oilreg Dataframe stores time and regressed oil prices. Visualisation of the results

123
124
125
126
127
128
129
130
plt.figure(figsize=(10,5))
plt.plot(df.WPI, color=grey, label="WPI Crude Oil")
plt.plot(oilreg, "r", label="Predicted Oil Price")
plt.xlim(['2012-01-01', '2016-11-01'])  # in-sample & out-of-sample
plt.ylim([20, 120])
plt.ylabel("USD/barrel")
plt.legend(loc=3)
plt.grid()

delivers
unknown
Bernanke’s multiple regression model points at an idea saying: when stock traders respond to a change in oil prices, they do so not necessarily because the oil movement in consequential in itself. Our finding (for different data sets than Bernanke’s) points at lack of strong dependancy on VIX thus on traders activity. Again, as in case of one-factor regression model discussed in Section 1.1, the stability of underlying correlations with all four considered factors remains statistically variable in time.

2. Non-Linear Correlations

So far we have understood how fragile the quantification and interpretation of the linear correlation between two time-series can be. We agreed on its time-dependant character and how data sample size may influence derived results. Linear correlation is a tool that captures 1-to-1 feedback between two data sets. What if a response is lagged? We need to look for some other useful tools good enough for this challenge. In this Section we will analyse two of them.

2.1. Cross-Correlation and Cross-Bicorrelation

If a change of one quantity has a noticeable effect on another one, this can be observed with or without time delay. In the time-series analysis we talk about one time-series being lagged (or led) behind (before) the other one. Since an exisiting lag (lead) is not observed directly (a linear feedback) it introduces a non-linear relationship between two time-series.

A degree of non-linear correlation can be measured by the application of two independent methods, namely, cross-correlations and cross-bicorrelations designed to examine the data dependence between any two (three) values led/lagged by $r$ ($s$) periods. By period one should understand the sampling frequency of the underlying time-series, for example a quarter, i.e. 3 month period.

Let $x(t) \equiv \{ x_t \} = \{ x_1, x_2, …, x_N \}$ for $t=1,…,N$ represents, e.g. the WPI Crude Oil price-series, and $\{ y_t \}$ the time-series of a selected quantity. For two time-series, $x(t)$ and $y(t)$, the cross-correlation is defined as:
$$
C_{xy}(r) = (N-r)^{-1} \sum_{i=1}^{N-r} x(t_i) y(t_i + r)
$$ whereas the cross-bicorrelation is usually assumed as:
$$
C_{xyy}(r,s) = (N-m)^{-1} \sum_{i=1}^{N-m} x(t_i) y(t_i + r) y(t_i + s)
$$ where $m=\max(r,s)$ (Brooks & Hinich 1999). Both formulae measure the inner product between the individual values of two time-series. In case of cross-correlation $C_{xy}(r)$ it is given by $\langle x(t_i) y(t_i + r) \rangle$ where the value of $y(t)$ is lagged by $r$ periods. The summation, similarly as in the concept of the Fourier transform, picks up the maximum power for a unique set of lags $r$ or $(r,s)$ for $C_{xy}(r)$ and $C_{xyy}(r,s)$, respectively. The use of cross-correlation method should return the greatest value of $C_{xy}(r)$ among all examined $r$’s. Here, $1 \le r < N$ in a first approximation. It reveals a direct non-linear correlation between $x(t_i)$ and $y(t_i + r)$ $\forall i$.

The concept of cross-bicorrelation goes one step further and examines the relationship between $x(t_i)$ and $y(t_i + r)$ and $y(t_i + s)$, e.g. how the current oil price at time $t$, $x(t_i)$, affects $y$ lagged by $r$ and $s$ independent time periods. Note that cross-bicorrelation also allows for $C_{xxy}(r,s)$ form, i.e. if we require to look for relationship of the current and lagged (by $r$) values of $x$ time-series and period $y$ time-series lagged by $s$.

2.2. Statistical Tests for Cross-Correlation and Cross-Bicorrelation

The application of both tools for small data samples ($N < 50$) does not guarantee a statistically significant non-linear correlations every single time the computations are done. The need for statistical testing of the results returned by both cross-correlation and cross-bicorrelation methods has beed addressed by Hinich (1996) and Brook & Hinich (1999). For two time-series, with a weak stationarity, the null hypothesis $H_0$ for the test is that two series are independent pure white noise processes against the alternative hypothesis $H_1$ saying that some cross-covariances, $E[x(t_i) y(t_i + r)]$, or cross-bicovariances, $E[x(t_i) y(t_i + r) y(t_i + s)]$, are nonzero. The test statistics, respectively for cross-correlation and cross-bicorrelation, are given by:
$$
H_{xy}(N) = \sum_{r=1}^{L} (N-L) C_{xy}^2(r) = (N-L) \sum_{r=1}^{L} C_{xy}^2(r)
$$ and
$$
H_{xyy}(N) = \sum_{s=-L}^{L} ‘ \sum_{r=1}^{L} (N-m) C_{xyy}^2(r,s) \ \ \ (‘-s \ne -1, 1, 0)
$$ where it has been shown by Hinich (1996) that:
$$
H_{xy}(N) \sim \chi^2_L
$$ and
$$
H_{xyy}(N) \sim \chi^2_{(L-1)(L/2)}
$$ where $\sim$ means are asymptotically distributed with the corresponding degrees for $N \rightarrow \infty$ and $m = \max(r, s)$ as previously. For both $H_{xy}(N)$ and $H_{xyy}(N)$ statistics, the number of lags $L$ is specified as $L = N^b$ where $0 < b < 0.5$. Based on the results of the Monte-Carlo simulations, Hinich & Parrerson (1995) recommended the use of $b=0.4$ in order to maximise the power of test while ensuring a valid approximation to the asymptotic theory. Further down, we compute $H_{xy}(N)$ and $H_{xyy}(N)$ statistics for every cross-correlation and cross-bicorrelation and determine the corresponding significance assuming the confidence level of 90% ($\alpha = 0.1$) if: $$ \mbox{p-value} < \alpha $$ where $$ \mbox{p-value} = 1 - \mbox{Pr} \left[ \chi^2_{\mbox{dof}} < H_{x(y)y} | H_1 \right] \ . $$ Given that, any resultant values of cross-correlations and cross-bicorrelations one can assume as statistically significant if p-value $< \alpha$. A recommended number of lags to be investigated is $L = N^{0.4}$ where $N$ is the length of $x(t)$ and $y(t)$ time-series. Employing floor rounding for $L$ we obtain the following relationship between $N$ and $L$ for the sample sizes of $N \le 50$:

L = []
for N in range(1, 51):
    L.append(np.floor(N**0.4))
 
fig, ax = plt.subplots(figsize=(8,3))
plt.plot(np.arange(1, 51), L, '.')
plt.xlabel("N")
plt.ylabel("L")
ax.margins(x=0.025, y=0.05) # add extra padding
ax.tick_params(axis='y',which='minor',left='off') # without minor yticks
plt.grid(b=True, which='minor', color='k', linestyle=':', alpha=0.3)
plt.minorticks_on()

unknown-2

2.3. Cross-Correlation and Cross-Bicorrelation in Python

Let’s translate cross-correlation to Python language and run a simple test for a random time-series:

# cross-correlation
def Cxy(x, y, r, N):
    z = 0
    for i in range(0, N-r-1):
        z += x[i] * y[i+r]
    z /= (N-r)
    return z
 
# test statistic for cross-correlation
def Hxy(x, y, L, N):
    z = 0
    for r in range(1, L+1):
        z += Cxy(x, y, r, N)**2
    z *= (N - L)   
    return z

where we defined two functions deriving $C_{xy}(r)$ and $H_{xy}(N)$, respectively. Now, let’s consider some simplified time-series of $N=16$, both normalised to be $\sim N(0,1)$:

# sample data
y = np.array([0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0])
x = np.array([1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])
 
# normalise such that x, y ~ N(0,1)
x = (x - np.mean(x))/np.std(x, ddof=1)
y = (y - np.mean(y))/np.std(y, ddof=1)

As one can see, there is expected strong cross-correlation for lag of $r = 2$ (strongest) with $r = 1$ to be detected too. We verify the overall statistical significance of cross-correlations for given time-series in the following way:

N = len(x)
L = int(np.floor(N**0.4))
 
for r in range(1, L+1):
    print("r =%2g\t Cxy(r=%g) = %.4f" % (r, r, Cxy(x, y, r)))
 
dof = L    
pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
alpha = 0.1
 
print("Hxy(N) = %.4f" % Hxy(x, y, L, N))
print("p-value = %.4f" % pvalue)
print("H0 rejected in favour of H1 at %g%% c.l.: %s" % 
      (100*(1-alpha), pvalue<alpha))

confirming both our expectations and significance of detected cross-correlations at 90% confidence level:

r = 1	 Cxy(r=1) = 0.0578
r = 2	 Cxy(r=2) = 0.8358
r = 3	 Cxy(r=3) = -0.3200
Hxy(N) = 10.4553
p-value = 0.0151
H0 rejected in favour of H1 at 90% c.l.: True

For cross-bicorrelation we have:

# cross-bicorrelation
def Cxyy(x, y, r, s, N):
    z = 0
    m = np.max([r, s])
    for i in range(0, N-m-1):
        z += x[i] * y[i+r] * y[i+s]
    z /= (N-m)
    return z
 
# test statistic for cross-bicorrelation
def Hxyy(x, y, L, N):
    z = 0
    for s in range(2, L+1):
        for r in range(1, L+1):
            m = np.max([r, s])
            z += (N-m) * Cxyy(x, y, r, s, N)**2
    return z    
 
 
# sample data
y = np.array([0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0])
x = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0])
 
# normalise such that x, y ~ N(0,1)
x = (x - np.mean(x))/np.std(x, ddof=1)
y = (y - np.mean(y))/np.std(y, ddof=1)
 
N = len(x)  # len(x) should be equal to len(y)
L = int(np.floor(N**0.4))
 
for r in range(0, L+1):
    for s in range(r+1, L+1):
        print("r =%2g, s =%2g\t Cxyy(r=%g, s=%g) = %.5f" % 
             (r, s, r, s, Cxyy(x, y, r, s, N)))
 
dof = (L-1)*(L/2)
pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
alpha = 0.1
 
print("Hxyy(N) = %.4f" % Hxyy(x, y, L, N))
print("p-value = %.4f" % pvalue)
print("H0 rejected in favour of H1 at %g%% c.l.: %s" % 
      (100*(1-alpha), pvalue<alpha))

where for a new sample data we expect to detect two strongest cross-bicorrelations for pairs of lags $(r,s)$ equal $(2,3)$ and $(1,2)$, respectively. By running the code we obtain indeed that:

r = 0, s = 1	 Cxyy(r=0, s=1) = -0.18168
r = 0, s = 2	 Cxyy(r=0, s=2) = -0.23209
r = 0, s = 3	 Cxyy(r=0, s=3) = 0.05375
r = 1, s = 2	 Cxyy(r=1, s=2) = 0.14724
r = 1, s = 3	 Cxyy(r=1, s=3) = -0.22576
r = 2, s = 3	 Cxyy(r=2, s=3) = 0.18276
Hxyy(N) = 5.3177
p-value = 0.0833
H0 rejected in favour of H1 at 90% c.l.: True

With these two examples we showed both cross-correlation and cross-bicorrelation methods in action. We also pointed at the important fact of an existing dependency between the length of time-series and the expected number of lags to be examined. The longer time-series the higher number of degrees-of-freedom.

Given that state of the play, there exist two independent and appealing ways for the application of cross-correlation and cross-bicorrelation in finance: (1) the first one regards the examination of time-series with a small number of data points (e.g. $N \le 50$); (2) alternatively, if we deal with a long ($N \ge 1000$) time-series, the common practice is to divide them into $k$ chunks (non-overlapping windows) of length at least $N_i$ of $50$ to $100$ where $i=1,…,k$ and compute cross-correlation and cross-bicorrelation metrics for every data window separately. In consequence, within this approach, we gain an opportunity to report on time-period dependent data segment with statistically signifiant (or not) non-linear correlations! (see e.g. Coronado et alii 2015).

In the Section that follows, we solely focus on the application of the first approach. Rewarding in many aspects if you’re open-mined.

2.4. Detecting Non-Linearity between Oil Prices and Stock Fundamentals

Since looking for a direct correlation between crude oil prices and stocks can be regarded as the first approach to the examination of mutual relationships, it’s not sexy; not any more. With cross-correlation and cross-bicorrelation methods you can step up and break the boredom.

The stock fundamentals consist a broad space of parameters that actually deserve to be cross-correlated with the oil prices. It is much more intuitive that any (if any) effect of rising or falling crude oil prices may affect different companies with various time lags. Moreover, for some businesses we might detect a non-linear relationships existing only between oil and company’s e.g. revenues whereas for others, changing oil prices might affect more than one factor. If the patterns is present indeed, with the application of cross-correlation and cross-bicorrelation methods, you gain a brand new tool for data analysis. Let’s have a look at a practical example.

As previously, we will need both the WPI Crude Oil price-series (see Section 1.1) and access to a large database containing company’s fundamentals over a number of quarters or years. For the latter let me use a demo version of Quandl.com‘s Zacks Fundamentals Collection C available here (download Zacks Fundamentals Expanded ZACKS-FE.csv file). It contains 852 rows of data for 30 selected companies traded at US stock markets (NYSE, NASDAQ, etc.). This sample collection covers data back to 2011 i.e. ca. 22 to 23 quarterly reported financials. There is over 620 fundamentals to select from. Their full list you can find here (download the ZFC Definitions csv file). Let’s put it all in a new Python listing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Non-Linear Cross-Bicorrelations between the Oil Prices and Stock Fundamentals
# (c) 2016 QuantAtRisk.com, by Pawel Lachowicz
 
import numpy as np
import pandas as pd
import scipy.stats
import datetime
from matplotlib import pyplot as plt
 
%matplotlib inline
 
grey = 0.7, 0.7, 0.7
 
import warnings
warnings.filterwarnings('ignore')
 
 
# WTI Crude Oil price-series
# src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse)
wpi = wpi[wpi.VALUE != "."]
wpi.VALUE = wpi.VALUE.astype(float)
 
# Zacks Fundamentals Collection C (Fundamentals Expanded)
# src: https://www.quandl.com/databases/ZFC
df = pd.read_csv('ZACKS-FE.csv')
 
col = ['Ticker', 'per_end_date', 'per_code', 'Revenue', "EBIT", 
          "GrossProfit", "CAPEX"]
data = pd.DataFrame(columns=col)  # create an empty DataFrame
 
tickers = ["XOM", "WMT", "BA", "AAPL", "JNJ"]
 
for i in range(df.shape[0]):
    ticker = df.iloc[i]["ticker"]
    pc = df.iloc[i]["per_code"]
    # extract only reported fundamentals for quarters "QR0 and all "QR-"
    # in total 22 to 23 rows of data per stock
    if((pc[0:3] == "QR-") | (pc[0:3] == "QR0")) and (ticker in tickers):
        data.loc[len(data)] = [df.iloc[i]["ticker"],
                               df.iloc[i]["per_end_date"],
                               df.iloc[i]["per_code"],
                               df.iloc[i]["tot_revnu"],
                               df.iloc[i]["ebit"],
                               df.iloc[i]["gross_profit"],
                               df.iloc[i]["cap_expense"]] 
 
print(data.head(25))

where we created a new Dataframe, data, storing Total Revenue, EBIT, Gross Profit (revenue from operations less associated costs), and CAPEX (capital expenditure) for a sample of 5 stocks (XOM, WMT, BA, AAPL, and JNJ) out of 30 available in that demo collection. Our goal here is just to show a path for a possible analysis of all those numbers with, kept in mind, an encouragement to build a bigger tool for stock screening based on detection of non-linear correlations amongst selected fundamentals and oil prices.

First 25 lines of data Dataframe are:

   Ticker per_end_date per_code  Revenue     EBIT  GrossProfit    CAPEX
0    AAPL   2011-03-31    QR-22  24667.0   7874.0      10218.0  -1838.0
1    AAPL   2011-06-30    QR-21  28571.0   9379.0      11922.0  -2615.0
2    AAPL   2011-09-30    QR-20  28270.0   8710.0      11380.0  -4260.0
3    AAPL   2011-12-31    QR-19  46333.0  17340.0      20703.0  -1321.0
4    AAPL   2012-03-31    QR-18  39186.0  15384.0      18564.0  -2778.0
5    AAPL   2012-06-30    QR-17  35023.0  11573.0      14994.0  -4834.0
6    AAPL   2012-09-30    QR-16  35966.0  10944.0      14401.0  -8295.0
7    AAPL   2012-12-31    QR-15  54512.0  17210.0      21060.0  -2317.0
8    AAPL   2013-03-31    QR-14  43603.0  12558.0      16349.0  -4325.0
9    AAPL   2013-06-30    QR-13  35323.0   9201.0      13024.0  -6210.0
10   AAPL   2013-09-30    QR-12  37472.0  10030.0      13871.0  -8165.0
11   AAPL   2013-12-31    QR-11  57594.0  17463.0      21846.0  -1985.0
12   AAPL   2014-03-31    QR-10  45646.0  13593.0      17947.0  -3367.0
13   AAPL   2014-06-30     QR-9  37432.0  10282.0      14735.0  -5745.0
14   AAPL   2014-09-30     QR-8  42123.0  11165.0      16009.0  -9571.0
15   AAPL   2014-12-31     QR-7  74599.0  24246.0      29741.0  -3217.0
16   AAPL   2015-03-31     QR-6  58010.0  18278.0      23656.0  -5586.0
17   AAPL   2015-06-30     QR-5  49605.0  14083.0      19681.0  -7629.0
18   AAPL   2015-09-30     QR-4  51501.0  14623.0      20548.0 -11247.0
19   AAPL   2015-12-31     QR-3  75872.0  24171.0      30423.0  -3612.0
20   AAPL   2016-03-31     QR-2  50557.0  13987.0      19921.0  -5948.0
21   AAPL   2016-06-30     QR-1  42358.0  10105.0      16106.0  -8757.0
22   AAPL   2016-09-30      QR0  46852.0  11761.0      17813.0 -12734.0
23     BA   2011-03-31    QR-22  14910.0   1033.0       2894.0   -417.0
24     BA   2011-06-30    QR-21  16543.0   1563.0       3372.0   -762.0

An exemplary visualisation of both data sets for selected company (here: Exxon Mobil Corporation, XOM, engaged in the exploration and production of crude oil and natural gas, manufacturing of petroleum products, and transportation and sale of crude oil, natural gas and petroleum products) we obtain with:

49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
fig, ax1 = plt.subplots(figsize=(10,5))
plt.plot(wpi.DATE, wpi.VALUE, color=grey, label="WPI Crude Oil (daily)")
plt.grid(True)
plt.legend(loc=1)
plt.ylabel("USD per barrel")
 
selection = ["XOM"]
 
ax2 = ax1.twinx()
plt.ylabel("1e6 USD")
for ticker in selection:
    tmp = data[data.Ticker == ticker]
    lab = ticker
    tmp['per_end_date'] =  pd.to_datetime(tmp['per_end_date'], 
                                                        format='%Y-%m-%d')
    plt.plot(tmp.per_end_date, tmp.Revenue, '.-', label=lab + ": Revenue")
    plt.plot(tmp.per_end_date, tmp.EBIT, '.-g', label=lab + ": EBIT")
    plt.plot(tmp.per_end_date, tmp.GrossProfit, '.-m', label=lab + ": 
                                                            Gross Profit")
    plt.plot(tmp.per_end_date, tmp.CAPEX, '.-r', label=lab + ": CAPEX")
    plt.legend(loc=3)

unknown-2
Every company has different periods of time (per_end_date) when they report their financials. That is why we need to make sure that in the process of comparison of selected fundamentals with oil prices, the oil price is picked up around the date defined by per_end_date. Let me use $\pm 5$ day average around those points:

70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# cross-correlation
def Cxy(x, y, r, N):
    z = 0
    for i in range(0, N-r-1):
        z += x[i] * y[i+r]
    z /= (N-r)
    return z
 
# test statistic for cross-correlation
def Hxy(x, y, L, N):
    z = 0
    for r in range(1, L+1):
        z += Cxy(x, y, r, N)**2
    z *= (N - L)   
    return z
 
# cross-bicorrelation
def Cxyy(x, y, r, s, N):
    z = 0
    m = np.max([r, s])
    for i in range(0, N-m-1):
        z += x[i] * y[i+r] * y[i+s]
    z /= (N-m)
    return z
 
def Hxyy(x, y, L, N):
    z = 0
    for s in range(2, L+1):
        for r in range(1, L+1):
            m = np.max([r, s])
            z += (N-m) * Cxyy(x, y, r, s, N)**2
    return z
 
 
fig, ax1 = plt.subplots(figsize=(10,5))
plt.plot(wpi.DATE, wpi.VALUE, color=grey, label="WPI Crude Oil (daily)")
plt.grid(True)
plt.legend(loc=1)
plt.ylabel("USD per barrel")
 
selection = ["XOM"]
 
ax2 = ax1.twinx()
plt.ylabel("1e6 USD")
for ticker in selection:
    tmp = data[data.Ticker == ticker]
    lab = ticker
    tmp['per_end_date'] =  pd.to_datetime(tmp['per_end_date'], 
                                               format='%Y-%m-%d')
    plt.plot(tmp.per_end_date, tmp.Revenue, '.-', label=lab + ": Revenue")
    plt.plot(tmp.per_end_date, tmp.EBIT, '.-g', label=lab + ": EBIT")
    plt.plot(tmp.per_end_date, tmp.GrossProfit, '.-m', label=lab + ": 
                                                            Gross Profit")
    plt.plot(tmp.per_end_date, tmp.CAPEX, '.-r', label=lab + ": CAPEX")
    plt.legend(loc=3)
 
    col = ['per_end_date', 'avgWPI']
    wpi2 = pd.DataFrame(columns=col)  # create an empty DataFrame
 
    for i in range(tmp.shape[0]):
        date = tmp.iloc[i]["per_end_date"]
        date0 = date
        date1 = date + datetime.timedelta(days=-5)
        date2 = date + datetime.timedelta(days=+5)
        wpiavg = wpi[(wpi.DATE >= date1) & (wpi.DATE <= date2)]
        avg = np.mean(wpiavg.VALUE)
        wpi2.loc[len(wpi2)] = [date0, avg]
 
    # plot quarterly averaged oil price-series
    plt.sca(ax1)  # set ax1 axis for plotting
    plt.plot(wpi2.per_end_date, wpi2.avgWPI, '.-k')

unknown-3
where a black line denotes quarterly changing WPI Crude Oil prices stored now in a new Dataframe of wpi2.

Let’s leave only XOM in selection list variable. After line #140 the Dataframe of tmp stores XOM’s selected fundamentals data. In what follows we will use both Dateframes (tmp and wpi2) to look for non-linear correlations as the final goal of this investigation:

142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
    # select some factors
    fundamentals = ["Revenue", "EBIT", "GrossProfit", "CAPEX"]
 
    # compute for them cross-correlations and cross-bicorrelations
    # and display final results
    #
    for f in fundamentals:
        print("%s: %s\n" % (ticker, f))
        # input data
        x = wpi2.avgWPI.values
        y = tmp[f].values
 
        # normalised time-series
        x = (x - np.mean(x))/np.std(x, ddof=1)
        y = (y - np.mean(y))/np.std(y, ddof=1)
 
        N = len(x)  # len(x) should be equal to len(y)
        L = int(np.floor(N**0.4))
 
        print("Cross-Correlation")
        for r in range(1, L+1):
            print("  r =%2g\t Cxy(r=%g) = %.4f" % 
                  (r, r, Cxy(x, y, r, N)))
 
        dof = L
        pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
        alpha = 0.1
 
        print("  Hxy(N) = %.4f" % Hxy(x, y, L, N))
        print("  p-value = %.4f" % pvalue)
        print("  H0 rejected in favour of H1 at %g%% c.l.: %s" % 
                                      (100*(1-alpha), pvalue<alpha))
 
        print("Cross-Bicorrelation")
        for r in range(0, L+1):
            for s in range(r+1, L+1):
                print("  r =%2g, s =%2g\t Cxyy(r=%g, s=%g) = %.5f" % 
                     (r, s, r, s, Cxyy(x, y, r, s, N)))
 
        dof = (L-1)*(L/2)
        pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
 
        print("  Hxyy(N) = %.4f" % Hxyy(x, y, L, N))
        print("  p-value = %.4f" % pvalue)
        print("  H0 rejected in favour of H1 at %g%% c.l.: %s\n" % 
                                       (100*(1-alpha), pvalue<alpha))

Our Python code generates for XOM the following output:

XOM: Revenue
 
Cross-Corelation
  r = 1	 Cxy(r=1) = 0.7869
  r = 2	 Cxy(r=2) = 0.6239
  r = 3	 Cxy(r=3) = 0.4545
  Hxy(N) = 24.3006
  p-value = 0.0000
  H0 rejected in favour of H1 at 90% c.l.: True
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = -0.44830
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.30186
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.19315
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.36678
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.22161
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.24446
  Hxyy(N) = 10.3489
  p-value = 0.0000
  H0 rejected in favour of H1 at 90% c.l.: True
 
XOM: EBIT
 
Cross-Correlation
  r = 1	 Cxy(r=1) = 0.7262
  r = 2	 Cxy(r=2) = 0.5881
  r = 3	 Cxy(r=3) = 0.3991
  Hxy(N) = 20.6504
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = -0.39049
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.25963
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.17480
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.28698
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.18030
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.21284
  Hxyy(N) = 6.8759
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
 
XOM: GrossProfit
 
Cross-Correlation
  r = 1	 Cxy(r=1) = 0.7296
  r = 2	 Cxy(r=2) = 0.5893
  r = 3	 Cxy(r=3) = 0.3946
  Hxy(N) = 20.7065
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = -0.38552
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.24803
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.17301
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.28103
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.17393
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.19780
  Hxyy(N) = 6.2095
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
 
XOM: CAPEX
 
Cross-Correlation
  r = 1	 Cxy(r=1) = -0.2693
  r = 2	 Cxy(r=2) = -0.3024
  r = 3	 Cxy(r=3) = -0.2333
  Hxy(N) = 4.3682
  p-value = 0.2244
  H0 rejected in favour of H1 at 90% c.l.: False
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = 0.00844
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.10889
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.14787
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.10655
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.16628
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.08072
  Hxyy(N) = 6.4376
  p-value = 0.2244
  H0 rejected in favour of H1 at 90% c.l.: False

A quick analysis points at detection of significant non-linear correlations (at 90% confidence level) between WPI Crude Oil and Revenue, EBIT, and GrossProfit for lag $r=1$ (based on cross-correlations) and for lags $(r=0, s=1)$ (based on cross-bicorrelations). There is no significant non-linear relationship between XOM’s CAPEX and oil prices.

Interestingly, rerunning the code for Boeing Company (NYSE ticker: BA) reveals no significant cross-(bi)correlations between the same factors and oil. For example, Boeing’s revenue improves year-over-year,

unknown-4

but it appears to have any significant non-linear link to the oil prices whatsoever:

BA: Revenue
 
Cross-Correlation
  r = 1	 Cxy(r=1) = -0.3811
  r = 2	 Cxy(r=2) = -0.3216
  r = 3	 Cxy(r=3) = -0.1494
  Hxy(N) = 5.4200
  p-value = 0.1435
  H0 rejected in favour of H1 at 90% c.l.: False
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = 0.11202
  r = 0, s = 2	 Cxyy(r=0, s=2) = 0.03588
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.08083
  r = 1, s = 2	 Cxyy(r=1, s=2) = 0.01526
  r = 1, s = 3	 Cxyy(r=1, s=3) = 0.00616
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.07252
  Hxyy(N) = 0.3232
  p-value = 0.1435
  H0 rejected in favour of H1 at 90% c.l.: False

The most interesting aspect of cross-correlation and cross-bicorrelation methods applied in oil vs. stock fundamentals research regards new opportunities to discover correlations for stocks that so far have been completely ignored (or not suspected) to have anything in common with oil price movements in the markets. Dare to go this path. I hope you will find new assets to invest in. The information is there. Now you are equipped with new tools. Use them wisely. Retire early.

REFERENCES
    Bernanke B. S., 2016, The relationship between stocks and oil prices
    Brooks C., Hinich M. J., 1999, Cross-correlations and cross-bicorrelations in Sterling
          exchange rates
(pdf)
    Coronado et al., 2015, A study of co-movements between USA and Latin American stock
          markets: a cross-bicorrelations perspective
(pdf)
    Hamilton J., 2014, Oil prices as an indicator of global economic conditions
    Hinich M.J., 1996. Testing for dependence in the input to a linear time series model.
          Journal of Nonparametric Statistics 6, 205–221.
    Hinich M.J., Patterson D.M., 1995, Detecting epochs of transient dependence in white
          noise
, Mimeo, University of Texas at Austin

DOWNLOADS
    COPPER.csv, DCOILWTICO.csv, DGS10.csv, DTWEXM.csv, VIXCLS.csv, ZACKS-FE.csv

How to Find a Company Name given a Stock Ticker Symbol utilising Quandl API

Quandl.com offers an easy solution to that task. Their WIKI database contains 3339 major stock tickers and corresponding company names. They can be fetched via secwiki_tickers.csv file. For a plain file of portfolio.lst storing the list of your tickers, for example:

AAPL
IBM
JNJ
MSFT
TXN

you can scan the .csv file for the company name coding in Python:

# How to Find a Company Name given a Stock Ticker Symbol utilising Quandl API
# (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
 
import pandas as pd
 
df = pd.read_csv('secwiki_tickers.csv')
dp = pd.read_csv('portfolio.lst',names=['pTicker'])
 
pTickers = dp.pTicker.values  # converts into a list
 
tmpTickers = []
 
for i in range(len(pTickers)):
    test = df[df.Ticker==pTickers[i]]
    if not (test.empty):
        print("%-10s%s" % (pTickers[i], list(test.Name.values)[0]))

what returns:

AAPL      Apple Inc.
IBM       International Business Machines Corporation
JNJ       Johnson & Johnson
MSFT      Microsoft Corporation
TXN       Texas Instruments Inc.

Please note that there is a possibility to combine more stocks from other Quandl’s resources in one file. For more information see Quandl.com‘s documentation online.

I will dedicate a separate Chapter on hacking the financial websites in my next book of Python for Quants. Volume II. in mid-2015.

Pre-Processing of Asset Price Series for Portfolio Optimization

Portfolio Optimization is a significant component of Matlab’s Financial Toolbox. It provides us with ready-to-use solution in finding optimal weights of assets that we consider for trading deriving them based on the historical asset performance. From a practical point of view, we can include it in our algorithmic trading strategy and backtest its applicability under different initial conditions. This is a subject of my next up-coming post. However, before we can enjoy the view from the peak, we need to climb the mountain first.

In Matlab, the portfolio is created as a dedicated object of the same name. It doesn’t read the raw stock data. We need to feed that beast. Two major ingredients satisfy the input: a vector of the expected asset returns and a covariance matrix. Matlab helps us to estimate these moments but first we need to deliver asset data in a digestable form.

In this post we will see how one can quickly download the stock data from the Internet based on our own stock selection and pre-process them for solving portfolio optimization problem in Matlab.

Initial Setup for Portfolio Object

Let’s say that at any point of time you have your own list of stocks you wish to buy. For simplicity let’s also assume that the list contains stocks traded on NYSE or NASDAQ. Since you have been a great fun of this game, now you are almost ready to buy what you jotted down on your ShoppingList.lst. Here, an example of 10 tech stocks:

AAPL   AOL   BIDU   GOOG   HPQ   IBM   INTC   MSFT   NVDA   TXN

They will constitute your portfolio of stocks. The problem of portfolio optimization requires a look back in time in the space of returns obtained in trading by each stock. Based on them the Return Proxy and Risk Proxy can be found.

The return matrix $R$ of dimensions $(N-1)\times M$ where $N$ stands for number of historical prices (e.g. derived daily, or monthly, etc.) and $M$ for the number of stocks in our portfolio, is required by Matlab as an input. We will see how does it work in next post. For now let’s solely focus on creation of this matrix.

In the article Create a Portfolio of Stocks based on Google Finance Data fed by Quandl I discussed Quandl.com as an attractive data provider for US stocks. Here, we will follow this solution making use of Quandl resources to pull out the stock price series for our shopping list. Ultimately, we aim at building a function, here: QuandlForPortfolio, that does the job for us:

% Pre-Processing of Asset Price Series for Portfolio Optimization in Matlab
%  (c) 2013, QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
 
% Input Parameters
n=1*365;
tickers='ShoppingList.lst';
qcodes='QuandlStockCodeListUS.xlsx';
 
[X,Y,R,AssetList] = QuandlForPortfolio(n,tickers,qcodes);

We call this function with three input parameters. The first one, $n$, denotes a number of calendar days from today (counting backwards) for which we wish to retrieve the stock data. Usually, 365 days will correspond to about 250$-$252 trading days. The second parameter is a path/file name to our list of stock (desired to be taken into account in the portfolio optimisation process) while the last input defines the path/file name to the file storing stocks’ tickers and associated Quandl Price Codes (see here for more details).

Feeding the Beast

The QuandlForPortfolio Matlab function is an extended version of the previously discussed solution. It contains an important correcting procedure for the data fetched from the Quandl servers. First, let’s have a closer look on the function itself:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
% Function assists in fetching Google Finance data from the Quandl.com
%  server for a given list of tickers of stocks traded on NYSE or
%  NASDAQ. Data are retrieved for last 'n' days with daily sampling.
%
% INPUT
%   n       : number of calendar days from 'today' (e.g. 365 would
%             correspond to about 252 business days)
%   tickers : a path/file name of a text file listing tickers
%   qcodes  : a path/file name of Excel workbook (.xlsx) containing a list
%              of tickers and Quandl Price Codes in the format of
%              [Ticker,Stock Name,Price Code,Ratios Code,In Market?]
% OUTPUT
%   X0        : [Nx1] column vector with days
%   Y0        : [NxM] matrix with Close Prices for M stocks
%   R0        : [(N-1)xM] matrix of Retruns
%   AssetList : a list of tickers (cell array)
%
% (c) 2013, QuantAtRisk.com, by Pawel Lachowicz
 
function [X0,Y0,R0,AssetList0] = QuandlForPortfolio(n,tickers,qcodes)
    fileID = fopen(tickers);
    tmp = textscan(fileID, '%s');
    fclose(fileID);
    AssetList=tmp{1};  % a list as a cell array
 
    % Read in the list of tickers and internal Quandl codes
    %
    [~,text,~] = xlsread(qcodes);
    quandlc=text(:,1); % again, as a list in a cell array
    quandlcode=text(:,3); % corresponding Quandl's Price Code
 
    date1=datestr(today-n,'yyyy-mm-dd'); % from
    date2=datestr(today,'yyyy-mm-dd');   % to
 
    % Fetch the data from Quandl.com
    %
    QData={};
    for i=1:length(AssetList)
        for j=1:length(quandlc)
            if(strcmp(AssetList{i},quandlc{j}))
                fprintf('%4.0f %s\n',i,quandlc{j});
                fts=0;
                [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                   'authcode','x',...
                   'start_date',date1,'end_date',date2,'collapse','daily');
                QData{i}=fts;
            end
        end
    end
 
    % Post-Processing of Fetched Data
    %
    % create a list of days across all tickers
    TMP=[];
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        tmp=tmp(:,1);
        TMP=[TMP; tmp];
    end
    ut=unique(TMP);
    % use that list to find these days that are not present
    %  among all data sets
    TMP=[];
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        tmp=tmp(:,1);
        TMP=[TMP; setdiff(ut,tmp)];
    end
    ut=unique(TMP);
    % finally, extract Close Prices from FTS object and store them
    %  in Y0 matrix, plus corresponding days in X0
    X0=[];
    Y0=[]; 
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        cp=[];
        for j=1:size(tmp,1)
            [r,~,~]=find(ut==tmp(j,1));
            if(isempty(r))
                cp=[cp; tmp(j,5)]; % column 5 corresponds to Close Price
                if(i<2)
                    % create a time column vector listing days
                    % common among all data sets
                    X0=[X0; tmp(j,1)];
                end
            end
        end
        Y0=[Y0 cp];
    end
    % transform Close Prices into Returns, R(i)=cp(i)/cp(i-1)-1
    R0=tick2ret(Y0);
    AssetList0=AssetList';
end

The main bottleneck comes from the fact that Matlab’s portfolio object demands an equal number of historical returns ($N-1$) in the matrix of $R$ for all $M$ assets. We design the function in the way that it sets the common timeframe for all stocks listed on our shopping list. Of course, we ensure that all stocks were traded in the markets for about $n$ last days (rough estimation).

Now, the timeframe of $n$ last days should be understood as a first approximation. We fetch the data from Quandl (numeric date, Open, High, Low, Close, Volume) and save them in the cell array QData (lines #37-49) for each stock separately as FTS objects (Financial Time-Series objects; see Financial Toolbox). However, it may occur that not every stock we fetched displays the same amount of data. That is why we need to investigate for what days and for what stocks we miss the data. We achieve that by scanning each FTS object and creating a unique list of all days for which we have data (lines #54-60).

Next, we loop again over the same data sets but now we compare that list with a list of all dates for each stock individually (lines #63-69), capturing (line #67) those dates that are missing. Their complete list is stored as a vector in line #69. Eventually, given that, we are able to compile the full data set (e.g. Close Prices; here line #80) for all stocks in our portfolio ensuring that we will include only those dates for which we have prices across all $M$ assets (lines #70-91).

Beast Unleashed

We test our data pre-processing simply by running the block of code listed above engaging QuandlForPortfolio function and we check the results in the Matlab’s command window as follows:

>> whos X Y R AssetList
  Name             Size            Bytes  Class     Attributes
 
  AssetList        1x10             1192  cell                
  R              250x10            20000  double              
  X              251x1              2008  double              
  Y              251x10            20080  double

what confirms the correctness of dimensions as expected.

At this stage, the aforementioned function can be used two-fold. First, we are interested in the portfolio optimisation and we look back at last $n$ calendar days since the most current one (today). The second usage is handy too. We consider our stocks on the shopping list and fetch for their last, say, $n=7\times365$ days with data. If all stocks were traded over past 7 years we should be able to collect a reach data set. If not, the function will adjust the beginning and end date to meet the initial time constrains as required for $R$ matrix construction. For the former case, we can use 7-year data sample for direct backtesting of algo models utilizing Portfolio Optimization.

Stay tuned as we will rock this land in the next post!

Any Questions?

Share them across QuantCove.com – the official Forum of QuantAtRisk.

Contact Form Powered By : XYZScripts.com