Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

Upcoming… 2nd Edition of Python for Quants. In Sale on Jan 31, 2017.

A brand new, the 2nd Edition of my Python for Quants. Volume I. book is coming out on Jan 31, 2017! Powered by great feedback from those who purchased its 1st Edition, supplemented and extended by new content, I’m pleased to deliver you the most friendly introduction to Python for Finance for everyone!

What new in 2nd Edition:

  1. Advanced NumPy for Quants: this time we make it more complete incl. linear and non-linear algebra for finance; accelerated computations on GPU; advanced matrix manipulations;
  2. Statistics: more practical applications for financial data analysis powered by SciPy and statsmodels libraries;
  3. pandas for Time-Series: the best introduction to the most powerful library for data handling and processing with advanced application in financial time-series analysis;
  4. 2D and 3D Plotting: a smart use of matplotlib, seaborn, and cufflinks for computational aspects of Python and pandas.


Interested in purchasing a book?
Enjoy 20% Off Before the Official Premiere!$^{*}$

Pre-Order Today!


Select:



U$990

$^{*}$ Prepay today to secure 20% Off the regular price of USD 49.00 after the Official Premiere. Pay Now USD 9.90 and only USD 29.30 later. Your total book price will be USD 39.20. Your Pre-Order Deposit will be saved under your name and entitle you to the offered discount (unreturnable). If you wish to upgrade from 1st Ed. you will be entitled to 45% discount (your personal data will be cross-verified with our records).

ebook03-cover1000

Non-Linear Cross-Bicorrelations between the Oil Prices and Stock Fundamentals

When we talk about correlations in finance, by default, we assume linear relationships between two time-series “co-moving”. In other words, if one time-series changes its values over a give time period, we seek for a tight correlation reflected within the other time-series. If found, we say they are correlated. But wait a minute! Is the game always about looking for an immediate feedback between two time-series? Is “correlated” always a synonymous of a direct, i.e. a linear response?

Think for a second about the oil prices. If they rise within a quarter, the effect on some companies is not the same. Certain firms can be tightly linked to oil/petroleum prices that affect their product value, distribution, sales, etc. while for other companies the change of crude oil price has a negligible effect, e.g. the online services. Is it so?

As a data scientist, quantitative (business) analysts, or investment researcher you need to know that mathematics and statistics deliver ample of tools you may use in order to derive the best possibilities for two (or more) financial assets that can be somehow related, correlated, or connected. The wording does not need to be precise to reflect the goal of an investigation: correlation, in general.

Correlation can be linear and non-linear. Non-linearity of correlation is somehow counterintuitive. We used to think in a linear way that is why it is so hard to readjust our thinking in a non-linear domain. The changes of the oil prices might have a non-negligble effect on the airlines, causing the air-ticket prices to rise or fall due to recalculated oil/petroleum surcharge. The effect can be not immediate. A rise of the oil prices today can be “detected” with a time delay of days, weeks, or even months, i.e. non-linearly correlated.

In this post analyse both linear and non-linear correlation methods that have been so far applied (and not) in the research over oil price tectonics across the markets. In particular, first we discuss one-factor and multiple linear regression models to turn, in consequence, toward non-linearity possible to be captured by cross-correlation and cross-bicorrelation methods. By developing an extensive but easy-to-follow Python code, we analyse a database of Quandl.com storing over 620 stocks’ financials factors (stock fundamentals). Finally, we show how both non-linear mathematical tools can be applied in order to detect non-linear correlations for different financial assets.

1. Linear Correlations

1.1. Oil Prices versus Stock Markets

Correlations between increasing/decreasing oil prices and stock markets have been a subject of investigation for a number of years. The main interest was initially in search for linear correlations between raw price/return time-series among the stocks and oil benchmarks (e.g. WPI Crude Oil). A justification standing behind such approach to data analysis seemed to be driven by a common sense: an impact of rising oil prices should affect companies dependent on petroleum-based products or services.

The usual presumption states that a decline in oil price is a good news for the economy, especially for net oil importers, e.g. USA or China. An increase in oil prices usually causes the raise of the input costs for most businesses and force consumers to spend more money on gasoline, thereby reducing the corporate earnings of other businesses. The opposite should be true too when the oil prices fall.

The oil prices are determined by the supply-and-damand for petroleum-based products. During economic expansion, prices might rise as a result of increased consumption. They might fall as a result of increased production. That pattern has been observed on many occasions so far.

A quantitative approach to measuring the strength of linear response of the stock market to oil price can be demonstrated using a plain linear regression modeling. If we agree that S&P 500 Index is a proxy for the US stock market, we should be able to capture the relationship. In the following example, we compare WPI Crude Oil prices with S&P 500 Index for 2-year period between Nov 2014 and Nov 2016. First we plot both time-series, and next we apply one-factor linear regression model,
$$
y(t) = \beta x(t) + \epsilon(t) \ \,
$$ in order to fit the data best:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt
from pandas_datareader import data, wb
 
%matplotlib inline
 
grey = 0.7, 0.7, 0.7
 
import warnings
warnings.filterwarnings('ignore')
 
# downloading S&P 500 Index from Yahoo! Finance
sp500 = web.DataReader("^GSPC", data_source='yahoo',
                  start='2014-11-01', end='2016-11-01')['Adj Close']
 
# WPI Crude Oil price-series
# src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse)
wpi = wpi[wpi.VALUE != "."]
wpi.VALUE = wpi.VALUE.astype(float)
wpi.index = wpi.DATE
wpi.drop('DATE', axis=1, inplace=True) 
 
# combine both time-series
data = pd.concat([wpi, sp500], axis=1).dropna()  # and remove NaNs, if any
data.columns = ["WPI", "S&P500"]
print(data.head())
              WPI       S&P500
2014-11-03  78.77  2017.810059
2014-11-04  77.15  2012.099976
2014-11-05  78.71  2023.569946
2014-11-06  77.87  2031.209961
2014-11-07  78.71  2031.920044
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
plt.figure(figsize=(8,8))
# plot time-series
ax1 = plt.subplot(2,1,1)
plt.plot(data.WPI, 'k', label="WPI Crude Oil")
plt.legend(loc=3)
plt.ylabel("WPI Crude Oil [U$/barrel]")
 
# fit one-factor linear model
slope, intercept, r2_value, p_value, std_err = 
                  stats.linregress(data.WPI, data["S&P500"])
 
# best fit: model
xline = np.linspace(np.min(data.WPI), np.max(data.WPI), 100)
line = slope*xline + intercept
 
# plot scatter plot and linear model
ax2 = ax1.twinx()
plt.plot(data["S&P500"], 'm', label="S&P500")
plt.ylabel("S&P 500 Index [U$]")
plt.legend(loc="best")
plt.subplot(2,1,2)
plt.plot(data.WPI, data["S&P500"],'.', color=grey)
plt.plot(xline, line,'--')
plt.xlabel("WPI Crude Oil [U$/barrel]")
plt.ylabel("S&P 500 Index [U$]")
plt.title("R$^2$ = %.2f" % r2_value, fontsize=11)

Our Python code generates:
unknown
where we can see that WPI Crude Oil price vs. S&P 500 linear correlation is weak, with $R^2 = 0.36$ only. This is a static picture.

As one might suspect, correlation is time-dependent. Therefore, it is wiser to check a rolling linear correlation for a given data window size. Below, we assume its length to be 50 days:

59
60
61
62
63
64
rollcorr = pd.rolling_corr(wpi, sp500, 50).dropna()
 
plt.figure(figsize=(10,3))
plt.plot(rollcorr)
plt.grid()
plt.ylabel("Linear Correlation (50 day rolling window)", fontsize=8)

unknown-1
In 2008 Andrea Pescatori measured changes in the S&P 500 and oil prices in the same way as demonstrated above. He noted that variables occasionally moved in the same direction at the same time, but even then, the relationship remained weak. He concluded on no correlation at the 95% confidence level.

The lack of correlation might be attributed to factors like: wages, interest rates, industrial metal, plastic, computer technology that can offset changes in energy costs. On the other side, corporations might have become increasingly sophisticated at reading the futures markets and are able, much better, to anticipate the shift in factor prices (e.g. a company should be able to switch production processes to compensate for added fuel costs).

No matter what we analyse, either oil prices vs. market indexes or vs. individual stock trading records, linear correlation method is solely able to measure 1-to-1 response of $y$ to $x$. The lower coefficient of correlation the less valid linear model as a descriptor of true events and mutual relationships under study.

1.2. Multiple Linear Regression Model for Oil Price Changes

In February 2016 Ben S. Bernanke found that a positive correlation of stocks and oil might arise because both are responding to the underlying shift in global demand. Using a simple multiple linear regression model he took and attempt in explaining the daily changes of the oil prices and concluded that in over 90% they could be driven by changes of commodities prices (copper), US dollar (spot price), 10-yr treasury interest rate, and in over 95% by extending the model with an inclusion of daily changes in VIX:
$$
\Delta p_{oil, t} = \beta_1 \Delta p_{copp, t} + \beta_2 \Delta p_{10yr, t} + \beta_3 \Delta p_{USD, t} + \beta_4 \Delta p_{VIX, t} + \beta_0
$$ The premise is that commodity prices, long-term interest rate, and USD are likely to respond to investors’ perceptions of global and US demand for oil. Additionally, the presence of VIX is in line with the following idea: if investors retreat from commodities and/or stocks during periods of high uncertainty/risk aversion, then shocks to volatility may be another reason for the observed tendency of stocks and oil prices moving together.

The model, inspired by James Hamilton (2014) work, aims at measuring the effect of demand shifts on the oil market. For the sake of elegance, we can code it in Python as follows. First, we download to the files all necessary price-series and process them within pandas Dataframes (more on Quantitative Aspects of pandas for Quants you will find in my upcoming book Python for Quants. Volume II.).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import pandas as pd
from scipy import stats
import datetime
from datetime import datetime
from matplotlib import pyplot as plt
import statsmodels.api as sm
 
%matplotlib inline
 
grey = 0.7, 0.7, 0.7
 
import warnings
warnings.filterwarnings('ignore')
 
# Download WPI Crude Oil prices
# src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse)
wpi = wpi[wpi.VALUE != "."]
wpi.VALUE = wpi.VALUE.astype(float)
wpi.index = wpi.DATE
wpi.drop('DATE', axis=1, inplace=True)
 
# Copper Futures 
# src: http://www.investing.com/commodities/copper-historical-data
cop = pd.read_csv("COPPER.csv", index_col=["Date"])
cop.index = pd.to_datetime(pd.Series(cop.index), format='%b %d, %Y')
 
# Nominal interest rate on 10-year Treasury bonds
# src: https://fred.stlouisfed.org/series/DGS10/
tb = pd.read_csv('DGS10.csv')
tb = tb[tb.DGS10 != "."]
tb.DGS10 = tb.DGS10.astype(float)
tb['DATE'] =  pd.to_datetime(tb['DATE'], format='%Y-%m-%d')
tb.index = tb.DATE
tb.drop('DATE', axis=1, inplace=True)
 
# Trade Weighted U.S. Dollar Index
# src: https://fred.stlouisfed.org/series/DTWEXM
usd = pd.read_csv('DTWEXM.csv')
usd = usd[usd.DTWEXM != "."]
usd.DTWEXM = usd.DTWEXM.astype(float)
usd['DATE'] =  pd.to_datetime(usd['DATE'], format='%Y-%m-%d')
usd.index = usd.DATE
usd.drop('DATE', axis=1, inplace=True)
 
# CBOE Volatility Index: VIX© (VIXCLS)
# src: https://fred.stlouisfed.org/series/VIXCLS/downloaddata
vix = pd.read_csv('VIXCLS.csv')
vix = vix[vix.VALUE != "."]
vix.VALUE = vix.VALUE.astype(float)
vix['DATE'] =  pd.to_datetime(vix['DATE'], format='%Y-%m-%d')
vix.index = vix.DATE
vix.drop('DATE', axis=1, inplace=True)

Not every time a plain use of pd.read_csv function works in the way we expect from it, therefore some extra steps are needed to be implemented in order to convert .csv files content to Dataframe, e.g. a correct formatting of date (line #18-19 or #35, 44, 53), price conversion from string to float (lines #21, 34, 43, 52), replacement of index with data included in ‘Date’ column followed by its removal (e.g. lines #22-23).

As a side note, in order to download the copper data, at the website we need manually mark a whole table, paste it into Excel or Mac’s Numbers, trim if required, and save down to .csv file. The problem might appear as dates are given in “Nov 26, 2016″ format. For this case, the following trick in Python saves a lot of time:

from datetime import datetime
expr = 'Nov 26, 2016'
datetime.strptime(expr, '%b %d, %Y')
datetime.datetime(2016, 11, 26, 0, 0)

what justifies the use of %b descriptor (see more on Python’s datetime date formats and parsing here and here).

pandas’ Dataframes are fast and convenient way to concatenate multiple time-series with indexes set to calendar dates. The following step ensures all five price-series to have data points on the same dates (if given), and if any data point (for any time-series) is missing (NaN), the whole row is removed from the end Dataframe of ‘df’ thanks to action of .dropna() function:

57
58
59
60
df = pd.concat([wpi, cop, tb, usd, vix], join='outer', axis=1).dropna()
df.columns = ["WPI", "Copp", "TrB", "USD", "VIX"]
 
print(df.head(10))
              WPI   Copp   TrB      USD    VIX
2007-11-15  93.37  3.082  4.17  72.7522  28.06
2007-11-16  94.81  3.156  4.15  72.5811  25.49
2007-11-19  95.75  2.981  4.07  72.7241  26.01
2007-11-20  99.16  3.026  4.06  72.4344  24.88
2007-11-21  98.57  2.887  4.00  72.3345  26.84
2007-11-23  98.24  2.985  4.01  72.2719  25.61
2007-11-26  97.66  3.019  3.83  72.1312  28.91
2007-11-27  94.39  2.958  3.95  72.5061  26.28
2007-11-28  90.71  3.001  4.03  72.7000  24.11
2007-11-29  90.98  3.063  3.94  72.6812  23.97

Given all price-series and our model in mind, we are more interested in running the regression using daily change in the natural logarithm of crude oil, copper price, 10-yr interest rate, USD spot price, and VIX. The use of log is a common practice in finance and economics (see here why). Contrary to the daily percent change, we derive the log returns in Python in the following way:

np.random.seed(7)
df = pd.DataFrame(100 + np.random.randn(100).cumsum(), columns=['price'])
df['pct_change'] = df.price.pct_change()
df['log_ret'] = np.log(df.price) - np.log(df.price.shift(1))
 
print(df.head())
        price  pct_change   log_ret
0  101.690526         NaN       NaN
1  101.224588   -0.004582 -0.004592
2  101.257408    0.000324  0.000324
3  101.664925    0.004025  0.004016
4  100.876002   -0.007760 -0.007790

Having that, we continue coding our main model:

62
63
64
65
66
67
68
69
70
71
72
73
headers = df.columns
 
dlog = pd.DataFrame()  # log returns Dateframe
 
for j in range(df.shape[1]):
    price = df.ix[:,j]
    dl = np.log(price) - np.log(price.shift(1))
    dlog.loc[:,j] = dl
 
dlog.dropna(inplace=True)
dlog.columns = headers
print(dlog.head())
                 WPI      Copp       TrB       USD       VIX
2007-11-16  0.015305  0.023727 -0.004808 -0.002355 -0.096059
2007-11-19  0.009866 -0.057047 -0.019465  0.001968  0.020195
2007-11-20  0.034994  0.014983 -0.002460 -0.003992 -0.044417
2007-11-21 -0.005968 -0.047024 -0.014889 -0.001380  0.075829
2007-11-23 -0.003353  0.033382  0.002497 -0.000866 -0.046910

and run multiple linear regression model for in-sample data, here, selected for 2 year in-sample period between Nov 2012 and Nov 2014:

75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
dlog0 = dlog.copy()
 
# in-sample data selection
dlog = dlog[(dlog.index > "2012-11-01") & (dlog.index < "2014-11-01")]
 
# define input data
y = dlog.WPI.values
x = [dlog.Copp.values, dlog.TrB.values, dlog.USD.values, dlog.VIX.values ]
 
# Multiple Linear Regression using 'statsmodels' library
def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results
 
# display a summary of the regression
print(reg_m(y, x).summary())
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.105
Model:                            OLS   Adj. R-squared:                  0.098
Method:                 Least Squares   F-statistic:                     14.49
Date:                Tue, 29 Nov 2016   Prob (F-statistic):           3.37e-11
Time:                        11:38:44   Log-Likelihood:                 1507.8
No. Observations:                 499   AIC:                            -3006.
Df Residuals:                     494   BIC:                            -2985.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0303      0.008     -3.800      0.000        -0.046    -0.015
x2            -0.3888      0.178     -2.180      0.030        -0.739    -0.038
x3             0.0365      0.032      1.146      0.252        -0.026     0.099
x4             0.2261      0.052      4.357      0.000         0.124     0.328
const      -3.394e-05      0.001     -0.064      0.949        -0.001     0.001
==============================================================================
Omnibus:                       35.051   Durbin-Watson:                   2.153
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               75.984
Skew:                          -0.392   Prob(JB):                     3.16e-17
Kurtosis:                       4.743   Cond. No.                         338.
==============================================================================
 
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is 
    correctly specified.

A complete model is given by the following parameters:

96
97
par = reg_m(y, x).params
print(par)
[ -3.02699345e-02  -3.88784262e-01   3.64964420e-02   2.26134337e-01
  -3.39418282e-05]

or more formally as:
$$
\Delta p_{oil, t} = 0.2261 \Delta p_{copp, t} + 0.0365 \Delta p_{10yr, t} -0.3888 \Delta p_{USD, t} -0.0303 \Delta p_{VIX, t}
$$ The adjusted $R^2$ is near zero and statistical significance of $\beta_2$ is more than 5%.

In order to use our model for “future” oil price prediction based on price movements in commodity (copper), long-term interest rate, USD spot price and VIX index, we generate out-of-sample data set and construct the estimation in the following way:

99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# out-of-sample log return Dataframe
dlog_oos = dlog0[dlog0.index >= '2014-11-01']
 
# what is the last WPI Crude Oil price in in-sample
last = df[df.index == "2014-10-31"].values[0][0]  # 80.53 USD/barrel
 
oilreg = pd.DataFrame(columns=["Date", "OilPrice"])  # regressed oil prices
par = reg_m(y, x).params  # parameters from regression
 
for i in range(dlog_oos.shape[0]):
    x1 = dlog_oos.iloc[i]["VIX"]
    x2 = dlog_oos.iloc[i]["USD"]
    x3 = dlog_oos.iloc[i]["TrB"]
    x4 = dlog_oos.iloc[i]["Copp"]
    w = par[0]*x1 + par[1]*x2 + par[2]*x3 + par[3]*x4 + par[4]
    if(i==0):
        oilprice = np.exp(w) * last
    else:
        oilprice = np.exp(w) * oilprice
    oilreg.loc[len(oilreg)] = [dlog_oos.index[i], oilprice]
 
oilreg.index = oilreg.Date
oilreg.drop('Date', axis=1, inplace=True)

where oilreg Dataframe stores time and regressed oil prices. Visualisation of the results

123
124
125
126
127
128
129
130
plt.figure(figsize=(10,5))
plt.plot(df.WPI, color=grey, label="WPI Crude Oil")
plt.plot(oilreg, "r", label="Predicted Oil Price")
plt.xlim(['2012-01-01', '2016-11-01'])  # in-sample & out-of-sample
plt.ylim([20, 120])
plt.ylabel("USD/barrel")
plt.legend(loc=3)
plt.grid()

delivers
unknown
Bernanke’s multiple regression model points at an idea saying: when stock traders respond to a change in oil prices, they do so not necessarily because the oil movement in consequential in itself. Our finding (for different data sets than Bernanke’s) points at lack of strong dependancy on VIX thus on traders activity. Again, as in case of one-factor regression model discussed in Section 1.1, the stability of underlying correlations with all four considered factors remains statistically variable in time.

2. Non-Linear Correlations

So far we have understood how fragile the quantification and interpretation of the linear correlation between two time-series can be. We agreed on its time-dependant character and how data sample size may influence derived results. Linear correlation is a tool that captures 1-to-1 feedback between two data sets. What if a response is lagged? We need to look for some other useful tools good enough for this challenge. In this Section we will analyse two of them.

2.1. Cross-Correlation and Cross-Bicorrelation

If a change of one quantity has a noticeable effect on another one, this can be observed with or without time delay. In the time-series analysis we talk about one time-series being lagged (or led) behind (before) the other one. Since an exisiting lag (lead) is not observed directly (a linear feedback) it introduces a non-linear relationship between two time-series.

A degree of non-linear correlation can be measured by the application of two independent methods, namely, cross-correlations and cross-bicorrelations designed to examine the data dependence between any two (three) values led/lagged by $r$ ($s$) periods. By period one should understand the sampling frequency of the underlying time-series, for example a quarter, i.e. 3 month period.

Let $x(t) \equiv \{ x_t \} = \{ x_1, x_2, …, x_N \}$ for $t=1,…,N$ represents, e.g. the WPI Crude Oil price-series, and $\{ y_t \}$ the time-series of a selected quantity. For two time-series, $x(t)$ and $y(t)$, the cross-correlation is defined as:
$$
C_{xy}(r) = (N-r)^{-1} \sum_{i=1}^{N-r} x(t_i) y(t_i + r)
$$ whereas the cross-bicorrelation is usually assumed as:
$$
C_{xyy}(r,s) = (N-m)^{-1} \sum_{i=1}^{N-m} x(t_i) y(t_i + r) y(t_i + s)
$$ where $m=\max(r,s)$ (Brooks & Hinich 1999). Both formulae measure the inner product between the individual values of two time-series. In case of cross-correlation $C_{xy}(r)$ it is given by $\langle x(t_i) y(t_i + r) \rangle$ where the value of $y(t)$ is lagged by $r$ periods. The summation, similarly as in the concept of the Fourier transform, picks up the maximum power for a unique set of lags $r$ or $(r,s)$ for $C_{xy}(r)$ and $C_{xyy}(r,s)$, respectively. The use of cross-correlation method should return the greatest value of $C_{xy}(r)$ among all examined $r$’s. Here, $1 \le r < N$ in a first approximation. It reveals a direct non-linear correlation between $x(t_i)$ and $y(t_i + r)$ $\forall i$.

The concept of cross-bicorrelation goes one step further and examines the relationship between $x(t_i)$ and $y(t_i + r)$ and $y(t_i + s)$, e.g. how the current oil price at time $t$, $x(t_i)$, affects $y$ lagged by $r$ and $s$ independent time periods. Note that cross-bicorrelation also allows for $C_{xxy}(r,s)$ form, i.e. if we require to look for relationship of the current and lagged (by $r$) values of $x$ time-series and period $y$ time-series lagged by $s$.

2.2. Statistical Tests for Cross-Correlation and Cross-Bicorrelation

The application of both tools for small data samples ($N < 50$) does not guarantee a statistically significant non-linear correlations every single time the computations are done. The need for statistical testing of the results returned by both cross-correlation and cross-bicorrelation methods has beed addressed by Hinich (1996) and Brook & Hinich (1999). For two time-series, with a weak stationarity, the null hypothesis $H_0$ for the test is that two series are independent pure white noise processes against the alternative hypothesis $H_1$ saying that some cross-covariances, $E[x(t_i) y(t_i + r)]$, or cross-bicovariances, $E[x(t_i) y(t_i + r) y(t_i + s)]$, are nonzero. The test statistics, respectively for cross-correlation and cross-bicorrelation, are given by:
$$
H_{xy}(N) = \sum_{r=1}^{L} (N-L) C_{xy}^2(r) = (N-L) \sum_{r=1}^{L} C_{xy}^2(r)
$$ and
$$
H_{xyy}(N) = \sum_{s=-L}^{L} ‘ \sum_{r=1}^{L} (N-m) C_{xyy}^2(r,s) \ \ \ (‘-s \ne -1, 1, 0)
$$ where it has been shown by Hinich (1996) that:
$$
H_{xy}(N) \sim \chi^2_L
$$ and
$$
H_{xyy}(N) \sim \chi^2_{(L-1)(L/2)}
$$ where $\sim$ means are asymptotically distributed with the corresponding degrees for $N \rightarrow \infty$ and $m = \max(r, s)$ as previously. For both $H_{xy}(N)$ and $H_{xyy}(N)$ statistics, the number of lags $L$ is specified as $L = N^b$ where $0 < b < 0.5$. Based on the results of the Monte-Carlo simulations, Hinich & Parrerson (1995) recommended the use of $b=0.4$ in order to maximise the power of test while ensuring a valid approximation to the asymptotic theory. Further down, we compute $H_{xy}(N)$ and $H_{xyy}(N)$ statistics for every cross-correlation and cross-bicorrelation and determine the corresponding significance assuming the confidence level of 90% ($\alpha = 0.1$) if: $$ \mbox{p-value} < \alpha $$ where $$ \mbox{p-value} = 1 - \mbox{Pr} \left[ \chi^2_{\mbox{dof}} < H_{x(y)y} | H_1 \right] \ . $$ Given that, any resultant values of cross-correlations and cross-bicorrelations one can assume as statistically significant if p-value $< \alpha$. A recommended number of lags to be investigated is $L = N^{0.4}$ where $N$ is the length of $x(t)$ and $y(t)$ time-series. Employing floor rounding for $L$ we obtain the following relationship between $N$ and $L$ for the sample sizes of $N \le 50$:

L = []
for N in range(1, 51):
    L.append(np.floor(N**0.4))
 
fig, ax = plt.subplots(figsize=(8,3))
plt.plot(np.arange(1, 51), L, '.')
plt.xlabel("N")
plt.ylabel("L")
ax.margins(x=0.025, y=0.05) # add extra padding
ax.tick_params(axis='y',which='minor',left='off') # without minor yticks
plt.grid(b=True, which='minor', color='k', linestyle=':', alpha=0.3)
plt.minorticks_on()

unknown-2

2.3. Cross-Correlation and Cross-Bicorrelation in Python

Let’s translate cross-correlation to Python language and run a simple test for a random time-series:

# cross-correlation
def Cxy(x, y, r, N):
    z = 0
    for i in range(0, N-r-1):
        z += x[i] * y[i+r]
    z /= (N-r)
    return z
 
# test statistic for cross-correlation
def Hxy(x, y, L, N):
    z = 0
    for r in range(1, L+1):
        z += Cxy(x, y, r, N)**2
    z *= (N - L)   
    return z

where we defined two functions deriving $C_{xy}(r)$ and $H_{xy}(N)$, respectively. Now, let’s consider some simplified time-series of $N=16$, both normalised to be $\sim N(0,1)$:

# sample data
y = np.array([0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0])
x = np.array([1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])
 
# normalise such that x, y ~ N(0,1)
x = (x - np.mean(x))/np.std(x, ddof=1)
y = (y - np.mean(y))/np.std(y, ddof=1)

As one can see, there is expected strong cross-correlation for lag of $r = 2$ (strongest) with $r = 1$ to be detected too. We verify the overall statistical significance of cross-correlations for given time-series in the following way:

N = len(x)
L = int(np.floor(N**0.4))
 
for r in range(1, L+1):
    print("r =%2g\t Cxy(r=%g) = %.4f" % (r, r, Cxy(x, y, r)))
 
dof = L    
pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
alpha = 0.1
 
print("Hxy(N) = %.4f" % Hxy(x, y, L, N))
print("p-value = %.4f" % pvalue)
print("H0 rejected in favour of H1 at %g%% c.l.: %s" % 
      (100*(1-alpha), pvalue<alpha))

confirming both our expectations and significance of detected cross-correlations at 90% confidence level:

r = 1	 Cxy(r=1) = 0.0578
r = 2	 Cxy(r=2) = 0.8358
r = 3	 Cxy(r=3) = -0.3200
Hxy(N) = 10.4553
p-value = 0.0151
H0 rejected in favour of H1 at 90% c.l.: True

For cross-bicorrelation we have:

# cross-bicorrelation
def Cxyy(x, y, r, s, N):
    z = 0
    m = np.max([r, s])
    for i in range(0, N-m-1):
        z += x[i] * y[i+r] * y[i+s]
    z /= (N-m)
    return z
 
# test statistic for cross-bicorrelation
def Hxyy(x, y, L, N):
    z = 0
    for s in range(2, L+1):
        for r in range(1, L+1):
            m = np.max([r, s])
            z += (N-m) * Cxyy(x, y, r, s, N)**2
    return z    
 
 
# sample data
y = np.array([0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0])
x = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0])
 
# normalise such that x, y ~ N(0,1)
x = (x - np.mean(x))/np.std(x, ddof=1)
y = (y - np.mean(y))/np.std(y, ddof=1)
 
N = len(x)  # len(x) should be equal to len(y)
L = int(np.floor(N**0.4))
 
for r in range(0, L+1):
    for s in range(r+1, L+1):
        print("r =%2g, s =%2g\t Cxyy(r=%g, s=%g) = %.5f" % 
             (r, s, r, s, Cxyy(x, y, r, s, N)))
 
dof = (L-1)*(L/2)
pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
alpha = 0.1
 
print("Hxyy(N) = %.4f" % Hxyy(x, y, L, N))
print("p-value = %.4f" % pvalue)
print("H0 rejected in favour of H1 at %g%% c.l.: %s" % 
      (100*(1-alpha), pvalue<alpha))

where for a new sample data we expect to detect two strongest cross-bicorrelations for pairs of lags $(r,s)$ equal $(2,3)$ and $(1,2)$, respectively. By running the code we obtain indeed that:

r = 0, s = 1	 Cxyy(r=0, s=1) = -0.18168
r = 0, s = 2	 Cxyy(r=0, s=2) = -0.23209
r = 0, s = 3	 Cxyy(r=0, s=3) = 0.05375
r = 1, s = 2	 Cxyy(r=1, s=2) = 0.14724
r = 1, s = 3	 Cxyy(r=1, s=3) = -0.22576
r = 2, s = 3	 Cxyy(r=2, s=3) = 0.18276
Hxyy(N) = 5.3177
p-value = 0.0833
H0 rejected in favour of H1 at 90% c.l.: True

With these two examples we showed both cross-correlation and cross-bicorrelation methods in action. We also pointed at the important fact of an existing dependency between the length of time-series and the expected number of lags to be examined. The longer time-series the higher number of degrees-of-freedom.

Given that state of the play, there exist two independent and appealing ways for the application of cross-correlation and cross-bicorrelation in finance: (1) the first one regards the examination of time-series with a small number of data points (e.g. $N \le 50$); (2) alternatively, if we deal with a long ($N \ge 1000$) time-series, the common practice is to divide them into $k$ chunks (non-overlapping windows) of length at least $N_i$ of $50$ to $100$ where $i=1,…,k$ and compute cross-correlation and cross-bicorrelation metrics for every data window separately. In consequence, within this approach, we gain an opportunity to report on time-period dependent data segment with statistically signifiant (or not) non-linear correlations! (see e.g. Coronado et alii 2015).

In the Section that follows, we solely focus on the application of the first approach. Rewarding in many aspects if you’re open-mined.

2.4. Detecting Non-Linearity between Oil Prices and Stock Fundamentals

Since looking for a direct correlation between crude oil prices and stocks can be regarded as the first approach to the examination of mutual relationships, it’s not sexy; not any more. With cross-correlation and cross-bicorrelation methods you can step up and break the boredom.

The stock fundamentals consist a broad space of parameters that actually deserve to be cross-correlated with the oil prices. It is much more intuitive that any (if any) effect of rising or falling crude oil prices may affect different companies with various time lags. Moreover, for some businesses we might detect a non-linear relationships existing only between oil and company’s e.g. revenues whereas for others, changing oil prices might affect more than one factor. If the patterns is present indeed, with the application of cross-correlation and cross-bicorrelation methods, you gain a brand new tool for data analysis. Let’s have a look at a practical example.

As previously, we will need both the WPI Crude Oil price-series (see Section 1.1) and access to a large database containing company’s fundamentals over a number of quarters or years. For the latter let me use a demo version of Quandl.com‘s Zacks Fundamentals Collection C available here (download Zacks Fundamentals Expanded ZACKS-FE.csv file). It contains 852 rows of data for 30 selected companies traded at US stock markets (NYSE, NASDAQ, etc.). This sample collection covers data back to 2011 i.e. ca. 22 to 23 quarterly reported financials. There is over 620 fundamentals to select from. Their full list you can find here (download the ZFC Definitions csv file). Let’s put it all in a new Python listing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Non-Linear Cross-Bicorrelations between the Oil Prices and Stock Fundamentals
# (c) 2016 QuantAtRisk.com, by Pawel Lachowicz
 
import numpy as np
import pandas as pd
import scipy.stats
import datetime
from matplotlib import pyplot as plt
 
%matplotlib inline
 
grey = 0.7, 0.7, 0.7
 
import warnings
warnings.filterwarnings('ignore')
 
 
# WTI Crude Oil price-series
# src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse)
wpi = wpi[wpi.VALUE != "."]
wpi.VALUE = wpi.VALUE.astype(float)
 
# Zacks Fundamentals Collection C (Fundamentals Expanded)
# src: https://www.quandl.com/databases/ZFC
df = pd.read_csv('ZACKS-FE.csv')
 
col = ['Ticker', 'per_end_date', 'per_code', 'Revenue', "EBIT", 
          "GrossProfit", "CAPEX"]
data = pd.DataFrame(columns=col)  # create an empty DataFrame
 
tickers = ["XOM", "WMT", "BA", "AAPL", "JNJ"]
 
for i in range(df.shape[0]):
    ticker = df.iloc[i]["ticker"]
    pc = df.iloc[i]["per_code"]
    # extract only reported fundamentals for quarters "QR0 and all "QR-"
    # in total 22 to 23 rows of data per stock
    if((pc[0:3] == "QR-") | (pc[0:3] == "QR0")) and (ticker in tickers):
        data.loc[len(data)] = [df.iloc[i]["ticker"],
                               df.iloc[i]["per_end_date"],
                               df.iloc[i]["per_code"],
                               df.iloc[i]["tot_revnu"],
                               df.iloc[i]["ebit"],
                               df.iloc[i]["gross_profit"],
                               df.iloc[i]["cap_expense"]] 
 
print(data.head(25))

where we created a new Dataframe, data, storing Total Revenue, EBIT, Gross Profit (revenue from operations less associated costs), and CAPEX (capital expenditure) for a sample of 5 stocks (XOM, WMT, BA, AAPL, and JNJ) out of 30 available in that demo collection. Our goal here is just to show a path for a possible analysis of all those numbers with, kept in mind, an encouragement to build a bigger tool for stock screening based on detection of non-linear correlations amongst selected fundamentals and oil prices.

First 25 lines of data Dataframe are:

   Ticker per_end_date per_code  Revenue     EBIT  GrossProfit    CAPEX
0    AAPL   2011-03-31    QR-22  24667.0   7874.0      10218.0  -1838.0
1    AAPL   2011-06-30    QR-21  28571.0   9379.0      11922.0  -2615.0
2    AAPL   2011-09-30    QR-20  28270.0   8710.0      11380.0  -4260.0
3    AAPL   2011-12-31    QR-19  46333.0  17340.0      20703.0  -1321.0
4    AAPL   2012-03-31    QR-18  39186.0  15384.0      18564.0  -2778.0
5    AAPL   2012-06-30    QR-17  35023.0  11573.0      14994.0  -4834.0
6    AAPL   2012-09-30    QR-16  35966.0  10944.0      14401.0  -8295.0
7    AAPL   2012-12-31    QR-15  54512.0  17210.0      21060.0  -2317.0
8    AAPL   2013-03-31    QR-14  43603.0  12558.0      16349.0  -4325.0
9    AAPL   2013-06-30    QR-13  35323.0   9201.0      13024.0  -6210.0
10   AAPL   2013-09-30    QR-12  37472.0  10030.0      13871.0  -8165.0
11   AAPL   2013-12-31    QR-11  57594.0  17463.0      21846.0  -1985.0
12   AAPL   2014-03-31    QR-10  45646.0  13593.0      17947.0  -3367.0
13   AAPL   2014-06-30     QR-9  37432.0  10282.0      14735.0  -5745.0
14   AAPL   2014-09-30     QR-8  42123.0  11165.0      16009.0  -9571.0
15   AAPL   2014-12-31     QR-7  74599.0  24246.0      29741.0  -3217.0
16   AAPL   2015-03-31     QR-6  58010.0  18278.0      23656.0  -5586.0
17   AAPL   2015-06-30     QR-5  49605.0  14083.0      19681.0  -7629.0
18   AAPL   2015-09-30     QR-4  51501.0  14623.0      20548.0 -11247.0
19   AAPL   2015-12-31     QR-3  75872.0  24171.0      30423.0  -3612.0
20   AAPL   2016-03-31     QR-2  50557.0  13987.0      19921.0  -5948.0
21   AAPL   2016-06-30     QR-1  42358.0  10105.0      16106.0  -8757.0
22   AAPL   2016-09-30      QR0  46852.0  11761.0      17813.0 -12734.0
23     BA   2011-03-31    QR-22  14910.0   1033.0       2894.0   -417.0
24     BA   2011-06-30    QR-21  16543.0   1563.0       3372.0   -762.0

An exemplary visualisation of both data sets for selected company (here: Exxon Mobil Corporation, XOM, engaged in the exploration and production of crude oil and natural gas, manufacturing of petroleum products, and transportation and sale of crude oil, natural gas and petroleum products) we obtain with:

49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
fig, ax1 = plt.subplots(figsize=(10,5))
plt.plot(wpi.DATE, wpi.VALUE, color=grey, label="WPI Crude Oil (daily)")
plt.grid(True)
plt.legend(loc=1)
plt.ylabel("USD per barrel")
 
selection = ["XOM"]
 
ax2 = ax1.twinx()
plt.ylabel("1e6 USD")
for ticker in selection:
    tmp = data[data.Ticker == ticker]
    lab = ticker
    tmp['per_end_date'] =  pd.to_datetime(tmp['per_end_date'], 
                                                        format='%Y-%m-%d')
    plt.plot(tmp.per_end_date, tmp.Revenue, '.-', label=lab + ": Revenue")
    plt.plot(tmp.per_end_date, tmp.EBIT, '.-g', label=lab + ": EBIT")
    plt.plot(tmp.per_end_date, tmp.GrossProfit, '.-m', label=lab + ": 
                                                            Gross Profit")
    plt.plot(tmp.per_end_date, tmp.CAPEX, '.-r', label=lab + ": CAPEX")
    plt.legend(loc=3)

unknown-2
Every company has different periods of time (per_end_date) when they report their financials. That is why we need to make sure that in the process of comparison of selected fundamentals with oil prices, the oil price is picked up around the date defined by per_end_date. Let me use $\pm 5$ day average around those points:

70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# cross-correlation
def Cxy(x, y, r, N):
    z = 0
    for i in range(0, N-r-1):
        z += x[i] * y[i+r]
    z /= (N-r)
    return z
 
# test statistic for cross-correlation
def Hxy(x, y, L, N):
    z = 0
    for r in range(1, L+1):
        z += Cxy(x, y, r, N)**2
    z *= (N - L)   
    return z
 
# cross-bicorrelation
def Cxyy(x, y, r, s, N):
    z = 0
    m = np.max([r, s])
    for i in range(0, N-m-1):
        z += x[i] * y[i+r] * y[i+s]
    z /= (N-m)
    return z
 
def Hxyy(x, y, L, N):
    z = 0
    for s in range(2, L+1):
        for r in range(1, L+1):
            m = np.max([r, s])
            z += (N-m) * Cxyy(x, y, r, s, N)**2
    return z
 
 
fig, ax1 = plt.subplots(figsize=(10,5))
plt.plot(wpi.DATE, wpi.VALUE, color=grey, label="WPI Crude Oil (daily)")
plt.grid(True)
plt.legend(loc=1)
plt.ylabel("USD per barrel")
 
selection = ["XOM"]
 
ax2 = ax1.twinx()
plt.ylabel("1e6 USD")
for ticker in selection:
    tmp = data[data.Ticker == ticker]
    lab = ticker
    tmp['per_end_date'] =  pd.to_datetime(tmp['per_end_date'], 
                                               format='%Y-%m-%d')
    plt.plot(tmp.per_end_date, tmp.Revenue, '.-', label=lab + ": Revenue")
    plt.plot(tmp.per_end_date, tmp.EBIT, '.-g', label=lab + ": EBIT")
    plt.plot(tmp.per_end_date, tmp.GrossProfit, '.-m', label=lab + ": 
                                                            Gross Profit")
    plt.plot(tmp.per_end_date, tmp.CAPEX, '.-r', label=lab + ": CAPEX")
    plt.legend(loc=3)
 
    col = ['per_end_date', 'avgWPI']
    wpi2 = pd.DataFrame(columns=col)  # create an empty DataFrame
 
    for i in range(tmp.shape[0]):
        date = tmp.iloc[i]["per_end_date"]
        date0 = date
        date1 = date + datetime.timedelta(days=-5)
        date2 = date + datetime.timedelta(days=+5)
        wpiavg = wpi[(wpi.DATE >= date1) & (wpi.DATE <= date2)]
        avg = np.mean(wpiavg.VALUE)
        wpi2.loc[len(wpi2)] = [date0, avg]
 
    # plot quarterly averaged oil price-series
    plt.sca(ax1)  # set ax1 axis for plotting
    plt.plot(wpi2.per_end_date, wpi2.avgWPI, '.-k')

unknown-3
where a black line denotes quarterly changing WPI Crude Oil prices stored now in a new Dataframe of wpi2.

Let’s leave only XOM in selection list variable. After line #140 the Dataframe of tmp stores XOM’s selected fundamentals data. In what follows we will use both Dateframes (tmp and wpi2) to look for non-linear correlations as the final goal of this investigation:

142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
    # select some factors
    fundamentals = ["Revenue", "EBIT", "GrossProfit", "CAPEX"]
 
    # compute for them cross-correlations and cross-bicorrelations
    # and display final results
    #
    for f in fundamentals:
        print("%s: %s\n" % (ticker, f))
        # input data
        x = wpi2.avgWPI.values
        y = tmp[f].values
 
        # normalised time-series
        x = (x - np.mean(x))/np.std(x, ddof=1)
        y = (y - np.mean(y))/np.std(y, ddof=1)
 
        N = len(x)  # len(x) should be equal to len(y)
        L = int(np.floor(N**0.4))
 
        print("Cross-Correlation")
        for r in range(1, L+1):
            print("  r =%2g\t Cxy(r=%g) = %.4f" % 
                  (r, r, Cxy(x, y, r, N)))
 
        dof = L
        pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
        alpha = 0.1
 
        print("  Hxy(N) = %.4f" % Hxy(x, y, L, N))
        print("  p-value = %.4f" % pvalue)
        print("  H0 rejected in favour of H1 at %g%% c.l.: %s" % 
                                      (100*(1-alpha), pvalue<alpha))
 
        print("Cross-Bicorrelation")
        for r in range(0, L+1):
            for s in range(r+1, L+1):
                print("  r =%2g, s =%2g\t Cxyy(r=%g, s=%g) = %.5f" % 
                     (r, s, r, s, Cxyy(x, y, r, s, N)))
 
        dof = (L-1)*(L/2)
        pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof)
 
        print("  Hxyy(N) = %.4f" % Hxyy(x, y, L, N))
        print("  p-value = %.4f" % pvalue)
        print("  H0 rejected in favour of H1 at %g%% c.l.: %s\n" % 
                                       (100*(1-alpha), pvalue<alpha))

Our Python code generates for XOM the following output:

XOM: Revenue
 
Cross-Corelation
  r = 1	 Cxy(r=1) = 0.7869
  r = 2	 Cxy(r=2) = 0.6239
  r = 3	 Cxy(r=3) = 0.4545
  Hxy(N) = 24.3006
  p-value = 0.0000
  H0 rejected in favour of H1 at 90% c.l.: True
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = -0.44830
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.30186
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.19315
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.36678
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.22161
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.24446
  Hxyy(N) = 10.3489
  p-value = 0.0000
  H0 rejected in favour of H1 at 90% c.l.: True
 
XOM: EBIT
 
Cross-Correlation
  r = 1	 Cxy(r=1) = 0.7262
  r = 2	 Cxy(r=2) = 0.5881
  r = 3	 Cxy(r=3) = 0.3991
  Hxy(N) = 20.6504
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = -0.39049
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.25963
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.17480
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.28698
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.18030
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.21284
  Hxyy(N) = 6.8759
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
 
XOM: GrossProfit
 
Cross-Correlation
  r = 1	 Cxy(r=1) = 0.7296
  r = 2	 Cxy(r=2) = 0.5893
  r = 3	 Cxy(r=3) = 0.3946
  Hxy(N) = 20.7065
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = -0.38552
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.24803
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.17301
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.28103
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.17393
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.19780
  Hxyy(N) = 6.2095
  p-value = 0.0001
  H0 rejected in favour of H1 at 90% c.l.: True
 
XOM: CAPEX
 
Cross-Correlation
  r = 1	 Cxy(r=1) = -0.2693
  r = 2	 Cxy(r=2) = -0.3024
  r = 3	 Cxy(r=3) = -0.2333
  Hxy(N) = 4.3682
  p-value = 0.2244
  H0 rejected in favour of H1 at 90% c.l.: False
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = 0.00844
  r = 0, s = 2	 Cxyy(r=0, s=2) = -0.10889
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.14787
  r = 1, s = 2	 Cxyy(r=1, s=2) = -0.10655
  r = 1, s = 3	 Cxyy(r=1, s=3) = -0.16628
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.08072
  Hxyy(N) = 6.4376
  p-value = 0.2244
  H0 rejected in favour of H1 at 90% c.l.: False

A quick analysis points at detection of significant non-linear correlations (at 90% confidence level) between WPI Crude Oil and Revenue, EBIT, and GrossProfit for lag $r=1$ (based on cross-correlations) and for lags $(r=0, s=1)$ (based on cross-bicorrelations). There is no significant non-linear relationship between XOM’s CAPEX and oil prices.

Interestingly, rerunning the code for Boeing Company (NYSE ticker: BA) reveals no significant cross-(bi)correlations between the same factors and oil. For example, Boeing’s revenue improves year-over-year,

unknown-4

but it appears to have any significant non-linear link to the oil prices whatsoever:

BA: Revenue
 
Cross-Correlation
  r = 1	 Cxy(r=1) = -0.3811
  r = 2	 Cxy(r=2) = -0.3216
  r = 3	 Cxy(r=3) = -0.1494
  Hxy(N) = 5.4200
  p-value = 0.1435
  H0 rejected in favour of H1 at 90% c.l.: False
Cross-Bicorrelation
  r = 0, s = 1	 Cxyy(r=0, s=1) = 0.11202
  r = 0, s = 2	 Cxyy(r=0, s=2) = 0.03588
  r = 0, s = 3	 Cxyy(r=0, s=3) = -0.08083
  r = 1, s = 2	 Cxyy(r=1, s=2) = 0.01526
  r = 1, s = 3	 Cxyy(r=1, s=3) = 0.00616
  r = 2, s = 3	 Cxyy(r=2, s=3) = -0.07252
  Hxyy(N) = 0.3232
  p-value = 0.1435
  H0 rejected in favour of H1 at 90% c.l.: False

The most interesting aspect of cross-correlation and cross-bicorrelation methods applied in oil vs. stock fundamentals research regards new opportunities to discover correlations for stocks that so far have been completely ignored (or not suspected) to have anything in common with oil price movements in the markets. Dare to go this path. I hope you will find new assets to invest in. The information is there. Now you are equipped with new tools. Use them wisely. Retire early.

REFERENCES
    Bernanke B. S., 2016, The relationship between stocks and oil prices
    Brooks C., Hinich M. J., 1999, Cross-correlations and cross-bicorrelations in Sterling
          exchange rates
(pdf)
    Coronado et al., 2015, A study of co-movements between USA and Latin American stock
          markets: a cross-bicorrelations perspective
(pdf)
    Hamilton J., 2014, Oil prices as an indicator of global economic conditions
    Hinich M.J., 1996. Testing for dependence in the input to a linear time series model.
          Journal of Nonparametric Statistics 6, 205–221.
    Hinich M.J., Patterson D.M., 1995, Detecting epochs of transient dependence in white
          noise
, Mimeo, University of Texas at Austin

DOWNLOADS
    COPPER.csv, DCOILWTICO.csv, DGS10.csv, DTWEXM.csv, VIXCLS.csv, ZACKS-FE.csv

Python for Algo-Trading: Workshops in Poland


QuantAtRisk.com would love to invite You to attend

poland-banner

where we will cover all essential aspects of algo-trading covering Warsaw Stock Exchange and selected EU/UK/US exchanges + FOREX markets.


Interested? Click here for Early-Bird Registration Special Offer and Details!

Organised and Led by
Dr. Pawel Lachowicz (Sydney, Australia)

Financial Time-Series Segmentation Based On Turning Points in Python

A determination of peaks and troughs for any financial time-series seems to be always in high demand, especially in algorithmic trading. A number of numerical methods can be found in the literature. The main problem exists when a smart differentiation between a local trend and “global” sentiment needs to be translated into computer language. In this short post, we fully refer to the publication of Yin, Si, & Gong (2011) on Financial Time-Series Segmentation using Turning Points wherein the authors proposed an appealing way to simplify the “noisy” character of the financial (high-frequency) time-series.

Since this publication presents an easy-to-digest historical introduction to the problem with a novel pseudo-code addressing solution, let me skip this part here and refer you to the paper itself (download .pdf here).

We develop Python implementation of the pseudo-code as follows. We start with some dataset. Let us use the 4-level order-book record of Hang Seng Index as traded over Jan 4, 2016 (download 20160104_orderbook.csv.zip; 8MB). The data cover both morning and afternoon trading sessions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
# Reading Orderbook
data = pd.read_csv('20160104_orderbook.csv')
data['MidPrice0'] = (data.AskPrice0 + data.BidPrice0)/2.  # mid-price
 
# Split Data according to Sessions
delta = np.diff(data.Timestamp)
# find a good separation index
k = np.where(delta > np.max(delta)/2)[0][0] + 1
 
data1 = data[0:k].copy()  # Session 12:15-15:00
data2 = data[k+1:].copy() # Session 16:00-19:15
data2.index = range(len(data2))
 
plt.figure(figsize=(10,5))
plt.plot(data1.Timestamp, data1.MidPrice0, 'r', label="Session 12:15-15:00")
plt.plot(data2.Timestamp, data2.MidPrice0, 'b', label="Session 16:00-19:15")
plt.legend(loc='best')
plt.axis('tight')

revealing:

tp_fig01

Turning Points pseudo-algorithm of Yin, Si, & Gong (2011) can be organised using simple Python functions in a straightforward way, namely:

24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def first_tps(p):
    tp = []
    for i in range(1, len(p)-1):
        if((p[i] < p[i+1]) and (p[i] < p[i-1])) or ((p[i] > p[i+1]) \
           and (p[i] > p[i-1])):
            tp.append(i)
    return tp
 
def contains_point_in_uptrend(i, p):
    if(p[i] < p[i+1]) and (p[i] < p[i+2]) and (p[i+1] < p[i+3]) and \
              (p[i+2] < p[i+3]) and \
              (abs(p[i+1] - p[i+2]) < abs(p[i] - p[i+2]) + abs(p[i+1] - p[i+3])):
        return True
    else:
        return False
 
def contains_point_in_downtrend(i, p):
    if(p[i] > p[i+1]) and (p[i] > p[i+2]) and (p[i+1] > p[i+3]) and \
           (p[i+2] > p[i+3]) and \
           (abs(p[i+2] - p[i+1]) < abs(p[i] - p[i+2]) + abs(p[i+1] - p[i+3])):
        return True
    else:
        return False
 
def points_in_the_same_trend(i, p, thr):
    if(abs(p[i]/p[i+2]-1) < thr) and (abs(p[i+1]/p[i+3]-1) < thr):
        return True
    else:
        return False
 
def turning_points(idx, p, thr):
    i = 0
    tp = []
    while(i < len(idx)-3):
        if contains_point_in_downtrend(idx[i], p) or \
           contains_point_in_uptrend(idx[i], p) \
              or points_in_the_same_trend(idx[i], p, thr):
            tp.extend([idx[i], idx[i+3]])
            i += 3
        else:
            tp.append(idx[i])
            i += 1
    return tp

The algorithms allows us to specify a number $k$ (or a range) of sub-levels for time-series segmentation. The “deeper” we go the more distinctive peaks and throughs remain. Have a look:

66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
thr = 0.05
sep = 75  # separation for plotting
 
P1 = data1.MidPrice0.values
P2 = data2.MidPrice0.values
tp1 = first_tps(P1)
tp2 = first_tps(P2)
 
plt.figure(figsize=(16,10))
 
plt.plot(data1.Timestamp, data1.MidPrice0, 'r', label="Session 12:15-15:00")
plt.plot(data2.Timestamp, data2.MidPrice0, 'b', label="Session 16:00-19:15")
plt.legend(loc='best')
 
for k in range(1, 10):  # k over a given range of sub-levels
    tp1 = turning_points(tp1, P1, thr)
    tp2 = turning_points(tp2, P2, thr)
    plt.plot(data1.Timestamp[tp1], data1.MidPrice0[tp1]-sep*k, 'k')
    plt.plot(data2.Timestamp[tp2], data2.MidPrice0[tp2]-sep*k, 'k')
 
plt.axis('tight')
plt.ylabel('Price')
plt.xlabel('Timestamp')

tp_fig02

It is highly tempting to use the code as a supportive indicator for confirmation of new trends in the time-series (single) or build concurrently running decomposition (segmentation; at the same sub-level) for two or more parallel time-series (e.g. of the FX pairs). Enjoy!

Computation of the Loss Distribution not only for Operational Risk Managers

In the Operational Risk Management, given a number/type of risks or/and business line combinations, the quest is all about providing the risk management board with an estimation of the losses the bank (or any other financial institution, hedge-fund, etc.) can suffer from. If you think for a second, the spectrum of things that might go wrong is wide, e.g. the failure of a computer system, an internal or external fraud, clients, products, business practices, a damage to physical goods, and so on. These ones blended with business lines, e.g. corporate finance, trading and sales, retail banking, commercial banking, payment and settlement, agency services, asset management, or retail brokerage return over 50 combinations of the operational risk factors one needs to consider. Separately and carefully. And it’s a tough one.

Why? A good question “why?”! Simply because of two main reasons. For an operational risk manager the sample of data describing the risk is usually insufficient (statistically speaking: the sample is small over the life period of the financial organ). Secondly, when something goes wrong, the next (of the same kind) event may take place in not-to-distant future or in far-distant future. The biggest problem the operational risk manager meets in his/her daily work regards the prediction of all/any losses due to operational failures. Therefore, the time of the (next) event comes in as an independent variable into that equation: the loss frequency distribution. The second player in the game is: the loss severity distribution, i.e. if the worst strikes, how much the bank/financial body/an investor/a trader might lose?!

From a perspective of a trader we well know that Value-at-Risk (VaR) and the Expected Shortfall are two quantitative risk measures that address similar questions. But from the viewpoint of the operational risk, the estimation of losses requires a different approach.

In this post, after Hull (2015), we present an algorithm in Python for computation of the loss distribution given the best estimation of the loss frequency and loss severity distributions. Though designed for operation risk analysts in mind, in the end we argue its usefulness for any algo-trader and/or portfolio risk manager.

1. Operational Losses: Case Study of the Vanderloo Bank

An access to operational loss data is much much harder than in case of stocks traded in the exchange. They usually stay within the walls of the bank, with an internal access only. A recommended practice for operational risk managers around the world is to share those unique data despite confidentiality. Only in that instance we can build a broader knowledge and understanding of risks and incurred losses due to operational activities.

Let’s consider a case study of a hypothetical Vanderloo Bank. The bank had been found in 1988 in Netherlands and its main line of business was concentrated around building unique customer relationships and loans for small local businesses. Despite a vivid vision and firmly set goals for the future, Vanderloo Bank could not avoid a number of operational roadblocks that led to a substantial operational losses:

Year Month Day Business Line Risk Category Loss ($M)
0 1989.0 1.0 13.0 Trading and Sales Internal Fraud 0.530597
1 1989.0 2.0 9.0 Retail Brokerage Process Failure 0.726702
2 1989.0 4.0 14.0 Trading and Sales System Failure 1.261619
3 1989.0 6.0 11.0 Asset Managment Process Failure 1.642279
4 1989.0 7.0 23.0 Corporate Finance Process Failure 1.094545
5 1990.0 10.0 21.0 Trading and Sales Employment Practices 0.562122
6 1990.0 12.0 24.0 Payment and Settlement Process Failure 4.009160
7 1991.0 8.0 23.0 Asset Managment Business Practices 0.495025
8 1992.0 1.0 28.0 Asset Managment Business Practices 0.857785
9 1992.0 3.0 14.0 Commercial Banking Damage to Assets 1.257536
10 1992.0 5.0 26.0 Retail Banking Internal Fraud 1.591007
11 1992.0 8.0 9.0 Corporate Finance Employment Practices 0.847832
12 1993.0 1.0 11.0 Corporate Finance System Failure 1.314225
13 1993.0 1.0 19.0 Retail Banking Internal Fraud 0.882371
14 1993.0 2.0 24.0 Retail Banking Internal Fraud 1.213686
15 1993.0 6.0 12.0 Commercial Banking System Failure 1.231784
16 1993.0 6.0 16.0 Agency Services Damage to Assets 1.316528
17 1993.0 7.0 11.0 Retail Banking Process Failure 0.834648
18 1993.0 9.0 21.0 Retail Brokerage Process Failure 0.541243
19 1993.0 11.0 11.0 Asset Managment Internal Fraud 1.380636
20 1994.0 11.0 22.0 Retail Banking External Fraud 1.426433
21 1995.0 2.0 14.0 Commercial Banking Process Failure 1.051281
22 1995.0 11.0 21.0 Commercial Banking External Fraud 2.654861
23 1996.0 8.0 17.0 Agency Services Process Failure 0.837237
24 1997.0 7.0 13.0 Retail Brokerage Internal Fraud 1.107019
25 1997.0 7.0 24.0 Agency Services External Fraud 1.513146
26 1997.0 8.0 8.0 Retail Banking Process Failure 1.002040
27 1997.0 9.0 2.0 Agency Services Damage to Assets 0.646596
28 1997.0 9.0 12.0 Retail Banking Employment Practices 0.966086
29 1998.0 1.0 8.0 Retail Banking Internal Fraud 0.938803
30 1998.0 1.0 12.0 Retail Banking System Failure 0.922069
31 1998.0 2.0 5.0 Asset Managment Process Failure 1.042259
32 1998.0 4.0 18.0 Commercial Banking External Fraud 0.969562
33 1998.0 5.0 12.0 Retail Banking External Fraud 0.683715
34 1999.0 1.0 3.0 Trading and Sales Internal Fraud 2.035785
35 1999.0 4.0 27.0 Retail Brokerage Business Practices 1.074277
36 1999.0 5.0 8.0 Retail Banking Employment Practices 0.667655
37 1999.0 7.0 10.0 Agency Services System Failure 0.499982
38 1999.0 7.0 17.0 Retail Brokerage Process Failure 0.803826
39 2000.0 1.0 26.0 Commercial Banking Business Practices 0.714091
40 2000.0 7.0 23.0 Trading and Sales System Failure 1.479367
41 2001.0 6.0 16.0 Retail Brokerage System Failure 1.233686
42 2001.0 11.0 5.0 Agency Services Process Failure 0.926593
43 2002.0 5.0 14.0 Payment and Settlement Damage to Assets 1.321291
44 2002.0 11.0 11.0 Retail Banking External Fraud 1.830254
45 2003.0 1.0 14.0 Corporate Finance System Failure 1.056228
46 2003.0 1.0 28.0 Asset Managment System Failure 1.684986
47 2003.0 2.0 28.0 Commercial Banking Damage to Assets 0.680675
48 2004.0 1.0 11.0 Asset Managment Process Failure 0.559822
49 2004.0 6.0 19.0 Commercial Banking Internal Fraud 1.388681
50 2004.0 7.0 3.0 Retail Banking Internal Fraud 0.886769
51 2004.0 7.0 21.0 Retail Brokerage Employment Practices 0.606049
52 2004.0 7.0 27.0 Asset Managment Employment Practices 1.634348
53 2004.0 11.0 26.0 Asset Managment Damage to Assets 0.983355
54 2005.0 1.0 9.0 Corporate Finance Damage to Assets 0.969710
55 2005.0 9.0 17.0 Commercial Banking System Failure 0.634609
56 2006.0 2.0 24.0 Agency Services Business Practices 0.637760
57 2006.0 3.0 21.0 Retail Banking Employment Practices 1.072489
58 2006.0 6.0 25.0 Payment and Settlement System Failure 0.896459
59 2006.0 12.0 25.0 Trading and Sales Process Failure 0.731953
60 2007.0 6.0 9.0 Commercial Banking System Failure 0.918233
61 2008.0 1.0 5.0 Corporate Finance External Fraud 0.929702
62 2008.0 2.0 14.0 Retail Brokerage System Failure 0.640201
63 2008.0 2.0 14.0 Commercial Banking Internal Fraud 1.580574
64 2008.0 3.0 18.0 Corporate Finance Process Failure 0.731046
65 2009.0 2.0 1.0 Agency Services System Failure 0.630870
66 2009.0 2.0 6.0 Retail Banking External Fraud 0.639761
67 2009.0 4.0 14.0 Payment and Settlement Internal Fraud 1.022987
68 2009.0 5.0 25.0 Retail Banking Business Practices 1.415880
69 2009.0 7.0 8.0 Retail Banking Business Practices 0.906526
70 2009.0 12.0 26.0 Agency Services System Failure 1.463529
71 2010.0 2.0 13.0 Asset Managment Damage to Assets 0.664935
72 2010.0 3.0 24.0 Payment and Settlement Process Failure 1.848318
73 2010.0 10.0 16.0 Commercial Banking External Fraud 1.020736
74 2010.0 12.0 27.0 Retail Banking Employment Practices 1.126265
75 2011.0 2.0 5.0 Retail Brokerage Process Failure 1.549890
76 2011.0 6.0 24.0 Corporate Finance Damage to Assets 2.153238
77 2011.0 11.0 6.0 Asset Managment System Failure 0.601332
78 2011.0 12.0 1.0 Payment and Settlement External Fraud 0.551183
79 2012.0 2.0 21.0 Corporate Finance External Fraud 1.866740
80 2013.0 4.0 22.0 Retail Brokerage External Fraud 0.672756
81 2013.0 6.0 27.0 Payment and Settlement Employment Practices 1.119233
82 2013.0 8.0 17.0 Commercial Banking System Failure 1.034078
83 2014.0 3.0 1.0 Asset Managment Employment Practices 2.099957
84 2014.0 4.0 4.0 Retail Brokerage External Fraud 0.929928
85 2014.0 6.0 5.0 Retail Banking System Failure 1.399936
86 2014.0 11.0 17.0 Asset Managment Process Failure 1.299063
87 2014.0 12.0 3.0 Agency Services System Failure 1.787205
88 2015.0 2.0 2.0 Payment and Settlement System Failure 0.742544
89 2015.0 6.0 23.0 Commercial Banking Employment Practices 2.139426
90 2015.0 7.0 18.0 Trading and Sales System Failure 0.499308
91 2015.0 9.0 9.0 Retail Banking Employment Practices 1.320201
92 2015.0 9.0 18.0 Corporate Finance Business Practices 2.901466
93 2015.0 10.0 21.0 Commercial Banking Internal Fraud 0.808329
94 2016.0 1.0 9.0 Retail Banking Internal Fraud 1.314893
95 2016.0 3.0 28.0 Asset Managment Business Practices 0.702811
96 2016.0 3.0 25.0 Payment and Settlement Internal Fraud 0.840262
97 2016.0 4.0 6.0 Retail Banking Process Failure 0.465896

 

Having a record of 97 events, now we can begin building a statistical picture on loss frequency and loss severity distribution.

2. Loss Frequency Distribution

For loss frequency, the natural probability distribution to use is a Poisson distribution. It assumes that losses happen randomly through time so that in any short period of time $\Delta t$ there is a probability of $\lambda \Delta t$ of a loss occurring. The probability of $n$ losses in time $T$ [years] is:
$$
\mbox{Pr} = \exp{(-\lambda T)} \frac{(\lambda T)^n}{n!}
$$ where the parameter $\lambda$ can be estimated as the average number of losses per year (Hull 2015). Given our table in the Python pandas’ DataFrame format, df, we code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Computation of the Loss Distribution not only for Operational Risk Managers
# (c) 2016 QuantAtRisk.com, Pawel Lachowicz
 
from scipy.stats import lognorm, norm, poisson
from matplotlib  import pyplot as plt
import numpy as np
import pandas as pd
 
# reading Vanderoo Bank operational loss data
df = pd.read_hdf('vanderloo.h5', 'df')
 
# count the number of loss events in given year
fre = df.groupby("Year").size()
print(fre)

where the last operation groups and displays the number of losses in each year:

Year
1989.0    5
1990.0    2
1991.0    1
1992.0    4
1993.0    8
1994.0    1
1995.0    2
1996.0    1
1997.0    5
1998.0    5
1999.0    5
2000.0    2
2001.0    2
2002.0    2
2003.0    3
2004.0    6
2005.0    2
2006.0    4
2007.0    1
2008.0    4
2009.0    6
2010.0    4
2011.0    4
2012.0    1
2013.0    3
2014.0    5
2015.0    6
2016.0    4
dtype: int64

The estimation of Poisson’s $\lambda$ requires solely the computation of:

16
17
18
# estimate lambda parameter
lam = np.sum(fre.values) / (df.Year[df.shape[0]-1] - df.Year[0])
print(lam)
3.62962962963

what informs us that during 1989–2016 period, i.e. over the past 27 years, there were $\lambda = 3.6$ losses per year. Assuming Poisson distribution as the best descriptor for loss frequency distribution, we model the probability of operational losses of the Vanderloo Bank in the following way:

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# draw random variables from a Poisson distribtion with lambda=lam
prvs = poisson.rvs(lam, size=(10000))
 
# plot the pdf (loss frequency distribution)
h = plt.hist(prvs, bins=range(0, 11))
plt.close("all")
y = h[0]/np.sum(h[0])
x = h[1]
 
plt.figure(figsize=(10, 6))
plt.bar(x[:-1], y, width=0.7, align='center', color="#2c97f1")
plt.xlim([-1, 11])
plt.ylim([0, 0.25])
plt.ylabel("Probability", fontsize=12)
plt.title("Loss Frequency Distribution", fontsize=14)
plt.savefig("f01.png")

revealing:
f01

3. Loss Severity Distribution

The data collected in the last column of $df$ allow us to plot and estimate the best fit of the loss severity distribution. In the practice of operational risk mangers, the lognormal distribution is a common choice:

37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
c = .7, .7, .7  # define grey color
 
plt.figure(figsize=(10, 6))
plt.hist(df["Loss ($M)"], bins=25, color=c, normed=True)
plt.xlabel("Incurred Loss ($M)", fontsize=12)
plt.ylabel("N", fontsize=12)
plt.title("Loss Severity Distribution", fontsize=14)
 
x = np.arange(0, 5, 0.01)
sig, loc, scale = lognorm.fit(df["Loss ($M)"])
pdf = lognorm.pdf(x, sig, loc=loc, scale=scale)
plt.plot(x, pdf, 'r')
plt.savefig("f02.png")
 
print(sig, loc, scale)  # lognormal pdf's parameters
0.661153638163 0.328566816132 0.647817560825

where the lognormal distribution probability density function (pdf) we use is given by:
$$
p(x; \sigma, loc, scale) = \frac{1}{x\sigma\sqrt{2\pi}} \exp{ \left[ -\frac{1}{2} \left(\frac{\log{x}}{\sigma} \right)^2 \right] }
$$
where $x = (y – loc)/scale$. The fit of pdf to the data returns:
f02

4. Loss Distribution

The loss frequency distribution must be combined with the loss severity distribution for each risk type/business line combination in order to determine a loss distribution. The most common assumption here is that loss severity is independent of loss frequency. Hull (2015) suggests the following steps to be taken in building the Monte Carlo simulation leading to modelling of the loss distribution:

1. Sample from the frequency distribution to determine the number of loss events ($n$)
2. Sample $n$ times from the loss severity distribution to determine the loss experienced
      for each loss event ($L_1, L_2, …, L_n$)
3. Determine the total loss experienced ($=L_1 + L_2 + … + L_n$)



When many simulation trials are used, we obtain a total distribution for losses of the type being considered. In Python we code those steps in the following way:

53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
def loss(r, loc, sig, scale, lam):
    X = []
    for x in range(11):  # up to 10 loss events considered
        if(r < poisson.cdf(x, lam)):  # x denotes a loss number
            out = 0
        else:
            out = lognorm.rvs(s=sig, loc=loc, scale=scale)
        X.append(out)
    return np.sum(X)  # = L_1 + L_2 + ... + L_n
 
 
# run 1e5 Monte Carlo simulations
losses = []
for _ in range(100000):
    r = np.random.random()
    losses.append(loss(r, loc, sig, scale, lam))
 
 
h = plt.hist(losses, bins=range(0, 16))
_ = plt.close("all")
y = h[0]/np.sum(h[0])
x = h[1]
 
plt.figure(figsize=(10, 6))
plt.bar(x[:-1], y, width=0.7, align='center', color="#ff5a19")
plt.xlim([-1, 16])
plt.ylim([0, 0.20])
plt.title("Modelled Loss Distribution", fontsize=14)
plt.xlabel("Loss ($M)", fontsize=12)
plt.ylabel("Probability of Loss", fontsize=12)
plt.savefig("f03.png")

revealing:
f03

The function of loss has been designed in the way that it considers up to 10 loss events. We run $10^5$ simulations. In each trial, first, we draw a random number r from a uniform distribution. If it is less than a value of Poisson cumulative distribution function (with $\lambda = 3.6$) for x loss number ($x = 0, 1, …, 10$) then we assume a zero loss incurred. Otherwise, we draw a rv from the lognormal distribution (given by its parameters found via fitting procedure a few lines earlier). Simple as that.

The resultant loss distribution as shown in the chart above describes the expected severity of future losses (due to operational “fatal” activities of Vanderloo Bank) given by the corresponding probabilities.

5. Beyond Operational Risk Management

A natural step of the numerical procedure which we have applied here seems to pertain to the modelling of, e.g., the anticipated (predicted) loss distribution for any portfolio of N-assets. One can estimate it based on the track record of losses incurred in trading as up-to-date. By doing so, we gain an additional tool in our arsenal of quantitative risk measures and modelling. Stay tuned as a new post will illustrate that case.

DOWNLOAD
     vanderloo.h5

REFERENCES
    Hull, J., 2015, Risk Management and Financial Institutions, 4th Ed.

Probability of Black Swan Events at NYSE

Featured by a Russian website Financial One (Apr 26, 2016)

The prediction of extreme rare events (EREs) in the financial markets remains one of the toughest problems. Firstly because of a very limited knowledge we have on their distribution and underlying correlations across the markets. Literally, we walk in dark, hoping it won’t happen today, not to the stocks we hold in our portfolios. But is that darkness really so dark?

In this post we will use a textbook knowledge on classical statistics in order to analyze the universe of 2500+ stocks traded at New York Stock Exchange (NYSE) that experienced the most devastating daily loss in their history (1 data point per 1 stock). Based on that, we will try to shed a new light on the most mind-boggling questions: when the next ERE will happen, how long we need to wait for it, and if it hits what its magnitude “of devastation” will be?

1. Extreme Rare Events of NYSE

An ERE meeting certain criteria has its own name in the financial world: a black swan. We will define it in the next Section. Historically speaking, at some point, people were convinced that only white swans existed until someone noticed a population of black swans in Western Australia. The impossible became possible.

The same we observe in the stock markets. Black Monday on Oct 19, 1987 no doubt is the greatest historical event where the markets suffered from 20%+ losses. The event was so rare that any risk analyst could not predict it and/or protect against it. The history repeated itself in 2000 when the Internet bubble caused many IT stocks to lose a lot over the period of few weeks. An event of 9/11 in New York forced the closure of US stock exchange for 6 days, and on Sep 17, as a consequence of fear, about 25 stocks at NYSE lost more than 15% in one day. The uncertainty cannot be avoided but it can be minimized. Is it so? The financial crisis of 2007-2008 triggered a significant number of black swans to fly again and fly high…

Did we learn enough from that? Do we protect more efficiently against black swans after 2009 market recovery? Will you be surprised if I tell you “no”?!

Let’s conduct a data analysis of the most EREs ever spotted at NYSE among all its stocks. To do that, we will use Python language and Yahoo! Finance database. The quality of data are not required to be superbly accurate and we will download only adjusted-close prices with the corresponding dates. A list of companies (inter alia their tickers) traded at NYSE you can extract from a .xls file available for download at http://www.nasdaq.com/screening/company-list.aspx. It contains 3226 tickers, while, as we will see below, Yahoo! Finance recognizes only 2716 of them. Since 2716 is a huge data set meeting a lot of requirements for the statistical significance, I’m more than happy to go along with it.

In the process of stock data download, we look for information on the most extreme daily loss per stock and collect them all in pandas’ DataFrame dfblack. It will contain a Date, the magnitude of ERE (Return), and Delta0. The latter we calculate on spot as a number of business days between some initial point in time (here, 1990-01-01) and the day of ERE occurrence. The processing time may take up to 45 min depending on your Internet speed. Be patient, it will be worth waiting that long. Using HDF5 we save dfblack DataFrame on the disk.

# Probability of Black Swan Events at NYSE
# (c) 2016 by Pawel Lachowicz, QuantAtRisk.com
 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader.data as web
import datetime as dt
from scipy.stats import gumbel_l, t, norm, expon, poisson
from math import factorial
 
# Constants
c = (.7, .7, .7)  # grey color
 
col = ['Date', 'Return', 'Delta0']
dfblack = pd.DataFrame(columns=col)  # create an empty DataFrame
 
i = n = 0
for lstline in open("nyseall.lst",'r').readlines():
    ticker=lstline[:-1]
    i += 1
    print(i, ticker)
    try:
        ok = True
        data = web.DataReader(str(ticker), data_source='yahoo', 
                 start='1900-01-01', end='2016-04-14')
    except:
        ok = False
    if(ok):
        n += 1
        data['Return'] = data['Adj Close' ] / data['Adj Close'].shift(1) - 1
        data = data[1:]  # skip first row
 
        df = data['Return'].copy()
        dfs = df.sort_values()
        start = dt.date(1900, 1, 1)
        end = dt.date(dfs.index[0].year, dfs.index[0].month, dfs.index[0].day)
        delta0 = np.busday_count(start,end)
        tmin = end
        rmin = dfs[0]
        dfblack.loc[len(dfblack)]=[tmin, rmin, delta0]  # add new row to dfblack
 
    else:
        print(' not found')
 
dfblack.set_index(dfblack.Date, inplace=True)  # set index by Date
dfblack.sort_index(inplace=True)  # sort DataFrame by Date
del dfblack['Date']  # remove Date column
 
print("No. of stocks with valid data: %g" % n)
 
# saving to file
store = pd.HDFStore('dfblack.h5')
store['dfblack'] = dfblack
store.close()
 
# reading from file
store = pd.HDFStore('dfblack.h5')
dfblack = pd.read_hdf('dfblack.h5', 'dfblack')
store.close()
 
dfblack0 = dfblack.copy()  # a backup copy

where nyseall.lst is a plain text file listing all NYSE tickers:

DDD
MMM
WBAI
WUBA
AHC
...

The first instinct of a savvy risk analyst would be to plot the distribution of the magnitudes of EREs as extracted for the NYSE universe. We achieve that by:

plt.figure(num=1, figsize=(9, 5))
plt.hist(dfblack.Return, bins=50, color='grey')  # a histrogram
plt.xlabel('Daily loss', fontsize = 12)
plt.xticks(fontsize = 12)
 
plt.savefig('fig01.png', format='png')

what reveals:
fig01
Peaked around -18% with a long left tail of losses reaching nearly -98%. Welcome to the devil’s playground! But, that’s just the beginning. The information on dates corresponding to the occurrence of those EREs allows us to build a 2D picture:

# Black Monday data sample
df_bm = dfblack[dfblack.index == dt.date(1987, 10, 19)]  
# Reopening of NYSE after 9/11 data sample
df_wtc = dfblack[dfblack.index == dt.date(2001, 9, 17)]  
# Financial Crisis of 2008 data sample
df_cre = dfblack[(dfblack.index >= dt.date(2008, 9, 19)) & 
           (dfblack.index <= dt.date(2009, 3, 6))]
# Drawdown of 2015/2016 data sample
df_16 = dfblack[(dfblack.index >= dt.date(2015, 12, 1)) & 
           (dfblack.index <= dt.date(2016, 2, 11))]
 
plt.figure(figsize=(13, 6))
plt.plot(dfblack.Return*100, '.k')
plt.plot(df_bm.Return*100, '.r')
plt.plot(df_wtc.Return*100, '.b')
plt.plot(df_cre.Return*100, '.m')
plt.plot(df_16.Return*100, '.y')
plt.xlabel('Time Interval [%s to %s]' % (dfblack.index[0], dfblack.index[-1]), 
               fontsize=12)
plt.ylabel('Magnitude [%]', fontsize=12)
plt.title('Time Distribution of the Most Extreme Daily Losses of %g NYSE 
          Stocks (1 point per stock)' % (len(dfblack)), fontsize=13)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.xlim([dfblack.index[0], dt.date(2016, 4, 14)])
 
plt.savefig('fig02.png', format='png')

delivering:
fig02
Keeping in mind that every (out of 2716) stock is represented only by 1 data point, the figure reveals a dramatic state of the play: significantly lower number of EREs before 1995 and progressing increase after that with a surprisingly huge number in the past 5 years! As discussed in my previous post, Detecting Human Fear in Electronic Trading, the surge in EREs could be explained by a domination of algorithmic trading over manual trades. It may also suggest that the algorithmic risk management for many stocks is either not as good as we want it to be or very sensitive to the global stock buying/selling pressure across the markets (the butterfly effect). On the other side, the number of new stocks being introduced to NYSE may cause them “to fight for life” and those most vulnerable, short-lived, of unestablished position, reputation, financial history, or least liquidity can be the subject to ERE occurrence more likely. Regardless of the cause number one, there are more and more “sudden drops in the altitude” now than ever. I find it alarming.

For clarity, in the above figure, the historical events of Black Monday 1978-10-19 (89 points), reopening of NYSE after 9/11 (25 points), Financial Crisis 2007-2008 (858 points), and a drawdown of 2015/2016 (256 points!) have been marked by red, blue, purple, and dark yellow colours, respectively.

2. Black Swans of NYSE

The initial distribution of Extreme Rare Events of NYSE (Figure 1) provides us with an opportunity to define a black swan event. There is a statistical theory, Extreme Value Theory (EVT), delivering an appealing explanation on behaviour of rare events. We have discussed it in detail in Extreme VaR for Portfolio Managers and Black Swan and Extreme Loss Modeling articles. Thanks to that, we gain a knowledge and tools so needed to describe what we observe. As previously, in this post we will be only considering the Gumbel distribution $G$ of the corresponding probability density function (pdf) $g$ given as:
$$
G(z;\ a,b) = e^{-e^{-z}} \ \ \mbox{for}\ \ z=\frac{x-b}{a}, \ x\in\Re
$$ and
$$
g(z;\ a,b) = b^{-1} e^{-z}e^{-e^{-z}} \ .
$$ where $a$ and $b$ are the location parameter and scale parameter, respectively. You can convince yourself that fitting $g(z;\ a,b)$ function truly works in practice:

# fitting
locG, scaleG = gumbel_l.fit(dfblack.Return)  # location, scale parameters
 
dx = 0.0001
x = [dx*i for i in range(-int(1/dx), int(1/dx)+1)]
x2 = x.copy()
 
plt.figure(num=3, figsize=(9, 4))
plt.hist(dfblack.Return, bins=50, normed=True, color=c, 
             label="NYSE Data of 2716 EREs")
pdf1 = gumbel_l.pdf(x2, loc=locG, scale=scaleG)
y = pdf1.copy()
plt.plot(x2, y, 'r', linewidth=2, label="Gumbel PDF fit")
plt.xlim([-1, 0])
plt.xlabel('Daily loss [%]', fontsize = 12)
plt.xticks(fontsize = 12)
plt.legend(loc=2)
plt.savefig('fig03.png', format='png')

fig03
By deriving:

print(locG, scaleG, locG-scaleG, locG+scaleG)
 
# Pr(Loss < locG-scaleG)
Pr_bs = gumbel_l.cdf(locG-scaleG, loc=locG, scale=scaleG)
print(Pr_bs)

we see that

-0.151912998711 0.0931852781844 -0.245098276896 -0.058727720527
0.307799372445

i.e. the Gumbel pdf’s location parameter (peak) is at -15.2% with scale of 9.3%. Using an analogy to Normal distribution, we define two brackets, namely:
$$
a-b \ \ \mbox{and} \ \ a+b
$$ to be -24.5% and -5.9%, respectively. Given that, we define the black swan event as a daily loss of:
$$
L < a-b $$ magnitude. Based on NYSE data sample we find that 30.8% of EREs (i.e. greater than 24.5% in loss) meet that criteria. We re-plot their time-magnitude history as follows:

# skip data after 2015-04-10 (last 252 business days)
dfb = dfblack0[dfblack0.index < dt.date(2015, 4, 10)]
# re-fit Gumbel (to remain politically correct in-sample ;)
locG, scaleG = gumbel_l.fit(dfb.Return)
print(locG, locG-scaleG)  # -0.16493276305 -0.257085820659
 
# extract black swans from EREs data set
dfb = dfb[dfb.Return < locG-scaleG]  
dfb0 = dfb.copy()
 
plt.figure(figsize=(13, 6))
plt.plot(dfb.Return*100, '.r')
plt.xlabel('Time Interval [%s to %s]' % (dfb.index[0], dfb.index[-1]), 
               fontsize=14)
plt.ylabel('Black Swan Magnitude [%]', fontsize=14)
plt.title('Time Distribution of Black Swans of %g NYSE Stocks (L < %.1f%%)' % 
              (len(dfb), (locG-scaleG)*100), fontsize=14)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.ylim([-100, 0])
plt.xlim([dfb.index[0], dt.date(2016, 4, 14)])
plt.savefig('fig04.png', format='png')

fig04
where we deliberately skipped last 252+ data points to allow ourselves for some backtesting later on (out-of-sample). That reduces the number of black swans from 807 down to 627 (in-sample).

3. Modelling the Probability of an Event Based on Observed Occurrences

Let’s be honest. It’s nearly impossible to predict in advance the exact day when next black swan will appear, for which stocks, will it be isolated or trigger some stocks to fall down on the same day? Therefore, what follows from now on, you should consider as a common sense in quantitative modelling of black swans backed by our knowledge of statistics. And yes, despite the opinions of Taleb (2010) in his book of The Black Swan: The Impact of the Highly Improbable we might derive some risk uncertainties and become more aware of the impact if the worst strikes.

The classical guideline we find in Vose’s book Quantitative Risk Analysis: A Guide to Monte Carlo Simulation Modelling (1996). David delivers not only a solid explanations of many statistical distributions but also excellent hints on their practical usage for risk analysts. Since the discovery of this book in 2010 in Singapore, it is my favourite reference manual of all times.

The modelling of the probability of an event based on observed occurrences may follow two distinct processes, analogous to the difference between discrete and continuous distributions: (1) an event that may occur only among a set of specific discrete opportunities, and (2) an event that may occur over a continuum of opportunity.

Considering the black swan events we are not limited by a finite number of instances, therefore we will be taking into account the continuous exposure probability modelling. A continuous exposure process is characterized by the mean interval between events (MIBE) $\beta$. The underlying assumption refers to the uncertainty displaying the properties of a Poisson process, i.e. the probability of an event is independent of however many events have occurred in the past or how recently. And this is our starting point towards a better understanding of NYSE black swans.

3.1. Mean Interval Between Black Swan Events (MIBE)

The MIBE is the average interval between $n$ observed occurrences of a black swan event. Its true value can be estimated from the observed occurrences using central limit theorem:
$$
\mbox{MIBE}\beta = \mbox{Normal} \left( \bar{t}, \frac{\sigma}{\sqrt{n-1}} \right)
$$ where $\bar{t}$ is the average of the $n-1$ observed intervals $t_i$ between the $n$ observed contiguous black swan events and $\sigma$ is the standard deviation of the $t_i$ intervals. The larger the value of $n$, the greater our confidence on knowing its true value.

First, we estimate $\bar{t}$ from the data. Using NumPy’s histogram function, extra, we confirm that a plain mean is the same as the expected value, $E(\Delta t)$, derived based on the histogram,
$$
E(\Delta t) = \sum_{i} p_i x_i \ ,
$$ i.e. $\bar{t} = E(\Delta t)$ where there is $p_i$ probability of observing $x_i$. In Python we execute that check as follows:

deltat = np.diff(dfb.Delta0)  # time differences between points
 
print('tbar = %.2f [days]' % np.mean(deltat))
print('min, max = %.2f, %.2f [days]' % (np.min(deltat), np.max(deltat)))
print()
 
# Confirm the mean using histogram data (mean = the expected value)
y, x = np.histogram(deltat, bins=np.arange(np.ceil(np.max(deltat)+2)),
                    density=True)
hmean = np.sum(x[:-1]*y)
print('E(dt) = %.2f [days]' % hmean)
 
# plot historgam with a fitted Exp PDF
plt.figure(figsize=(9, 5))
plt.bar(x[:-1], y, width=1, color=c)
plt.xlim([0, 20])
plt.title('Time Differences between Black Swan Events', 
              fontsize=12)
plt.xlabel('Days', fontsize=12)
plt.savefig('fig05.png', format='png')

what returns

tbar = 17.38 [days]
min, max = 0.00, 950.00 [days]
 
E(dt) = 17.38 [days]

What is worth noticing is the fact that the largest recorded separation among historical black swans at NYSE was 950 days, i.e. 2.6 years. By plotting the distribution of time differences, we get:
fig05
what could be read out more clearly if we type, e.g.:

# the first 5 data points from the historgram
print(y[0:5])
print(x[0:5])

returning

[ 0.30990415  0.12300319  0.08466454  0.04472843  0.03354633]
[ 0.  1.  2.  3.  4.]

and means that our data suggest that there is a probability of 31.0% of observing the black swan events on the same day, 12.3% with a 1 day gap, 8.5% with a 2 day separation, and so on.

Since MIBE$\beta$ is not given by a single number, the recipe we have for its estimation (see the formula above) allows us solely for modelling of Mean Interval Between Black Swan Events as follows. First, we calculate the corresponding standard deviation:

sig = np.std(deltat, ddof=1)  # standard deviation
 
print(tbar, sig)
print(tbar, sig/np.sqrt(ne-1))
15.8912 57.5940828519
15.8912 2.30192251209

and next

ne = len(dfb)  # 627, a number of black swans in data sample
 
betaMIBE_sample = norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1), size=10000)
 
plt.figure(figsize=(9, 5))
plt.hist(betaMIBE_sample, bins=25, color=c)
plt.xlabel("MIBE beta [days]", fontsize=12)
plt.title("Mean Interval Between Black Swan Events", fontsize=12)
plt.savefig('fig06.png', format='png')

fig06

3.2. Time Till Next Black Swan Event

Once we have estimated the MIBE $\beta$, it is possible to use its value in order to estimate the time till the next black swan event. It obeys
$$
\mbox{TTNE} = \mbox{Expon}(\beta)
$$ i.e. the exponential distribution given by the probability density function of:
$$
f(x; \beta) = \lambda \exp(-\lambda x) = \frac{1}{\beta} \exp\left(-\frac{x}{\beta}\right) \ .
$$ In Python we model TTNE in the following way:

betaMIBE_one = norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1), size=1)
 
# The time till next event = Expon(beta)
ttne = expon.rvs(loc=0, scale=betaMIBE, size=100000)
 
y, x = np.histogram(ttne, bins=np.arange(int(np.max(ttne))), density=True)
 
plt.figure(figsize=(9, 5))
plt.bar(x[:-1], y, width=1, color=c, edgecolor=c)
plt.xlim([0, 120])
plt.title("Time Till Next Black Swan Event", fontsize=12)
plt.xlabel("Days", fontsize=12)
plt.savefig('fig07.png', format='png')

fig07
where by checking

print(y[0:5])
print(x[0:5])

we get

[0.05198052  0.0498905   0.04677047  0.04332043]
[1 2 3 4]

id est there is 5.20% of probability that the next black swan event at NYSE will occur on the next day, 4.99% in 2 days, 4.68% in 3 days, etc.

3.3. The Expected Number of Black Swans per Business Day

The distribution of the number of black swan events to occur at NYSE per unit time (here, per business day) can be modelled using Poisson distribution:
$$
\mbox{Poisson}(\lambda) \ \ \mbox{where} \ \ \lambda = \frac{1}{\beta} \ .
$$ The Poisson distribution models the number of occurrences of an event in a time $T$ when the time between successive events follows a Poisson process. If $\beta$ is the mean time between events, as used by the Exponential distribution, then $\lambda = T/\beta$. We will use it in the “out-of-sample” backtesting in Section 4.

Let’s turn words into action by executing the following Python code:

# generate a set of Poisson rvs
nep = poisson.rvs(1/betaMIBE_one, size=100000)
 
y, x = np.histogram(nep, bins=np.arange(np.max(nep)+1), density=True)
 
plt.figure(figsize=(9, 5))
plt.title("The Expected Number of Black Swans per Business Day", fontsize=12)
plt.ylabel("N", fontsize=12)
plt.ylabel("Probability", fontsize=12)
plt.xticks(x)
plt.xlim([-0.1, 2.1])
plt.bar(x[:-1], y, width=0.05, color='r', edgecolor='r', align='center')
plt.savefig('fig08.png', format='png')

fig08
The derived probabilities:

print(y)
print(x[:-1])
[0.95465  0.04419  0.00116]
[0 1 2]

inform us that there is 95.5% of chances that there will be no occurrence of the black swan event per business day, 4.4% that we will record only one, and 0.12% that we may find 2 out of 2500+ NYSE stocks that will suffer from the loss greater than 25.7% on the next business day (unit time).

3.4. Probability of the Occurrence of Several Events in an Interval

The Poisson $(1/\beta)$ distribution calculates the distribution of the number of events that will occur in a single unit interval. The probability of exactly $k$ black swan events in $m$ trials is given by:
$$
\mbox{Pr}[X = N] = \frac{\lambda^k}{k!} \exp(-\lambda) = \frac{1}{\beta^k k!} \exp\left(-\frac{1}{\beta}\right)
$$
with $\beta$ rescaled to reflect the period of exposure, e.g.:
$$
\beta_{yr} = \beta/252
$$ where 252 stands for a number of business days in a calendar year. Therefore, the number of black swan events in next business year can be modelled by:
$$
N = \mbox{Poisson}(\lambda) = \mbox{Poisson}(1/\beta_{yr}) =
\mbox{Poisson}(252/\mbox{Normal}(\bar{t},
\sigma/\sqrt{n-1}))
$$ what we can achieve in Python, coding:

inNdays = 252
 
N = poisson.rvs(inNdays/norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1)), size=100000)
exN = np.round(np.mean(N))
stdN = np.round(np.std(N, ddof=1))
 
y, x = np.histogram(N, bins=np.arange(int(np.max(N))), density=True)
 
_ = plt.figure(figsize=(9, 5))
_ = plt.title("The Expected Number of Black Swans in next %g Business Days" % 
              inNdays, fontsize=12)
_ = plt.ylabel("N", fontsize=12)
_ = plt.ylabel("Probability", fontsize=12)
_ = plt.bar(x[:-1], y, width=1, color=c)
plt.savefig('fig09.png', format='png')
 
# probability that it will occur <N> events in next 'inNdays' days
print('E(N) = %g' % exN)
print('stdN = %g' % stdN)
 
tmp = [(i,j) for i, j in zip(x,y) if i == exN]
print('Pr(X=%g) = %.4f' % (exN, tmp[0][1]))
 
tmp = [(i,j) for i, j in zip(x,y) if i == 0]
print('Pr(X=%g) = %.4f' % (0, tmp[0][1]))

Our Monte Carlo simulation may return, for instance:
fig9
where the calculated measures are:

E(N) = 16
stdN = 4
Pr(X=16) = 0.0992
Pr(X=0) = 0.0000

The interpretation is straightforward. On average, we expect to record $16\pm 4$ black swan events in next 252 business days among 2500+ stocks traded at NYSE as for Apr 10, 2015 COB. The probability of observing exactly 16 black swans is 9.92%. It is natural, based on our data analysis, that the resultant probability of the “extreme luck” of not having any black swan at NYSE, Pr$(X=0)$, in the following trading year is zero. But, miracles happen!

9.92% is derived based on a single run of the simulation. Using the analytical formula for Pr$(X=N)$ given above, we compute the uncertainty of
$$
\mbox{Pr}[X = (N=16)] = \frac{1}{(\beta_{yr})^{16} 16!} \exp\left(-\frac{1}{\beta_{yr}}\right) \ ,
$$

Pr = []
for nsim in range(100000):
    betayr = norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1), size=1) / inNdays
    p =  1/(betayr**exN * factorial(exN)) * np.exp(-1/betayr)
    Pr.append(p)
 
print('Pr[X = E(N) = %g] = %.2f +- %.2f' % (exN, np.mean(Pr), 
         np.std(Pr, ddof=1)))

that yields

Pr[X = E(N) = 16] = 0.08 +- 0.02

and stays in and agreement with our result.

4. Prediction and Out-of-Sample Backtesting

All right, time to verify a theoretical statistics in practice. Our goal is to analyze the sample of black swan data skipped so far but available to us. This procedure usually is known as an “out-of-sample” backtesting, i.e. we know “the future” as it has already happened but we pretend it is unknown.

In first step, we look for a suitable time interval of exactly 252 business days “in the future”. It occurs to be:

print(np.busday_count(dt.date(2015, 4, 10) , dt.date(2016,3, 29)))  # 252

Next, we extract out-of-sample (oos) DataFrame and illustrate them in a similar fashion as previously for in-sample data:

oos = dfblack0[(dfblack0.index >= dt.date(2015, 4, 10)) &
               (dfblack0.index <= dt.date(2016, 3, 29))]
 
# extract black swans
oos = oos[oos.Return < locG-scaleG]
 
plt.figure(figsize=(13, 6))
plt.plot(oos.Return*100, '.b')
plt.xlabel('Time Interval [%s to %s]' % (dfblack.index[0], oos.index[-1]), 
                  fontsize=14)
plt.ylabel('Black Swan Magnitude [%]', fontsize=14)
plt.title('Time Distribution of Out-of-Sample Black Swans of %g 
               NYSE Stocks (L < %.1f%%)' % 
              (len(oos), (locG-scaleG)*100), fontsize=14)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.ylim([-100, 0])
plt.xlim([dfblack.index[0], oos.index[-1]])
plt.savefig('fig10.png', format='png')

fig10
Based on that, we compare the MIBE as predicted with what the next 252 business days reveal:

deltat_oos = np.diff(oos.Delta0)  # time differences between points
 
tbar_oos = np.mean(deltat_oos)
print('MIBE (predicted)   = %.2f +- %.2f [days]' % (tbar, sig/np.sqrt(ne-1)))
print('MIBE out-of-sample = %.2f         [days]' % tbar_oos)
print()
 
y, x = np.histogram(deltat_oos, bins=np.arange(np.ceil(np.max(deltat)+2)),
                    density=True)
 
print(y[0:5])  # [ 0.3814433   0.20618557  0.17525773  0.06185567  0.05154639]
print(x[0:5])  # [ 0.  1.  2.  3.  4.]
 
# Predicted probabilities were:
# [ 0.30990415  0.12300319  0.08466454  0.04472843  0.03354633]

The outcome surprises, i.e.

MIBE (predicted)   = 17.38 +- 2.30 [days]
MIBE out-of-sample = 2.32          [days]

clearly indicating that Apr 2015–Apr 2016 period was nearly 7.5 times more “active” than “all-time” NYSE black swan data would predict to be. Have I already said “it’s alarming”?! Well, at least scary enough for all who trade on the daily basis holding portfolios rich in NYSE assets. The probability of having black swan on the same day increased from 31% to 38%. A new, pure definition of sophisticated gambling.

A verification of the time till next black swan event requires a gentle modification of oos DataFrame. Namely, we need to eliminate all data points with the same Delta0‘s:

oos2 = oos.Delta0.drop_duplicates(inplace=False)
tdiff = np.diff(oos2)
 
y, x = np.histogram(tdiff, bins=np.arange(int(np.max(tdiff))), density=True)
 
_ = plt.figure(figsize=(9, 5))
_ = plt.bar(x[:-1], y, width=1, color=c, edgecolor=c)
_ = plt.xlim([1, 30])
_ = plt.title("Time Till Next Black Swan Event (Out-of-Sample)", fontsize=12)
_ = plt.xlabel("Days", fontsize=12)
_ = plt.ylabel("Probability", fontsize=12)
plt.savefig('fig11.png', format='png')
 
print(y[1:5])
print(x[1:5])

fig11
Again, from our modeling of Poisson($1/\beta$) we expected to have about 5% chances of waiting for the next black swan event a day, two, or three. Out-of-sample data suggest,

[0.33898305  0.28813559  0.10169492  0.08474576]
[1 2 3 4]

that we are 6.3 times more likely to record black swan event at NYSE on the next business day. Terrifying.

The calculation of the expected number of black swans per business day in out-of-sample data I’m leaving with you as a homework in Python programming. Our earlier prediction of those events in next 252 business days returned 16$\pm$4. As we already have seen it, in a considered time interval after Apr 10, 2015 we recorded 98 swans which is $6.1^{+2.1}_{-1.2}$ times more than forecasted based on a broad sample.

In next post of this series, we will try to develop a methodology for a quantitative derivation of the probability of black swan for $N$-asset portfolio.

Reflection

Investors invest in stocks with anticipation of earning profit. As we have witnessed above, the birds become a serious threat to flying high with a significant increase in this investment air-traffic. Your plane (stock) can be grounded and your airline (portfolio) can suffer from huge losses. Hoping is not the solution. The better ways of protecting against the unavoidable should be implemented for investments while your money are in the game.

But what you should do when black swan strikes? Denzel Washington once said If you pray for rain, you gotta deal with the mud too.

Detecting Human Fear in Electronic Trading: Emotional Quantum Entanglement

This post presents an appealing proof for the progressing domination of algorithmic trading over human trading. By analysing the US stock market between 1960 and 1990, we estimate a human engagement (human factor) in live trading decisions taken after 2000. We find a clear distinction between 2000-2002 “dot-era” trading behaviour and 2007-2008 crisis. By the use of peculiar data samples, we support their usefulness being motivated by recent discoveries in quantum physics helping us to better understand the physics of human thoughts and shared emotions when related to the same topic, here, the fear of losing.

1. Introduction

When I was 16 someone pointed my attention to the phenomenon of fear spreading across the room. The example was taken (nearly) from a daily life. Namely, imagine that you are sitting in the IMAX cinema’s room. It’s dark and all people are fully absorbed by a new 2.5-hour long action movie with James Bond in 3D. There is a lot of tension in the air. Suddenly, out of the blue, someone stands up and screams loudly: There is a snake here!! What happens next??! There is a panic! And… it’s spreading instantaneously across the room! How is it possible?!

Believe me or not but I devoted the past 20 years of my life to the investigation of this unbelievable event. Well, from the sidelines it looks normal, right? A snake in a dark cinema’s room is a rare occurrence indeed however it is not about the cinema! It is all about the imprinted information in our brains: the knowledge about snakes. They are fast, dangerous, and (in most cases) deadly venomous! That’s all what we know about them leaving an elementary school. That’s all we need to know in order to create a firm association in our minds that “a snake” means “it will kill you”! That’s it. End of the story.

When I was 18, some people were telling me some stories about human’s aura. A “mystic” energy-like field surrounding our bodies. Permanently. If you know how to interact with it, you can influence it in a good or bad way. You can help in healing or interrupt the flow of “chi” energy around the body. At that time, it was more (than ever) mind-blowing piece of knowledge handed me without any proof of its validity. I guess you know what I mean by that. When we are young, we assume so many “facts” by faith. A few dare to explore them. Deeper.

Science is a field of study where it takes an event as it is and seeks for the ways, methods, approaches in order to prove or discard its truthfulness. Before the humankind invented an Infra-Red (IR) camera, people had no idea about an IR imaging of our bodies. With seeing “outside the box” we proved (once and for all) that our body’s emission extended beyond the visual light within which one our eyes could “see” the world around. Therefore, what is invisible and undetected, it does not mean that it does not exist. Sometimes, discovering it is a matter of time. Sometimes, it’s a matter of turning the impossible into “faith” that it is possible.

Humans. We are made to have and share emotions. What are they, for example, an emotion of fear? If you ask a neuroscientist he/she will tell you: it’s a past experience (therefore a new block of knowledge) coded in and by your brain. Once anything similar is “experienced again” by you, your brain will seek for a relevant information and react “as previously”. With every brain’s activity (not solely related to the human’s emotions) the corresponding reaction is engaged via chemical, neurological, and muscular feedback. If the response to “a new event” in your life has to be made, you learn by “hit and miss” method. If you know “what to do”, you are more “experienced” (or your mind’s processes trigger an intellectual thinking leading you (more likely) to finding the “best” solution to the existing problem or situation). The gravity of our “thinking” becomes exponentially complex the more you go into it. Shortly you discover an astonishing fact about yourself: the more you explore yourself, the less you understand yourself!

When I was 20, I studied quantum mechanics at uni. I spent nearly 6 months looking at a bizarre bra-ket notation of quantum states without a clear understanding what the lecturer was talking about. But a dead Schrodinger’s cat was not so dead as it appeared to be. I came back to the quantum mechanics in my mid-30s, backed by PhD in astrophysics, and with my relentless determination to seek and find the answers to the most fundamental questions related to the nature of humans’ mind way of thinking and interpersonal communication.

When I was young, someone taught me all about making my dreams a reality. The formula was staggering: close your eyes and imagine yourself as a person who already has achieved it. Damn straightforward… And it all occurred to work that way. I followed the rules, I worked hard over my life and my dreams (e.g. of living one day in Sydney, Australia) and… now I live exactly where I dreamt to live.. in Sydney, Australia! Not in Mozambic nor Buenos Aires.

That was a big shot. Plenty of “impossible” events took place along the way…

I used to read a lot. I was deeply surprised what I found in the words of Jesus Christ. Again, I’m an amateur when it come to the Bible. I’m not a Pro. But what I can quote here I think will be sufficient: “whatever you ask for in prayer, believe that you have received it, and it will be yours” (Mark 11, 24). The Present Perfect tense: that you have received it. That was a Jesus’ confirmation on what today people teach: close your eyes and imagine you are already living your dream and it will soon become your reality. Period. We find the same message in the book of Wallance D. Wattles The Science of Getting Rich (1910) and in the books of Napoleon Hill (1930s) just to name the best ones.

How is it possible that emotions and feelings mixed with our inner vision of “reality” seek and find their way in real life? Do we really attract them or it is a playground of “God” who only knows the equation of state and “does” the miracles? Or maybe, we, humans, possess the powers of mind, the potential within that we barely tap on.

The past 50+ years of developments in science, philosophy, religions, and brain research started to turn our understanding of “unknown” processes towards one common denominator in this game: the quantum mechanics. The bizarre one.

2. Quantum Mechanics of Human Thoughts

The early years of the 20th century were the cornerstone of quantum physics and battlefield between it and the classical physics. Bohr against Einstein. Einstein did not believe that God played a dice. The randomness built-in at micro scales frightened him. The birth and progress in quantum mechanics (QM) introduced a new package of uncertainties: a vision of the Universe ruled by the laws going against our classical logic, remaining counter-intuitive.

The QM has soon been confirmed to be valid in the observational world of wonders. Just recall a two-slit experiment with an electron passing through. A mechanical wave on the water interferes. The electron (or any other particle) doing the same triggers the interference patterns however with a completely different distribution of “minimas” and “maximas” when contrasted with a classic physics’ expectations. By observing a wave on the water, we clearly see the effect of “going through” the slits. In quantum world, we have no idea through which slit the particle chose to penetrate.

Quantum Mechanics delivers a super-complex picture of reality at the fundamental level. The movement of a particle in space-time (4D) could be described as a “wave”. A special kind of wave, far different from our “classical” comprehension. Technically speaking we associate so-called wave function, $\Psi$, with moving or interacting particles. The wave function is a solution to the Schrodinger equation:
$$
\hat{H}\Psi(x,t) = i\hbar \frac{\partial \Psi}{\partial t}
$$ which describes the time evolution of a particle. How bizarre this is one can already see by the presence of an imaginary number of $i$ “built-in” into equation.

The most widely accepted interpretation of $\Psi$ reveals its dimensionless nature to be a wave of probabilities spreading “across the room”. It is impossible to see or measure $\Psi$ in a quantum world. Once the observation is done, for example, of a moving electron or photon by a detector, the whole wave function collapses instantaneously. We can derive the amplitudes of probabilities by:
$$
|\Psi(x,t)|^2
$$ but never be certain where the particle is located before the observation.

$\Psi$ is an abstract mathematical object living in the Hilbert space. It represents a particular pure quantum state of a specific isolated system of one or more particles. By choosing a specific system of coordinates, e.g. $\xi = (r, \vartheta, \psi)$ or $\xi = (x,y,z, \sigma_z)$, one can represent the wave function of $\Psi(\xi)$ as:
$$
\Psi(\xi) = \left( \Phi_{\xi}, \Psi \right) = \left( \Phi_{(x,y,z,\sigma_z}), \Psi \right)
$$ i.e. the inner product of a state vector $\Psi$ and a state $\Phi(\xi)$ in which the particle is located in 3D by $x$, $y$, $z$, having a definite value spin $z$-projection of $\sigma_z$. For example, the wave function for a hydrogen atom in different quantum states in spherical coordinates is given by:
$$
\Psi_{n,m,l}(r, \vartheta, \psi)
$$ i.e. depends on quantum numbers (for a full formula see here). The visualization of probability density for Hydrogen (see here) provides us with a better “feeling” where we can find its electron in a given state. But again, we never know where the electron is until the moment of observation.

A wave function for $N$ particles in 4D space-time can be denoted as:
$$
\Psi(\textbf{r}_1,\textbf{r}_2,…,\textbf{r}_N, t)
$$ where $\textbf{r}_i\equiv \textbf{r}_{(x,y,z), i}$ are the position vectors for all $i=1,2,…,N$ particles at time $t$. Interestingly, for $N$ particles in QM there are no $N$ wave functions (for each particle separately) but rather one wave function describing them all. Moreover, that leads to the phenomenon of Quantum Entanglement — a physical phenomenon that occurs when pairs or groups of particles are generated or interact in ways such that the quantum state of each particle cannot be described independently — instead, a quantum state must be described for the system as a whole. That’s very important as we will discuss it, shortly.

What is a human thought? Technically speaking it is an electrical impulse generated in the brain. Where does it originate exactly? It’s difficult to say. We have no way or method to capture the birth place of a thought, for example, when you decide to think about a white ping-pong ball. With a help of ElectroEncephaloGraphy (EEG) we can measure the electrical activity of the brain. EEG measures voltage fluctuations resulting from ionic current within the neurons of the brain. On the other hand, the functional Magnetic Resonance Imaging or functional MRI (fMRI) is a functional neuroimaging procedure using MRI technology that measures brain activity by detecting changes associated with blood flow. Both approaches, though very useful in medicine and neurobiology, leave a vast room for “capturing” our thoughts.

An electrical impulse sent by the brain commences a whole avalanche of neuro-chemical-muscular processes in our body. It is super-complex. No doubt. In 2004, the team of neurobiologists lead by Massimo Scanziani from University of California San Diego have uncovered evidence that sheds light on the long-standing mystery of how the brain makes sense of the information contained in electrical impulses sent to it by millions of neurons from the body. We are bombarded by thousands of information every second. Somehow, our brain does an amazing job for us. How information is sorted by the brain has been an open question. The group discovered that different neurons in the brain are dedicated to respond to specific portions of the information (see a full report here). But what is a human thought? Just an electrical impulse or a series of impulses in a specific “state”?

That’s where quantum mechanics may help us in building a clearer picture, therefore turning metaphysics into real physics. If, technically considering, a thought emerging from our “intellect” or “consciousness” can be associated with a series of elementary particles (e.g. electrons, etc.) “running” through our bodies, is it possible to allow them “spreading across the room” or “around the world” or even “around the Universe”?! I.e. to leave “us” (the body which we consider as something that usually defines us in the visible band of the EM wavelengths) and interact with other(s) at great distances?

It seems not to be a science-fiction any more. A particle in a quantum world may “tunnel” through the wall what in a world we know is impossible. You cannot walk through the Great Wall of China through the bricks of stone (unless your name is David Copperfield and you did it; see the movie if you missed that; a great magical trick with light and stairs).

The hypothesis of quantum tunnelling taking place in our bodies has been formulated by Turin in 1996. “Turin posited that the approximately 350 types of human smell receptors perform an act of quantum tunneling when a new odorant enters the nostril and reaches the olfactory nerve. After the odorant attaches to one of the nerve’s receptors, electrons from that receptor tunnel through the odorant, jiggling it back and forth. In this view, the odorant’s unique pattern of vibration is what makes a rose smell rosy and a wet dog smell wet-doggy.” — we read in Discover Magazine (2009). Quantum tunneling has also been observed in enzymes, the proteins that facilitate molecular reactions within cells; the papers published in 2006 and 2007 in Science and Biophysical Journal, respectively.

If quantum tunnelling is taking a place as a part of “us” being alive, it may be the fundamental indication towards understanding how our thoughts spread out in and out. Again, this is at the abstract level of our understanding of “what’s going on” and we are forced only to speculate.

Nevertheless, a human thought may be viewed in the quantum world as a time-dependent evolution of (charged) particles in a form of the wave function in a given quantum state. It is tempting to postulate, therefore I dare to do it here, that a concentrated thinking (i.e. when we focus on one task only or we feel the emotion of fear/happiness/etc.) is the brain activity “releasing” the groups of particles that their quantum state must be described as a whole — quantum entanglement. If so, this is where the magic begins!

Say, if two people “feel” the same way, through the quantum entanglement of their quantum state(s), the wave function describing similar thought process,
$$
\Psi(\textbf{r}_1,\textbf{r}_2,…,\textbf{r}_N, t)
$$ must be in the quantum world a good representation of “shared emotions” or “shared thinking” about the same thing. According to the above formula, the more people “think” or “share” the same emotions, at the quantum level they become entangled. If true, that would define emotional quantum entanglement in a more physical way.

It’s nearly impossible to prove that QM works that way. However there are empirical methods that one can use to convince oneself that feeling “something” works regardless the distance. For example, a mother seems to be linked tightly with her child. If the child is locked in its room out the sight, reach, and event acoustically-separated and something is happening to it, the mother magically “sense” that state and reacts. Another example. If you find yourself among a group of people who stand or sit in silence, you quickly “sense” that there is something, e.g. wrong “in the air”. Such experiments were taken in the past. Your ability to “detect” the feeling is phenomenal. The groups sitting in silence were asked simply to think intensively about something horrible that took place in their lives. One more? You start thinking about someone who you wish were where you are. But you are in Chile and that person is absolutely out of your place. You receive a phone call from him in a matter of an hour (out of the blue!) and he claims that he had “that feeling” about you, and found the way to find and call you.

The number of similar examples is endless. Talking to flowers in order to help them bloom is my favourite one. Emitting different emotions and causing the change of structure in a frozen water have been investigated already by Russian scientists in 60s (see a full movie) and Dr Masaru Emoto Hado in 1994 (see his movie and website).

But we all know it from our daily life. There are people who you meet for the very first time and you “feel” great and positive from that very first contact. The “emit” positive vibrations. On the other side, when your spouse’s mother is a negative person, always complains and in her spirit wishes you not the best, you “sense” it. You start avoid her. Something triggers your fear or anxiety.

If the fear is an emotion spreading instantaneously (being or not governed by the unknown/already-known rules of quantum mechanics), we all become affected if we “resonate” at the same “wave”. The recent experiments of Xing-Can Yao et al. (2012) proved that we can entangle eight photons. The complicated instrumental setup they used to do so acted as a smart “brain” making this entanglement possible. What if, our brain is an extraordinary instrument that at the quantum level picks up, emits, and entangles a specific group of thoughts emitted by others in a given “emotional” state?! What if it works that way indeed…

3. Detecting Fear among Traders

Pretend it is 28th of May, 1962. You are a broker/trader pushing the deals over the phone. But that day is peculiar. The S&P 500 index opened at 59.15 which occurred to be also its maximum price (high). At the end of the trading session it closed the day at 55.50 and market dropped 6.17% in a single day. The Low was denoted at 55.42. The ratio between Open-and-Close to High-and-Low was over 97.5%. There was nearly no pressure to “buy”, i.e. strong supply vs low demand. People were selling stocks like crazy. The same did not happen again until 25th of October 1982 where the loss was nearly 4% at the ratio of 100% at over 8 times higher traded volume. What is going here?

If you watched the Wall Street movie or The Wolf of Wall Street, both included some scenes from an abnormal trading session where traders were affected by tension and nervousness due to falling stock prices. Well before 1990, the American stock markets’ order were conducted by people working in the brokerage firms. They were responsible for decisions made upon earlier agreements with their clients. As we have discussed earlier, the day when the market turns south causes people to think and act in a similar fashion. The feelings are spread across the floor and no one wants to climb the mountain while avalanche is coming down. A tidal wave that comes lifts all boats or put them down. We call that phenomenon “a human nature” but quantum physics may be involved in understanding why that is so.

Let’s consider S&P500 index. It is a good representation of US stock market behaviour, the sentiments or alertness to unusual trading factors. Assuming that between 1960 and 1990 the human engagement is trading decisions was dominant (over today’s computerized automatic order exectution), one may select from the historical data a peculiar data sample. Namely, again, we look for days when ratio between Open-and-Close to High-and-Low of S&P500 was over 95% and index closed with a loss. What we mean by that criterion is that we allow for 5% buying/selling pressure deviation (Open-High and Low-Close). Additionally, we use information on volume traded on those days. Why? Our second assumption relates to the fact that people (not computers) tend to display behaviour patterns fairly consistent over time. Therefore, the volume “pushed” on the days we seek for should remain pretty of the same magnitude (what we confirm in Figure 1 of our analysis below).

Using Python language and basic S&P500 data provider of Yahoo! Finance (the precision in prices is not a major concern here), first, we download the index prices and extract data sample based on abovementioned criteria:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Detecting Human Fear in Electronic Trading
# (c) 2016 Pawel Lachowicz, QuantAtRisk.com
 
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
import pandas_datareader.data as web
from scipy.stats.stats import pearsonr
from scipy import stats
 
data = web.DataReader("^gspc", data_source='yahoo', start='1960-01-01', 
          end='1989-12-31')['Open','High','Low','Close']
 
data['OC'] = abs(data.Open - data.Close)
data['HL'] = abs(data.High - data.Low)
data['Ratio'] = data.OC/data.HL
data['Loss'] = (data.Close/data.Open - 1)*100  # daily loss in percent
 
events = data[(data.Ratio > 0.95)  & (data.Loss < 0)]
print(events.shape[0])

The analysis returns 71 days between 1960 and 1990. Not a lot but sufficient in order to draw some further steps.

4. Volume of Fears

The observation we dare to make here is directly linked to the examination of the relationship between traded volume on those days and a percentage loss. Again, we expect the volume to be somehow dependent of the magnitude of a daily loss but how strongly depended? Let’s derive the following:

22
23
24
25
26
27
pr, pvalue = pearsonr(events.Loss.values, events.Volume.values)
slope, intercept, r_value, p_value, std_e = stats.linregress(events.Loss.values,
                                                  events.Volume.values)
 
print(pr, pvalue)           # Pearson correlation coefficient
print(r_value**2, p_value)  # R-square value from a linear regression

where we find:

-0.789294146279   2.91365177983e-16
 0.622985249351   2.91365177983e-16

pointing at pretty solid correlation though not super strong as intuitively expected. Based on the linear regression we plot this relationship:

29
30
31
32
33
34
35
36
37
38
39
40
plt.figure(num=1, figsize=(13, 6))
plt.plot(events.Loss, events.Volume, 'ro')
x = np.linspace(-100, 0, 100)
y = slope*x + intercept
plt.plot(x, y, 'k:')
plt.xlim([np.min(events.Loss), np.max(events.Loss)])
plt.ylim([np.min(events.Volume), np.max(events.Volume)])
plt.xlabel('Daily Loss [%]')
plt.ylabel('Volume')
plt.show()
 
plt.savefig('fig01.png', format='png')

revealing:
fig01
What the linear fit does, it describes the simplest model between daily volume traded and, in good approximation, the fear in the market expressed by the loss on those days.

Since we made a remark on the amount of volume traded by people in those years, in the next Section, we will use it to the data recorded after 1999.

5. Disentangling Humans from Computers

We repeat the same procedure as outlined above but now for a data set covering the years 2000 to 2016:

42
43
44
45
46
47
48
49
50
51
52
53
del data, events
 
data = web.DataReader("^gspc", data_source='yahoo', start='2000-01-01', 
                        end='2016-03-01')['Open','High','Low','Close']
 
data['OC'] = abs(data.Open - data.Close)
data['HL'] = abs(data.High - data.Low)
data['Ratio'] = data.OC/data.HL
data['Loss'] = (data.Close/data.Open - 1)*100
 
events = data[(data.Ratio > 0.95)  & (data.Loss < 0)]
print(events.shape[0])

what reveals 147 days of the same character. Since it is logically justified that it only make sense to compare “apples” with “apple”, we build an estimation of human factor (HF; human engagement) is trading decisions after 1999 as:
$$
\mbox{HF} = \frac{ax+b}{y}
$$ or

56
events['HF'] = (slope*events.Loss + intercept)/events.Volume

where the coefficients $a$ (slope) and $b$ (intercept) have been found in the earlier linear regression analysis. $x$ and $y$ denote Daily Loss and Volume (taken between 2000 and 2016), respectively. In other words, $y$ is expected to be far higher “now” than “then” due to a global involvement of algorithmic automated (daily, intraday, and high-frequency) trading.

Therefore $\mbox{HF}$ ought to provide us with an idea on the volume traded solely by people and the rest by human-independent trading “robots”. By plotting the results:

58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
ev1 = events[(events.index.year >= 2000) & (events.index.year <= 2002)]
ev2 = events[(events.index.year >= 2007) & (events.index.year <= 2008)]
 
plt.figure(num=2, figsize=(13, 6))
plt.scatter(100*events.HF, events.Loss, c='black', edgecolors='black', s=50)
plt.scatter(100*ev1.HF, ev1.Loss, c='red', edgecolors='red', s=50)
plt.scatter(100*ev2.HF, ev2.Loss, c='yellow', edgecolors='yellow', s=50)
plt.legend(['2000-2016', '2000-2002', '2007-2008'], loc=4)
plt.xlim([0, 25])
plt.ylim([-10, 1])
 
plt.xlabel('Human Factor [%]')
plt.ylabel('Daily Loss [%]')
plt.show()
 
plt.savefig('fig02.png', format='png')

we get:
fig02

The result is spectacular. Not only we see a vast spread in human factors but also we learn that the lower limit on $\mbox{HF}$ extends up to 20% only. It may mean that from the year 2000 onwards, in over 80% the trading in the US market (as reflected by S&P500 index) have been dominated by algorithms.

In addition to that picture, we overplot with different colours $\mbox{HF}$s corresponding to 2000-2002 “dot-era” bubble (red markers) and 2007-2008 crisis (yellow markers). What is rather not so surprising is a clear boundary between both groups. During the last financial crisis, the selling of stocks have been executed by computer-based decisions in over 95% according to our findings.

6. End Note

Will computers eliminate human trading in not-too-distant future? Fear not. Just take a quantum leap of faith that it won’t happen soon…

Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders (2)

This part is awesome. Trust me! Previously, in Part 1, we examined two independent methods in order to estimate the probability of a very rare event (heavy or extreme loss) that an asset could experience on the next day. The first one was based on the computation of the tail probability, i.e.: given the upper threshold $L_{thr}$ find the corresponding $\alpha$ such that $ Pr(L < -L_{thr}) = \alpha$ where a random variable $L$ was referred to as a daily asset return (in percent). In this case, we showed that this approach was strongly dependent on the number of similar events in the past that would contribute to the changes of $\alpha$'s in time. The second method was based on a classical concept of the conditional probability. For example, given that the index (benchmark, etc.) drops by $L'$ what is the probability that an asset will lose $L\le L'$? We found that this methods fails if the benchmark never experienced a loss of $L'$.

In both cases, the lack of information on the occurrence of a rare event was the major problem. Therefore, is there any tool we can use to predict an event that has not happened yet?!

This post addresses an appealing solution and delivers a black-box for every portfolio holder he or she can use as an integrated part of the risk management computer system.

1. Bayesian Conjugates

Bayesian statistics treats the problem of finding the probability of an event a little bit differently than what we know from school. For example, if we flip a fair coin, in long run, we expect the same proportion of heads and tails. Formally, it is known as a frequentist interpretation. Bayesian interpretation assumes nothing about the coin (is it fair or weighted) and with every toss of the coin, it builds a grand picture on the underlying probability distribution. And it’s fascinating because we can start with an assumption that the coin is made such that we should observe a number of heads more frequently whereas, after $N$ tosses, we may find that the coin is fair.

Let me skip the theoretical part on that aspect of Bayesian statistics which has been beautifully explained by Mike Halls-Moore within his two articles (Bayesian Statistics: A Beginner’s Guide and Bayesian Inference Of A Binomial Proportion – The Analytical Approach) that I recommend you to study before reading this post further down.

Looking along the return-series for any asset (spanned over $M$ days), we may denote by $x_i = 1$ ($i=1,…,M$) a day when a rare event took place (e.g. an asset lost more than -16% or so) and by $x_i = 0$ a lack of such event. Treating days as a series of Bernoulli trials with unknown probability $p$, if we start with an uninformed prior, such as Beta($\alpha, \beta$) with $\alpha=\beta=1$ then the concept of Bayesian conjugates can be applied in order to update a prior belief. The update is given by:
$$
\alpha’ = \alpha + n_e = \alpha + \sum_{i=1}^{N\le M} x_i \\
\beta’ = \beta + N – n_e = \beta + N – \sum_{i=1}^{N\le M} x_i
$$ what simply means that after $N$ days from some initial moment in the past, by observing a rare event $n_e$ times, the probability distribution is given by Beta($\alpha’, \beta’$). Therefore, a posterior predictive mean for Bernoulli likelihood can be interpreted as a new probability of an event to take place on the next day:
$$
\mbox{Pr}(L < -L_{thr}) = \frac{\alpha'}{\alpha' + \beta'} $$ And that's the key concept we are going to use now for the estimation of probability for heavy or extreme losses that an asset can experience in trading.

As a quick visualisation, imagine you analyse the last 252 trading days of the AAPL stock. The stock never lost more than 90% in one day. No such extreme event took place. According to our new method, the probability that AAPL will plummet 90% (or more) in next trading session is not zero(!) but
$$
\mbox{Pr}(L < -90\%) = \frac{\alpha'}{\alpha' + \beta'} = \frac{1+0}{1+252-0} = 0.3953\% $$ with 95% confidence interval: $$ \left[ B^{-1}(0.05, \alpha', \beta');\ B^{-1}(0.95, \alpha', \beta') \right] $$ where $B^{-1}$ denotes the percent point function (quantile function) for Beta distribution.

The estimation of a number of events corresponding to the loss of $L < -L_{thr}$ during next 252 trading day, we find by: $$ E(n_e) = \left\lceil \frac{252}{[B^{-1}(0.95, \alpha', \beta')]^{-1}} \right\rceil $$ where $\lceil \rceil$ denotes the operation of rounding.

2. Analysis of Facebook since its IPO

Having a tool that includes information whether an event took place or not is powerful. You may also notice that the more information we gather (i.e. $N\to\infty$) the more closer we are to the “true” probability function. It also means that the spread around a new posterior predictive mean:
$$
\sigma = \sqrt{ \frac{\alpha’ \beta’} { (\alpha’+\beta’)^2 (\alpha’+\beta’+1) } } \ \to 0 .
$$ It is logical that the analysis of an asset traded daily for the past 19 years will deliver “more trustful” estimation of $\mbox{Pr}(L < -L_{thr})$ rather than a situation where we take into account only a limited period of time. However, what if the stock is a newbie in the market?

Let’s investigate the case of Facebook (NASDAQ:FB). The information we have the record of, spans a bit over 3.5 years back, when Facebook’s initial public offering (IPO) took place on May 18th, 2012. Since that day till today (December 6, 2015) the return-series (daily data) is $M=892$ points long. Using the analytical formulations as given above and writing a Python code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders (2)
# (c) 2015 Pawel Lachowicz, QuantAtRisk.com
 
import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt
import pandas_datareader.data as web
 
 
# Fetching Yahoo! Finance for Facebook
data = web.DataReader("FB", data_source='yahoo',
                  start='2012-05-18', end='2015-12-04')['Adj Close']
cp = np.array(data.values)  # daily adj-close prices
ret = cp[1:]/cp[:-1] - 1    # compute daily returns
N = len(ret)
 
# Plotting IBM price- and return-series
plt.figure(num=2, figsize=(9, 6))
plt.subplot(2, 1, 1)
plt.plot(cp)
plt.axis("tight")
plt.ylabel("FB Adj Close [USD]")
plt.subplot(2, 1, 2)
plt.plot(ret, color=(.6, .6, .6))
plt.axis("tight")
plt.ylabel("Daily Returns")
plt.xlabel("Time Period [days]")
#plt.show()
 
# provide a threshold
Lthr = -0.21
 
# how many events of L < -21% occured?
ne = np.sum(ret < Lthr)
# of what magnitude?
ev = ret[ret < Lthr]
# avgloss = np.mean(ev)  # if ev is non-empty array
 
 
# prior
alpha0 = beta0 = 1
# posterior
alpha1 = alpha0 + ne
beta1 = beta0 + N - ne
pr = alpha1/(alpha1+beta1)
cl1 = beta.ppf(0.05, alpha1, beta1)
cl2 = beta.ppf(0.95, alpha1, beta1)
ne252 = np.round(252/(1/cl2))
 
print("ne = %g" % ne)
print("alpha', beta' = %g, %g" % (alpha1, beta1))
print("Pr(L < %3g%%) = %5.2f%%\t[%5.2f%%, %5.2f%%]" % (Lthr*100, pr*100,
                                                       cl1*100, cl2*100))
print("E(ne) = %g" % ne252)

first we visualise both the price- and return-series:
figure_2
and while the computations we find:

ne = 0
alpha', beta' = 1, 893
Pr(L < -21%) =  0.11%	[ 0.01%,  0.33%]
E(ne) = 1

There is no record of an event within 892 days of FB trading, $n_e = 0$, that the stock lost 21% or more in a single day. However, the probability that such even may happen at the end of the next NASDAQ session is 0.11%. Non zero but nearly zero.

A quick verification of the probability estimate returned by our Bayesian method we obtain by changing the value of a threshold from -0.21 to Lthr = 0. Then, we get:

alpha', beta' = 422, 472
Pr(L <   0%) = 47.20%	[44.46%, 49.95%]
E(ne) = 126

i.e. 47.2% that on the next day the stock will close lower. It is in an excellent agreement with what can be calculated based on the return-series, i.e. there were 421 days with the negative daily returns, thus $421/892 = 0.4719$. Really cool, don’t you think?!

Now, imagine for a second that after first few days since the Facebook IPO we recalculate the posterior predictive mean, day-by-day, eventually to reach the most current estimate of the probability that the stock may lose -21% or more at the end of the next trading day — such a procedure allows us to plot the change of $\mbox{Pr}(L < -L_{thr})$ during the whole trading history of FB. We achieve that by running a modified version of the previous code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt
import pandas_datareader.data as web
 
 
data = web.DataReader("FB", data_source='yahoo',
                  start='2012-05-18', end='2015-12-05')['Adj Close']
cp = np.array(data.values)  # daily adj-close prices
returns = cp[1:]/cp[:-1] - 1   # compute daily returns
 
Lthr = -0.21
 
Pr = CL1 = CL2 = np.array([])
for i in range(2, len(returns)):
    # data window
    ret = returns[0:i]
    N = len(ret)
    ne = np.sum(ret < Lthr)
    # prior
    alpha0 = beta0 = 1
    # posterior
    alpha1 = alpha0 + ne
    beta1 = beta0 + N - ne
    pr = alpha1/(alpha1+beta1)
    cl1 = beta.ppf(0.05, alpha1, beta1)
    cl2 = beta.ppf(0.95, alpha1, beta1)
    #
    Pr = np.concatenate([Pr, np.array([pr*100])])
    CL1 = np.concatenate([CL1, np.array([cl1*100])])
    CL2 = np.concatenate([CL2, np.array([cl2*100])])
 
plt.figure(num=1, figsize=(10, 5))
plt.plot(Pr, "k")
plt.plot(CL1, "r--")
plt.plot(CL2, "r--")
plt.axis("tight")
plt.ylim([0, 5])
plt.grid((True))
plt.show()

what displays:
figure_1
i.e. after 892 days we reach the value of 0.11% as found earlier. The black line tracks the change of probability estimate while both red dashed lines denote the 95% confidence interval. The black line is smooth, i.e. because of a lack of -21% event in the past, a recalculated probability did not changed its value “drastically”.

This is not the case when we consider the scenario for Lthr = -0.11, delivering:
figure_1
what stays in agreement with:

ne = 1
alpha', beta' = 2, 892
Pr(L < -11%) =  0.22%	[ 0.04%,  0.53%]
E(ne) = 1

and confirms that only one daily loss of $L < -11\%$ took place while the entire happy life of Facebook on NASDAQ so far.

Since the computer can find the number of losses to be within a certain interval, that gives us an opportunity to estimate:
$$
\mbox{Pr}(L_1 \le L < L_2) $$ in the following way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt
import pandas_datareader.data as web
 
 
data = web.DataReader("FB", data_source='yahoo',
                  start='2012-05-18', end='2015-12-05')['Adj Close']
cp = np.array(data.values)  # daily adj-close prices
ret = cp[1:]/cp[:-1] - 1    # compute daily returns
N = len(ret)
 
dL = 2
for k in range(-50, 0, dL):
    ev = ret[(ret >= k/100) & (ret < (k/100+dL/100))]
    avgloss = 0
    ne = np.sum((ret >= k/100) & (ret < (k/100+dL/100)))
    #
    # prior
    alpha0 = beta0 = 1
    # posterior
    alpha1 = alpha0 + ne
    beta1 = beta0 + N - ne
    pr = alpha1/(alpha1+beta1)
    cl1 = beta.ppf(0.05, alpha1, beta1)
    cl2 = beta.ppf(0.95, alpha1, beta1)
    if(len(ev) > 0):
        avgloss = np.mean(ev)
        print("Pr(%3g%% < L < %3g%%) =\t%5.2f%%\t[%5.2f%%, %5.2f%%]  ne = %4g"
          "  E(L) = %.2f%%" % \
          (k, k+dL, pr*100, cl1*100, cl2*100, ne, avgloss*100))
    else:
        print("Pr(%3g%% < L < %3g%%) =\t%5.2f%%\t[%5.2f%%, %5.2f%%]  ne = " \
              "%4g" %  (k, k+dL, pr*100, cl1*100, cl2*100, ne))

By doing so, we visualise the distribution of potential losses (that may occur on the next trading day) along with their magnitudes:

Pr(-50% < L < -48%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-48% < L < -46%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-46% < L < -44%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-44% < L < -42%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-42% < L < -40%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-40% < L < -38%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-38% < L < -36%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-36% < L < -34%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-34% < L < -32%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-32% < L < -30%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-30% < L < -28%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-28% < L < -26%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-26% < L < -24%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-24% < L < -22%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-22% < L < -20%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-20% < L < -18%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-18% < L < -16%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-16% < L < -14%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-14% < L < -12%) =	 0.11%	[ 0.01%,  0.33%]  ne =    0
Pr(-12% < L < -10%) =	 0.34%	[ 0.09%,  0.70%]  ne =    2  E(L) = -11.34%
Pr(-10% < L <  -8%) =	 0.67%	[ 0.29%,  1.17%]  ne =    5  E(L) = -8.82%
Pr( -8% < L <  -6%) =	 0.89%	[ 0.45%,  1.47%]  ne =    7  E(L) = -6.43%
Pr( -6% < L <  -4%) =	 2.91%	[ 2.05%,  3.89%]  ne =   25  E(L) = -4.72%
Pr( -4% < L <  -2%) =	 9.84%	[ 8.26%, 11.53%]  ne =   87  E(L) = -2.85%
Pr( -2% < L <   0%) =	33.11%	[30.54%, 35.72%]  ne =  295  E(L) = -0.90%

For example, based on a complete trading history of FB, there is a probability of 0.34% that the stock will close with a loss to be between -12% and -10%, and if so, the “expected” loss would be -11.3% (estimated based solely on 2 events).

Again. The probability, not certainty.

Student t Distributed Linear Value-at-Risk

One of the most underestimated feature of the financial asset distributions is their kurtosis. A rough approximation of the asset return distribution by the Normal distribution becomes often an evident exaggeration or misinterpretations of the facts. And we know that. The problem arises if we investigate a Value-at-Risk (VaR) measure. Within a standard approach, it is computed based on the analytical formula:
$$
\mbox{VaR}_{h, \alpha} = \Phi ^{-1} (1-\alpha)\sqrt{h}\sigma – h\mu
$$ what expresses $(1-\alpha)$ $h$-day VaR, $\Phi ^{-1} (1-\alpha)$ denotes the standardised Normal distribution $(1-\alpha)$ quantile, and $h$ is a time horizon. The majority of stocks display significant excess kurtosis and negative skewness. In plain English that means that the distribution is peaked and likely to have a heavy left tail. In this scenario, the best fit of the Normal probability density function (pdf) to the asset return distribution underestimates the risk accumulated in a far negative territory.

In this post we will introduce the concept of Student t Distributed Linear VaR, i.e. an alternative method to measure VaR by taking into account “a correction” for kurtosis of the asset return distribution. We will use the market stock data of IBM as an exemplary case study and investigate the difference in a standard and non-standard VaR calculation based on the parametric models. It will also be an excellent opportunity to learn how to do it in Python, quickly and effectively.

1. Leptokurtic Distributions and Student t VaR

A leptokurtic distribution is one whose density function has a higher peak and greater mass in the tails than the normal density function of the same variance. In a symmetric unimodal distribution, i.e. one whose density function has only one peak, leptokurtosis is indicated by a positive excess kurtosis (Alexander 2008). The impact of leptokurtosis on VaR is non-negligible. Firstly, assuming high significance levels, e.g. $\alpha \le 0.01$, the estimation of VaR for Normal and leptokurtic distribution would return:
$$
\mbox{VaR}_{lepto,\ \alpha\le 0.01} \gt \mbox{VaR}_{Norm,\ \alpha\le 0.01}
$$
which is fine, however for low significance levels the situation may change:
$$
\mbox{VaR}_{lepto,\ \alpha\ge 0.05} \lt \mbox{VaR}_{Norm,\ \alpha\ge 0.05} \ .
$$
A good example of the leptokurtic distribution is Student t distribution. If it describes the data better and displays significant excess kurtosis, there is a good chance that VaR for high significance levels will be a much better measure of the tail risk. The Student t distribution is formally given by its pdf:
$$
f_\nu(x) = \sqrt{\nu\pi} \Gamma\left(\frac{\nu}{2}\right)^{-1} \Gamma\left(\frac{\nu+1}{2}\right)
(1+\nu^{-1} x^2)^{-(\nu+1)/2}
$$ where $\Gamma$ denotes the gamma function and $\nu$ the degrees of freedom. For $\nu\gt 2$ its variance is $\nu(\nu-2)^{-1}$ and the distribution has a finite excess kurtosis of $\kappa = 6(\nu-4)^{-1}$ for $\nu \gt 4$. It is obvious that for $\nu\to \infty$ the Student t pdf approaches the Normal pdf. Large values of $\nu$ for the fitted Student t pdf in case of the financial asset daily return distribution are unlikely, unless the Normal distribution is the best pdf model indeed (possible to obtain for very large samples).

We find $\alpha$ quantile of the Student t distribution ($\mu=0$ and $\sigma=1$) by the integration of the pdf and usually denote it as $\sqrt{\nu^{-1}(\nu-2)} t_{\nu}^{-1}(\alpha)$ where $t_{\nu}^{-1}$ is the percent point function of the t distribution. Accounting for the fact that in practice we deal with random variables of the form of $X = \mu + \sigma T$ where $X$ would describe the asset return given as a transformation of the standardised Student t random variable, we reach to the formal definition of the $(1-\alpha)$ $h$-day Student t VaR given by:
$$
\mbox{Student}\ t\ \mbox{VaR}_{\nu, \alpha, h} = \sqrt{\nu^{-1}(\nu-2)h}\ t_{\nu}^{-1}(1-\alpha)\sigma – h\mu
$$ Having that, we are ready to use this knowledge in practice and to understand, based on data analysis, the best regimes where one should (and should not) apply the Student t VaR as a more “correct” measure of risk.

2. Normal and Student t VaR for IBM Daily Returns

Within the following case study we will make of use of Yahoo! Finance data provider in order to fetch the price-series of the IBM stock (adjusted close) and transform it into return-series. As the first task we have to verify before heading further down this road will be the sample skewness and kurtosis. We begin, as usual:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Student t Distributed Linear Value-at-Risk
#   The case study of VaR at the high significance levels
# (c) 2015 Pawel Lachowicz, QuantAtRisk.com
 
import numpy as np
import math
from scipy.stats import skew, kurtosis, kurtosistest
import matplotlib.pyplot as plt
from scipy.stats import norm, t
import pandas_datareader.data as web
 
 
# Fetching Yahoo! Finance for IBM stock data
data = web.DataReader("IBM", data_source='yahoo',
                  start='2010-12-01', end='2015-12-01')['Adj Close']
cp = np.array(data.values)  # daily adj-close prices
ret = cp[1:]/cp[:-1] - 1    # compute daily returns
 
# Plotting IBM price- and return-series
plt.figure(num=2, figsize=(9, 6))
plt.subplot(2, 1, 1)
plt.plot(cp)
plt.axis("tight")
plt.ylabel("IBM Adj Close [USD]")
plt.subplot(2, 1, 2)
plt.plot(ret, color=(.6, .6, .6))
plt.axis("tight")
plt.ylabel("Daily Returns")
plt.xlabel("Time Period 2010-12-01 to 2015-12-01 [days]")

where we cover the most recent past 5 years of trading of IBM at NASDAQ. Both series visualised look like:
IBMseries
where, in particular, the bottom one reveals five days where the stock experienced 5% or higher daily loss.

In Python, a correct computation of the sample skewness and kurtosis is possible thanks to the scipy.stats module. As a test you may verify that for a huge sample (e.g. of 10 million) of random variables $X\sim N(0, 1)$, the expected skewness and kurtosis,

from scipy.stats import skew, kurtosis
X = np.random.randn(10000000)
print(skew(X))
print(kurtosis(X, fisher=False))

is indeed equal 0 and 3,

0.0003890933008049
3.0010747070253507

respectively.

As we will convince ourselves in a second, this is not the case for our IBM return sample. Continuing,

31
32
33
34
35
36
37
38
39
40
41
print("Skewness  = %.2f" % skew(ret))
print("Kurtosis  = %.2f" % kurtosis(ret, fisher=False))
# H_0: the null hypothesis that the kurtosis of the population from which the
# sample was drawn is that of the normal distribution kurtosis = 3(n-1)/(n+1)
_, pvalue = kurtosistest(ret)
beta = 0.05
print("p-value   = %.2f" % pvalue)
if(pvalue < beta):
    print("Reject H_0 in favour of H_1 at %.5f level\n" % beta)
else:
    print("Accept H_0 at %.5f level\n" % beta)

we compute the moments and run the (excess) kurtosis significance test. As already explained in the comment, we are interested whether the derived value of kurtosis is significant at $\beta=0.05$ level or not? We find that for IBM, the results reveal:

Skewness  = -0.70
Kurtosis  = 8.39
p-value   = 0.00
Reject H_0 in favour of H_1 at 0.05000 level

id est, we reject the null hypothesis that the kurtosis of the population from which the sample was drawn is that of the normal distribution kurtosis in favour of the alternative hypothesis ($H_1$) that it is not. And if not then we have work to do!

In the next step we fit the IBM daily returns data sample with both Normal pdf and Student t pdf in the following way:

43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# N(x; mu, sig) best fit (finding: mu, stdev)
mu_norm, sig_norm = norm.fit(ret)
dx = 0.0001  # resolution
x = np.arange(-0.1, 0.1, dx)
pdf = norm.pdf(x, mu_norm, sig_norm)
print("Integral norm.pdf(x; mu_norm, sig_norm) dx = %.2f" % (np.sum(pdf*dx)))
print("Sample mean  = %.5f" % mu_norm)
print("Sample stdev = %.5f" % sig_norm)
print()
 
# Student t best fit (finding: nu)
parm = t.fit(ret)
nu, mu_t, sig_t = parm
pdf2 = t.pdf(x, nu, mu_t, sig_t)
print("Integral t.pdf(x; mu, sig) dx = %.2f" % (np.sum(pdf2*dx)))
print("nu = %.2f" % nu)
print()

where we ensure that the complete integral over both pdfs return 1, as expected:

Integral norm.pdf(x; mu_norm, sig_norm) dx = 1.00
Sample mean  = 0.00014
Sample stdev = 0.01205
 
Integral t.pdf(x; mu, sig) dx = 1.00
nu = 3.66

The best fit of the Student t pdf returns the estimate of 3.66 degrees of freedom.

From this point, the last step is the derivation of two distinct VaR measure, the first one corresponding to the best fit of the Normal pdf and the second one to the best fit of the Student t pdf, respectively. The trick is

61
62
63
64
65
66
67
68
69
# Compute VaR
h = 1  # days
alpha = 0.01  # significance level
StudenthVaR = (h*(nu-2)/nu)**0.5 * t.ppf(1-alpha, nu)*sig_norm - h*mu_norm
NormalhVaR = norm.ppf(1-alpha)*sig_norm - mu_norm
 
lev = 100*(1-alpha)
print("%g%% %g-day Student t VaR = %.2f%%" % (lev, h, StudenthVaR*100))
print("%g%% %g-day Normal VaR    = %.2f%%" % (lev, h, NormalhVaR*100))

that in the calculation of the $(1-\alpha)$ $1$-day Student t VaR ($h=1$) according to:
$$
\mbox{Student}\ t\ \mbox{VaR}_{\nu, \alpha=0.01, h=1} = \sqrt{\nu^{-1}(\nu-2)}\ t_{\nu}^{-1}(0.99)\sigma – \mu
$$ we have to take $\sigma$ and $\mu$ given by sig_norm and mu_norm variable (in the code) and not the ones returned by the t.fit(ret) function in line #54. We find out that:

99% 1-day Student t VaR = 3.19%
99% 1-day Normal VaR    = 2.79%

i.e. VaR derived based on the Normal pdf best fit may underestimate the tail risk for IBM, as suspected in the presence of highly significant excess kurtosis. Since the beauty of the code is aesthetic, the picture is romantic. We visualise the results of our investigation, extending our code with:

71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
plt.figure(num=1, figsize=(11, 6))
grey = .77, .77, .77
# main figure
plt.hist(ret, bins=50, normed=True, color=grey, edgecolor='none')
plt.hold(True)
plt.axis("tight")
plt.plot(x, pdf, 'b', label="Normal PDF fit")
plt.hold(True)
plt.axis("tight")
plt.plot(x, pdf2, 'g', label="Student t PDF fit")
plt.xlim([-0.2, 0.1])
plt.ylim([0, 50])
plt.legend(loc="best")
plt.xlabel("Daily Returns of IBM")
plt.ylabel("Normalised Return Distribution")
# inset
a = plt.axes([.22, .35, .3, .4])
plt.hist(ret, bins=50, normed=True, color=grey, edgecolor='none')
plt.hold(True)
plt.plot(x, pdf, 'b')
plt.hold(True)
plt.plot(x, pdf2, 'g')
plt.hold(True)
# Student VaR line
plt.plot([-StudenthVaR, -StudenthVaR], [0, 3], c='g')
# Normal VaR line
plt.plot([-NormalhVaR, -NormalhVaR], [0, 4], c='b')
plt.text(-NormalhVaR-0.01, 4.1, "Norm VaR", color='b')
plt.text(-StudenthVaR-0.0171, 3.1, "Student t VaR", color='g')
plt.xlim([-0.07, -0.02])
plt.ylim([0, 5])
plt.show()

what brings us to:
IBMrisk
where in the inset, we zoomed in the left tail of the distribution marking both VaR results. Now, it is much clearly visible why at $\alpha=0.01$ the Student t VaR is greater: the effect of leptokurtic distribution. Moreover, the fit of the Student t pdf appears to be a far better parametric model describing the central mass (density) of IBM daily returns.

There is an ongoing debate which one out of these two risk measures should be formally applied when it comes to reporting of 1-day VaR. I do hope that just by presenting the abovementioned results I made you more aware that: (1) it is better to have a closer look at the excess kurtosis for your favourite asset (portfolio) return distribution, and (2) normal is boring.

Someone wise once said: “Don’t be afraid of being different. Be afraid of being the same as everyone else.” The same means normal. Dare to fit non-Normal distributions!

DOWNLOAD
     tStudentVaR.py

REFERENCES
     Alexander, C., 2008, Market Risk Analysis. Volume IV., John Wiley & Sons Ltd

Recovery of Financial Price-Series based on Daily Returns Matrix in Python

Lesson 10>>

As a financial analyst or algo trader, you are so often faced with information on, inter alia, daily asset trading in a form of a daily returns matrix. In many cases, it is easier to operate with the return-series rather than with price-series. And there are excellent reasons standing behind such decision, e.g. the possibility to plot the histogram of daily returns, the calculation of daily Value-at-Risk (Var), etc.

When you use Python (not Matlab), the recovery of price-series for return-series may be a bit of challenge, especially when you face the problem for the first time. A technical problem, i.e. “how to do it?!” within Python, requires you to switch your thinking mode and adjust your vantage point from Matlab-ish to Pythonic. Therefore, let’s see what is the best recipe to turn your world upside down?!

Say, we start with a $N$-asset portfolio of $N$ assets traded for $L+1$ last days. It will require the use of the Python’s NumPy arrays. We begin with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import matplotlib.pyplot as plt
 
np.random.seed(2014)
 
# define portfolio
N = 5  # number of assets
L = 4  # number of days
 
# asset close prices
p = np.random.randint(10, 30, size=(N, 1)) + \
    np.random.randn(N, 1)  # a mixture of uniform and N(0,1) rvs
print(p)
print(p.shape)
print()

where we specify a number of assets in portfolio and a number of days. $L = 4$ has been selected for the clarity of printing of the outcomes below, however, feel free to increase that number (or both) anytime you rerun this code.

Next, we create a matrix ($N\times 1$) with a starting random prices for all $N$ assets to be between \$10 and \$30 (random integer) supplemented by (0, 1) fractional part. Printing p returns:

[[ 25.86301396]
 [ 19.82072772]
 [ 22.33569347]
 [ 21.38584671]
 [ 24.56983489]]
(5, 1)

Now, let’s generate a matrix of random daily returns over next 4 days for all 5 assets:

17
18
19
r = np.random.randn(N, L)/50
print(r)
print(r.shape)

delivering:

[[ 0.01680965 -0.00620443 -0.02876535 -0.03946471]
 [-0.00467748 -0.0013034   0.02112921  0.01095789]
 [-0.01868982 -0.01764086  0.01275301  0.00858922]
 [ 0.01287237 -0.00137129 -0.0135271   0.0080953 ]
 [-0.00615219 -0.03538243  0.01031361  0.00642684]]
(5, 4)

Having that, our wish is, for each asset, take its first close-price value from the p array and using information on daily returns stored row-by-row (i.e. asset per asset) in the r array, reconstruct the close-price asset time-series:

21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
for i in range(r.shape[0]):
    tmp = []
    for j in range(r.shape[1]):
        if(j == 0):
            tmp.append(p[i][0].tolist())
            y = p[i] * (1 + r[i][j])
        else:
            y = y * (1 + r[i][j])
        tmp.append(y.tolist()[0])
 
    if(i == 0):
        P = np.array(tmp)
    else:
        P = np.vstack([P, np.array(tmp)])
 
print()
print(P)
print(P.shape)
print()

That returns:

[[ 25.86301396  26.29776209  26.13459959  25.38282873  24.38110263]
 [ 19.82072772  19.72801677  19.70230331  20.11859739  20.33905475]
 [ 22.33569347  21.91824338  21.53158666  21.8061791   21.99347727]
 [ 21.38584671  21.66113324  21.63142965  21.33881915  21.51156325]
 [ 24.56983489  24.41867671  23.55468454  23.79761839  23.95056194]]
(5, 5)

Thus, we have two loops: the outer one over rows/assets (index i) and inner one over columns/days (index j). For j = 0 we copy the price of the asset from p as a “starting close price”, e.g. on the first day. Concurrently, using the first information from r matrix we compute a change in price on the next day. In tmp list we store (per asset) the history of close price changes over all L+1 days. These operations are based on a simple list processing. Finally, having a complete information on i-th asset and its price changes after r.shape[1] + 1 days, we build a new array of P with an aid of np.vstack function (see more in Section 3.3.4 of Python for Quants. Volume I). Therefore, P stores the simulated close-price time-series for N assets.

We can display them by adding to our main code:

41
42
43
44
plt.figure(num=1, figsize=(8, 5))
plt.plot(P.T, '+-')
plt.xlabel("Days")
plt.ylabel("Asset Close Price (\$)")

what reveals:
stockss
where the transposition of P for plotting has been applied to deliver asset-by-asset price-series (try to plot the array without it and see what happens and understand why it is so).

The easiest solutions are most romantic, what we have proven above. You don’t impress your girlfriend by buying 44 red roses. 44 lines of Python code will do the same! Trust me on that! ;-)

This lesson comes from a fragment of my newest book of Python for Quants. Volume I. The bestselling book in a timeframe of last 3 days after publishing!

5 Words on How To Write A Quant Blog

An extract from Jacques Joubert’s newest article on How To Write A Great Quant Blog.

by Pawel Lachowicz

Do not commence working over your blog without the vision. “If you don’t know where you are going, any road will get you there!” You want to avoid that mistake. Spend some time dreaming of the final form of your site.

Highly sought after content is important but not as much as your commitment to excel in its delivery. Write from your heart. Listen to your inner voice. Follow your own curiosity. Not the trends. Not what would be profitable. Forget about the noise in the blogosphere. If you want to shine as a diamond, you need to get cut as a diamond.

Your blog should be a reflection of your personality. Forget about achieving success quickly. It will take time. It will count in years. Just remember: “Success [in this venture] is something that you attract by becoming an attractive person”.

Don’t rush. Make a plan. Build your blog around the uniqueness of your writing. And don’t worry about “instant” followers.

Lastly, write less frequently. However, deliver pearls. Not plums.

How to Get a List of all NASDAQ Securities as a CSV file using Python?

This post will be short but very informative. You can learn a few good Unix/Linux tricks on the way. The goal is well defined in the title. So, what’s the quickest solution? We will make use of Python in the Unix-based environment. As you will see, for any text file, writing a single line of Unix commands is more than enough to deliver exactly what we need (a basic text file processing). If you try to do the same in Windows.. well, good luck!

In general, we need to get through the FTP gate of NASDAQ heaven. It is sufficient to log on as an anonymous user providing your password defined by your email. In fact, any fake email will do the job. Let’s begin coding in Python:

1
2
3
4
5
6
7
8
9
10
# How to Get a List of all NASDAQ Securities as a CSV file using Python?
# +tested in Python 3.5.0b2, Mac OS X 10.10.3
#
# (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
 
import os
 
os.system("curl --ftp-ssl anonymous:jupi@jupi.com "
          "ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt "
          "> nasdaq.lst")

Here we use os module from the Python’s Standard Library and a Unix command of curl. The latter allows us to connect to FTS server of NASDAQ exchange, fetch the file of nasdaqlisted.txt to be usually stored in the SymbolDirectory directory and download it directly to our current folder under a given name of nasdaq.lst. During that process you will see the progress information displayed by Python, e.g.:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   162  100   162    0     0    125      0  0:00:01  0:00:01 --:--:--   125
100  174k  100  174k    0     0  23409      0  0:00:07  0:00:07 --:--:-- 39237

Now, in order to inspect the content of the downloaded file we may run in Python an extra line of code, namely:

12
13
os.system("head -20 nasdaq.lst")
print()

which displays first 20 lines from the top:

<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
Symbol|Security Name|Market Category|Test Issue|Financial Status|Round Lot Size
AAIT|iShares MSCI All Country Asia Information Technology Index Fund|G|N|N|100
AAL|American Airlines Group, Inc. - Common Stock|Q|N|N|100
AAME|Atlantic American Corporation - Common Stock|G|N|D|100
AAOI|Applied Optoelectronics, Inc. - Common Stock|G|N|N|100
AAON|AAON, Inc. - Common Stock|Q|N|N|100
AAPC|Atlantic Alliance Partnership Corp. - Ordinary Shares|S|N|N|100
AAPL|Apple Inc. - Common Stock|Q|N|N|100
AAVL|Avalanche Biotechnologies, Inc. - Common Stock|G|N|N|100
AAWW|Atlas Air Worldwide Holdings - Common Stock|Q|N|N|100
AAXJ|iShares MSCI All Country Asia ex Japan Index Fund|G|N|N|100
ABAC|Aoxin Tianli Group, Inc. - Common Shares|S|N|N|100
ABAX|ABAXIS, Inc. - Common Stock|Q|N|N|100

As you can see, we are not interested in first 8 lines of our file. Before cleaning that mess, let’s inspect the “happing ending” as well:

15
16
os.system("tail -5 nasdaq.lst")
print()

displaying

ZVZZT|NASDAQ TEST STOCK|G|Y|N|100
ZWZZT|NASDAQ TEST STOCK|S|Y|N|100
ZXYZ.A|Nasdaq Symbology Test Common Stock|Q|Y|N|100
ZXZZT|NASDAQ TEST STOCK|G|Y|N|100
File Creation Time: 0624201511:02|||||

Again, we notice that the last line does not make our housewarming party more merrier.

Given that information, we employ heavy but smart one-liner making use of immortal Unix commands of cat and sed in the pipe (pipeline process). Therefore, the next calling in our Python code does 3 miracles all-in-one shot. Have a look:

18
19
os.system("tail -n +9 nasdaq.lst | cat | sed '$d' | sed 's/|/ /g' > "
          "nasdaq.lst2")

If you view the output file of nasdaq.lst2 you will see its content to be exactly as we wanted it to be, i.e.:

$ echo; head nasdaq.lst2; echo "..."; tail nasdaq.lst2
 
AAIT iShares MSCI All Country Asia Information Technology Index Fund G N N 100
AAL American Airlines Group, Inc. - Common Stock Q N N 100
AAME Atlantic American Corporation - Common Stock G N D 100
AAOI Applied Optoelectronics, Inc. - Common Stock G N N 100
AAON AAON, Inc. - Common Stock Q N N 100
AAPC Atlantic Alliance Partnership Corp. - Ordinary Shares S N N 100
AAPL Apple Inc. - Common Stock Q N N 100
AAVL Avalanche Biotechnologies, Inc. - Common Stock G N N 100
AAWW Atlas Air Worldwide Holdings - Common Stock Q N N 100
AAXJ iShares MSCI All Country Asia ex Japan Index Fund G N N 100
...
ZNGA Zynga Inc. - Class A Common Stock Q N N 100
ZNWAA Zion Oil & Gas Inc - Warrants G N N 100
ZSAN Zosano Pharma Corporation - Common Stock S N N 100
ZSPH ZS Pharma, Inc. - Common Stock G N N 100
ZU zulily, inc. - Class A Common Stock Q N N 100
ZUMZ Zumiez Inc. - Common Stock Q N N 100
ZVZZT NASDAQ TEST STOCK G Y N 100
ZWZZT NASDAQ TEST STOCK S Y N 100
ZXYZ.A Nasdaq Symbology Test Common Stock Q Y N 100
ZXZZT NASDAQ TEST STOCK G Y N 100

The command of

tail -n +9 nasdaq.lst

lists all lines of the file skipping first nine at the beginning. Next we push in a pipe that output and list it as a whole using cat command. In next step that output is processed by sed command which (a) removes the last line first; (b) the second one replaces all “|” tokens with “empty space” token. Finally, the processed output is saved as a nasdaq.lst2 file. The power of Unix in a single line. After 15 years of using it I’m still smiling to myself doing that :)

All right. What is left? Getting a list of tickers and storing it into a CSV file. Piece of cake. Here we employ the Unix command of awk in the following way:

21
22
os.system("awk '{print $1}' nasdaq.lst2 > nasdaq.csv")
os.system("echo; head nasdaq.csv; echo '...'; tail nasdaq.csv")

which returns

AAIT
AAL
AAME
AAOI
AAON
AAPC
AAPL
AAVL
AAWW
AAXJ
...
ZNGA
ZNWAA
ZSAN
ZSPH
ZU
ZUMZ
ZVZZT
ZWZZT
ZXYZ.A
ZXZZT

i.e. an isolated list of NASDAQ tickers stored in nasdaq.csv file. From this point, you can read it into Python’s pandas DataFrame as follows:

24
25
26
27
import pandas as pd
data = pd.read_csv("nasdaq.csv", index_col=None, header=None)
data.columns=["Ticker"]
print(data)

displaying

      Ticker
0       AAIT
1        AAL
2       AAME
3       AAOI
4       AAON
5       AAPC
...
 
[3034 rows x 1 columns]

That’s it.

In the following post, I will make use of that list to fetch the stock trading data and analyse the distribution of extreme values–the gateway to prediction of extreme and heavy losses for every portfolio holder (part 2 out of 3). Stay tuned!

DOWNLOADS
   nasdaqtickers.py

RELATED POSTS
   How to Find a Company Name given a Stock Ticker Symbol utilising Quandl API
   Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders (1)

Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders (1)

The probability of improbable events. The simplicity amongst complexity. The purity in its best form. The ultimate cure for those who trade, for those who invest. Does it exist? Can we compute it? Is it really something impossible? In this post we challenge ourselves to the frontiers of accessible statistics and data analysis in order to find most optimal computable solutions to this enigmatic but mind-draining problem. Since there is no certainty, everything remains shrouded in the veil of probability. The probability we have to face and hope that something unlikely, determined to take place anyhow, eventually will not happen.


Our goal is to calculate the probability of a very rare event (e.g. a heavy and/or extreme loss) in the trading market (e.g. of a stock plummeting 5% or much more) in a specified time-horizon (e.g. on the next day, in one week, in one month, etc.). The probability. Not the certainty of that event.

In this Part 1, first, we look at the tail of an asset return distribution and compress our knowledge on Value-at-Risk (VaR) to extract the essence required to understand why VaR-stuff is not the best card in our deck. Next, we move to a classical Bayes’ theorem which helps us to derive a conditional probability of a rare event given… yep, another event that (hypothetically) will take place. Eventually, in Part 2, we will hit the bull between its eyes with an advanced concept taken from the Bayesian approach to statistics and map, in real-time, for any return-series its loss probabilities. Again, the probabilities, not certainties.

1. VaR (not) for Rare Events

In the framework of VaR we take into consideration $T$ days of trading history of an asset. Next, we drive a number (VaR) that describes a loss that is likely to take place with the probability of approximately $\alpha$. “To take place” does not mean here that it will take place. In this approach we try to provide some likelihood (a quantitative measure) of the rare event in a specified time-horizon (e.g. on the next day if daily return-series are under our VaR investigation; a scenario considered in this post).

If by $L$ we denote a loss (in percent) an asset can experience on the next day, then:
$$
\mbox{Pr}(L \le -\mbox{VaR}_{1-\alpha}) = \alpha
$$ would be the probability of a loss of $-L\times D$ dollars where $-L\times D\ge -\mbox{VaR}\times D$, equal, for instance, $\alpha=0.05$ (also referred to as $(1-\alpha)$% VaR measure; $\mbox{VaR}_{95}$, etc.) and $D$ is the position size (money invested in the asset in terms of physical currency, e.g. in dollars). In other words, the historical data can help us to find $\mbox{VaR}_{95}$ given $\alpha$ assuming 5% of chances that $\mbox{VaR}_{1-\alpha}$ will be exceeded on the next day.

In order to illustrate that case and its shortcomings when it comes to the analysis of rare events, let’s look at the 10 year trading history of two stocks in the NASDAQ market: highly volatile CAAS (China Automotive Systems, Inc.; of the market capital of 247M) and highly liquid AAPL (Apple Inc.; of the market capital of 750B). First, we fetch their adjusted close price-series from Yahoo! Finance and derive the corresponding daily return-series utilising Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders
# (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
#
# heavy1.py
 
import pandas.io.data as web
import matplotlib.pyplot as plt
import numpy as np
from pyvar import findvar, findalpha
 
 
# ---1. Data Processing
 
# fetch and download daily adjusted-close price series for CAAS
#  and AAPL stocks using Yahoo! Finance public data provider
caas = web.DataReader("CAAS", data_source='yahoo',
                      start='2005-05-13', end='2015-05-13')['Adj Close']
aapl = web.DataReader("AAPL", data_source='yahoo',
                      start='2005-05-13', end='2015-05-13')['Adj Close']
 
CAAScp = np.array(caas.values)
AAPLcp = np.array(aapl.values)
 
f = file("data1.dat","wb")
np.save(f, CAAScp)
np.save(f, AAPLcp)
f.close()
 
# read in the data from a file
f = file("data1.dat","rb")
CAAScp = np.load(f)
AAPLcp = np.load(f)
f.close()
 
# compute return-series
retCAAS = CAAScp[1:]/CAAScp[:-1]-1
retAAPL = AAPLcp[1:]/AAPLcp[:-1]-1

The best way to understand the data is by plotting them:

39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# plotting (figure #1)
#  adjusted-close price-series
fig, ax1 = plt.subplots(figsize=(10, 6))
plt.xlabel("Trading days 13/05/2005-13/05/2015")
plt.plot(CAAScp, '-r', label="CAAS")
plt.axis("tight")
plt.legend(loc=(0.02, 0.8))
plt.ylabel("CAAS Adj Close Price (US$)")
ax2 = ax1.twinx()
plt.plot(AAPLcp, '-', label="AAPL")
plt.legend(loc=(0.02, 0.9))
plt.axis("tight")
plt.ylabel("AAPL Adj Close Price (US$)")
 
# plotting (figure #2)
#  daily return-series
plt.figure(num=2, figsize=(10, 6))
plt.subplot(211)
plt.grid(True)
plt.plot(retCAAS, '-r', label="CAAS")
plt.axis("tight")
plt.ylim([-0.25,0.5])
plt.legend(loc="upper right")
plt.ylabel("CAAS daily returns")
plt.subplot(212)
plt.grid(True)
plt.plot(retAAPL, '-', label="AAPL")
plt.legend(loc="upper right")
plt.axis("tight")
plt.ylim([-0.25,0.5])
plt.ylabel("AAPL daily returns")
plt.xlabel("Trading days 13/05/2005-13/05/2015")

We obtain the price-series
figure_1
and return-series
figure_2

respectively. For the latter plot, by fixing the scaling of both $y$-axes we immediately gain an chance to inspect the number of daily trades closing with heavy losses. Well, at least at first glance and for both directions of trading. In this post we will be considering the long positions only.

Having our data pre-processed we may implement two different strategies to make use of the VaR framework in order to work out the probabilities for tail events. The first one is based on setting $\alpha$ level and finding $-VaR_{1-\alpha}$. Let’s assume $\alpha=0.01$, then the following piece of code

73
74
75
76
77
78
79
80
# ---2. Compuation of VaR given alpha
 
alpha = 0.01
 
VaR_CAAS, bardata1 = findvar(retCAAS, alpha=alpha, nbins=200)
VaR_AAPL, bardata2 = findvar(retAAPL, alpha=alpha, nbins=100)
 
cl = 100.*(1-alpha)

aims at computation of the corresponding numbers making use of the function:

def findvar(ret, alpha=0.05, nbins=100):
    # Function computes the empirical Value-at-Risk (VaR) for return-series
    #   (ret) defined as NumPy 1D array, given alpha
    # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
    #
    # compute a normalised histogram (\int H(x)dx = 1)
    #  nbins: number of bins used (recommended nbins>50)
    hist, bins = np.histogram(ret, bins=nbins, density=True)
    wd = np.diff(bins)
    # cumulative sum from -inf to +inf
    cumsum = np.cumsum(hist * wd)
    # find an area of H(x) for computing VaR
    crit = cumsum[cumsum <= alpha]
    n = len(crit)
    # (1-alpha)VaR
    VaR = bins[n]
    # supplementary data of the bar plot
    bardata = hist, n, wd
    return VaR, bardata

Here, we create the histogram with a specified number of bins and $-VaR_{1-\alpha}$ is found in an empirical manner. For many reasons this approach is much better than fitting the Normal Distribution to the data and finding VaR based on the integration of that continuous function. It is a well-know fact that such function would underestimate the probabilities in far tail of the return distribution. And the game is all about capturing what is going out there, right?

The results of computation we display as follows:

82
83
print("%g%% VaR (CAAS) = %.2f%%" % (cl, VaR_CAAS*100.))
print("%g%% VaR (AAPL) = %.2f%%\n" % (cl, VaR_AAPL*100.))

i.e.

99% VaR (CAAS) = -9.19%
99% VaR (AAPL) = -5.83%

In order to gain a good feeling of those numbers we display the left-tails of both return distributions

85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# plotting (figure #3)
#  histograms of daily returns; H(x)
#
plt.figure(num=3, figsize=(10, 6))
c = (.7,.7,.7)  # grey color (RGB)
#
# CAAS
ax = plt.subplot(211)
hist1, bins1 = np.histogram(retCAAS, bins=200, density=False)
widths = np.diff(bins1)
b = plt.bar(bins1[:-1], hist1, widths, color=c, edgecolor="k", label="CAAS")
plt.legend(loc=(0.02, 0.8))
#
# mark in red all histogram values where int_{-infty}^{VaR} H(x)dx = alpha
hn, nb, _ = bardata1
for i in range(nb):
    b[i].set_color('r')
    b[i].set_edgecolor('k')
plt.text(-0.225, 30, "VaR$_{%.0f}$ (CAAS) = %.2f%%" % (cl, VaR_CAAS*100.))
plt.xlim([-0.25, 0])
plt.ylim([0, 50])
#
# AAPL
ax2 = plt.subplot(212)
hist2, bins2 = np.histogram(retAAPL, bins=100, density=False)
widths = np.diff(bins2)
b = plt.bar(bins2[:-1], hist2, widths, color=c, edgecolor="k", label="AAPL")
plt.legend(loc=(0.02, 0.8))
#
# mark in red all histogram bars where int_{-infty}^{VaR} H(x)dx = alpha
hn, nb, wd = bardata2
for i in range(nb):
    b[i].set_color('r')
    b[i].set_edgecolor('k')
plt.text(-0.225, 30, "VaR$_{%.0f}$ (AAPL) = %.2f%%" % (cl, VaR_AAPL*100.))
plt.xlim([-0.25, 0])
plt.ylim([0, 50])
plt.xlabel("Stock Returns (left tail only)")
plt.show()

where we mark our $\alpha=$1% regions in red:
heavy3

As you may notice, so far, we haven’t done much new. A classical textbook example coded in Python. However, the last figure reveals the main players of the game. For instance, there is only 1 event of a daily loss larger than 15% for AAPL while CAAS experienced 4 heavy losses. Much higher 99% VaR for CAAS takes into account those 4 historical events and put more weight on 1-day VaR as estimated in our calculation.

Imagine now that we monitor all those tail extreme/rare events day by day. It’s not too difficult to notice (based on the inspection of Figure #2) that in case of AAPL, the stock recorded its first serious loss of $L \lt -10$% approximately 650 days since the beginning of our “monitoring” which commenced on May 13, 2005. In contrast, CAAS was much more volatile and you needed to wait only ca. 100 days to record the loss of the same magnitude.

If something did not happen, e.g. $L \lt -10$%, the VaR-like measure is highly inadequate measure of probabilities for rare events. Once the event took place, the estimation of VaR changes (is updated) but decreases in time until a new rare event occurs. Let’s illustrate it with Python. This is our second VaR strategy: finding $\alpha$ given the threshold for rare events. We write a simple function that does the job for us:

def findalpha(ret, thr=1, nbins=100):
    # Function computes the probablity P(X<thr)=alpha given threshold
    #   level (thr) and return-series (NumPy 1D array). X denotes the
    #   returns as a rv and nbins is number of bins used for histogram
    # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
    #
    # compute normalised histogram (\int H(x)dx=1)
    hist, bins = np.histogram(ret, bins=nbins, density=True)
    # compute a default histogram
    hist1, bins1 = np.histogram(ret, bins=nbins, density=False)
    wd = np.diff(bins1)
    x = np.where(bins1 < thr)
    y = np.where(hist1 != 0)
    z = list(set(x[0]).intersection(set(y[0])))
    crit = np.cumsum(hist[z]*wd[z])
    # find alpha
    try:
        alpha = crit[-1]
    except Exception as e:
        alpha = 0
    # count number of events falling into (-inft, thr] intervals
    nevents = np.sum(hist1[z])
    return alpha, nevents

We call it in our main program:

126
127
128
129
130
131
132
133
134
135
136
137
138
# ---3. Computation of alpha, given the threshold for rare events
 
thr = -0.10
 
alpha1, ne1 = findalpha(retCAAS, thr=thr, nbins=200)
alpha2, ne2 = findalpha(retAAPL, thr=thr, nbins=200)
 
print("CAAS:")
print("  Pr( L < %.2f%% ) = %.2f%%" % (thr*100., alpha1*100.))
print("  %g historical event(s)" % ne1)
print("AAPL:")
print("  Pr( L < %.2f%% ) = %.2f%%" % (thr*100., alpha2*100.))
print("  %g historical event(s)" % ne2)

what returns the following results:

CAAS:
  Pr( L < -10.00% ) = 0.76%
  19 historical event(s)
AAPL:
  Pr( L < -10.00% ) = 0.12%
  3 historical event(s)

And that’s great however all these numbers are given as a summary of 10 years of analysed data (May 13/2005 to May/13 2015 in our case). The final picture could be better understood if we could dynamically track in time the changes of both probabilities and the number of rare events. We achieve it by the following not-state-of-the-art code:

141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
# ---4. Mapping alphas for Rare Events
 
alphas = []
nevents = []
for t in range(1,len(retCAAS)-1):
    data = retCAAS[0:t]
    alpha, ne = findalpha(data, thr=thr, nbins=500)
    alphas.append(alpha)
    nevents.append(ne)
 
alphas2 = []
nevents2 = []
for t in range(1,len(retAAPL)-1):
    data = retAAPL[0:t]
    alpha, ne = findalpha(data, thr=thr, nbins=500)
    alphas2.append(alpha)
    nevents2.append(ne)
 
# plotting (figure #4)
#   running probability for rare events
#
plt.figure(num=4, figsize=(10, 6))
ax1 = plt.subplot(211)
plt.plot(np.array(alphas)*100., 'r')
plt.plot(np.array(alphas2)*100.)
plt.ylabel("Pr( L < %.2f%% ) [%%]" % (thr*100.))
plt.axis('tight')
ax2 = plt.subplot(212)
plt.plot(np.array(nevents), 'r', label="CAAS")
plt.plot(np.array(nevents2), label="AAPL")
plt.ylabel("# of Events with L < %.2f%%" % (thr*100.))
plt.axis('tight')
plt.legend(loc="upper left")
plt.xlabel("Trading days 13/05/2005-13/05/2015")
plt.show()

revealing the following picture:
heavy4

It’s a great way of looking at and understanding the far left-tail volatility of the asset under your current investigation. The probability between two rare/extreme events decreases for the obvious reason: along the time axis we include more and more data therefore the return distribution evolves and shifts its mass to the right leaving left tail events less and less probable.

The question remains: if an asset has never experienced an extreme loss of given magnitute, the probability of such event, within our VaR framework, simply remains zero! For example, for losses $L \lt -20$%,
heavy4a
CAAS displays 2 extreme events while AAPL none! Therefore, our estimation of superbly rare daily loss of -20% (or more) for AAPL stock based on 10 years of data is zero or undefined or… completely unsound. Can we do better than this?

2. Classical Conditional Prediction for Rare Events

Let’s consider a case where we want to predict the probability of the asset/stock (CAAS; traded at NASDAQ exchange) falling down more than $-L$%. Previously we have achieved such estimation through the integration of its probability density function. An alternative way is via derivation of the conditional probability, i.e. that CAAS will lose more than $-L$% given that NASDAQ index drops more than $-L$% on the next day.

Formula? Well, Bayes’ formula. That all what we need:
$$
\mbox{Pr}(B|R) = \frac{ \mbox{Pr}(R|B)\ \mbox{Pr}(B) } { \mbox{Pr}(R) } .
$$
Great, now, the meaning of each term. Let $A$ denotes the event of the stock daily return to be between -L% and 0%. Let B is the event of the stock return to be $\lt -L$%. Let $R$ is the event of NASDAQ daily return to be $\lt -L$%. Given that, based on both CAAS and NASDAQ historical time-series, we are able to compute $\mbox{Pr}(A)$, $\mbox{Pr}(B)$, $\mbox{Pr}(R|A)$, $\mbox{Pr}(R|B)$, therefore
$$
\mbox{Pr}(R) = \mbox{Pr}(R|A)\mbox{Pr}(A) + \mbox{Pr}(R|B)\mbox{Pr}(B)
$$ as well. Here, $\mbox{Pr}(R|A)$ would stand for the probability of NASDAQ falling down more than $-L$% given the observation of the CAAS return to be in $(-L;0)$% interval and $\mbox{Pr}(R|B)$ would denote the probability of NASDAQ falling down more than $-L$% given the observation of the CAAS return also dropping more than $-L$% on the same day.

With Bayes’ formula we inverse the engineering, aiming at providing the answer on $\mbox{Pr}(B|R)$ given all available data. Ready to code it? Awesome. Here we go:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders
# (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
#
# heavy2.py
 
import pandas.io.data as web
import matplotlib.pyplot as plt
import numpy as np
 
# ---1. Data Processing
 
# fetch and download daily adjusted-close price series for CAAS stock
#  and NASDAQ index using Yahoo! Finance public data provider
'''
caas = web.DataReader("CAAS", data_source='yahoo',
                   start='2005-05-13', end='2015-05-13')['Adj Close']
nasdaq = web.DataReader("^IXIC", data_source='yahoo',
                   start='2005-05-13', end='2015-05-13')['Adj Close']
 
CAAScp = np.array(caas.values)
NASDAQcp = np.array(nasdaq.values)
 
f = file("data2.dat","wb")
np.save(f,CAAScp)
np.save(f,NASDAQcp)
f.close()
'''
 
f = file("data2.dat","rb")
CAAScp = np.load(f)
NASDAQcp = np.load(f)
f.close()
 
# compute the return-series
retCAAS = CAAScp[1:]/CAAScp[:-1]-1
retNASDAQ = NASDAQcp[1:]/NASDAQcp[:-1]-1

The same code as used in Section 1 but instead of AAPL data, we fetch NASDAQ index daily close prices. Let’s plot the return-series:

37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# plotting (figure #1)
#  return-series for CAAS and NASDAQ index
#
plt.figure(num=2, figsize=(10, 6))
plt.subplot(211)
plt.grid(True)
plt.plot(retCAAS, '-r', label="CAAS")
plt.axis("tight")
plt.ylim([-0.25,0.5])
plt.legend(loc="upper right")
plt.ylabel("CAAS daily returns")
plt.subplot(212)
plt.grid(True)
plt.plot(retNASDAQ, '-', label="NASDAQ")
plt.legend(loc="upper right")
plt.axis("tight")
plt.ylim([-0.10,0.15])
plt.ylabel("NASDAQ daily returns")
plt.xlabel("Trading days 13/05/2005-13/05/2015")
plt.show()

i.e.,
heavy5
where we observe a different trading dynamics for NASDAQ index between 750th and 1000th day (as counted from May/13, 2005). Interestingly, the heaviest losses of NASDAQ are not ideally correlated with those of CAAS. That make this case study more exciting! Please also note that CAAS is not the part of the NASDAQ index (i.e., its component). Bayes’ formula is designed around independent events and, within a fair approximation, we may think of CAAS as an asset ticking off that box here.

What remains and takes a lot of caution is the code that “looks at” the data in a desired way. Here is its final form:

58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# ---2. Computations of Conditional Probabilities for Rare Events
 
# isolate return-series displaying negative returns solely
#  set 1 for time stamps corresponding to positive returns
nretCAAS = np.where(retCAAS < 0, retCAAS, 1)
nretNASDAQ = np.where(retNASDAQ < 0, retNASDAQ, 1)
 
# set threshold for rare events
thr = -0.065
 
# compute the sets of events
A = np.where(nretCAAS < 0, nretCAAS, 1)
A = np.where(A >= thr, A, 1)
B = np.where(nretCAAS < thr, retCAAS, 1)
R = np.where(nretNASDAQ < thr, retNASDAQ, 1)
nA = float(len(A[A != 1]))
nB = float(len(B[B != 1]))
n = float(len(nretCAAS[nretCAAS != 1]))  # n must equal to nA + nB
# (optional)
print(nA, nB, n == (nA + nB))  # check, if True then proceed further
print(len(A), len(B), len(R))
print
 
# compute the probabilities
pA = nA/n
pB = nB/n
 
# compute the conditional probabilities
pRA = np.sum(np.where(R+A < 0, 1, 0))/n
pRB = np.sum(np.where(R+B < 0, 1, 0))/n
 
pR = pRA*pA + pRB*pB
 
# display results
print("Pr(A)\t = %5.5f%%" % (pA*100.))
print("Pr(B)\t = %5.5f%%" % (pB*100.))
print("Pr(R|A)\t = %5.5f%%" % (pRA*100.))
print("Pr(R|B)\t = %5.5f%%" % (pRB*100.))
print("Pr(R)\t = %5.5f%%" % (pR*100.))
 
if(pR>0):
    pBR = pRB*pB/pR
    print("\nPr(B|R)\t = %5.5f%%" % (pBR*100.))
else:
    print("\nPr(B|R) impossible to be determined. Pr(R)=0.")

Python’s NumPy library helps us in a tremendous way by its smartly designed function of where which we employ in lines #69-72 and #86-87. First we test a logical condition, if it’s evaluated to True we grab the right data (actual returns), else we return 1. That opens for us a couple of shortcuts in finding the number of specific events and making sure we are still on the right side of the force (lines #73-79).

As you can see, in line #66 we specified our threshold level of $L = -6.5$%. Given that, we derive the probabilities:

(1216.0, 88.0, True)
(2516, 2516, 2516)
 
Pr(A)	 = 93.25153%
Pr(B)	 = 6.74847%
Pr(R|A)	 = 0.07669%
Pr(R|B)	 = 0.15337%
Pr(R)	 = 0.08186%
 
Pr(B|R)	 = 12.64368%

The level of -6.5% is pretty random but delivers an interesting founding. Namely, based on 10 years of data there is 12.6% of chances that on May/14 2015 CAAS will lose more than -6.5% if NASDAQ drops by the same amount. How much exactly? We don’t know. It’s not certain. Only probable in 12.6%.

The outcome of $\mbox{Pr}(R|B)$ which is close to zero may support our assumption that CAAS has a negligible “influence” on NASDAQ itself. On the other side of the rainbow, it’s much more difficult to interpret $\mbox{Pr}(R)$, the probability of an “isolated” rare event (i.e., $L<-6.5$%) since its estimation is solely based on two players in the game: CAAS and NASDAQ. However, when computed for $N\ge 100$ stocks outside the NASDAQ index but traded at NASDAQ, such distribution of $\mbox{Pr}(R)$'s would be of great value as an another alternative estimator for rare events (I should write about it a distinct post). Now, back the business. There is one problem with our conditional estimation of rare event probabilities as outlined within this Section. A quick check for $L = -10$% reveals:

Pr(A)	 = 98.69632%
Pr(B)	 = 1.30368%
Pr(R|A)	 = 0.00000%
Pr(R|B)	 = 0.00000%
Pr(R)	 = 0.00000%
 
Pr(B|R) impossible to be determined. Pr(R)=0.

Let’s remind that $R$ is the event of NASDAQ daily return to be $\lt -L$%. A quick look at NASDAQ return-series tells us the whole story in order to address the following questions that come to your mind: why, why, why zero? Well, simply because there was no event in past 10 year history of the index that it slid more than -10%. And we are cooked in the water.

A visual representation of that problem can be obtained by executing the following piece of code:

104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
from pyvar import cpr  # non-at-all PEP8 style ;)
 
prob = []
for t in range(2,len(retCAAS)-1):
    ret1 = retCAAS[0:t]
    ret2 = retNASDAQ[0:t]
    pBR, _ = cpr(ret1, ret2, thr=thr)
    prob.append(pBR)
 
# plotting (figure #2)
#   the conditional probability for rare events given threshold
#
plt.figure(num=2, figsize=(10, 6))
plt.plot(np.array(prob)*100., 'r')
plt.ylabel("Pr(B|R) [%%]")
plt.axis('tight')
plt.title("L$_{thr}$ = %.2f%%" % (thr*100.))
plt.xlabel("Trading days 13/05/2005-13/05/2015")
plt.show()

where we moved all Bayes’-based calculations (lines #58-102) into a function cpr (a part of pyvar.py local library; see Download section below) and repeated our previous experiment, however, this time, $\mbox{Pr}(R|B)$ changing in time given the threshold of $L=-6.5$%. The resulting plot would be:
heavy7
Before ca. 800th day the NASDAQ index was lacking any event of $L\lt -6.5$% therefore the $\mbox{Pr}(R|B)$ could not be determined. After that period we got some heavy hits and the conditional probability could be derived. The end value (as for May 13, 2015) is 12.64%.

Hope amongst Hopelessness?

There is no better title to summarise the outcomes we derived. Our goal was to come up with some (running) probability for very rare event(s) that, in our case, would be expressed as a 1-day loss of $-L$% in trading for a specified financial asset. In the first attempt we engaged VaR-related framework and the estimation of occurrence of the rare event based on the PDF integration. In our second attempt we sought for an alternative solution making use of Bayes’ conditional prediction of the asset’s loss of $-L$% (or more) given the observation of the same event somewhere else (NASDAQ index). For the former we ended up the a running probability of $\mbox{Pr}(L \le L_{thr})$ while for the latter with $\mbox{Pr}(R|B)$ given $L_{thr}$ for both B and R events.

Assuming $L_{thr}$ to be equal to -6.5%, we can present the corresponding computations in the following chart:
heavy8

Well, the situation looks hopeless. It ain’t get better even if, as discussed earlier, the probability of $\mbox{Pr}(R)$ has been added. Both methodologies seem to deliver the answer to the same question however, somehow, we are left confused, disappointed, misled, in tasteless despair…

Stay tuned as in Part 2 we will see the light at the end of the tunnel. I promise.

DOWNLOAD
    heavy1.py, heavy2.py, pyvar.py

RELATED POSTS
    VaR and Expected Shortfall vs. Black Swan
    Extreme VaR for Portfolio Managers

Hacking Google Finance in Real-Time for Algorithmic Traders. (2) Pre-Market Trading.

Featured in: Data Science Weekly Newsletter, Issue 76 (May 7, 2015)

It has been over a year since I posted Hacking Google Finance in Real-Time for Algorithmic Traders article. Surprisingly, it became the number one URL of QaR that Google has been displaying as a result to various queries and the number two most frequently read post. Thank You! It’s my pleasure to provide quality content covering interesting topics that I find potentially useful.


You can be surprised how fast Python solutions went forward facilitating life of quants and algo traders. For instance, yesterday, haphazardly, I found a code that seems to work equally well as compared to my first version, and, in fact, is more flexible in data content that could be retrieved. The idea stays the same as previously, however, our goal this time is to monitor changes of stock prices provided by Google Finance in real-time before the market opens.

Constructing Pre-Market Price-Series

The pre-market trading session typically occurs between 8:00am and 9:30am EDT each trading day though for some stocks we often observe frequent movements much earlier, e.g. at 6:00am. Many investors and traders watch the pre-market trading activity to judge the strength and direction of the market in anticipation for the regular trading session. Pre-market trading activity generally has limited volume and liquidity, and therefore, large bid-ask spreads are common. Many retail brokers offer pre-market trading, but may limit the types of orders that can be used during the pre-market period$^1$.

In Google Finance the stock price in pre-market is usually displayed right beneath the tricker, for example:

AAPLpm

The price of the stock (here: AAPL) varies depending on interest, good/bad news, etc.

In Python we can fetch those changes (I adopt a code found on the Web) in the following way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import urllib2  # works fine with Python 2.7.9 (not 3.4.+)
import json
import time
 
def fetchPreMarket(symbol, exchange):
    link = "http://finance.google.com/finance/info?client=ig&q="
    url = link+"%s:%s" % (exchange, symbol)
    u = urllib2.urlopen(url)
    content = u.read()
    data = json.loads(content[3:])
    info = data[0]
    t = str(info["elt"])    # time stamp
    l = float(info["l"])    # close price (previous trading day)
    p = float(info["el"])   # stock price in pre-market (after-hours)
    return (t,l,p)
 
 
p0 = 0
while True:
    t, l, p = fetchPreMarket("AAPL","NASDAQ")
    if(p!=p0):
        p0 = p
        print("%s\t%.2f\t%.2f\t%+.2f\t%+.2f%%" % (t, l, p, p-l,
                                                 (p/l-1)*100.))
    time.sleep(60)

In this code we target Google to get every 60 seconds an update of the pre-market price (line #14). What we retrieve is a JSON file of the form:

// [
{
"id": "22144"
,"t" : "AAPL"
,"e" : "NASDAQ"
,"l" : "125.80"
,"l_fix" : "125.80"
,"l_cur" : "125.80"
,"s": "1"
,"ltt":"4:02PM EDT"
,"lt" : "May 5, 4:02PM EDT"
,"lt_dts" : "2015-05-05T16:02:28Z"
,"c" : "-2.90"
,"c_fix" : "-2.90"
,"cp" : "-2.25"
,"cp_fix" : "-2.25"
,"ccol" : "chr"
,"pcls_fix" : "128.7"
,"el": "126.10"
,"el_fix": "126.10"
,"el_cur": "126.10"
,"elt" : "May 6, 6:35AM EDT"
,"ec" : "+0.30"
,"ec_fix" : "0.30"
,"ecp" : "0.24"
,"ecp_fix" : "0.24"
,"eccol" : "chg"
,"div" : "0.52"
,"yld" : "1.65"
,"eo" : ""
,"delay": ""
,"op" : "128.15"
,"hi" : "128.45"
,"lo" : "125.78"
,"vo" : "21,812.00"
,"avvo" : "46.81M"
,"hi52" : "134.54"
,"lo52" : "82.90"
,"mc" : "741.44B"
,"pe" : "15.55"
,"fwpe" : ""
,"beta" : "0.84"
,"eps" : "8.09"
,"shares" : "5.76B"
,"inst_own" : "62%"
,"name" : "Apple Inc."
,"type" : "Company"
}
]

You can download it individually if we execute in the browser a query:

http://www.google.com/finance/info?infotype=infoquoteall&q=NASDAQ:AAPL

Some of those information you can easily decipher. For our task we need to get only: el (an asset price in pre-market or after-hours trading; a.k.a. extended hours trading); elt (corresponding time stamp); and l (most recent last price). This is what our Python code does for us in lines #12-14. Nice and smoothly.

When executed before 9.30am EDT (here for NASDAQ:AAPL), we may construct the pre-market price-series every time the price has been changed:

May 6, 6:35AM EDT	125.80	126.18	+0.38	+0.30%
May 6, 6:42AM EDT	125.80	126.21	+0.41	+0.33%
May 6, 6:45AM EDT	125.80	126.16	+0.36	+0.29%
May 6, 6:46AM EDT	125.80	126.18	+0.38	+0.30%
May 6, 6:49AM EDT	125.80	126.10	+0.30	+0.24%
May 6, 6:51AM EDT	125.80	126.20	+0.40	+0.32%
May 6, 6:57AM EDT	125.80	126.13	+0.33	+0.26%
May 6, 7:00AM EDT	125.80	126.20	+0.40	+0.32%
May 6, 7:01AM EDT	125.80	126.13	+0.33	+0.26%
May 6, 7:07AM EDT	125.80	126.18	+0.38	+0.30%
May 6, 7:09AM EDT	125.80	126.20	+0.40	+0.32%
May 6, 7:10AM EDT	125.80	126.19	+0.39	+0.31%
May 6, 7:10AM EDT	125.80	126.22	+0.42	+0.33%
May 6, 7:12AM EDT	125.80	126.20	+0.40	+0.32%
May 6, 7:22AM EDT	125.80	126.27	+0.47	+0.37%
May 6, 7:28AM EDT	125.80	126.24	+0.44	+0.35%
...
May 6, 9:02AM EDT	125.80	126.69	+0.89	+0.71%
May 6, 9:03AM EDT	125.80	126.71	+0.91	+0.72%
May 6, 9:04AM EDT	125.80	126.73	+0.93	+0.74%
May 6, 9:08AM EDT	125.80	126.67	+0.87	+0.69%
May 6, 9:09AM EDT	125.80	126.69	+0.89	+0.71%
May 6, 9:10AM EDT	125.80	126.68	+0.88	+0.70%
May 6, 9:13AM EDT	125.80	126.67	+0.87	+0.69%
May 6, 9:14AM EDT	125.80	126.72	+0.92	+0.73%
May 6, 9:16AM EDT	125.80	126.74	+0.94	+0.75%
May 6, 9:17AM EDT	125.80	126.72	+0.92	+0.73%
May 6, 9:18AM EDT	125.80	126.70	+0.90	+0.72%
May 6, 9:19AM EDT	125.80	126.73	+0.93	+0.74%
May 6, 9:20AM EDT	125.80	126.75	+0.95	+0.76%
May 6, 9:21AM EDT	125.80	126.74	+0.94	+0.75%
May 6, 9:21AM EDT	125.80	126.79	+0.99	+0.79% (*)
May 6, 9:23AM EDT	125.80	126.78	+0.98	+0.78%
May 6, 9:24AM EDT	125.80	126.71	+0.91	+0.72%
May 6, 9:25AM EDT	125.80	126.73	+0.93	+0.74%
May 6, 9:26AM EDT	125.80	126.75	+0.95	+0.76%
May 6, 9:27AM EDT	125.80	126.70	+0.90	+0.72%
May 6, 9:28AM EDT	125.80	126.75	+0.95	+0.76%
May 6, 9:29AM EDT	125.80	126.79	+0.99	+0.79%

Since the prices in pre-market tend to vary slowly, 60 second time interval is sufficient to keep our eye on the stock. You can compare a live result retrieved using our Python code at 9:21am (*) with the above screenshot I took at the same time.

A simple joy of Python in action. Enjoy!

HOMEWORK
     1. The code fails after 9.30am EST (NYC time). Modify it to catch this exception.
     2. Modify the code (or write a new function) that works after 9.30am EDT.
     3. It is possible to get $N\gt1$ queries for $N$ stocks by calling, for example:
            NASDAQ:AAPL,NYSE:JNJ,… in line #7 of the code. Modify the program
            to fetch pre-market time-series, $x_i(t)$ $(i=1,…,N)$, for $N$-asset portfolio.
            Given that, compute a fractional root-mean-square volatility, $\sigma_{x_i(t)}/\langle x_i(t) \rangle$,
            i.e. standard deviation divided by the mean, between 6am and 9.30am EDT
            for each asset and check can you use it as an indicator for stock price movement
            after 9.30am? Tip: the higher frms the more trading is expected in first 15 min
            of a new session at Wall Street.
     4. Modify the code to monitor after-hours trading till 4.30pm.

RELATED POSTS
    Hacking Google Finance in Real-Time for Algorithmic Traders

FURTHER READING
    Chenoweth, M., 2011, Downloading Google Intraday historical data with Python
    NetworkError.org, 2013, Google’s Undocumented Finance API

REFERENCES
    $^1$Pre-Market, Investopedia, http://www.investopedia.com/terms/p/premarket.asp

How to Find a Company Name given a Stock Ticker Symbol utilising Quandl API

Quandl.com offers an easy solution to that task. Their WIKI database contains 3339 major stock tickers and corresponding company names. They can be fetched via secwiki_tickers.csv file. For a plain file of portfolio.lst storing the list of your tickers, for example:

AAPL
IBM
JNJ
MSFT
TXN

you can scan the .csv file for the company name coding in Python:

# How to Find a Company Name given a Stock Ticker Symbol utilising Quandl API
# (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
 
import pandas as pd
 
df = pd.read_csv('secwiki_tickers.csv')
dp = pd.read_csv('portfolio.lst',names=['pTicker'])
 
pTickers = dp.pTicker.values  # converts into a list
 
tmpTickers = []
 
for i in range(len(pTickers)):
    test = df[df.Ticker==pTickers[i]]
    if not (test.empty):
        print("%-10s%s" % (pTickers[i], list(test.Name.values)[0]))

what returns:

AAPL      Apple Inc.
IBM       International Business Machines Corporation
JNJ       Johnson & Johnson
MSFT      Microsoft Corporation
TXN       Texas Instruments Inc.

Please note that there is a possibility to combine more stocks from other Quandl’s resources in one file. For more information see Quandl.com‘s documentation online.

I will dedicate a separate Chapter on hacking the financial websites in my next book of Python for Quants. Volume II. in mid-2015.

Fast Walsh–Hadamard Transform in Python

Lesson 9>>

I felt myself a bit unsatisfied after my last post on Walsh–Hadamard Transform and Tests for Randomness of Financial Return-Series leaving you all with a slow version of Walsh–Hadamard Transform (WHT). Someone wise once said: in order to become a champion, you need to flight one round longer. So here I go, one more time, on WHT in Python. Please excuse me or learn from it. The choice is yours.

This post is not only a great lesson on how one can convert an algorithm that is slow into its faster version, but also how to time it and take benefits out of its optimisation for speed. Let’s start from the beginning.

1. From Slow WFT to Fast WFT

In the abovementioned post we outlined that the Slow Walsh-Hadamard Transform for any signal $x(t)$ of length $n=2^M$ where $M\in\mathbb{Z}^+\gt 2$ we may derive as:
$$
WHT_n = \mathbf{x} \bigotimes_{i=1}^M \mathbf{H_2}
$$ where $\mathbf{H_2}$ is Hadamard matrix of order 2 and signal $x(t)$ is discrete and real-valued.

The Kronecker product of two matrices plays a key role in the definition of WH matrices. Thus, the Kronecker product of $\mathbf{A}$ and $\mathbf{B}$ where the former is a square matrix of order $n$ and the latter is of order $m$ is the square matrix $\mathbf{C}$ such:
$$
\mathbf{C} = \mathbf{A} \bigotimes \mathbf{B}
$$ Fino & Algazi (1976) outlined that if $\mathbf{A}$ and $\mathbf{B}$ are unitary matrices and thus $\mathbf{C}$ is also unitary and can be defined by the fast algorithm as shown below:

fwht

which can be computed in place. The technical details of it are beyond the scope of this post however the paper of Fino & Algazi (1976) is a good place to start your research on Fast Walsh-Hadamard Transform algorithm.

The algorithm requires bit inversion permutation. This tricky subject has been covered by Ryan Compton in his post entitled Bit-Reversal Permutation in Python. For me, its a gateway (or a shortcut) to covert Matlab’s function of bitrevorder which permutes data into bit-reversed order. Therefore, for any Python’s list we obtain its bit-reversed order making use of Ryan’s functions, namely:

def bit_reverse_traverse(a):
    # (c) 2014 Ryan Compton
    # ryancompton.net/2014/06/05/bit-reversal-permutation-in-python/
    n = a.shape[0]
    assert(not n&(n-1) ) # assert that n is a power of 2
    if n == 1:
        yield a[0]
    else:
        even_index = np.arange(n/2)*2
        odd_index = np.arange(n/2)*2 + 1
        for even in bit_reverse_traverse(a[even_index]):
            yield even
        for odd in bit_reverse_traverse(a[odd_index]):
            yield odd
 
def get_bit_reversed_list(l):
    # (c) 2014 Ryan Compton
    # ryancompton.net/2014/06/05/bit-reversal-permutation-in-python/
    n = len(l)
    indexs = np.arange(n)
    b = []
    for i in bit_reverse_traverse(indexs):
        b.append(l[i])
    return b

that can be called for any Python’s list or 1D NumPy object as follows:

from random import randrange, seed
from WalshHadamard import *
 
seed(2015)
 
X=[randrange(-1,2,2) for i in range(2**3)]
print("X = ")
print(X)
 
x=get_bit_reversed_list(X)
x=np.array(x)
print("\nBit Reversed Order of X = ")
print(x)

what returns:

X = 
[1, 1, 1, -1, 1, -1, 1, -1]
 
Bit Reversed Order of X = 
[ 1  1  1  1  1 -1 -1 -1]

i.e. exactly the same output as we can obtain employing for the same signal of $X$ the Matlab’s function of bitrevorder(X), namely:

>> X=[1, 1, 1, -1, 1, -1, 1, -1]
X =
     1     1     1    -1     1    -1     1    -1
 
>> bitrevorder(X)
ans =
     1     1     1     1     1    -1    -1    -1

2. Fast Walsh–Hadamard Transform in Python

Given the above, we get the Fast WHT adopting a Python version of the Mathworks’ algorithm and making use of Ryan’s bit reversed order for any real-valued discrete signal of $x(t)$, as follows:

def FWHT(X):
    # Fast Walsh-Hadamard Transform for 1D signals
    # of length n=2^M only (non error-proof for now)
    x=get_bit_reversed_list(X)
    x=np.array(x)
    N=len(X)
 
    for i in range(0,N,2):
        x[i]=x[i]+x[i+1]
        x[i+1]=x[i]-2*x[i+1]
 
    L=1
    y=np.zeros_like(x)
    for n in range(2,int(log(N,2))+1):
        M=2**L
        J=0; K=0
        while(K<N):
            for j in range(J,J+M,2):
                y[K]   = x[j]   + x[j+M]
                y[K+1] = x[j]   - x[j+M]
                y[K+2] = x[j+1] + x[j+1+M]
                y[K+3] = x[j+1] - x[j+1+M]
                K=K+4
            J=J+2*M
        x=y.copy()
        L=L+1
 
    y=x/float(N)

where an exemplary call of this function can take place in your Python’s main program as here:

from random import randrange, seed
from WalshHadamard import *
 
seed(2015)
 
X=[randrange(-1,2,2) for i in range(2**3)]
 
print("X = ")
print(X)
print("Slow WHT(X) = ")
print(WHT(X)[0])
print("Fast WHT(X) = ")
print(FWHT(X))

returning the output, e.g.:

X = 
[1, 1, 1, -1, 1, -1, 1, -1]
Slow WHT(X) = 
[ 0.25  0.75  0.25 -0.25  0.25 -0.25  0.25 -0.25]
Fast WHT(X) = 
[ 0.25  0.75  0.25 -0.25  0.25 -0.25  0.25 -0.25]

3. Slow vs. Fast Walsh-Hadamard Transform in Python

When it comes to making speed comparisons I always feel uncomfortable due to one major factor: the machine I use to run the test. And since I do not have a better one, I use what I’ve got: my Apple MacBook Pro 2013, with 2.6 GHz Intel Core i7, and 16 GB 1600 MHz DDR3. That’s it. Theodore Roosevelt once said: Do what you can, with what you have, where you are. So, that’s the best what I have where I am.

Let’s design a simple test which will test the performance of both Fast and Slow WHT in Python as defined above. We will be interested in the computation times for both transforms for a variable length of the discrete signal $X(t)$ (here: fixed to be the same thanks to the Python’s seed functions that freezes signal to be the same every time it is called to be random) defined as of $n=2^M$ for $M$ in interval between $[4,15]$ as an example.

First we will plot a physical time it takes to get both transforms followed by the graph presenting the speedup we gain by employing Fast WHT. The main code that executes our thought process looks as follows:

from random import randrange, seed
from WalshHadamard import *
import time
 
maxM=16
 
m=[]; s=[]; f=[]
for M in range(4,maxM+1):
    shwt=fwht=t0=fast=slow=0
    # generate random binary (-1,1) signal X
    # of length n=2**M
    seed(2015)
    X=[randrange(-1,2,2) for i in range(2**M)]
    # compute Slow WHT for X
    t0=time.time()
    shwt,_,_=WHT(X)
    slow=time.time()-t0 # time required to get SWHT
    s.append(slow)
    del shwt, slow, t0
    # compute Fast WHT for X
    t0=time.time()
    fwht=FWHT(X)
    fast=time.time()-t0 # time required to get FWHT
    m.append(M)
    f.append(fast)
    del fwht, fast, t0
 
speedup=np.array(s)/np.array(f)
 
f1=plt.figure(1)
plt.plot(m,s,"ro-", label='Slow WHT')
plt.hold(True)
plt.plot(m,f,"bo-", label='Fast WHT')
plt.legend(loc="upper left")
ax = plt.gca()
ax.set_yscale("log")
plt.xlabel("Signal length order, M")
plt.ylabel("Computation time [sec]")
 
f2=plt.figure(2)
plt.plot(m,speedup,'go-')
plt.xlabel("Signal length order, M")
plt.ylabel("Speedup (x times)")
plt.hold(True)
plt.plot((4,M),(1,1),"--k")
plt.xlim(4,M)
 
plt.show()

where we obtain:
figure_1
and the speedup plot:
figure_2
From the graph we notice an immediate benefit of Fast WHT when it comes to much much longer signals. Simplicity of complexity in action. Piece of cake, a walk in a park.

Fast & Furious 8? Well, QuantAtRisk.com delivers you the trailer before the official trailer. Enjoy! But only if feel the need… the need for speed!

DOWNLOAD
     WalshHadamard.py

REFERENCES
     Fino B.J, Algazi, V. R., 1976, Unified Matrix Treatment of the Fast Walsh-Hadamard
         Transform
, IEEE, Transactions on Computers (pdf)

Walsh–Hadamard Transform and Tests for Randomness of Financial Return-Series

Randomness. A magic that happens behind the scene. An incomprehensible little blackbox that does the job for us. Quants. Many use it every day without thinking, without considering how those beautifully uniformly distributed numbers are drawn?! Why so fast? Why so random? Is randomness a byproduct of chaos and order tamed somehow? How trustful can we be placing this, of minor importance, command of random() as piece of our codes?


Randomness Built-in. Not only that’s a name of a chapter in my latest book but the main reason I’m writing this article: wandering sideways around the subject that led me to new undiscovered territories! In general the output in a form of a random number comes for a dedicated function. To be perfectly honest, it is like a refined gentleman, of sophisticated quality recognised by its efficiency when a great number of drafts is requested. The numbers are not truly random. They are pseudo-random. That means the devil resides in details. A typical pseudo-random number generator (PRNG) is designed for speed but defined by underlying algorithm. In most of computer languages Mersenne Twister developed by 松本 眞 and 西村 拓士 in 1997 has become a standard. As one might suspect, an algorithm must repeat itself and in case of our Japanese hero, its period is superbly long, i.e. $2^{19937}-1$. A reliable implementation in C or C++ guarantees enormous speedups in pseudo-random numbers generation. Interestingly, Mersenne Twister is not perfect. Its use had been discouraged when it comes to obtaining cryptographic random numbers. Can you imagine that?! But that’s another story. Ideal for a book chapter indeed!

In this post, I dare to present the very first, meaningful, and practical application of the Walsh–Hadamard Transform (WHT) in quantitative finance. Remarkably, this tool, of marginal use in digital signal processing, had been shown to serve as a great facility in testing any binary sequence for its statistically significant randomness!

Within the following sections we introduce the transform to finance. After Oprina et al. (2009) we encode WHT Significance Tests to Python (see Download section) and verify a hypothesis of the evidence of randomness as generated by Mersenne Twister PRNG against the alternative one at vey high significance levels of $\alpha\le 0.00001$. Next, we show a practical application of the WHT framework in search for randomness in any financial time-series by example of 30-min FX return-series and compare them to the results obtained from PRNG available in Python by default. Curious about the findings? Welcome to the world of magic!

1. Walsh–Hadamard Transform And Binary Sequences

I wish I had for You this great opening story on how Jacques Hadamard and Joseph L. Walsh teamed up with Jack Daniels on one Friday’s night in the corner pub somewhere in San Francisco coming up to a memorable breakthrough in theory of numbers. I wish I had! Well, not this time. However, as a make-up, below I will present in a nutshell a theoretical part on WHT I’ve been studying for past three weeks and found it truly mind-blowing because of its simplicity.

Let’s consider any real-valued discrete signal $X(t_i)$ where $i=0,1,…,N-1$. Its trimmed version, $x(t_i)$, of the total length of $n=2^M$ such that $2^M\le(N-1)$ and $M\in\mathbb{Z}^+$ at $M\ge 2$ is considered as an input signal for the Walsh–Hadamard Transform, the latter defined as:
$$
WHT_n = \mathbf{x} \bigotimes_{i=1}^M \mathbf{H_2}
$$ where the Hadamard matrix of order $n=2^M$ is obtainable recursively by the formula:
$$
\mathbf{H_{2^M}} =
\begin{pmatrix}
H_{2^{M-1}} & H_{2^{M-1}} \\
H_{2^{M-1}} & -H_{2^{M-1}}
\end{pmatrix}
\ \ \ \ \ \text{therefore} \ \ \ \ \ \mathbf{H_2} =
\begin{pmatrix}
1 & 1 \\
1 & -1
\end{pmatrix}
$$ and $\otimes$ denotes the Kronecker product between two matrices. Given that, $WHT_n$ is the dot product between the signal (1D array; vector) and resultant Kronecker multiplications of $\mathbf{H_2}$ $M$-times (Johnson & Puschel 2000).

1.1. Walsh Functions

The Walsh-Hadamard transform uses the orthogonal square-wave functions, $w_j(x)$, introduced by Walsh (1923), which have only two values $\pm 1$ in the interval $0\le x\lt 1$ and the value zero elsewhere. The original definition of the Walsh functions is based on the following recursive equations:
$$
w_{2j}(x) = w_j(2x)+(-1)^j w_j (2x -1) \quad \mbox{for} \ \ j=1,2,… \\
w_{2j-1}(x) = w_{j-1}(2x)-(-1)^{j-1} w_{j-1} (2x -1) \quad \mbox{for} \ \ j=1,2,…
$$ with the initial condition of $w_0(x)= 1$. You can meet with different ordering of Walsh functions in the literature but in general it corresponds to the ordering of the orthogonal harmonic functions $h_j(x)$, which are defined as:
$$
h_{2j}(x) = \cos(2\pi j x)\quad \mbox{for} \ \ j=0,1,… \\
h_{2j-1}(x) = \sin(2\pi j x)\quad \mbox{for} \ \ j=1,2,…
$$ on the interval $0\le x\lt 1$, and have zero value for all other values of $x$ outside this interval. A comparison of both function classes looks as follows:

hwfunctions

where the first eight harmonic functions and Walsh functions are given in the left and right panels, respectively. Walsh functions with even and odd orders are called the cal and sal functions, respectively, and they correspond to the cosine and sine functions in Fourier analysis (Aittokallio et al. 2001).

1.2. From Hadamard to Walsh Matrix

In Python the Hadamard matrix of order $2^M$ is obtainable as a NumPy 2D array making use of SciPy module as follows:

from scipy.linalg import hadamard
M=3; n=2**M
H=hadamard(n)
print(H)

what is this case returns:

[[ 1  1  1  1  1  1  1  1]
 [ 1 -1  1 -1  1 -1  1 -1]
 [ 1  1 -1 -1  1  1 -1 -1]
 [ 1 -1 -1  1  1 -1 -1  1]
 [ 1  1  1  1 -1 -1 -1 -1]
 [ 1 -1  1 -1 -1  1 -1  1]
 [ 1  1 -1 -1 -1 -1  1  1]
 [ 1 -1 -1  1 -1  1  1 -1]]

Each row of the matrix corresponds to Walsh function. However, the ordering is different (Hadamard ordering). Therefore, to be able to “see” the shape of Walsh functions as presented in the figure above, we need to rearrange their indexing. The resultant matrix is known as Walsh matrix. We derive it in Python:

def Hadamard2Walsh(n):
    # Function computes both Hadamard and Walsh Matrices of n=2^M order
    # (c) 2015 QuantAtRisk.com, coded by Pawel Lachowicz, adopted after
    # au.mathworks.com/help/signal/examples/discrete-walsh-hadamard-transform.html
    import numpy as np
    from scipy.linalg import hadamard
    from math import log
 
    hadamardMatrix=hadamard(n)
    HadIdx = np.arange(n)
    M = log(n,2)+1
 
    for i in HadIdx:
        s=format(i, '#032b')
        s=s[::-1]; s=s[:-2]; s=list(s)
        x=[int(x) for x in s]
        x=np.array(x)
        if(i==0):
            binHadIdx=x
        else:
            binHadIdx=np.vstack((binHadIdx,x))
 
    binSeqIdx = np.zeros((n,M)).T
 
    for k in reversed(xrange(1,int(M))):
        tmp=np.bitwise_xor(binHadIdx.T[k],binHadIdx.T[k-1])
        binSeqIdx[k]=tmp
 
    tmp=np.power(2,np.arange(M)[::-1])
    tmp=tmp.T
    SeqIdx = np.dot(binSeqIdx.T,tmp)
 
    j=1
    for i in SeqIdx:
        if(j==1):
            walshMatrix=hadamardMatrix[i]
        else:
            walshMatrix=np.vstack((walshMatrix,hadamardMatrix[i]))
        j+=1
 
    return (hadamardMatrix,walshMatrix)

Therefore, by calling the function in an exemplary main program:

from WalshHadamard import Hadamard2Walsh
import matplotlib.pyplot as plt
import numpy as np
 
M=3; n=2**M
(H,W)=Hadamard2Walsh(n)
# display Hadamard matrix followed by Walsh matrix (n=8)
print(H); print;
print(W)
 
# a visual comparison of Walsh function (j=2)
M=3;
n=2**M
_,W=Hadamard2Walsh(n)
plt.subplot(2,1,1)
plt.step(np.arange(n).tolist(),W[2],where="post")
plt.xlim(0,n)
plt.ylim(-1.1,1.1); plt.ylabel("order M=3")
 
M=8;
n=2**M
_,W=Hadamard2Walsh(n)
plt.subplot(2,1,2)
plt.step(np.arange(n).tolist(),W[2],where="post")
plt.xlim(0,n)
plt.ylim(-1.1,1.1); plt.ylabel("order M=8")
 
plt.show()

first, we display Hadamard and Walsh matrices of order $n=2^3$, respectively:

[[ 1  1  1  1  1  1  1  1]
 [ 1 -1  1 -1  1 -1  1 -1]
 [ 1  1 -1 -1  1  1 -1 -1]
 [ 1 -1 -1  1  1 -1 -1  1]
 [ 1  1  1  1 -1 -1 -1 -1]
 [ 1 -1  1 -1 -1  1 -1  1]
 [ 1  1 -1 -1 -1 -1  1  1]
 [ 1 -1 -1  1 -1  1  1 -1]]
 
[[ 1  1  1  1  1  1  1  1]
 [ 1  1  1  1 -1 -1 -1 -1]
 [ 1  1 -1 -1 -1 -1  1  1]
 [ 1  1 -1 -1  1  1 -1 -1]
 [ 1 -1 -1  1  1 -1 -1  1]
 [ 1 -1 -1  1 -1  1  1 -1]
 [ 1 -1  1 -1 -1  1 -1  1]
 [ 1 -1  1 -1  1 -1  1 -1]]

and next we visually verify that the shape of the third Walsh function ($j=2$) is preserved for two different orders, here $M=3$ and $M=8$, respectively:
walsh38

The third possibility of ordering the Walsh functions is so that they are arranged in increasing order of their sequencies or number of zero-crossings: sequency order. However, we won’t be interested in it this time.

1.3. Signal Transformations

As we have seen in the beginning, the WHT is able to perform a signal transformation for any real-vauled time-series. The sole requirement is that the signal ought to be of $2^M$ length. Now, it should be obvious why is so. When you think about WHT for a longer second you may understand its uniqueness as contrasted with the Fourier transform. Firstly, the waveforms are much simpler. Secondly, the complexity of computation is significantly reduced. Finally, if the input signal is converted from its original form down to only two discrete values $\pm 1$, we end up with a bunch of trivial arithmetical calculations while WHT.

The Walsh-Hadamard transform found its application in medical signal processing, audio/sound processing, signal and image compression, pattern recognition, and cryptography. In the most simplistic cases, one deals with the input signal to be of the binary form, e.g. 01001011010110. If we consider the binary function $f: \mathbb{Z}_2^n \rightarrow \mathbb{Z}_2$ then the following transformation is possible:
$$
\bar{f}(\mathbf{x}) = 1 – 2f(\mathbf{x}) = (-1)^{f(\mathbf{x})}
$$ therefore $\bar{f}: \mathbb{Z}_2^n \rightarrow \{-1,1\}$. What does it do is it performs the following conversion, for instance:
$$
\{0,1,0,1,1,0,1,0,0,0,1,…\} \rightarrow \{-1,1,-1,1,1,-1,1,-1,-1,-1,1,…\}
$$ of the initial binary time-series. However, what would be more interesting for us as it comes to the financial return-series processing, is the transformation:
$$
\bar{f}(\mathbf{x}) = \left\{
\begin{array}{l l}
1 & \ \text{if $f(\mathbf{x})\ge 0$}\\
-1 & \ \text{if $f(\mathbf{x})\lt 0$}
\end{array} \right.
$$ Given that, any return-series in the value interval $[-1,1]$ (real-valued) is transformed to the binary form of $\pm 1$, for example:
$$
\{0.031,0.002,-0.018,-0.025,0.011,…\} \rightarrow \{1,1,-1,-1,1,…\} \ .
$$ This simple signal transformation in connection with the power of Walsh-Hadamard Transform opens new possibilities of analysing the underlying true signal. WHT is all “made of” $\pm 1$ values sleeping in its Hadamard matrices. Coming in a close contact with the signal of the same form this is “where the rubber meets the road” (Durkee, Jacobs, & Meat Loaf 1995).

1.4. Discrete Walsh-Hadamard Transform in Python

The Walsh-Hadamard transform is an orthogonal transformation that decomposes a signal into a set of orthogonal, rectangular waveforms (Walsh functions). For any real-valued signal we derive WHT as follows:

def WHT(x):
    # Function computes (slow) Discrete Walsh-Hadamard Transform
    # for any 1D real-valued signal
    # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
    x=np.array(x)
    if(len(x.shape)<2): # make sure x is 1D array
        if(len(x)>3):   # accept x of min length of 4 elements (M=2)
            # check length of signal, adjust to 2**m
            n=len(x)
            M=trunc(log(n,2))
            x=x[0:2**M]
            h2=np.array([[1,1],[1,-1]])
            for i in xrange(M-1):
                if(i==0):
                    H=np.kron(h2,h2)
                else:
                    H=np.kron(H,h2)
 
            return (np.dot(H,x)/2.**M, x, M)
        else:
            print("HWT(x): Array too short!")
            raise SystemExit
    else:
        print("HWT(x): 1D array expected!")
        raise SystemExit

Despite the simplicity, this solution slows down for signals of length of $n\ge 2^{22}$ where, in case of my MacBook Pro, 16 GB of RAM is just not enough! Therefore, the mechanical derivation of WHT making use of Kronecker products between matrices is often referred to as Slow Walsh-Hadamard Transform. It is obvious that Fast WHT exists but its application for the use of this article (and research) is not required. Why? We will discuss it later on.

We can see our Python’s WHT(x) function in action coding, for instance:

from WalshHadamard import WHT
from random import randrange
 
x1=[0.123,-0.345,-0.021,0.054,0.008,0.043,-0.017,-0.036]
wht,_,_=WHT(x1)
print("x1  = %s" % x1)
print("WHT = %s" % wht)
 
x2=[randrange(-1,2,2) for i in xrange(2**4)]
wht,_,_=WHT(x2)
print;
print("x2  = %s" % x2)
print("WHT = %s" % wht)

what returns

x1  = [0.123, -0.345, -0.021, 0.054, 0.008, 0.043, -0.017, -0.036]
WHT = [-0.023875  0.047125 -0.018875  0.061125 -0.023375  0.051125 
       -0.044875 0.074625]
 
x2  = [1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, -1, -1]
WHT = [ 0.375  0.125  0.125 -0.125 -0.125  0.125 -0.375 -0.125  
        0.375  0.125 -0.375  0.375 -0.125  0.125  0.125  0.375]

where we performed WHT of a real-values signal (e.g. a financial return-series) of $x_1$ and a random binary sequence of $x_2$. Is it correct? You can always verify those results in Matlab (version used here: 2014b) by executing the corresponding function of fwht from the Matlab’s Signal Processing Toolbox:

>> x1=[0.123, -0.345, -0.021, 0.054, 0.008, 0.043, -0.017, -0.036]
x1 =
    0.1230  -0.3450  -0.0210  0.0540  0.0080  0.0430  -0.0170  -0.0360
 
>> y1 = fwht(x1,length(x1),'hadamard')
y1 =
   -0.0239    0.0471   -0.0189    0.0611   -0.0234    0.0511   -0.0449    0.0746
 
>> x2=[1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, -1, -1]
x2 =
     1 -1  1  1  1  1  1  1  -1  1  1  -1  1  1  -1  -1
 
>> y2 = fwht(x2,length(x2),'hadamard')'
y2 =
    0.3750
    0.1250
    0.1250
   -0.1250
   -0.1250
    0.1250
   -0.3750
   -0.1250
    0.3750
    0.1250
   -0.3750
    0.3750
   -0.1250
    0.1250
    0.1250
    0.3750

All nice, smooth, and in agreement. The difference between WHT(x) and Matlab’s fwht is that the former trims the signal down while the latter allows for padding with zeros. Just keep that in mind if you employ Matlab in your own research.

2. Random Sequences and Walsh-Hadamard Transform Statistical Test

You may raise a question. Why do we need to introduce WHT at all, transform return-series into binary $\pm 1$ sequences, and what does it have in common with randomness? The answer is straightforward and my idea is simple, namely: we may test any financial return time-series for randomness using our fancy WHT approach. A bit of creativity before sundown.

WFT of the binary signal returns a unique pattern. If signal $x(t)$ is of random nature, we might suspect not to find any sub-string of it to be repeatable. That’s the main motivation standing behind PRNGs: to imitate true randomness met in nature.

In 2009 Oprina et alii (hereinafter Op09) proposed a statistical test based on results derived for any binary $\pm 1$ signal $x(t)$. But they were smart, instead of looking at the signal as a whole, they suggested its analysis chunk by chunk (via signal segmentation into equi-lengthy blocks). The statistical test they designed aims at performing class of autocorrelation tests with the correlation mask given by the rows of Hadamard matrix. As a supplement to the methodology presented in NIST Special Publication 800-22 (Rukhin at al. 2010) which specifies 16 independent statistical tests for random and pseudorandom number generators for cryptographic applications, Op09 proposed another two methods, based on confidence intervals, which can detect a more general failure in the random and pseudorandom generators. All 5 statistical tests of Op09 form the Walsh-Hadamard Transform Statistical Test Suite and we, what follows, will encode them to Python, focusing both on statistical meaning and application.

The working line standing behind the Op09’s Suite was a generalised test suitable for different purposes: randomness testing, cryptographic design, crypto-analysis techniques and stegano-graphic detection. In this Section, first, we concentrate our effort to see how this new test suite works for the standard Mersenne Twister PRNG concealed in the Python’s random function class. Next, we move to the real-life financial data as an input where we aim to verify whether any chunk (block) of time-series can be regarded as of random nature at a given significance level of $\alpha$. If so, which one. In not, well, well, well… why not?! For the latter case, this test opens a new window of opportunities to look for non-stochastic nature of, e.g. trading signals, and their repeatable properties (if any).

2.1. Signal Pre-Processing

Each tests requires some input data. In case of Op09 tests, we need to provide: a trimmed signal $x(t)$ of the total length $n=2^M$; choose a sequence (block) size of length $2^m$; a significance level of $\alpha$ (rejection level); a probability $p$ of occurrence of the digit 1. In step 1 we transform (if needed) $x(t_i)$ into $\pm 1$ sequence as specified in Section 1.3. In step 2 we compute lower and upper rejection limits of the test $u_{\alpha/2}$ and $u_{1-\alpha/2}$. In step 3 we compute the number of sequences to be processed $a=n/(2m)$ and split $x(t)$ into $a$ adjacent blocks (sequences).

Since it sounds complicated let’s see how easy it is in fact. We design a function that splits $X(t)$ signal into $a$ blocks of $b$ length:

def xsequences(x):
    x=np.array(x)   # X(t) or x(t)
    if(len(x.shape)<2): # make sure x is 1D array
        if(len(x)>3):   # accept x of min length of 4 elements (M=2)
            # check length of signal, adjust to 2**M if needed
            n=len(x)
            M=trunc(log(n,2))
            x=x[0:2**M]
            a=(2**(M/2))  # a number of adjacent sequences/blocks
            b=2**M/a      # a number of elements in each sequence
            y=np.reshape(x,(a,b))
            #print(y)
            return (y,x,a,b,M)
        else:
            print("xsequences(x): Array too short!")
            raise SystemExit
    else:
        print("xsequences(x): 1D array expected!")
        raise SystemExit

where the conversion of Python list’s (or NumPy 1D array’s) values into $\pm 1$ signal we obtain by:

def ret2bin(x):
    # Function converts list/np.ndarray values into +/-1 signal
    # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
    Y=[]; ok=False
    print(str(type(x)))
    print(x)
    if('numpy' in str(type(x)) and 'ndarray' in str(type(x))):
        x=x.tolist()
        ok=True
    elif('list' in str(type(x))):
        ok=True
    if(ok):
        for y in x:
            if(y<0):
                Y.append(-1)
            else:
                Y.append(1)
        return Y
    else:
        print("Error: neither 1D list nor 1D NumPy ndarray")
        raise SystemExit

To be perfectly informed about what took place we may wish to display a summary on pre-processed signal as follows:

def info(X,xt,a,b,M):
    line()
    print("Signal\t\tX(t)")
    print("  of length\tn = %d digits" % len(X))
    print("trimmed to\tx(t)")
    print("  of length\tn = %d digits (n=2^%d)" % (a*b,M))
    print("  split into\ta = %d sub-sequences " % a)
    print("\t\tb = %d-digit long" % b)
    print

Now, let see how this section of our data processing works in practice. As an example, first, we generate a random signal $X(t)$ of length $N=2^{10}+3$ and values spread in the interval of $[-0.2,0.2]$. That should provide us with some sort of feeling of the financial return time-series (e.g. collected daily based on stock or FX-series trading). Next, employing the function of xsequences we will trim the signal $X(t)$ down to $x(t)$ of length $n=2^{10}$ however converting first the return-series into $\pm 1$ sequence denoted by $x'(t)$. Finally, the $x(t)$ we split into $a$ sub-sequences of length $b$; $x_{\rm seq}$. An exemplary main program executing those steps could be coded as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from WalshHadamard import ret2bin, xsequencies, info
from random import uniform, seed
import numpy as np
 
# generate a random signal X(t)
seed(4515)
X=[uniform(-0.2,0.2) for i in range(2**10)]; X=X+[0.12,-0.04,0.01]
Xorg=X
 
# convert its values into +/-1 sequence
X=ret2bin(X)  # x'(t)
 
# split X'(t) into a blocks; save result in xseq 2D array
(xseq,x,a,b,M) = xsequences(X)
 
print("X(t) =")
for i in xrange(0,len(Xorg)):
    print("%10.5f" % Xorg[i])
print("\nx'(t) =")
print(np.array(X))
print("\nx(t) =")
print(x)
print("\nxseq =")
print(xseq)
print
 
info(X,x,a,b,M)

returning:

X(t) =
   0.17496
   0.07144
  -0.15979
   0.11344
   0.08134
  -0.01725
     ...
  -0.16005
   0.01419
  -0.08748
  -0.03218
  -0.07908
  -0.02781
   0.12000
  -0.04000
   0.01000
 
X'(t) =
[ 1  1 -1 ...,  1 -1  1]
 
x(t) =
[ 1  1 -1 ..., -1 -1 -1]
 
xseq =
[[ 1  1 -1 ..., -1 -1  1]
 [-1 -1 -1 ...,  1 -1  1]
 [-1 -1  1 ..., -1  1  1]
 ..., 
 [-1  1  1 ...,  1 -1  1]
 [-1  1  1 ..., -1 -1 -1]
 [ 1 -1 -1 ..., -1 -1 -1]]
 
----------------------------------------------------------------------
Signal		X(t)
  of length	n = 1027 digits
trimmed to	x(t)
  of length	n = 1024 digits (n=2^10)
  split into	a = 32 sub-sequences 
		b = 32-digit long

Having such framework for initial input data, we are ready to program WHT Statistical Test based on Op09 recipe.

2.2. Test Statistic

2D matrix of $x_{\rm seq}$ holding our signal under investigation is the starting point to its tests for randomness. Op09’s test is based on computation of WHT for each row of $x_{\rm seq}$ and the t-statistics, $t_{ij}$, as a test function based on Walsh-Hadamard transformation of all sub-sequencies of $x(t)$.

It is assumed that for any signal $y(t_i)$ where $i=0,1,…$ the WHT returns a sequence $\{w_i\}$ and: (a) for $w_0$ the mean value is $m_0=2^m(1-2p)$; the variance is given by $\sigma^2_0=2^{m+2}p(1-p)$ and the distribution of $(w_0-m_0)/\sigma_0 \sim N(0,1)$ for $m>7$; (b) for $w_i$ ($i\ge 1$) the mean value is $m_i=0$; the variance is $\sigma^2_i=2^{m+2}p(1-p)$ and the distribution of $(w_i-m_i)/\sigma_i \sim N(0,1)\ $ for $m>7$. Recalling that $p$ stands for probability of occurrence of the digit 1 in $x_{{\rm seq},j}$ for $p = 0.5$ (our desired test probability) the mean value of $w_i$ is equal 0 for every $i$.

In $x_{\rm seq}$ array for every $j=0,1,…,(a-1)\ $ and for every $i=0,1,…,(b-1)\ $ we compute t-statistic as follows:
$$
t_{ij} = \frac{w_{ij} – m_i}{\sigma_i}
$$ where $w_{ij}$ is the $i$-th Walsh-Hadamard transform component of the block $j$. In addition, we convert all $t_{ij}$ into $p$-values:
$$
p{\rm-value} = P_{ij} = {\rm Pr}(X\lt t_{ij}) = 1-\frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{\frac{-x^2}{2}} dx
$$ such $t_{ij}\sim N(0,1)$, i.e. has a normal distribution with zero mean and unit standard deviation. Are you still with me? Great! The Python code for this part of our analysis may look in the following way:

def tstat(x,a,b,M):
    # specify the probability of occurrence of the digit "1"
    p=0.5
    print("Computation of WHTs...")
    for j in xrange(a):
        hwt, _, _ = WHT(x[j])
        if(j==0):
            y=hwt
        else:
            y=np.vstack((y,hwt))   # WHT for xseq
    print("  ...completed")
    print("Computation of t-statistics..."),
    t=[];
    for j in xrange(a):     # over sequences/blocks (rows)
        for i in xrange(b): # over sequence's elements (columns)
            if(i==0):
                if(p==0.5):
                    m0j=0
                else:
                    m0j=(2.**M/2.)*(1.-2.*p)
                sig0j=sqrt((2**M/2)*p*(1.-p))
                w0j=y[j][i]
                t0j=(w0j-m0j)/sig0j
                t.append(t0j)
            else:
                sigij=sqrt((2.**((M+2.)/2.))*p*(1.-p))
                wij=y[j][i]
                tij=wij/sigij
                t.append(tij)
    t=np.array(t)
    print("completed")
    print("Computation of p-values..."),
    # standardised t-statistics; t_{i,j} ~ N(0,1)
    t=(t-np.mean(t))/(np.std(t))
    # p-values = 1-[1/sqrt(2*pi)*integral[exp(-x**2/2),x=-inf..t]]
    P=1-ndtr(t)
    print("completed\n")
    return(t,P,y)

and returns, as the output, two 2D arrays storing $t_{ij}$ statistics for every $w_{ij}$ (t variable) and corresponding $p$-values of $P_{ij}$ (P variable).

Continuing the above example, we extend its body by adding what follows:

28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from WalshHadamard import tstat
 
(t, P, xseqWHT) = tstat(xseq,a,b,M)
 
print("t =")
print(t)
print("\np-values =")
print(P)
print("\nw =")
print(xseqWHT)
 
# display WHTs for "a" sequences
plt.imshow(xseqWHT, interpolation='nearest', cmap='PuOr')
plt.colorbar()
plt.xlabel("Sequence's Elements"); plt.ylabel("Sequence Number")
plt.title("Walsh-Hadamard Transforms")
plt.show()

to derive:

Computation of WHTs...
  ...completed
Computation of t-statistics... completed
Computation of p-values... completed
 
t =
[ 0.26609308  0.72730106 -0.69960094 ..., -1.41305193 -0.69960094
  0.01385006]
 
p-values =
[ 0.39508376  0.23352077  0.75791172 ...,  0.92117977  0.75791172
  0.4944748 ]
 
w =
[[ 0.125   0.125  -0.125  ...,  0.125  -0.125  -0.125 ]
 [ 0.1875 -0.1875 -0.0625 ..., -0.0625  0.0625 -0.0625]
 [ 0.0625  0.0625 -0.4375 ..., -0.0625 -0.0625 -0.0625]
 ..., 
 [-0.1875 -0.3125  0.1875 ...,  0.1875  0.1875 -0.1875]
 [-0.125   0.125  -0.125  ..., -0.125  -0.125  -0.375 ]
 [ 0.      0.125  -0.25   ..., -0.25   -0.125   0.    ]]

and, at the end, we display WHTs computed for every single (horizontal) sequence of $x_{\rm seq}$ array as stored in $w$ matrix by tstats function:
Walsh-Hadamard Transforms
Again, every element (a square) in the above figure corresponds to $w_{ij}$ value. I would like also pay your attention to the fact that both arrays of $t$ and $P$ are 1D vectors storing all t-statistics and $p$-values, respectively. In the following WHT tests we will make an additional effort to split them into $a\times b$ matrices corresponding exactly to $w$-like arrangement. That’s an easy part so don’t bother too much about that.

2.3. Statistical Test Framework

In general, here, we are interested in verification of two opposite statistical hypotheses. The concept of hypothesis testing is given in every textbook on the subject. We test our binary signal $x(t)$ for randomness. Therefore, $H_0$: $x(t)$ is generated by a binary memory-less source i.e. the signal does not contain any predictable component; $H_1$: $x(t)$ is not produced by a binary memory-less source, i.e. the signal contains a predictable component.

Within the standard framework of hypothesis testing we can make two errors. The first one refers to $\alpha$ (so-called significance level) and denotes the probability of occurrence of a false positive result. The second one refers to $\beta$ and denotes the probability of the occurrence of a false negative result.

The testing procedure that can be applied here is: for a fixed value of $\alpha$ we find a confidence region for the test statistic and check if the statistical test value is in the confidence region. The confidence levels are computed using the quantiles $u_{\alpha/2}$ and $u_{1-\alpha/2}$ (otherwise, specified in the text of the test). Alternatively, if an arbitrary $t_{\rm stat}$ is the value of the test statistics (test function) we may compare $p{\rm-value}={\rm Pr}(X\lt t_{\rm stat})$ with $\alpha$ and decide on randomness when $p$-value$\ \ge\alpha$.

2.4. Crude Decision (Test 1)

The first WHT test from the Op09’s suite is a crude decision or majority decision. For chosen $\alpha$ and at $u_\alpha$ denoting the quantile of order $\alpha$ of the normal distribution, if:
$$
t_{ij} \notin [u_{\alpha/2}; u_{1-\alpha/2}]
$$ then reject the hypothesis of randomness regarding $i$-th test statistic of the signal $x(t)$ at the significance level of $\alpha$. Jot down both $j$ and $i$ corresponding to sequence number and sequence’s element, respectively. Op09 suggest that this test is suitable for small numbers of $a<1/\alpha\ $ which is generally always fulfilled for our data. We code this test in Python as follows:

def test1(cl,t,a,b,otest):
    alpha=1.-cl/100.
    u1=norm.ppf(alpha/2.)
    u2=norm.ppf(1-alpha/2.)
    Results1=[]
    for l in t:
        if(l<u1 or l>u2):
            Results1.append(0)
        else:
            Results1.append(1)
    nfail=a*b-np.sum(Results1)
    print("Test 1 (Crude Decision)")
    print("  RESULT: %d out of %d test variables stand for " \
          "randomness" % (a*b-nfail,a*b))
    if((a*b-nfail)/float(a*b)>.99):
        print("\t  Signal x(t) appears to be random")
    else:
        print("\t  Signal x(t) appears to be non-random")
    otest.append(100.*(a*b-nfail)/float(a*b)) # gather per cent of positive results
    print("\t  at %.5f%% confidence level" % (100.*(1.-alpha)))
    print
    return(otest)

Op09’s decision on rejection of $H_0$ is too stiff. In the function we calculate the number of $t_{ij}$’s falling outside the test interval. If their number exceeds 1%, we claim on the lack of evidence of randomness for $x(t)$ as a whole.

2.5. Proportion of Sequences Passing a Test (Test 2)

Recall that for each row (sub-sequence of $x(t)$) and its elements, we have computed both $t_{ij}$’s and $P_{ij}$ values. Let’s use the latter here. In this test, first, we check for every row of (re-shaped) $t$ 2D array a number of $p$-values to be $P_{ij}\lt\alpha$. If this number is greater than zero, we reject $j$-th sub-sequence of $x(t)$ at the significance level of $\alpha$ to pass the test. For all $a$ sub-sequences we count its total number of those which did not pass the test, $n_2$. If:
$$
n_2 \notin \left[ a\alpha \sqrt{a \alpha (1-\alpha))} u_{\alpha/2}; a\alpha \sqrt{a \alpha (1-\alpha))} u_{1-\alpha/2} \right]
$$ then there is evidence that signal $x(t)$ is non-random.

We code this test simply as:

def test2(cl,P,a,b,otest):
    alpha=1.-cl/100.
    u1=norm.ppf(alpha/2.)
    u2=norm.ppf(1-alpha/2.)
    Results2=[]
    rP=np.reshape(P,(a,b))  # turning P 1D-vector into (a x b) 2D array!
    for j in xrange(a):
        tmp=rP[j][(rP[j]<alpha)]
        #print(tmp)
        if(len(tmp)>0):
            Results2.append(0)   # fail for sub-sequence
        else:
            Results2.append(1)   # pass
 
    nfail2=a-np.sum(Results2)  # total number of sub-sequences which failed
    t2=nfail2/float(a)
    print("Test 2 (Proportion of Sequences Passing a Test)")
    b1=alpha*a+sqrt(a*alpha*(1-alpha))*u1
    b2=alpha*a+sqrt(a*alpha*(1-alpha))*u2
    if(t2<b1 or t2>b2):
        print("  RESULT: Signal x(t) appears to be non-random")
        otest.append(0.)
    else:
        print("  RESULT: Signal x(t) appears to be random")
        otest.append(100.)
    print("\t  at %.5f%% confidence level" % (100.*(1.-alpha)))
    print
    return(otest)

This test is also described well by Rukhin et al. (2010) though we follow the method of Op09 adjusted for the proportion of $p$-values failing the test for randomness as counted sub-sequence by sub-sequence.

2.7. Uniformity of p-values (Test 3)

In this test, the distribution of $p$-values is examined to ensure uniformity. This may be visually illustrated using a histogram, whereby, the interval between 0 and 1 is divided into $K=10$ sub-intervals, and the $p$-values, i.e. in our case $P_{ij}$’s, that lie within each sub-interval are counted and displayed.

Uniformity may also be determined via an application of a $\chi^2$ test and the determination of a $p$-value corresponding to the Goodness-of-Fit Distributional Test on the $p$-values obtained for an arbitrary statistical test (i.e., a $p$-value of the $p$-values). We accomplish that via computation of the test statistic:
$$
\chi^2 = \sum_{i=1}^{K} \frac{\left( F_i-\frac{a}{K} \right)^2}{\frac{a}{K}}
$$ where $F_i$ is the number of $P_{ij}$ in the histogram’s bin of $i$, and $a$ is the number of sub-sequences of $x(t)$ we investigate.

We reject the hypothesis of randomness regarding $i$-th test statistic $t_{ij}$ of $x(t)$ at the significance level of $\alpha$ if:
$$
\chi^2_i \notin [0; \chi^2(\alpha, K-1)] \ .
$$ Let $\chi^2(\alpha, K-1)$ be the quantile of order $\alpha$ of the distribution $\chi^2(K-1)$. In Python we may calculated it in the following way:

from scipy.stats import chi2
alpha=0.001
K=10
# for some derived variable of chi2_test
print(Chi2(alpha,K-1))

If our test value of Test 3 $\chi_i^2\le \chi^2(\alpha, K-1)$ then we count $i$-th statistics to be not against randomness of $x(t)$. This is an equivalent to testing $i$-th $p$-value of $P_{ij}$ if $P_{ij}\ge\alpha$. The latter can be computed in Python as:

from scipy.special import gammainc
alpha=0.001
K=10
# for some derived variable of chi2_test
pvalue=1-gammainc((K-1)/2.,chi2_test/2.)

where gammainc(a,x) stands for incomplete gamma function defined as:
$$
\Gamma(a,x) = \frac{1}{\Gamma(a)} \int_0^x e^{-t} t^{a-1} dt
$$ and $\Gamma(a)$ denotes a standard gamma function.

Given that, we code Test 3 of Uniformity of $p$-values in the following way:

def test3(cl,P,a,b,otest):
    alpha=1.-cl/100.
    rP=np.reshape(P,(a,b))
    rPT=rP.T
    Results3=0
    for i in xrange(b):
        (hist,bin_edges,_)=plt.hist(rPT[i], bins=list(np.arange(0.0,1.1,0.1)))
        F=hist
        K=len(hist)  # K=10 for bins as defined above
        S=a
        chi2=0
        for j in xrange(K):
            chi2+=((F[j]-S/K)**2.)/(S/K)
        pvalue=1-gammainc(9/2.,chi2/2.)
        if(pvalue>=alpha and chi2<=Chi2(alpha,K-1)):
            Results3+=1
    print("Test 3 (Uniformity of p-values)")
    print("  RESULT: %d out of %d test variables stand for randomness"\
          % (Results3,b))
    if((Results3/float(b))>.99):
        print("\t  Signal x(t) appears to be random")
    else:
        print("\t  Signal x(t) appears to be non-random")
    otest.append(100.*(Results3/float(b)))
    print("\t  at %.5f%% confidence level" % (100.*(1.-alpha)))
    print
    return(otest)

where again we allow for less than 1% of all results not to stand against the rejection of $x(t)$ as random signal.

Please note on the transposition of rP matrix. The reason for testing WHT’s values (converted to $t_{ij}$’s or $P_{ij}$’s) is to detect autocorrelation patterns in the tested signal $x(t)$. The same approach has been applied by Op09 in Test 4 and Test 5 as discussed in the following sections and constitutes the core of invention and creativity added to Rukhin et al. (2010)’s suite of 16 tests for signals generated by various PRNGs to meet the cryptographic levels of acceptance as close-to-truly random.

2.8. Maximum Value Decision (Test 4)

This test is based again on the confidence levels approach. Let $T_{ij}=\max_j t_{ij}$ then if:
$$
T_{ij} \notin \left[ u_{\left(\frac{\alpha}{2}\right)^{a^{-1}}}; u_{\left(1-\frac{\alpha}{2}\right)^{a^{-1}}} \right]
$$ then reject the hypothesis of randomness (regarding $i$-th test function) of signal $x(t)$ at the significance level of $\alpha$. We encode it to Python no simpler as that:

def test4(cl,t,a,b,otest):
    alpha=1.-cl/100.
    rt=np.reshape(t,(a,b))
    rtT=rt.T
    Results4=0
    for i in xrange(b):
        tmp=np.max(rtT[i])
        u1=norm.ppf((alpha/2.)**(1./a))
        u2=norm.ppf((1.-alpha/2.)**(1./a))
        if not(tmp<u1 or tmp>u2):
            Results4+=1
    print("Test 4 (Maximum Value Decision)")
    print("  RESULT: %d out of %d test variables stand for randomness" % (Results4,b))
    if((Results4/float(b))>.99):
        print("\t  Signal x(t) appears to be random")
    else:
        print("\t  Signal x(t) appears to be non-random")
    otest.append(100.*(Results4/float(b)))
    print("\t  at %.5f%% confidence level" % (100.*(1.-alpha)))
    print
    return(otest)

Pay attention how this test looks at the results derived based on WHTs. It is sensitive to the distribution of maximal values along $i$-th’s elements of $t$-statistics.

2.9. Sum of Square Decision (Test 5)

Final test makes use of the $C$-statistic designed as:
$$
C_i = \sum_{j=0}^{a-1} t_{ij}^2 \ .
$$ If $C_i \notin [0; \chi^2(\alpha, a)]$ we reject the hypothesis of randomness of $x(t)$ at the significance level of $\alpha$ regarding $i$-th test function. The Python reflection of this test finds its form:

def test5(cl,t,a,b,otest):
    alpha=1.-cl/100.
    rt=np.reshape(t,(a,b))
    rtT=rt.T
    Results5=0
    for i in xrange(b):
        Ci=0
        for j in xrange(a):
           Ci+=(rtT[i][j])**2.
        if(Ci<=Chi2(alpha,a)):
            Results5+=1
    print("Test 5 (Sum of Square Decision)")
    print("  RESULT: %d out of %d test variables stand for randomness" % (Results5,b))
    if((Results5/float(b))>.99):
        print("\t  Signal x(t) appears to be random")
    else:
        print("\t  Signal x(t) appears to be non-random")
    otest.append(100.*(Results5/float(b)))
    print("\t  at %.5f%% confidence level" % (100.*(1.-alpha)))
    print
    return(otest)

Again, we allow of 1% of false negative results.

2.10. The Overall Test for Randomness of Binary Signal

We accept signal $x(t)$ to be random if the average passing rate from all five WHT statistical tests is greater than 99%, i.e. 1% can be due to false negative results, at the significance level of $\alpha$.

def overalltest(cl,otest):
    alpha=1.-cl/100.
    line()
    print("THE OVERALL RESULT:")
    if(np.mean(otest)>=99.0):
        print("   Signal x(t) displays an evidence for RANDOMNESS"),
        T=1
    else:
        print("   Signal x(t) displays an evidence for NON-RANDOMNESS"),
        T=0
    print("at %.5f%% c.l." % (100.*(1.-alpha)))
    print("   based on Walsh-Hadamard Transform Statistical Test\n")
    return(T)

and run all 5 test by calling the following function:

def WHTStatTest(cl,X):
    (xseq,xt,a,b,M) = xsequences(X)
    info(X,xt,a,b,M)
    if(M<7):
        line()
        print("Error:  Signal x(t) too short for WHT Statistical Test")
        print("        Acceptable minimal signal length: n=2^7=128\n")
    else:
        if(M>=7 and M<19):
            line()
            print("Warning: Statistically advisable signal length: n=2^19=524288\n")
        line()
        print("Test Name: Walsh-Hadamard Transform Statistical Test\n")
        (t, P, _) = tstat(xseq,a,b,M)
        otest=test1(cl,t,a,b,[])
        otest=test2(cl,P,a,b,otest)
        otest=test3(cl,P,a,b,otest)
        otest=test4(cl,t,a,b,otest)
        otest=test5(cl,t,a,b,otest)
        T=overalltest(cl,otest)
        return(T)  # 1 if x(t) is random

fed by binary $\pm 1$ signal of $X$ (see example in Section 2.1). The last function return T variable storing $1$ for the overall decision that $x(t)$ is random, $0$ otherwise. It can be used for a great number of repeated WHT tests for different signals in a loop, thus for determination of ratio of instances the WHT Statistical Test passed.

3. Randomness of random()

I know what you think right now. I have just spent an hour reading all that stuff so far, how about some real-life tests? I’m glad you asked! Here we go! We start with Mersenne Twister algorithm being the Rolls-Royce engine of Python’s random() function (and its derivatives). The whole fun of the theoretical part given above comes down to a few lines of code as given below.

Let’s see our WHT Suite of 5 Statistical Tests in action for a very long (of length $n=2^{21}$) random binary signal of $\pm 1$ form. Let’s run the exemplary main program calling the test:

# Walsh-Hadamard Transform and Tests for Randomness of Financial Return-Series
# (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
#
# Mersenne Twister PRNG test using WHT Statistical Test
 
from WalshHadamard import WHTStatTest, line
from random import randrange
 
# define confidence level for WHT Statistical Test
cl=99.9999
 
# generate random binary signal X(t)
X=[randrange(-1,2,2) for i in xrange(2**21)]
line()
print("X(t) =")
for i in range(20):
    print(X[i]),
print("...")
 
WHTStatTest(cl,X)

what returns lovely results, for instance:

----------------------------------------------------------------------
X(t) =
-1 -1 1 1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 1 -1 1 1 ...
----------------------------------------------------------------------
Signal		X(t)
  of length	n = 2097152 digits
trimmed to	x(t)
  of length	n = 2097152 digits (n=2^21)
  split into	a = 1024 sub-sequences 
		b = 2048-digit long
 
----------------------------------------------------------------------
Test Name: Walsh-Hadamard Transform Statistical Test
 
Computation of WHTs...
  ...completed
Computation of t-statistics... completed
Computation of p-values... completed
 
Test 1 (Crude Decision)
  RESULT: 2097149 out of 2097152 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 2 (Proportion of Sequences Passing a Test)
  RESULT: Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 3 (Uniformity of p-values)
  RESULT: 2045 out of 2048 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 4 (Maximum Value Decision)
  RESULT: 2047 out of 2048 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 5 (Sum of Square Decision)
  RESULT: 2048 out of 2048 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
----------------------------------------------------------------------
THE OVERALL RESULT:
   Signal x(t) displays an evidence for RANDOMNESS at 99.99990% c.l.
   based on Walsh-Hadamard Transform Statistical Test

where we assumed the significance level of $\alpha=0.000001$. Impressive indeed!

A the same $\alpha$ however for shorter signal sub-sequencies ($a=256; n=2^{16}$), we still get a significant number of passed tests supporting randomness of random(), for instance:

----------------------------------------------------------------------
X(t) =
-1 -1 1 -1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 1 1 -1 1 -1 ...
----------------------------------------------------------------------
Signal		X(t)
  of length	n = 65536 digits
trimmed to	x(t)
  of length	n = 65536 digits (n=2^16)
  split into	a = 256 sub-sequences 
		b = 256-digit long
 
----------------------------------------------------------------------
Warning: Statistically advisable signal length: n=2^19=524288
 
----------------------------------------------------------------------
Test Name: Walsh-Hadamard Transform Statistical Test
 
Computation of WHTs...
  ...completed
Computation of t-statistics... completed
Computation of p-values... completed
 
Test 1 (Crude Decision)
  RESULT: 65536 out of 65536 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 2 (Proportion of Sequences Passing a Test)
  RESULT: Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 3 (Uniformity of p-values)
  RESULT: 254 out of 256 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 4 (Maximum Value Decision)
  RESULT: 255 out of 256 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 5 (Sum of Square Decision)
  RESULT: 256 out of 256 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
----------------------------------------------------------------------
THE OVERALL RESULT:
   Signal x(t) displays an evidence for RANDOMNESS at 99.99990% c.l.
   based on Walsh-Hadamard Transform Statistical Test

For this random binary signal $x(t)$, the Walsh-Hadamard Transforms for the first 64 signal sub-sequences reveal pretty “random” distributions of $w_{ij}$ values:
wht3

4. Randomness of Financial Return-Series

Eventually, we stand face to face in the grand finale with the question: do financial return time-series can be of random nature? More precisely, if we convert any return-series (regardless of time step, frequency of trading, etc.) to the binary $\pm 1$ signal, do the corresponding positive and negative returns occur randomly or not? Good question.

Let’s see by example of FX 30-min return-series of USDCHF currency pair traded between Sep/2009 and Nov/2014, how our WHT test for randomness works. We use the tick-data downloaded from Pepperstone.com and rebinned up to 30-min evenly distributed price series as provided in my earlier post of Rebinning Tick-Data for FX Algo Traders. We read the data by Python and convert the price-series into return-series.

What follows is similar what we have done within previous examples. First, we convert the return-series into binary signal. As we will see, the signal $X(t)$ is 65732 point long and can be split into 256 sub-sequences 256-point long. Therefore $n=2^{16}$ stands for the trimmed signal of $x(t)$. For the same of clarity, we plot first 64 segments (256-point long) for the USDCHF price-series marking all segments with vertical lines. Next, we run the WHT Statistical Test for all $a=256$ sequences but we display WHTs for only first 64 blocks as visualised for the price-series. The main code takes form:

# Walsh-Hadamard Transform and Tests for Randomness of Financial Return-Series
# (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
#
# 30-min FX time-series (USDCHF) traded between 9/2009 and 11/2014
 
from WalshHadamard import WHTStatTest, ret2bin, line as Line
import matplotlib.pyplot as plt
import numpy as np
import csv
 
# define confidence level for WHT Statistical Test
cl=99.9999
 
# open the file and read in the 30min price of USD/CHF
P=[]
with open("USDCHF.30m") as f:
    c = csv.reader(f, delimiter=' ', skipinitialspace=True)
    for line in c:
        price=line[6]
        P.append(price)
 
x=np.array(P,dtype=np.float128) # convert to numpy array
r=x[1:]/x[0:-1]-1.  # get a return-series
Line()
print("r(t) =")
for i in range(7):
    print("%8.5f" % r[i]),
print("...")
 
X=ret2bin(r)
print("X(t) =")
for i in range(7):
    print("%8.0f" %X[i]),
print("...")
 
plt.plot(P)
plt.xlabel("Time (from 1/05/2009 to 27/09/2010)")
plt.ylabel("USDCHF (30min)")
#plt.xlim(0,len(P))
plt.ylim(0.95,1.2)
plt.xlim(0,64*256)
plt.gca().xaxis.set_major_locator(plt.NullLocator())
for x in range(0,256*265,256):
    plt.hold(True)
    plt.plot((x,x), (0,10), 'k-')
plt.show()
plt.close("all")
 
WHTStatTest(cl,X)

and displays both plots as follows: (a) USDCHF clipped price-series,
usdchf1
and (b) WFTs for first 64 sequences 256-point long,
usdchf2

From the comparison of both figures one can understand the level of details how WHT results are derived. Interesting, for FX return-series, the WHT picture seems to be quite non-uniform suggesting that our USDCHF return-series is random. Is it so? The final answer deliver the results of WHT statistical test, summarised by our program as follows:

----------------------------------------------------------------------
r(t) =
-0.00092 -0.00033 -0.00018  0.00069 -0.00009 -0.00003 -0.00086 ...
X(t) =
      -1       -1       -1        1       -1       -1       -1 ...
----------------------------------------------------------------------
Signal		X(t)
  of length	n = 65731 digits
trimmed to	x(t)
  of length	n = 65536 digits (n=2^16)
  split into	a = 256 sub-sequences 
		b = 256-digit long
 
----------------------------------------------------------------------
Warning: Statistically advisable signal length: n=2^19=524288
 
----------------------------------------------------------------------
Test Name: Walsh-Hadamard Transform Statistical Test
 
Computation of WHTs...
  ...completed
Computation of t-statistics... completed
Computation of p-values... completed
 
Test 1 (Crude Decision)
  RESULT: 65536 out of 65536 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 2 (Proportion of Sequences Passing a Test)
  RESULT: Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 3 (Uniformity of p-values)
  RESULT: 250 out of 256 test variables stand for randomness
	  Signal x(t) appears to be non-random
	  at 99.99990% confidence level
 
Test 4 (Maximum Value Decision)
  RESULT: 255 out of 256 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
Test 5 (Sum of Square Decision)
  RESULT: 256 out of 256 test variables stand for randomness
	  Signal x(t) appears to be random
	  at 99.99990% confidence level
 
----------------------------------------------------------------------
THE OVERALL RESULT:
   Signal x(t) displays an evidence for RANDOMNESS at 99.99990% c.l.
   based on Walsh-Hadamard Transform Statistical Test

Nice!! Wasn’t worth it all the efforts to see this result?!

I’ll leave you with a pure joy of using the software I created. There is a lot to say and a lot to verify. Even the code itself can be amended a bit and adjusted for different sequence number $a$ and $b$’s. If you discover a strong evidence for non-randomness, post it in comments or drop me an e-mail. I can’t do your homework. It’s your turn. I need to sleep sometimes… ;-)

DOWNLOAD
     WalshHadamard.py, USDCHF.30m

REFERENCES
    to be added

Applied Portfolio VaR Decomposition. (3) Incremental VaR and Portfolio Revaluation.

Portfolios are like commercial aircrafts. Rely on computers. Dotted around the world. Vulnerable to ever changing weather conditions. Designed to bring benefits. Crafted to take risks and survive when the unexpected happens. As portfolio managers we have an ability to control those risk levels and adjust positions accordingly. The turbulences may occur anytime. A good pilot then knows what to do.


In Part 1 we implemented VaR analytical methodology to an exemplary 7-asset portfolio of equities deriving a running portfolio VaR, marginal VaR, and component VaR. In Part 2 we have examined an approximative impact of historical data on the covariance matrix estimation used in the portfolio VaR calculations.

Today, we will add a third component: an incremental VaR.

Altitude Adjustment

For a current portfolio holding $N$ assets with a total exposure of $C$ (dollars) we get a 95% portfolio VaR estimation simply computing:
$$
\mbox{VaR}_P = 1.65 \sigma_P C = 1.65\left(\boldsymbol{d}’ \boldsymbol{M}_2 \boldsymbol{d} \right)^{1/2}
$$ This can be estimated any time. For example, if we evaluate a portfolio of US stocks on the daily basis, and new data are available to be downloaded after 4pm when markets close, 1-day VaR can be recalculated. If it increases day-over-day and, say, one of component VaR behaves in an abnormal way (rising a warning flag), we may want to do something about that particular position: remove it completely or decrease the exposure.

A second scenario could be more optimistic. Say, we have extra cash to invest and we want to increase one (or few) currently held positions in our portfolio. Adding more shares of $i$-th asset in portfolio will increase portfolio VaR. The secret is in not pushing running engines too much and keep the oil temperature within acceptable brackets.

Alternatively, as a risk manager working in a global bank, the bank’s portfolio lists 10,453 open positions (across various markets and continents) and you have an incoming proposal of a new trade coming from one of the bank’s client. He wants to invest $\$$2,650,000 in highly illiquid asset. It is physically possible but the bank VaR may increase significantly over night. How much?

Well, in this or any similar scenarios, you need to perform the portfolio revaluation employing the incremental VaR procedure. Your aircraft may be brand new but remember that big planes fall from the sky too.

In general, by the incremental VaR we consider the difference between a new portfolio VaR (including (a) new trade(s) or a change in (a) current holding(s)) and recent portfolio VaR:
$$
\mbox{iVaR}_P = \mbox{VaR}_{P+a} – \mbox{VaR}_P
$$ where $a$ symbolises the proposed change. The incremental VaR changes in a non-linear way as the position size can be increased/reduced in any amount. That is the main difference as compared to the marginal VaR.

It is possible to expand $\mbox{VaR}_{P+a}$ in series around the original point, ignoring second-order terms if the deviations $a$ are small:
$$
\mbox{VaR}_{P+a} = \mbox{VaR}_{P} + (\Delta \mbox{VaR})’a + …
$$ and by “small” I mean a dollar change in (a) current holding(s) much much less than $C$ under market exposure. In such case,
$$
\mbox{iVaR}_P^{\rm approx} \approx (\Delta \mbox{VaR})’a
$$ and the apostrophe denotes the transposition of the vector. This approximation becomes important for really large portfolios (e.g. in banks) where revaluation and a new estimation of bank VaR needs to be derived quickly. A time saver at a reliable precision of landing. In all other cases, a new vector of extra exposures can be added to present position sizes and reevaluated. Let’s see how it works in practice!

 

Case Study A: Increasing Position of One Asset

Let’s recall the same initial settings as we have done in Part 1. We deal with 7-asset portfolio of US equities as of 12th of January, 2015. There is nearly 3 million dollars invested in stock market (I wish it were mine! ;) and based on 3-year historical data of all underlying stocks, the estimated 1-day 95% portfolio VaR is $\$$33,027.94 as forecasted for the next trading day. Id est, there is 5% of chances that we can lose 1.1% of the portfolio value before English men in New York will rise their cups of tea:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
% Applied Portfolio VaR Decomposition. (3) Incremental VaR and Portfolio 
% Revaluation.
%
% (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
format bank
 
%{
% Read the list of the portfolio components
fileID = fopen('portfolio.lst');
tmp = textscan(fileID, '%s');
fclose(fileID);
pc=tmp{1};  % a list as a cell array
 
% fetch stock data for last 3 years since:
t0=735976; % 12-Jan-2015
date2=datestr(t0,'yyyy-mm-dd');        % from
date1=datestr(t0-3*365,'yyyy-mm-dd');  % to
 
% create an empty array for storing stock data
stockd={};
% scan the tickers and fetch the data from Quandl.com
for i=1:length(pc)
    quandlc=['WIKI/',pc{i}];
    fprintf('%4.0f %s\n',i,quandlc);
    % fetch the data of the stock from Quandl
    % using recommanded Quandl's command and
    % saving them directly into Matlab's FTS object (fts)
    fts=0;
    [fts,headers]=Quandl.get(quandlc,'type','fints', ...
                  'authcode','YourQuandlCode',...
                  'start_date',date1,'end_date',date2);
    stockd{i}=fts; % entire FTS object in an array's cell
end
save data
%}
 
load data
 
% limit data to 3 years of business days, select adjusted
% close price time-series and calculate return-series
rs=[];
for i=1:length(pc)
    cp=fts2mat(stockd{i}.Adj_Close,0);
    rv=cp(2:end,1)./cp(1:end-1,1)-1;
    rs=[rs rv(end-(3*5*4*12):end)];
end
rs(isnan(rs))=0.0;
 
% covariance matrix
M2=cov(rs)
 
% an initial dollar expose per position
d=[   55621.00; ...
     101017.00; ...
      23409.00; ...
    1320814.00; ...
     131145.00; ...
     321124.00; ...
    1046867.00]
 
% invested capital of C
C=sum(d)
 
% the portfolio volatility
vol_P=sqrt(d'*M2*d);
 
% diversified portfolio VaR
VaR_P=1.65*vol_P
 
% volatility of assets (based on daily returns)
v=sqrt(diag(cov(rs)));
 
% individual and undiversified portfolio VaR
VaRi=1.65*abs(v.*d);
uVaR_P=sum(VaRi);
 
% the portfolio betas
beta=C*(M2*d)/vol_P^2;
 
% the portfolio marginal VaR
DVaR=1.65*(M2*d)/vol_P;
 
% the component VaR
cVaR=100*DVaR.*d/VaR_P; % [percent]
cVaRd=DVaR.*d; % dollars

Now, we want to add $\$$10,000 to our exposure in Walt Disney Co. stock. As of Jan/12 2015 DIS closes at $\$$94.48 therefore we consider to buy $\$$10,000/$\$$94.48=105.8 shares on the next day. We are flexible so we buy at the market price of $\$$95.23 on Jan/13, i.e. extra 105 shares in DIS. Therefore we added $105\times \$95.23$ or $\$$9,999.15 to our portfolio:

89
90
91
92
93
94
95
96
% proposed changes to current exposure
da=[0;       ...
    9999.15; ...
    0;       ...
    0;       ...
    0;       ...
    0;       ...
    0            ];

Of course we can assume a new portfolio VaR estimation by adding extra 105 shares of DIS. In real weather conditions, this number can be lower. Keep that in mind.

The remaining part seems to be easy to code:

100
101
102
103
104
105
106
107
108
109
% new portfolio VaR
VaR_Pa=1.65*sqrt((d+da)'*M2*(d+da))
% incremental VaR
iVaR=VaR_Pa-VaR_P;
% incremental VaR (approximation; see text)
iVaR_approx=DVaR'*da;
 
% new component VaR
ncVaR=100*DVaR.*(d+da)/VaR_P; % [percent]
ncVaRd=DVaR.*(d+da); % dollars

followed by a new risk report on the impact of proposed changes:

111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
fprintf('Current exposure:\t\t\t $%1s\n',bankformat(C));
fprintf('Current Portfolio VaR (diversified):\t $%1s\n',bankformat(VaR_P));
fprintf('Proposed changes in holdings:\n');
for i=1:length(d)
    fprintf('   %s\t $%-s\n',pc{i},bankformat(da(i)) );
end
fprintf('New Portfolio VaR (diversified):\t $%1s\n',bankformat(VaR_Pa));
fprintf('Incremental VaR:\t\t\t $%1s\n',bankformat(iVaR));
fprintf('Incremental VaR (approximation):\t $%1s\n',bankformat(iVaR_approx));
fprintf('Change in exposure:\t\t\t %-1.4f%%\n',sum(da)/C);
fprintf('Component VaR:\n');
fprintf('\t Current\t\t New\t\t\tChange in CVaR\n');
for i=1:length(d)
    fprintf('   %s\t%6.2f%%   $%-10s\t%6.2f%%   $%-10s\t$%-10s\n', ...
            pc{i},cVaR(i), ...
            bankformat(cVaRd(i)),ncVaR(i), ...
            bankformat(ncVaRd(i)),bankformat(ncVaRd(i)-cVaRd(i)) );
end

In this case study, the report says:

Current exposure:			 $2,999,997.00
Current Portfolio VaR (diversified):	 $33,027.94
Proposed changes in holdings:
   AAPL	 $0.00
   DIS	 $9,999.15
   IBM	 $0.00
   JNJ	 $0.00
   KO	 $0.00
   NKE	 $0.00
   TXN	 $0.00
New Portfolio VaR (diversified):	 $33,084.26
Incremental VaR:			 $56.31
Incremental VaR (approximation):	 $55.84
Change in exposure:			 0.0033%
Component VaR:
	 Current		 New			Change in CVaR
   AAPL	  0.61%   $202.99    	  0.61%   $202.99    	$0.00      
   DIS	  1.71%   $564.13    	  1.88%   $619.97    	$55.84     
   IBM	  0.26%   $84.43     	  0.26%   $84.43     	$0.00      
   JNJ	 30.99%   $10,235.65 	 30.99%   $10,235.65 	$0.00      
   KO	  1.19%   $392.02    	  1.19%   $392.02    	$0.00      
   NKE	  9.10%   $3,006.55  	  9.10%   $3,006.55  	$0.00      
   TXN	 56.14%   $18,542.15 	 56.14%   $18,542.15 	$0.00

That’s the great example to show how the approximation of the incremental VaR works in practice. $\$$10,000 added to DIS is “small” as compared to $C$ of nearly 3 million dollars in the game. By adding 105 shares of DIS we didn’t move 1-day portfolio VaR significantly which is good. Let’s play again.

 

Case Study B: Increase and Decrease Exposure

From the initial risk report we learnt that TXN contributed most to the overall portfolio VaR. Let’s say we have extra $\$$126,000 to invest and we think that AAPL would be the next shining star. In the same time, we want to reduce our market exposure to TXN by selling stocks of the total worth of $\$$500,000. This operation would be denoted as:

89
90
91
92
93
94
95
96
% proposed changes to current exposure
da=[126000;  ...
    0;       ...
    0;       ...
    0;       ...
    0;       ...
    0;       ...
    -500000      ];

and

Current exposure:			 $2,999,997.00
Current Portfolio VaR (diversified):	 $33,027.94
Proposed changes in holdings:
   AAPL	 $126,000.00
   DIS	 $0.00
   IBM	 $0.00
   JNJ	 $0.00
   KO	 $0.00
   NKE	 $0.00
   TXN	 $-500,000.00
New Portfolio VaR (diversified):	 $25,869.04
Incremental VaR:			 $-7,158.90
Incremental VaR (approximation):	 $-8,396.16
Change in exposure:			 -0.1247%
Component VaR:
	 Current		 New			Change in CVaR
   AAPL	  0.61%   $202.99    	  2.01%   $662.85    	$459.85    
   DIS	  1.71%   $564.13    	  1.71%   $564.13    	$0.00      
   IBM	  0.26%   $84.43     	  0.26%   $84.43     	$0.00      
   JNJ	 30.99%   $10,235.65 	 30.99%   $10,235.65 	$0.00      
   KO	  1.19%   $392.02    	  1.19%   $392.02    	$0.00      
   NKE	  9.10%   $3,006.55  	  9.10%   $3,006.55  	$0.00      
   TXN	 56.14%   $18,542.15 	 29.33%   $9,686.13  	$-8,856.02

The incremental VaR reacts to a massive sell out of TXN reducing the overall portfolio VaR significantly. Adding $\$$126k to APPL (share number precision omitted for the sake of simplicity) does not affect the portfolio risk levels in any dramatic way.

 

Case Study C: Adding Two New Assets

Your are a child of a new era, the era of credit cards. More and more people use them worldwide. You have this brilliant idea to add to your portfolio extra $\$$100,000 invested in (split equally between) Mastercard Inc. and Visa Inc.

Considering such an operation requires a new portfolio revaluation. Let’s do it step by step. First, let’s fetch historical data for both MA and V stocks, the same way as we did it at the very beginning. We save the tickers names in a separate file named proposal.lst:

MA
V

We supplement out Matlab code with extra few lines:

130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
% Read the list of the portfolio components
fileID = fopen('proposal.lst');
tmp = textscan(fileID, '%s');
fclose(fileID);
pcE=tmp{1};  % a list as a cell array
 
% fetch stock data for last 3 years since:
t0=735976; % 12-Jan-2015
date2=datestr(t0,'yyyy-mm-dd');        % from
date1=datestr(t0-3*365,'yyyy-mm-dd');  % to
 
% create an empty array for storing stock data
stockd={};
% scan the tickers and fetch the data from Quandl.com
for i=1:length(pcE)
    quandlc=['WIKI/',pcE{i}];
    fprintf('%4.0f %s\n',i,quandlc);
    % fetch the data of the stock from Quandl
    % using recommanded Quandl's command and
    % saving them directly into Matlab's FTS object (fts)
    fts=0;
    [fts,headers]=Quandl.get(quandlc,'type','fints', ...
                  'authcode','YourQuandlCode',...
                  'start_date',date1,'end_date',date2);
    stockdp{i}=fts; % entire FTS object in an array's cell
end
save datap
%}
 
load datap
 
% limit data to 3 years of business days, select adjusted
% close price time-series and calculate return-series
rsp=[];
for i=1:length(pcE)
    cp=fts2mat(stockdp{i}.Adj_Close,0);
    rv=cp(2:end,1)./cp(1:end-1,1)-1;
    rsp=[rsp rv(end-(3*5*4*12):end)];
end
rsp(isnan(rsp))=0.0;

As quickly as new data arrive, we build and extend return-series matrix, now not $721×7$ but $721×9$ in its dimensions:

171
rsE=[rs rsp];

What follows is pretty intuitive in coding:

173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
% covariance matrix
M2E=cov(rsE);
 
% proposed changes to current exposure
daE=[0;       ...
     0;       ...
     0;       ...
     0;       ...
     0;       ...
     0;       ...
     0;       ...
     50000;   ...
     50000        ];
 
% new portfolio VaR
d=[d; 0; 0]  % we have to extend it to allow d+daE
VaR_PaE=1.65*sqrt((d+daE)'*M2E*(d+daE))
 
% a running list of tickers, too
pcE=[pc; pcE]
 
% incremental VaR
iVaRE=VaR_PaE-VaR_P;
 
fprintf('Current exposure:\t\t\t $%1s\n',bankformat(C));
fprintf('Current Portfolio VaR (diversified):\t $%1s\n',bankformat(VaR_P));
fprintf('Proposed changes in holdings:\n');
for i=1:length(daE)
    fprintf('   %s\t $%-s\n',pcE{i},bankformat(daE(i)) );
end
fprintf('New Portfolio VaR (diversified):\t $%1s\n',bankformat(VaR_PaE));
fprintf('Incremental VaR:\t\t\t $%1s\n',bankformat(iVaRE));

generating new risk estimations:

Current exposure:			 $2,999,997.00
Current Portfolio VaR (diversified):	 $33,027.94
Proposed changes in holdings:
   AAPL	 $0.00
   DIS	 $0.00
   IBM	 $0.00
   JNJ	 $0.00
   KO	 $0.00
   NKE	 $0.00
   TXN	 $0.00
   MA	 $50,000.00
   V	 $50,000.00
New Portfolio VaR (diversified):	 $33,715.09
Incremental VaR:			 $687.15

From this example we can see that extending our portfolio with two assets would lead us to the increase of 1-day portfolio VaR by approximately $\$$680. A digestable amount.

Stay tuned as in Part 4 we go extreme with Extreme VaR.

Endnote

When you pilot an aircraft from Amsterdam to Tokyo, the path is usually over Norway, North Pole, and eastern Russia. If one of the engines goes down, you switch it off and ground the plane at the nearest airport. The risk in a commercial aviation is too high to continue the flight.

As a portfolio manager you deal with similar situations but you do not necessarily close your portfolio at once. Instead, you reduce the risk to troublesome assets. You can still fly in the markets but with reduced thrust and hopes for big gains. But don’t worry! Every plane lands with its nose up towards the sky. You can be losing but stay positive till the very end. You never know when cross-winds will change your situation. You may be in the game again before the bell will close the session.

Applied Portfolio VaR Decomposition. (2) Impact vs Moving Elements.


Calculations of daily Value-at-Risk (VaR) for any $N$-asset portfolio, as we have studied it already in Part 1, heavily depend on the covariance matrix we need to estimate. This estimation requires historical return time-series. Often negligible but superbly important question one should ask here is: How long?! How long those return-series ought to be in order to provide us with fair and sufficient information on the daily return distribution of each asset held in the portfolio?

It is intuitive that taking data covering solely past 250 trading days and, for example, past 2500 days (moving element) would have a substantial impact on the calculation of current portfolio VaR. In this short post, we will examine this relation.

Case Study (Supplementary Research)

Using the example of 7-asset portfolio from Part 1, we modify our Matlab code that allows us to download the history of stock (adjusted) close prices covering past 10 calendar years (ca. 3600 days),

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
% Applied Portfolio VaR Decomposition. (2) Impact vs Moving Elements.
% (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
 
 
clear all; close all; clc;
format short
 
% Read the list of the portfolio components
fileID = fopen('portfolio.lst');
tmp = textscan(fileID, '%s');
fclose(fileID);
pc=tmp{1};  % a list as a cell array
 
% fetch stock data for last 10 years since:
t0=735976; % 12-Jan-2015
date2=datestr(t0,'yyyy-mm-dd');        % from
date1=datestr(t0-10*365,'yyyy-mm-dd');  % to
 
%{
% create an empty array for storing stock data
stockd={};
% scan the tickers and fetch the data from Quandl.com
for i=1:length(pc)
    quandlc=['WIKI/',pc{i}];
    fprintf('%4.0f %s\n',i,quandlc);
    % fetch the data of the stock from Quandl
    % using recommanded Quandl's command and
    % saving them directly into Matlab's FTS object (fts)
    fts=0;
    [fts,headers]=Quandl.get(quandlc,'type','fints', ...
                  'authcode','YourQuandlCode',...
                  'start_date',date1,'end_date',date2);
    stockd{i}=fts; % entire FTS object in an array's cell
end
save data2
%}
load data2

Assuming the same initial holdings (position sizes) for each asset:

39
40
41
42
43
44
45
46
% an initial dollar expose per position
d=[   55621.00; ...
     101017.00; ...
      23409.00; ...
    1320814.00; ...
     131145.00; ...
     321124.00; ...
    1046867.00];

we aim at calculation of the portfolio VaR,
$$
\mbox{VaR}_P = 1.65 \sigma_P C = 1.65\left(\boldsymbol{d}’ \boldsymbol{M}_2 \boldsymbol{d} \right)^{1/2}
$$ in a function of time or, alternatively speaking, in a function of varying length of return time-series:

48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
% examine VaR_P as a function of historical data
% taken into account
e=[];
for td=(10*250):-1:60   % loop over trading days
    rs=[];
    for i=1:length(pc)
        cp=fts2mat(stockd{i}.Adj_Close,0);
        rv=cp(2:end,1)./cp(1:end-1,1)-1;
        rs=[rs rv(end-td:end)];
    end
    rs(isnan(rs))=0.0;
    % covariance matrix
    M2=cov(rs);
    % the portfolio volatility
    vol_P=sqrt(d'*M2*d);
    % diversified portfolio VaR
    VaR_P=1.65*vol_P;
    e=[e; td VaR_P];
end

The meaning of the loop is straightforward: How does $\mbox{VaR}_P$ change if we take only last 60 trading days into account, and what is its value when we include more and more data going back in time as far as 10 years?

The best way to illustrate the results is to compare running portfolio VaR with the stock market itself, here reflected by ETF of SPY,

68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
% download close price time-series for SPY (ETF)
% tracking S&P500 Index
[fts,headers]=Quandl.get('GOOG/NYSE_SPY','type','fints', ...
                  'authcode','YourQuandlCode',...
                  'start_date',date1,'end_date',date2);
SPY_cp=fts2mat(fts.Close,0);
 
% plot results
subplot(2,1,1)
xtmp=(2501:-1:1)./250;
plot(xtmp,SPY_cp(end-10*250:end,1));
xlim([min(xtmp) max(xtmp)]);
set(gca,'xdir','reverse');
ylabel('SPY Close Price [$]')
 
subplot(2,1,2)
plot(e(:,1)./250,e(:,2),'r');
set(gca,'xdir','reverse');
ylabel('VaR_P [$]')
xlabel('Time since 12-Jan-2015 [trading years]')

uncovering the big picture as follows:
ex
And now all becomes clear! Firstly, $\mbox{VaR}_P$ of our 7-asset portfolio is a function of data. Secondly, $\mbox{VaR}_P$ changes in the way that we might expect as contrasted with the market behaviour. In this point, note an evident increase in $\mbox{VaR}_P$ value if we include not 6 but 7 years of data. Why? The market went down in 2008 and the impact of it had been imprinted in the daily losses of single stocks. Therefore this contribution enriched the left tail of their return distributions. An inclusion of this information in our covariance matrix in a natural way increases the level of $\mbox{VaR}_P$.

So, how long? How many data should we include to get fair estimation of $\mbox{VaR}_P$? It’s not so easy to address. I would go with a lower bound of at least 1 year. As for the upper limit, it depends. If you are optimistic that in 2015 stocks will keep their upward momentum, look back up to 5 years. If you are a cautious investor, include 7 to 10 years of data.

Footnote

Some of you may still wonder why Impact vs Moving Elements?! Well, I believe the inspiration for a good title may come from the most unexpected direction. It was 2013 when I heard that song for the first time and I loved it! Through ages, beautiful women have always inspired men, therefore let me share my inspiration with you. You can become inspired too ;-)

YouTube Preview Image

Applied Portfolio VaR Decomposition. (1) Marginal and Component VaR.

Risk. The only ingredient of life that makes us growing and pushing outside our comfort zones. In finance, taking the risk is a risky business. Once your money have been invested, you need to keep your eyes on the ball that is rolling. Controlling the risk is the art and is the science: a quantitative approach that allows us to ease the pressure. With some proper tools we can manage our positions in the markets. No matter whether you invest in a long-term horizon or trade many times per day, your portfolio is a subject to the constantly changing levels of risk and asset volatility.


VaR. Value-at-Risk. The most hated and most adorable quantitative measure of risk. Your running security amongst the unpredictability of the floor. A drilling pain in the ass; a few sleepless nights you’ve been so familiarised with. The omnipresent and rarely confirmed assumptions of the assets returns to be normally distributed. Well, we all know the truth but, somehow, we love to settle for the analytical solutions allowing us to “see a less attractive girl” — more attractive.

In this post we will stay in the mystic circle of those assumptions as a starting point. Don’t blame me. This is what they teach us in the top-class textbooks on financial risk management and test our knowledge within FRM or CFA exams. You may pass the exam but do you really, I mean, really understand the effects of what you have just studied in real portfolio trading?

We kick off from a theory on VaR decomposition in the framework of an active $N$-asset portfolio when our initial capital of $C$ has been fully invested in the stock market (as an example). Next, we investigate the portfolio VaR and by computation of its marginal VaR we derive new positions in our portfolio that minimise the portfolio VaR.

The Analytical VaR for N-Asset Portfolio. The Beginning.

Let’s keep our theoretical considerations to the absolute but essential minimum. At the beginning you have $C$ dollars in hand. You decide to invest them all into $N$ assets (stocks, currencies, commodities, etc.). Great. Say, you pick up US equities and buy shares of $N=7$ stocks. The number of shares depends on stock price and each position size (in dollars) varies depending on your risk-return appetite. Therefore, the position size can be understood as a weighting factor, $w_i$, for $i=1,…,N$, such:
$$
\sum_{i=1}^N w_i = 1 \ \ \mbox{or} \ \ 100\mbox{%}
$$ or $d_i=w_iC$ in dollars and $\sum_i d_i=C$. Once the orders have been fulfilled, you are a proud holder of $N$-asset portfolio of stocks. Congratulations!

From now on, every day, when the markets close, you check a daily rate of return for individual assets, $r_{t,i} = P_{i,t}/P_{i,{t-1}}-1$ where by $P$ we denote the stock close price time-series. Your portfolio rate of return, day-to-day i.e. from $t$-1 to $t$, will be:
$$
R_{p,t} = \sum_{i=1}^N w_ir_{i,t} = \boldsymbol{w}’\boldsymbol{r}
$$ or alternatively:
$$
R_{p,t} = \sum_{i=1}^N d_ir_{i,t} = \boldsymbol{d}’\boldsymbol{r}
$$ what provides us with the updated portfolio value. Now, that differs from the portfolio expected return defined as:
$$
\mu_P = E\left(\sum_{i=1}^{N} w_ir_i \right) = \sum_{i=1}^{N} w_i\mu_i = \boldsymbol{w}’\boldsymbol{\mu}
$$ where the expected daily return of the $i$-th asset, $\mu_i$, needs to be estimated based on its historical returns (a return-series). This is the most fragile spot in the practical application of the Modern Portfolio Theory. If we estimate,
$$
\mu_i = \sum_{t’={(t-1)}-T}^{{t-1}} r_{t’}/T
$$ looking back at last $T$ trading days, it is obvious that we will derive different $\mu_i$’s for $T$=252 (1 year) and for $T$=1260 (5 years). That affects directly the computation of the portfolio covariance matrix thus, as we will see below, the (expected) portfolio volatility. Keep that crucial remark in mind!

The expected portfolio variance is:
$$
\sigma_P^2 = \boldsymbol{w}’ \boldsymbol{M}_2 \boldsymbol{w}
$$
where $\boldsymbol{M}_2$ is the covariance matrix $N\times N$ with the individual covariances of $c_{ij}=\mbox{cov}(\{r_i\},\{r_j\})$ between security $i$ and $j$ computed based on the return-series. Therefore,
$$
\sigma_P^2 = [w_1,…,w_N]
\left[
\begin{array}{cccc}
c_{11} & c_{12} & … & c_{1N} \\
… & … & … & … \\
c_{N1} & c_{N2} & … & c_{NN}
\end{array}
\right]
\left[
\begin{array}{cccc}
w_{1} \\
… \\
w_{N}
\end{array}
\right]
$$ what can be expressed in the terms of the dollar exposure as:
$$
\sigma_P^2C^2 = \boldsymbol{d}’ \boldsymbol{M}_2 \boldsymbol{d} \ .
$$ In order to move from the portfolio variance to the portfolio VaR we need to know the distribution of the portfolio returns. And we assume it to be normal (delta-normal model). Given that, the portfolio VaR at the 95% confidence level ($\alpha=1.65$) is:
$$
\mbox{VaR}_P = 1.65 \sigma_P C = 1.65\left(\boldsymbol{d}’ \boldsymbol{M}_2 \boldsymbol{d} \right)^{1/2}
$$ what can be understood as diversified VaR since we use our portfolio to reduce the overall risk of the investment. On the other hand, the 95% individual VaR describing individual risk of each asset in the portfolio is:
$$
\mbox{VaR}_i = 1.65 |\sigma_i d_i|
$$ where $\sigma_i$ represents the asset volatility over past period of $T$.

It is extremely difficult to obtain a portfolio with a perfect correlation (unity) among all its components. In such a case, we would talk about undiversified VaR possible to be computed as a sum of $\mbox{VaR}_i$ when, in addition as a required condition, there is no short position held in our portfolio. Reporting the risk for a portfolio currently alive is often conducted by the risk managers by providing a quotation of both diversified and undiversified VaR.

Now, imagine that we add one unit of an $i$-th asset to our portfolio and we aim at understanding the impact of this action on the portfolio VaR. In this case we talk about a marginal contribution to risk, possible to be derived through the differentiation of the portfolio variance with respect to $w_i$:
$$
\frac{\mbox{d}\sigma_P}{\mbox{d}w_i} = \frac { \left[ \sum_{i=1}^{N} w_i^2\sigma_i^2 + \sum_{i=1}^N \sum_{j=1, j\ne i}^{N} w_iw_jc_{ij} \right]^{1/2} }
{ \mbox{d}w_i }
$$
$$
= \frac{1}{2} \left[ 2w_i\sigma_i^2 + 2 \sum_{j=1, j\ne 1}^{N} w_jc_{ij} \right] \times \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \left[ \sum_{i=1}^{N} w_i^2\sigma_i^2 + \sum_{i=1}^N \sum_{j=1, j\ne i}^{N} w_iw_jc_{ij} \right]^{-1/2}
$$
$$
= \frac{w_i^2\sigma_i^2 + \sum_{j=1, j\ne 1}^{N} w_j c_{ij} } {\sigma_P}
= \frac{c_{iP}}{\sigma_P}
$$ where, again, $c_{ij}$ denotes the covariance between any $i$-th and $j$-th asset and $c_{iP}$ between $i$-th component and portfolio. That allows us to define the marginal VaR for $i$-th asset in the portfolio in the following way:
$$
\Delta\mbox{VaR}_i = \frac{\mbox{dVaR}_P}{\mbox{d}d_i} = \frac{\mbox{dVaR}_P}{\mbox{d}w_iC} = \frac{\mbox{d}(\alpha\sigma_P C)}{\mbox{d}w_iC} = \alpha\frac{C}{C}
\frac{\mbox{d}\sigma_P}{\mbox{d}w_i} = \alpha \frac{c_{iP}}{\sigma_P}
$$ that has a close relation to the systematic risk of $i$-th asset as confronted with the portfolio itself, described by:
$$
\beta_i = \frac{c_{iP}}{\sigma_P^2}
$$ therefore
$$
\Delta\mbox{VaR}_i = \alpha \sigma_P \beta_i = \beta_i \frac{ \mbox{VaR}_P } {C}
$$ The best and most practical interpretation of the marginal VaR calculated for all positions in the portfolio would be: the higher $\Delta\mbox{VaR}_i$ the corresponding exposure of the $i$-th component should be reduced to lower the overall portfolio VaR. Simple as that. Hold on, we will see that at work in the calculations a little bit later.

Since the portfolio volatility is a highly nonlinear function of its components, a simplistic computation of individual VaRs and adding the up all together injects the error into portfolio risk estimation. However, with a help of the marginal VaR we gain a tool to capture the fireflies amongst the darkness of the night.

We define the component VaR as:
$$
\mbox{cVaR}_i = \Delta\mbox{VaR}_i d_i = \Delta\mbox{VaR}_i w_i C = \beta_i \frac{ \mbox{VaR}_P w_i C } {C} = \beta_i w_i \mbox{VaR}_P
$$ utilising the observation that:
$$
\sigma_P = \sum_{i=1}^{N} w_i c_{iP} = \sum_{i=1}^{N} w_i \beta_i \sigma_P = \sigma_P \sum_{i=1}^{N} w_i \beta_i = \sigma_P
$$ what very nicely leads us to:
$$
\sum_{i=1}^{N} \mbox{cVaR}_i = \mbox{VaR}_P \sum_{i=1}^{N} w_i \beta_i = \mbox{VaR}_P \ .
$$ Having that, we get the percent contribution of the $i$-th asset to the portfolio VaR as:
$$
i = \frac{ \mbox{cVaR}_i } { \mbox{VaR}_P } = w_i \beta_i
$$ It is possible to calculate the required changes in the positions sizes based on our VaR analysis. We obtain it by re-evaluation of the portfolio marginal VaRs meeting now the following condition:
$$
\Delta\mbox{VaR}_i = \beta_i \frac{ \mbox{VaR}_P } { C } = \ \mbox{constant .}
$$ a subject of finding a solution for a “Risk-Minimising Position” problem. All right. The end of a lecture. Time for coding!

 

Case Study

Do You like sex? I do! So, let’s do it in Matlab this time! First, we pick up randomly seven stocks among US equites (for the simplicity of this example), and their tickers are given in the portfolio.lst text file as follows:

AAPL
DIS
IBM
JNJ
KO
NKE
TXN

and next we download their 3 years of historical data from Quandl.com as the data provider:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
% Applied Portfolio VaR Decomposition. (1) Marginal and Component VaR.
%
% (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
format short
 
% Read the list of the portfolio components
fileID = fopen('portfolio.lst');
tmp = textscan(fileID, '%s');
fclose(fileID);
pc=tmp{1};  % a list as a cell array
 
% fetch stock data for last 3 years since:
t0=735976; % 12-Jan-2015
date2=datestr(t0,'yyyy-mm-dd');        % from
date1=datestr(t0-3*365,'yyyy-mm-dd');  % to
 
% create an empty array for storing stock data
stockd={};
for i=1:length(pc) % scan the tickers and fetch the data from Quandl.com
    quandlc=['WIKI/',pc{i}];
    fprintf('%4.0f %s\n',i,quandlc);
    % fetch the data of the stock from Quandl
    % using recommanded Quandl's command and
    % saving them directly into Matlab's FTS object (fts)
    fts=0;
    [fts,headers]=Quandl.get(quandlc,'type','fints', ...
                  'authcode','YourQuandlCode',...
                  'start_date',date1,'end_date',date2);
    stockd{i}=fts; % entire FTS object in an array's cell
end
 
% limit data to 3 years of trading days, select the adjusted
% close price time-series and calculate return-series
rs=[];
for i=1:length(pc)
    cp=fts2mat(stockd{i}.Adj_Close,0);
    rv=cp(2:end,1)./cp(1:end-1,1)-1;
    rs=[rs rv(end-(3*5*4*12):end)];
end
rs(isnan(rs))=0.0;

The covariance matrix, $\boldsymbol{M}_2$, computed based on 7 return-series is:

44
45
% covariance matrix
M2=cov(rs);

Let’s now assume some random dollar exposure, i.e. position size for each asset,

47
48
49
50
51
52
53
54
55
56
57
% exposure per position
d=[   55621.00; ...
     101017.00; ...
      23409.00; ...
    1320814.00; ...
     131145.00; ...
     321124.00; ...
    1046867.00]
 
% invested capital of C
C=sum(d)

returning our total exposure of $C=\$$2,999,997 in the stock market which is based on the number of shares purchased and/or our risk-return appetite (in Part II of this series, we will witness how that works in a simulated trading).

We turn a long theory on the VaR decomposition into numbers just within a few lines of code, namely:

59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
% the portfolio volatility
vol_p=sqrt(d'*M2*d);
 
% diversified portfolio VaR
VaR_P=1.65*vol_p;
 
% volatility of assets (based on daily returns)
v=sqrt(diag(cov(rs)));
 
% individual and undiversified portfolio VaR
VaRi=1.65*abs(v.*d);
uVaR_P=sum(VaRi);
 
% the portfolio betas
beta=C*(M2*d)/vol_p^2;
 
% the portfolio marginal VaR
DVaR=1.65*(M2*d)/vol_p;
 
% the component VaR [percent]
cVaR=100*DVaR.*d/VaR_P;
 
% initial positions of assets in portfolio [percent]
orgpos=100*d/C

The last column vector provides us with an information on the original (currently held) rate of exposure of our invested capital in a function of the asset number in portfolio:

orgpos =
          1.85
          3.37
          0.78
         44.03
          4.37
         10.70
         34.90

Before we generate our very first risk report, let’s develop a short code that would allow us to find a solution for “Risk-Minimising Position” problem. Again, we want to find new position sizes (different from orgpos) at the lowest possible level of the portfolio risk. We construct intelligently a dedicated function that does the job for us:

% Finding a solution for the "Risk-Minimasing Position" problem
%   based on portfolio VaR
% (c) 2015 QuantAtRisk, by Pawel Lachowicz
 
function [best_x,best_dVaR,best_VaRp,best_volp]=rmp(M2,d,VaR_p)
    c=sum(d);
    N=length(d);
    best_VaRp=VaR_p;
    for i=1:100                     % an arbitrary number (recommended >50)
        k=0;
        while(k~=1)
            d=random('uniform',0,1,N,1);
            d=d/sum(d)*c;
            vol_p=sqrt(d'*M2*d);
            % diversified VaR (portfolio)
            VaRp=1.65*vol_p;
            dVaR=1.65*(M2*d)/vol_p;
            test=fix(dVaR*100);     % set precision here
            m=fix(mean(test));
            t=(test==m);
            k=sum(t)/N;
            if(VaRp<best_VaRp)
                best_x=d;
                best_dVaR=dVaR;
                best_VaRp=VaRp;
                best_volp=vol_p;
            end
        end
    end
end

We feed it with three input parameters, namely, the covariance matrix of $\boldsymbol{M}_2$, the current dollar exposure per asset of $\boldsymbol{d}$, and the current portfolio VaR. Inside we generate randomly a new dollar exposure column vector, next re-calculate the portfolio VaR and marginal VaR until,
$$
\Delta\mbox{VaR}_i = \beta_i \frac{ \mbox{VaR}_P } { C } = \ \mbox{constant}
$$ condition is met. We need to be sure that a new portfolio VaR is lower than the original one and somehow lowest among a large number of random possibilities. That is why we repeat that step 100 times. In this point, as a technical remark, I would suggest to use more than 50 iterations and to experiment with a precision that is required to find marginal VaRs (or $\beta$’s) to be (nearly) the same. The greater both golden numbers are the longer time of waiting for the final solution.

It is also that apex of the day in daily bank’s portfolio evaluation process that assesses the bank’s market risk and global exposure. A risk manager knows how long it takes to find new risk-minimaised positions and how fast C/C++/GPU solutions need to be implemented to cut the time of those computations. Here, with our Matlab function of rmp, we can feel the pain of this game.

Eventually, we finalise our computations:

84
85
86
[nd,nDVaR,nVaR_P,nvol_P]=rmp(M2,d,VaR_P);
redVaR=100*(nVaR_P-VaR_P)/VaR_P;
newpos=100*nd/C;

followed by a nicely formatted report:

88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
fprintf('\n============================================\n');
fprintf('  Portfolio Risk Report as of %s\n',datestr(t0));
fprintf('============================================\n\n');
fprintf('1-day VaR estimation at %1.0f%% confidence level\n\n',95.0)
fprintf('Number of assets:\t\t %1.0f \n',length(d))
fprintf('Current exposure:\t\t $%1s\n',bankformat(C));
fprintf('Portfolio VaR (undiversified):\t $%1s\n',bankformat(uVaR_P));
fprintf('Portfolio VaR (diversified):\t $%1s\n',bankformat(VaR_P));
fprintf('Component VaR [individual VaR]\n');
for i=1:length(d)
    fprintf('   %s\t%6.2f%%   $%-10s\t[$%-10s]\n',pc{i},cVaR(i), ...
            bankformat(cVaRd(i)),bankformat(VaRi(i)));
end
fprintf('\nRisk-Minimising Position scenario\n');
fprintf('--------------------------------------------------------------');
fprintf('\n\t\t   Original   Marginal     New        Marginal\n');
fprintf('\t\t   position   VaR          position   VaR\n');
for i=1:length(d)
    fprintf('\t  %-4s\t    %6.2f%%   %1.5f\t    %6.2f%%   %1.2f\n', ...
            pc{i},orgpos(i), ...
            DVaR(i),newpos(i),nDVaR(i))
end
fprintf('div VaR\t\t $%1s\t\t $%1s\n',bankformat(VaR_P), ...
        bankformat(nVaR_P));
fprintf('\t\t\t\t\t%10.2f%%\n',redVaR);
fprintf('ann Vol\t\t%10.2f%%\t\t%10.2f%%\n',  ...
        sqrt(252)*vol_P/1e6*1e2,sqrt(252)*nvol_P/1e6*1e2);
fprintf('--------------------------------------------------------------');
fprintf('\n\n');

where bankformat function you can find here.

As a portfolio risk manager you are interested in numbers, numbers, and numbers that, in fact, reveal a lot. It is of paramount importance to analyse the portfolio risk report with attention before taking any further steps or decisions:

 
============================================
  Portfolio Risk Report as of 12-Jan-2015
============================================
 
1-day VaR estimation at 95% confidence level
 
Number of assets:		 7 
Current exposure:		 $2,999,997.00
Portfolio VaR (undiversified):	 $54,271.26
Portfolio VaR (diversified):	 $33,027.94
Component VaR [individual VaR]
   AAPL	  0.61%   $202.99    	[$1,564.29  ]
   DIS	  1.71%   $564.13    	[$1,885.84  ]
   IBM	  0.26%   $84.43     	[$429.01    ]
   JNJ	 30.99%   $10,235.65 	[$17,647.63 ]
   KO	  1.19%   $392.02    	[$2,044.05  ]
   NKE	  9.10%   $3,006.55  	[$7,402.58  ]
   TXN	 56.14%   $18,542.15 	[$23,297.83 ]
 
Risk-Minimising Position scenario
--------------------------------------------------------------
		   Original   Marginal     New        Marginal
		   position   VaR          position   VaR
	  AAPL	      1.85%   0.00365	      6.02%   0.01
	  DIS 	      3.37%   0.00558	      1.69%   0.01
	  IBM 	      0.78%   0.00361	     10.79%   0.01
	  JNJ 	     44.03%   0.00775	     37.47%   0.01
	  KO  	      4.37%   0.00299	     27.36%   0.01
	  NKE 	     10.70%   0.00936	      8.13%   0.01
	  TXN 	     34.90%   0.01771	      8.53%   0.01
div VaR		 $33,027.94		 $26,198.39
					    -20.68%
ann Vol		     31.78%		     25.21%
--------------------------------------------------------------

From the risk report we can spot the difference of $\$$21,243.32 between the undiversified and diversified portfolio VaR and the largest contribution of Texas Instruments Incorporated stock (TXN) to the overall portfolio VaR though it is Johnson & Johnson stock (JNJ) that occupies the largest dollar position in our portfolio.

The original marginal VaR is the largest for TXN, NKE, and JNJ. Therefore, in order to minimise portfolio VaR, we should cut them and/or reduce their positions. On the other hand, the exposure should be increased in case of KO, IBM, AAPL, and DIS which display the lowest marginal VaR. Both suggestions find the solution in a form of derived new holdings that would reduce the portfolio VaR by $-$20.68% decreasing concurrently the annualised portfolio volatility by 6.57%.

Indeed. Risk is like sex. The more the merrier! Unless, you are a risk manager and you calculate your “positions”.

Quants, NumPy, and LOTTO

Lesson 9>>

Since I’m working over the Volume I of Python for Quants ebook and I am going through NumPy abilities, they leave me speechless despite the rain. Somehow. Every time. There is so much flexibility in expressing your thoughts, ideas, and freedom of coding of logical and mathematical concepts as a quant. With its grace of syntax, with its purity of simplicity.

When I was 16 I used to play LOTTO: the game where for your 3 favourite combinations of 6 numbers out of 49 you pay $\$1$. It was obvious that hitting “six” was an equivalent of the probability of:
$$
\frac{1}{C^6_{49}} = \left( \frac{49!}{6!(49-6)!} \right)^{-1} = \frac{1}{13983816} \ \ .
$$ I had my favourite 6 numbers. The magic was that I believed that I could win the entire prize. Somehow, the game was worth trying. You never know when your lucky day comes!

Monte-Carlo Simulation for LOTTO

It is out of any doubts that our mind can create certain circumstances in real world (that we do not immediately understand) that allow us to bend a reality to our knees. How? Well, if you attend any mind control seminar you learn on the power of your mind and how to trick mathematical, logical reasoning. That applies to LOTTO game as well. There is nearly 14M combinations and we aim to win that game. A low probability. A high reward if you are lucky.

I used to play with my favourite combination: 7, 9, 11, 21, 35, 45 for nearly 3 months. I was visualising a situation when one day exactly those numbers could appear in the LOTTO drawing machine next Wednesday or Saturday. It took me 3 months to hit 5 out of 6 numbers! The financial reward was nice, trust me! But because I was of small faith, I did not hit the bull between its eyes. Now it’s your turn!

Python language is superb. It offers you an opportunity to simulate your luck just in few minutes. Say, you are like me, you randomly pick up 6 out of 49 numbers and you play, two times a week, and… wait, for your early retirement. With NumPy library in Python, the estimation of the waiting time is straightforward. Analyse the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Quants, NumPy, and LOTTO
# (c) 2014 QuantAtRisk.com, by Pawel Lachowicz
 
import numpy as np
 
def lotto_numbers():
    ok = False
    while not ok:
        x = np.random.choice(49,6,replace=False)
        x.sort()
        tmp = np.where(x == 0)
        (m, )= tmp[0].shape
        if(m == 0):
            ok = True
    return x
 
fav = lotto_numbers()  # choose your favourite 6 numbers
print(fav)
 
match = False; i = 0
while not match:
    tmp = lotto_numbers()
    cmp = (tmp == fav)
    i += 1
    if cmp.all():
        match = True
 
print("Iterations: %g\nProbability = %.2e" % (i,1./i) )
print(tmp)

It’s a nice combination of most useful NumPy functions in action: .random.choice(), .where(), 
.shape(), 
.sort(), and 
.all().

In line #9 we create a 6-element vector of random integers drawn from a set $[0,49]$ and in line #10 we force them all to be sorted. Because $0$ is not the part of the game, we need to check for it presence and draw again if detected. Line #11 returns a temporary array storing indexes where $x$ has a zero-element. Since we use .random.choice function with a parameter replace=False, we can be sure that there is only one zero (if drawn). Therefore, the result of line #12 should be $m$ equal 0 or 1.

In the main program (lines #17-29) we repeat the loop until we find the match between a (random) selection of our favourite 6 numbers and a new LOTTO draw. In line #23 an array cmp holds 6 boolean elements as the result of comparison. Therefore, making the arrays sorted is essential part of the code. Finally, in line #25 a NumPy function of .all() returns True or False if all elements are True (or not), respectively. Nice and easy. An exemplary outcome you can get is:

[ 1  2  4 11 13 18]
Iterations: 37763
Probability = 2.65e-05
[ 1  2  4 11 13 18]

To provide you with a feeling how long you need to wait to hit your favourite “six” let’s calculate:

>>> 37763/(2*52)
363

what returns a number of years. Trust me, if you really really really want to win playing only with your lucky numbers, you can. It is a matter of faith, luck, and… time.

As a homework try to modify the code to run it $10^4$ times and plot the histogram of probabilities. Is it Normal or Uniform?

Now I guess I should say Good luck!

Rebinning Tick-Data for FX Algo Traders

If you work or intend to work with FX data in order to build and backtest your own FX models, the Historical Tick-Data of Pepperstone.com is probably the best place to kick off your algorithmic experience. As for now, they offer tick-data sets of 15 most frequently traded currency pairs since May 2009. Some of the unzip’ed files (one month data) reach over 400 MB in size, i.e. storing 8.5+ millions of lines with a tick resolution for both bid and ask “prices”. A good thing is you can download them all free of charge and their quality is regarded as very high. A bad thing is there is 3 month delay in data accessibility.

Dealing with a rebinning process of tick-data up, that’s a different story and the subject of this post. We will see how efficiently you can turn Pepperstone’s Tick-Data set(s) into 5-min time-series as an example. We will make use of scripting in bash (Linux/OS X) supplemented with data processing in Python.

Data Structure

You can download Pepperstone’s historical tick-data from here, month by month, pair by pair. Their inner structure follows the same pattern, namely:

$ head AUDUSD-2014-09.csv 
AUD/USD,20140901 00:00:01.323,0.93289,0.93297
AUD/USD,20140901 00:00:02.138,0.9329,0.93297
AUD/USD,20140901 00:00:02.156,0.9329,0.93298
AUD/USD,20140901 00:00:02.264,0.9329,0.93297
AUD/USD,20140901 00:00:02.265,0.9329,0.93293
AUD/USD,20140901 00:00:02.265,0.93289,0.93293
AUD/USD,20140901 00:00:02.268,0.93289,0.93295
AUD/USD,20140901 00:00:02.277,0.93289,0.93296
AUD/USD,20140901 00:00:02.278,0.9329,0.93296
AUD/USD,20140901 00:00:02.297,0.93288,0.93296

The columns, from left to right, represent respectively: a pair name, the date and tick-time, the bid price, and the ask price.

Pre-Processing

Here, for each .csv file, we aim to split the date into year, month, and day separately, and remove commas and colons to get raw data ready to be read in as a matrix (array) using any other programming language (e.g. Matlab or Python). The matrix is mathematically intuitive data structure therefore making direct reference to any specific column of it makes any backtesting engine running with its full thrust.

Let’s play with AUDUSD-2014-09.csv data file. Working in the same directory where the file is located we begin with writing a bash script (pp.scr) that contains:

1
2
3
4
5
6
7
8
9
10
11
# pp.scr
# Rebinning Pepperstone.com Tick-Data for FX Algo Traders 
# (c) 2014 QuantAtRisk, by Pawel Lachowicz
 
clear
echo "..making a sorted list of .csv files"
for i in $1-*.csv; do echo ${i##$1-} $i ${i##.csv};
done | sort -n | awk '{print $2}' > $1.lst
 
python pp.py
head AUDUSD.pp

that you run in Terminal:

$ chmod +x pp.scr
$ ./pp.scr AUDUSD

where the first command makes sure the script becomes executable (you need to perform this task only once). Lines #7-8 of our script, in fact, look for all .csv data files in the local directory starting with AUDUSD- prefix and create their list in AUDUSD.lst file. Since we work with AUDUSD-2014-09.csv file only, the AUDUSD.lst file will contain:

$ cat AUDUSD.lst 
AUDUSD-2014-09.csv

as expected. Next, we utilise the power and flexibility of Python in the following way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# pp.py
import csv
 
fnlst="AUDUSD.lst"
fnout="AUDUSD.pp"
 
for lstline in open(fnlst,'r').readlines():
    fncur=lstline[:-1]
    #print(fncur)
 
    with open(fnout,'w') as f:
        writer=csv.writer(f,delimiter=" ")
 
        i=1 # counts a number of lines with tick-data
        for line in open(fncur,'r').readlines():
            if(i<=5200): # replace with (i>0) to process an entire file
                #print(line)
                year=line[8:12]
                month=line[12:14]
                day=line[14:16]
                hh=line[17:19]
                mm=line[20:22]
                ss=line[23:29]
                bidask=line[30:]
                writer.writerow([year,month,day,hh,mm,ss,bidask])
                i+=1

It is a pretty efficient way to open really a big file and process its information line by line. Just for further purpose of display, in the code we told computer to process only first 5,200 of lines. The output of lines #10-11 of pp.scr is the following:

2014 09 01 00 00 01.323 "0.93289,0.93297
"
2014 09 01 00 00 02.138 "0.9329,0.93297
"
2014 09 01 00 00 02.156 "0.9329,0.93298
"
2014 09 01 00 00 02.264 "0.9329,0.93297
"
2014 09 01 00 00 02.265 "0.9329,0.93293
"

since we allowed Python to save bid and ask information as one string (due to a variable number of decimal digits). In order to clean this mess we continue:

13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# pp.scr (continued)
echo "..removing token: comma"
sed 's/,/ /g' AUDUSD.pp > $1.tmp
rm AUDUSD.pp
 
echo "..removing token: double quotes"
sed 's/"/ /g' $1.tmp > $1.tmp2
rm $1.tmp
 
echo "..removing empty lines"
sed -i '/^[[:space:]]*$/d' $1.tmp2
mv $1.tmp2 AUDUSD.pp
 
echo "head..."
head AUDUSD.pp
echo "tail..."
tail AUDUSD.pp

what brings us to pre-processed data:

..removing token: comma
..removing token: double quotes
..removing empty lines
head...
2014 09 01 00 00 01.323  0.93289 0.93297
2014 09 01 00 00 02.138  0.9329 0.93297
2014 09 01 00 00 02.156  0.9329 0.93298
2014 09 01 00 00 02.264  0.9329 0.93297
2014 09 01 00 00 02.265  0.9329 0.93293
2014 09 01 00 00 02.265  0.93289 0.93293
2014 09 01 00 00 02.268  0.93289 0.93295
2014 09 01 00 00 02.277  0.93289 0.93296
2014 09 01 00 00 02.278  0.9329 0.93296
2014 09 01 00 00 02.297  0.93288 0.93296
tail...
2014 09 02 00 54 39.324  0.93317 0.93321
2014 09 02 00 54 39.533  0.93319 0.93321
2014 09 02 00 54 39.543  0.93318 0.93321
2014 09 02 00 54 39.559  0.93321 0.93321
2014 09 02 00 54 39.784  0.9332 0.93321
2014 09 02 00 54 39.798  0.93319 0.93321
2014 09 02 00 54 39.885  0.93319 0.93325
2014 09 02 00 54 39.886  0.93319 0.93321
2014 09 02 00 54 40.802  0.9332 0.93321
2014 09 02 00 54 48.829  0.93319 0.93321

Personally, I love that part as you can learn how to do simple but necessary text file operations by typing single lines of Unix/Linux commands. Good luck for those who try to repeat the same in Microsoft Windows not spending more than 30 sec for doing it.

Rebinning: 5-min Data

The rebinning has many schools. It’s the art for some people. We just want to have the job done. I opt for simplicity and understanding of the data we deal with. Imagine we have two adjacent 5 min bins with a tick history of trading:

sam
We want to derive the closest possible (or most fair) price estimation every 5 min, denoted in the above painting by a red marker. The old-school approach is to take the average over a number (larger than 5) of tick data points from the left and from the right. That creates the under- or overestimation of the mid-price.

If we trade live, every 5 min we receive an information on the last tick point before the minute hits 5 and we wait for the next tick point after 5 (blue markers). Taking the average of their prices (mid-price) makes most of sense. The precision we look at here is sometimes $10^{-5}$. It is not much of significance if our position is small, but if it is not, the mid-price may start playing a crucial role.

The cons of the old-school approach: a possible high volatility among all tick-data within last 5 minutes that we neglect.

The following Python code (pp2.py) performs 5-min rebinning for our pre-processed AUDUSD-2014-09 file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# pp2.py
import csv
import numpy as np
 
def convert(data):
     tempDATA = []
     for i in data:
         tempDATA.append([float(j) for j in i.split()])
     return np.array(tempDATA).T
 
fname="AUDUSD.pp"
 
with open(fname) as f:
    data = f.read().splitlines()
 
#print(data)
 
i=1
for d in data:
    list=[s for s in d.split(' ')]
    #print(list)
    # remover empty element in the list
    dd=[x for x in list if x]
    #print(dd)
    tmp=convert(dd)
    #print(tmp)
    if(i==1):
        a=tmp
        i+=1
    else:
        a = np.vstack([a, tmp])
        i+=1
 
N=i-1
#print("N = %d" % N)
 
# print the first line
tmp=np.array([a[1][0],a[1][1],a[1][2],a[1][3],a[1][4],0.0,(a[1][6]+a[1][7])/2])
print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" %
             (tmp[0],tmp[1],tmp[2],tmp[3],tmp[4],tmp[5],tmp[6]))
m=tmp
 
# check the boundary conditions (5 min bins)
for i in xrange(2,N-1):
    if( (a[i-1][4]%5!=0.0) and (a[i][4]%5==0.0)):
 
        # BLUE MARKER No. 1
        # (print for i-1)
        #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" %
        #      (a[i-1][0],a[i-1][1],a[i-1][2],a[i-1][3],a[i-1][4],a[i-1][5],a[i-1][6],a[i-1][7]))
        b1=a[i-1][6]
        b2=a[i][6]
        a1=a[i-1][7]
        a2=a[i][7]
        # mid-price, and new date for 5 min bin
        bm=(b1+b2)/2
        am=(a1+a2)/2
        Ym=a[i][0]
        Mm=a[i][1]
        Dm=a[i][2]
        Hm=a[i][3]
        MMm=a[i][4]
        Sm=0.0        # set seconds to zero
 
        # RED MARKER
        print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" %
              (Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2))
        tmp=np.array([Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2])
        m=np.vstack([m, tmp])
 
        # BLUE MARKER No. 2
        # (print for i)
        #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" %
        #      (a[i][0],a[i][1],a[i][2],a[i][3],a[i][4],a[i][5],a[i][6],a[i][7]))

what you run in pp.scr file as:

31
32
33
# pp.scr (continued)
 
python pp2.py > AUDUSD.dat

in order to get 5-min rebinned FX time-series as follows:

$ head AUDUSD.dat
2014  9  1  0  0  0.000   0.932935
2014  9  1  0  5  0.000   0.933023
2014  9  1  0 10  0.000   0.932917
2014  9  1  0 15  0.000   0.932928
2014  9  1  0 20  0.000   0.932937
2014  9  1  0 25  0.000   0.933037
2014  9  1  0 30  0.000   0.933075
2014  9  1  0 35  0.000   0.933070
2014  9  1  0 40  0.000   0.933092
2014  9  1  0 45  0.000   0.933063

That concludes our efforts. Happy rebinning!

GPU-Accelerated Finance in Python with NumbaPro Library. Really?

When in 2010 I lived in Singapore I crossed my life path with some great guys working in the field of High-Performance Computing: Łukasz Orłowski, Marek T. Michalewicz, Iain Bell from Quadrant Capital. They both pointed my attention towards GPU computations utilizing Nvidia CUDA architecture. There was one problem. Everything was wrapped up with C syntax with a promise to do it more efficiently in C++. Soon.


So I waited and studied C/C++ at least at the level allowing me to understand some CUDA codes. Years were passing by until the day when I discovered an article of Mark Harris, NumbaPro: High-Performance Python with CUDA Acceleration, delivering Python-friendly CUDA solutions to all my nightmares involving C/C++ coding. At this point you just need to understand one thing: not every quant or algo trader is a fan of C/C++ the same as some people prefer Volvo to Audi, including myself ;)

Let’s have a sincere look at what the game is about. It is more than tempting to put your hands on a piece of code that allows you to speed-up some quantitative computations in Python making use of a new library.

Accelerate your Python

I absolutely love Continuum Analytics for the mission they stand for: making Python language easily accessible and used by everyone worldwide! It is a great language with a great syntax, easy to pick up, easy to be utilised in the learning process of the fundamentals of programming. Thanks to them, now you can download and install Python’s distribution of Anaconda for Windows, Mac OS X, or Linux just in few minutes (see my earlier post on Setting up Python for Quantitative Analysis in OS X 10.10 Yosemite as an example).

When you visit their webpage you can spot Anaconda’s Add-Ons, three additional software packages to their Python distribution. Among them, they offer Accelerate module containing NumbaPro library. Once you read the description, and I quote,

Accelerate is an add-on to Continuum’s free enterprise Python distribution, Anaconda. It opens up the full capabilities of your GPU or multi-core processor to Python. Accelerate includes two packages that can be added to your Python installation: NumbaPro and MKL Optimizations. MKL Optimizations makes linear algebra, random number generation, Fourier transforms, and many other operations run faster and in parallel. NumbaPro builds fast GPU and multi-core machine code from easy-to-read Python and NumPy code with a Python-to-GPU compiler.

NumbaPro Features
– NumbaPro compiler targets multi-core CPU and GPUs directly from
    simple Python syntax
– Easily move vectorized NumPy functions to the GPU
– Multiple CUDA device support
– Bindings for CUDA libraries, including cuBlas, cuRand, cuSparse, and cuFFT
– Support for array slicing and fast array math
– Use multiple threads without worrying about the GIL
– Supported on NVIDIA CUDA-enabled GPUs with compute capability 2.0
    or above on Intel/AMD (x86) processors.

your blood pressure increases and the level of endorphins skyrockets. Why? Simply because of the promise to do some tasks faster utilising GPU in a parallel mode! If you are new to GPU or CUDA I recommend you to read some well written posts on Mike’s website, for instance, Installing Nvidia CUDA on Mac OSX for GPU-based Parallel Computing or Monte Carlo Simulations in CUDA – Barrier Option Pricing. You will grasp the essence of what is all about. In general, much ado about CUDA is still around making use of your GPU and proving this extra upmhhh in speedup. If you have any quantitative problem in mind and it can be executed in the parallel mode, NumbaPro is a tool you need to look at but – not every engine sounds the same. Hold on till the end of this post. It will be worth it.

Selling the Speed

When you approach a new concept or a new product and someone tries to sell it to you, he needs to impress you to win your attention and boost your curiosity. Imagine for a moment that you have no idea about GPU or CUDA and you want to add two vectors. In Python you can do it as follows:

import numpy as np
from timeit import default_timer as timer
 
def VectorAdd(a,b,c):
    for i in xrange(a.size):
        c[i]=a[i]+b[i]
 
def main():
 
    N=32000000
 
    A=np.ones(N, dtype=np.float32)
    B=np.ones(N, dtype=np.float32)
    C=np.zeros(N, dtype=np.float32)
 
    start=timer()
    VectorAdd(A,B,C)
    totaltime=timer()-start
 
    print("\nCPU time: %g sec" % totaltime)
 
if __name__ == '__main__':
    main()

We all know that once you run the code, Python does not compile it, it goes line by line and interprets what it reads. So, we aim at adding two vectors, $A$ and $B$, containing 32 millions of elements. I get $C=A+B$ matrix on my MacBook Pro (2.6 GHz Intel Core i7, 16 GB 1600 MHz DDR3 RAM, NVIDIA GeForce GT 650M 1GB) after:

CPU time: 9.89753 sec

Can we do it better? With NumbaPro the required changes to the code itself are minor. All we need to add is a function decorator that tells how and where the function should be executed. In fact, what NumbaPro does is that it “compiles” VectorAdd function on-the-go and deploys computations to the GPU unit:

import numpy as np
from timeit import default_timer as timer
from numbapro import vectorize
 
@vectorize(["float32(float32,float32)"], target="gpu")
def VectorAdd(a,b):
    return a+b
 
def main():
 
    N=32000000
 
    A=np.ones(N, dtype=np.float32)
    B=np.ones(N, dtype=np.float32)
    C=np.zeros(N, dtype=np.float32)
 
    start=timer()
    C=VectorAdd(A,B)
    totaltime=timer()-start
 
    print("\nGPU time: %g sec" % totaltime)
 
if __name__ == '__main__':
    main()

We get

GPU time: 0.286101 sec

i.e. 34.6x speed-up. Not bad, right?! Not bad if you’re a sale person indeed! But, hey, what’s that?:

import numpy as np
from timeit import default_timer as timer
 
def main():
 
    N=32000000
 
    A=np.ones(N, dtype=np.float32)
    B=np.ones(N, dtype=np.float32)
    C=np.zeros(N, dtype=np.float32)
 
    start=timer()
    C=A+B
    totaltime=timer()-start
 
    print("\nCPU time: %g sec" % totaltime)
 
if __name__ == '__main__':
    main()

Run it to discover that:

CPU time: 0.0592878 sec

i.e. 4.82x faster than using GPU. Oh, boy! CUDA:NumPy (0:1).

Perfect Pitch

When Nvidia introduced CUDA among some exemplary C codes utilising CUDA programming we could find an immortal Black-Scholes model for option pricing. In this Nobel-prize winning solution, we derive a call option price for non-dividend-paying underlying stock:
$$
C(S,t) = N(d_1)S – N(d_2)Ke^{-r(T-t)}
$$
where $(T-t)$ is the time to maturity (scalar), $r$ is the risk free rate (scalar), $S$ is the spot price of the underlying asset, $K$ is the strike price, and $\sigma$ is the volatility of returns of the underlying asset. $N(\dot)$ is the cumulative distribution function (cnd) of the standard normal distribution and has an analytical form. A classical way to code it in Python is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
import time
 
RISKFREE = 0.02
VOLATILITY = 0.30
 
def cnd(d):
    A1 = 0.31938153
    A2 = -0.356563782
    A3 = 1.781477937
    A4 = -1.821255978
    A5 = 1.330274429
    RSQRT2PI = 0.39894228040143267793994605993438
    K = 1.0 / (1.0 + 0.2316419 * np.abs(d))
    ret_val = (RSQRT2PI * np.exp(-0.5 * d * d) *
               (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5))))))
    return np.where(d > 0, 1.0 - ret_val, ret_val)
 
def black_scholes(callResult, putResult, stockPrice, optionStrike, optionYears,
                  Riskfree, Volatility):
    S = stockPrice
    X = optionStrike
    T = optionYears
    R = Riskfree
    V = Volatility
    sqrtT = np.sqrt(T)
    d1 = (np.log(S / X) + (R + 0.5 * V * V) * T) / (V * sqrtT)
    d2 = d1 - V * sqrtT
    cndd1 = cnd(d1)
    cndd2 = cnd(d2)
 
    expRT = np.exp(- R * T)
    callResult[:] = (S * cndd1 - X * expRT * cndd2)
 
def randfloat(rand_var, low, high):
    return (1.0 - rand_var) * low + rand_var * high
 
def main (*args):
    OPT_N = 4000000
    iterations = 10
    if len(args) >= 2:
        iterations = int(args[0])
 
    callResult = np.zeros(OPT_N)
    stockPrice = randfloat(np.random.random(OPT_N), 5.0, 30.0)
    optionStrike = randfloat(np.random.random(OPT_N), 1.0, 100.0)
    optionYears = randfloat(np.random.random(OPT_N), 0.25, 10.0)
 
    time0 = time.time()
    for i in range(iterations):
        black_scholes(callResult, putResult, stockPrice, optionStrike,
                      optionYears, RISKFREE, VOLATILITY)
    time1 = time.time()
    print("Time: %f msec per option" % ((time1-time0)/iterations/OPT_N*1000))
 
if __name__ == "__main__":
    import sys
    main(*sys.argv[1:])

what returns

Time: 0.000192 msec per option

The essence of this code is to derive 4 million independent results based on feeding the function with random stock prices, option strike prices, and times to maturity. They enter the game under cover as row vectors with randomised values (see lines #45-47). Anaconda Accelerate’s CUDA solution for the same code is:

import numpy as np
import math
import time
from numba import *
from numbapro import cuda
from blackscholes import black_scholes # save the previous code as
                                       # black_scholes.py
 
RISKFREE = 0.02
VOLATILITY = 0.30
 
A1 = 0.31938153
A2 = -0.356563782
A3 = 1.781477937
A4 = -1.821255978
A5 = 1.330274429
RSQRT2PI = 0.39894228040143267793994605993438
 
@cuda.jit(argtypes=(double,), restype=double, device=True, inline=True)
def cnd_cuda(d):
    K = 1.0 / (1.0 + 0.2316419 * math.fabs(d))
    ret_val = (RSQRT2PI * math.exp(-0.5 * d * d) *
               (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5))))))
    if d > 0:
        ret_val = 1.0 - ret_val
    return ret_val
 
@cuda.jit(argtypes=(double[:], double[:], double[:], double[:], double[:],
                    double, double))
def black_scholes_cuda(callResult, putResult, S, X,
                       T, R, V):
#    S = stockPrice
#    X = optionStrike
#    T = optionYears
#    R = Riskfree
#    V = Volatility
    i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    if i >= S.shape[0]:
        return
    sqrtT = math.sqrt(T[i])
    d1 = (math.log(S[i] / X[i]) + (R + 0.5 * V * V) * T[i]) / (V * sqrtT)
    d2 = d1 - V * sqrtT
    cndd1 = cnd_cuda(d1)
    cndd2 = cnd_cuda(d2)
 
    expRT = math.exp((-1. * R) * T[i])
    callResult[i] = (S[i] * cndd1 - X[i] * expRT * cndd2)
 
def randfloat(rand_var, low, high):
    return (1.0 - rand_var) * low + rand_var * high
 
def main (*args):
    OPT_N = 4000000
    iterations = 10
 
    callResultNumpy = np.zeros(OPT_N)
    putResultNumpy = -np.ones(OPT_N)
    stockPrice = randfloat(np.random.random(OPT_N), 5.0, 30.0)
    optionStrike = randfloat(np.random.random(OPT_N), 1.0, 100.0)
    optionYears = randfloat(np.random.random(OPT_N), 0.25, 10.0)
    callResultNumba = np.zeros(OPT_N)
    putResultNumba = -np.ones(OPT_N)
    callResultNumbapro = np.zeros(OPT_N)
    putResultNumbapro = -np.ones(OPT_N)
 
# Numpy ----------------------------------------------------------------
    time0 = time.time()
    for i in range(iterations):
        black_scholes(callResultNumpy, putResultNumpy, stockPrice,
                      optionStrike, optionYears, RISKFREE, VOLATILITY)
    time1 = time.time()
    dtnumpy = ((1000 * (time1 - time0)) / iterations)/OPT_N
    print("\nNumpy Time            %f msec per option") % (dtnumpy)
 
# CUDA -----------------------------------------------------------------
    time0 = time.time()
    blockdim = 1024, 1
    griddim = int(math.ceil(float(OPT_N)/blockdim[0])), 1
    stream = cuda.stream()
    d_callResult = cuda.to_device(callResultNumbapro, stream)
    d_putResult = cuda.to_device(putResultNumbapro, stream)
    d_stockPrice = cuda.to_device(stockPrice, stream)
    d_optionStrike = cuda.to_device(optionStrike, stream)
    d_optionYears = cuda.to_device(optionYears, stream)
 
    time2 = time.time()
 
    for i in range(iterations):
        black_scholes_cuda[griddim, blockdim, stream](
            d_callResult, d_putResult, d_stockPrice, d_optionStrike,
            d_optionYears, RISKFREE, VOLATILITY)
        d_callResult.to_host(stream)
        d_putResult.to_host(stream)
        stream.synchronize()
 
    time3 = time.time()
    dtcuda = ((1000 * (time3 - time2)) / iterations)/OPT_N
 
    print("Numbapro CUDA Time    %f msec per option (speed-up %.1fx)
             \n") % (dtcuda, dtnumpy/dtcuda)
#   print(callResultNumbapro)
 
 
if __name__ == "__main__":
    import sys
    main(*sys.argv[1:])

returning

Numpy Time            0.000186 msec per option
Numbapro CUDA Time    0.000024 msec per option (speed-up 7.7x)

In order to understand why CUDA wins over NumPy this time is not so difficult. First we have a programmable analytical form of the problem. We deploy it to GPU and perform exhaustive calculations involving cnd_cuda function for the estimation of the cumulative distribution function of the standard normal distribution. Splitting the task into many concurrently running threats on GPU reduces the time. Again, it’s possible because all option prices can be computed independently. CUDA:NumPy (1:1).

Multiplied Promises

In finance, the concept of portfolio optimization is well established (see my ebook on that, Applied Portfolio Optimization with Risk Management, as an example). The idea standing behind is to find such a vector of weights, $w$, for all assets that the derived estimated portfolio risk ($\sigma_P$) and return ($\mu_P$) meets our needs or expectations.

An alternative (but not greatly recommended) approach would involve optimization through randomisation of $w$ vectors. We could generate a big number of them, say $N$, in order to obtain:
$$
\mu_{P,i} = m w_i^T \ \ \ \mbox{and} \ \ \ \sigma_{P,i} = w_iM_2w_i^T
$$ for $i=1,…,N$. Here, $m$ is a row-vector holding estimated expected returns for all assets in portfolio $P$ and based on return-series we end up with $M_2$ covariance matrix $(MxM$ where $M$ is a number of assets in $P$). In the first case, we aim at the multiplication of row-vector with a transposed row-vector of weight whereas for the latter we perform the multiplication of the row-vector with the square matrix (as the first operation). If $N$ is really big, say a couple of millions, the computation could be somehow accelerated using GPU. Therefore, we need a code for matrix multiplication on GPU in Python.

Let’s consider a more advanced concept of ($K\times K$)$\times$($K\times K$) matrix multiplication. If that works faster, than our random portfolios problem should be even faster. Continuum Analytics provides with a ready-to-use solution:

import numpy as np
from numbapro import cuda
import numba
from timeit import default_timer as timer
from numba import float32
 
bpg = 32
tpb = 32
 
n = bpg * tpb
 
shared_mem_size = (tpb, tpb)
griddim = bpg, bpg
blockdim = tpb, tpb
 
@numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])")
def naive_matrix_mult(A, B, C):
    x, y = cuda.grid(2)
    if x >= n or y >= n:
        return
 
    C[y, x] = 0
    for i in range(n):
        C[y, x] += A[y, i] * B[i, x]
 
 
@numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])")
def optimized_matrix_mult(A, B, C):
 
    # Declare shared memory
    sA = cuda.shared.array(shape=shared_mem_size, dtype=float32)
    sB = cuda.shared.array(shape=shared_mem_size, dtype=float32)
 
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    x, y = cuda.grid(2)
 
    acc = 0
    for i in range(bpg):
        if x < n and y < n:
            # Prefill cache
            sA[ty, tx] = A[y, tx + i * tpb]
            sB[ty, tx] = B[ty + i * tpb, x]
 
        # Synchronize all threads in the block
        cuda.syncthreads()
 
        if x < n and y < n:
            # Compute product
            for j in range(tpb):
                acc += sA[ty, j] * sB[j, tx]
 
        # Wait until all threads finish the computation
        cuda.syncthreads()
 
    if x < n and y < n:
        C[y, x] = acc
 
 
# Prepare data on the CPU
A = np.array(np.random.random((n, n)), dtype=np.float32)
B = np.array(np.random.random((n, n)), dtype=np.float32)
 
print "(%d x %d) x (%d x %d)" % (n, n, n, n)
 
# Prepare data on the GPU
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.device_array_like(A)
 
# Time the unoptimized version
s = timer()
naive_matrix_mult[griddim, blockdim](dA, dB, dC)
numba.cuda.synchronize()
e = timer()
unopt_ans = dC.copy_to_host()
tcuda_unopt = e - s
 
# Time the optimized version
s = timer()
optimized_matrix_mult[griddim, blockdim](dA, dB, dC)
numba.cuda.synchronize()
e = timer()
opt_ans = dC.copy_to_host()
tcuda_opt = e - s
 
assert np.allclose(unopt_ans, opt_ans)
print "CUDA without shared memory:", "%.2f" % tcuda_unopt, "s"
print "CUDA with shared memory   :", "%.2f" % tcuda_opt, "s"
 
s = timer()
np.dot(A,B)
e = timer()
npt=e-s
print "NumPy dot product         :", "%.2f" % npt, "s"

what returns

(1024 x 1024) x (1024 x 1024)
CUDA without shared memory: 0.76 s
CUDA with shared memory   : 0.25 s
NumPy dot product         : 0.06 s

and leads to CUDA:NumPy (1:2) score of the game. The natural questions arise. Is it about the matrix size? Maybe it is too simple problem we try to solve it with an improper tool? Or the way how we approach matrix allocation and deployment to GPU itself?

The last question made me digging deeper. In Python you can create one big matrix holding a number of smaller matrices. The following code tries to perform 4 million $(2\times 2)$ matrix multiplications where matrix $B$ is randomised every single time (see our random portfolio problem). In Anaconda Accelerate we achieve it as follows:

import numbapro
import numba.cuda
import numpy as np
from timeit import default_timer as timer
# Use the builtin matrix_multiply in NumPy for CPU test
import numpy.core.umath_tests as ut
 
 
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],
                      '(m, n),(n, p)->(m, p)', target='gpu')
def batch_matrix_mult(a, b, c):
    for i in range(c.shape[0]):
        for j in range(c.shape[1]):
            tmp = 0
            for n in range(a.shape[1]):
                 tmp += a[i, n] * b[n, j]
            c[i, j] = tmp
 
 
def main():
 
    n   = 4000000
    dim = 2
 
    sK=0
    KN=10
 
    for K in range(KN):
 
        # Matrix Multiplication:   c = a x b
 
        a = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
        c = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
 
        # NUMPY -------------------------------------------------------------
        start = timer()
        b = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
        d=ut.matrix_multiply(a, b)
        np_time=timer()-start
 
        # CUDA --------------------------------------------------------------
        dc = numba.cuda.device_array_like(c)
        da = numba.cuda.to_device(a)
 
        start = timer()
        b = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
        db = numba.cuda.to_device(b)
        batch_matrix_mult(da, db, out=dc)
        numba.cuda.synchronize()
        dc.copy_to_host(c)
        cuda_time=timer()-start
 
        sK += np_time/cuda_time
 
        del da, db
 
    print("\nThe average CUDA speed-up: %.5fx") % (sK/KN)
 
 
if __name__ == '__main__':
    main()

leading us to

The average CUDA speed-up: 0.79003x

i.e. deceleration. Playing with the sizes of matrices and their number may result in error caused by GPU memory required for matrix $A$ and $B$ allocation on GPU. It seems we have CUDA:NumPy (1:3).

Black Magic in Black Box

I approached NumbaPro solution as a complete rookie. I spent a considerable amount of time searching for Anaconda Accelerate’s GPU codes demonstrating massive speed-ups as promised. I found different fragments in different places across the Web. With the best method known among all beginners, namely, copy and paste, I re-ran what I found. Then modified and re-ran again. And again, and again. I felt the need, the need for speed! But failed finding my tail wind.

This post may demonstrate my lack of understanding of what is going on or reveal a blurred picture standing behind: a magic that works if you know all the tricks. I hope that at least you enjoyed the show!

Covariance Matrix for N-Asset Portfolio fed by Quandl in Python

A construction of your quantitative workshop in Python requires a lot of coding or at least spending a considerable amount of time assembling different blocks together. There are many simple fragments of code reused many times. The calculation of covariance matrix is not a problem once NumPy is engaged but the meaning is derived once you add some background idea what you try to achieve.

Let’s see in this lesson of Accelerated Python for Quants tutorial how to use Quandl.com data provider in construction of any $N$-Asset Portfolio based on SEC securities and for return-series we may calculate a corresponding covariance matrix.

Quandl and SEC Stock List

Quandl is a great source of data. With their ambition to become the largest data provider on the planet free of charge, no doubt they do an amazing job. You can use their Python API to feed your code directly with Open, High, Low, Close for any SEC stock. In the beginning we will need a list of companies (tickers) and, unfortunately, the corresponding internal call-tickers as referred to by Quandl. The .csv file containing all information you can download from this website or directly here: