Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

## Probability of Black Swan Events at NYSE

Featured by a Russian website Financial One (Apr 26, 2016)

The prediction of extreme rare events (EREs) in the financial markets remains one of the toughest problems. Firstly because of a very limited knowledge we have on their distribution and underlying correlations across the markets. Literally, we walk in dark, hoping it won’t happen today, not to the stocks we hold in our portfolios. But is that darkness really so dark?

In this post we will use a textbook knowledge on classical statistics in order to analyze the universe of 2500+ stocks traded at New York Stock Exchange (NYSE) that experienced the most devastating daily loss in their history (1 data point per 1 stock). Based on that, we will try to shed a new light on the most mind-boggling questions: when the next ERE will happen, how long we need to wait for it, and if it hits what its magnitude “of devastation” will be?

1. Extreme Rare Events of NYSE

An ERE meeting certain criteria has its own name in the financial world: a black swan. We will define it in the next Section. Historically speaking, at some point, people were convinced that only white swans existed until someone noticed a population of black swans in Western Australia. The impossible became possible.

The same we observe in the stock markets. Black Monday on Oct 19, 1987 no doubt is the greatest historical event where the markets suffered from 20%+ losses. The event was so rare that any risk analyst could not predict it and/or protect against it. The history repeated itself in 2000 when the Internet bubble caused many IT stocks to lose a lot over the period of few weeks. An event of 9/11 in New York forced the closure of US stock exchange for 6 days, and on Sep 17, as a consequence of fear, about 25 stocks at NYSE lost more than 15% in one day. The uncertainty cannot be avoided but it can be minimized. Is it so? The financial crisis of 2007-2008 triggered a significant number of black swans to fly again and fly high…

Did we learn enough from that? Do we protect more efficiently against black swans after 2009 market recovery? Will you be surprised if I tell you “no”?!

Let’s conduct a data analysis of the most EREs ever spotted at NYSE among all its stocks. To do that, we will use Python language and Yahoo! Finance database. The quality of data are not required to be superbly accurate and we will download only adjusted-close prices with the corresponding dates. A list of companies (inter alia their tickers) traded at NYSE you can extract from a .xls file available for download at http://www.nasdaq.com/screening/company-list.aspx. It contains 3226 tickers, while, as we will see below, Yahoo! Finance recognizes only 2716 of them. Since 2716 is a huge data set meeting a lot of requirements for the statistical significance, I’m more than happy to go along with it.

In the process of stock data download, we look for information on the most extreme daily loss per stock and collect them all in pandas’ DataFrame dfblack. It will contain a Date, the magnitude of ERE (Return), and Delta0. The latter we calculate on spot as a number of business days between some initial point in time (here, 1990-01-01) and the day of ERE occurrence. The processing time may take up to 45 min depending on your Internet speed. Be patient, it will be worth waiting that long. Using HDF5 we save dfblack DataFrame on the disk.

# Probability of Black Swan Events at NYSE # (c) 2016 by Pawel Lachowicz, QuantAtRisk.com   import numpy as np import pandas as pd import matplotlib.pyplot as plt import pandas_datareader.data as web import datetime as dt from scipy.stats import gumbel_l, t, norm, expon, poisson from math import factorial   # Constants c = (.7, .7, .7) # grey color   col = ['Date', 'Return', 'Delta0'] dfblack = pd.DataFrame(columns=col) # create an empty DataFrame   i = n = 0 for lstline in open("nyseall.lst",'r').readlines(): ticker=lstline[:-1] i += 1 print(i, ticker) try: ok = True data = web.DataReader(str(ticker), data_source='yahoo', start='1900-01-01', end='2016-04-14') except: ok = False if(ok): n += 1 data['Return'] = data['Adj Close' ] / data['Adj Close'].shift(1) - 1 data = data[1:] # skip first row   df = data['Return'].copy() dfs = df.sort_values() start = dt.date(1900, 1, 1) end = dt.date(dfs.index[0].year, dfs.index[0].month, dfs.index[0].day) delta0 = np.busday_count(start,end) tmin = end rmin = dfs[0] dfblack.loc[len(dfblack)]=[tmin, rmin, delta0] # add new row to dfblack   else: print(' not found')   dfblack.set_index(dfblack.Date, inplace=True) # set index by Date dfblack.sort_index(inplace=True) # sort DataFrame by Date del dfblack['Date'] # remove Date column   print("No. of stocks with valid data: %g" % n)   # saving to file store = pd.HDFStore('dfblack.h5') store['dfblack'] = dfblack store.close()   # reading from file store = pd.HDFStore('dfblack.h5') dfblack = pd.read_hdf('dfblack.h5', 'dfblack') store.close()   dfblack0 = dfblack.copy() # a backup copy

where nyseall.lst is a plain text file listing all NYSE tickers:

DDD MMM WBAI WUBA AHC ...

The first instinct of a savvy risk analyst would be to plot the distribution of the magnitudes of EREs as extracted for the NYSE universe. We achieve that by:

plt.figure(num=1, figsize=(9, 5)) plt.hist(dfblack.Return, bins=50, color='grey') # a histrogram plt.xlabel('Daily loss', fontsize = 12) plt.xticks(fontsize = 12)   plt.savefig('fig01.png', format='png')

what reveals:

Peaked around -18% with a long left tail of losses reaching nearly -98%. Welcome to the devil’s playground! But, that’s just the beginning. The information on dates corresponding to the occurrence of those EREs allows us to build a 2D picture:

# Black Monday data sample df_bm = dfblack[dfblack.index == dt.date(1987, 10, 19)] # Reopening of NYSE after 9/11 data sample df_wtc = dfblack[dfblack.index == dt.date(2001, 9, 17)] # Financial Crisis of 2008 data sample df_cre = dfblack[(dfblack.index >= dt.date(2008, 9, 19)) & (dfblack.index <= dt.date(2009, 3, 6))] # Drawdown of 2015/2016 data sample df_16 = dfblack[(dfblack.index >= dt.date(2015, 12, 1)) & (dfblack.index <= dt.date(2016, 2, 11))]   plt.figure(figsize=(13, 6)) plt.plot(dfblack.Return*100, '.k') plt.plot(df_bm.Return*100, '.r') plt.plot(df_wtc.Return*100, '.b') plt.plot(df_cre.Return*100, '.m') plt.plot(df_16.Return*100, '.y') plt.xlabel('Time Interval [%s to %s]' % (dfblack.index[0], dfblack.index[-1]), fontsize=12) plt.ylabel('Magnitude [%]', fontsize=12) plt.title('Time Distribution of the Most Extreme Daily Losses of %g NYSE Stocks (1 point per stock)' % (len(dfblack)), fontsize=13) plt.xticks(fontsize = 12) plt.yticks(fontsize = 12) plt.xlim([dfblack.index[0], dt.date(2016, 4, 14)])   plt.savefig('fig02.png', format='png')

delivering:

Keeping in mind that every (out of 2716) stock is represented only by 1 data point, the figure reveals a dramatic state of the play: significantly lower number of EREs before 1995 and progressing increase after that with a surprisingly huge number in the past 5 years! As discussed in my previous post, Detecting Human Fear in Electronic Trading, the surge in EREs could be explained by a domination of algorithmic trading over manual trades. It may also suggest that the algorithmic risk management for many stocks is either not as good as we want it to be or very sensitive to the global stock buying/selling pressure across the markets (the butterfly effect). On the other side, the number of new stocks being introduced to NYSE may cause them “to fight for life” and those most vulnerable, short-lived, of unestablished position, reputation, financial history, or least liquidity can be the subject to ERE occurrence more likely. Regardless of the cause number one, there are more and more “sudden drops in the altitude” now than ever. I find it alarming.

For clarity, in the above figure, the historical events of Black Monday 1978-10-19 (89 points), reopening of NYSE after 9/11 (25 points), Financial Crisis 2007-2008 (858 points), and a drawdown of 2015/2016 (256 points!) have been marked by red, blue, purple, and dark yellow colours, respectively.

2. Black Swans of NYSE

The initial distribution of Extreme Rare Events of NYSE (Figure 1) provides us with an opportunity to define a black swan event. There is a statistical theory, Extreme Value Theory (EVT), delivering an appealing explanation on behaviour of rare events. We have discussed it in detail in Extreme VaR for Portfolio Managers and Black Swan and Extreme Loss Modeling articles. Thanks to that, we gain a knowledge and tools so needed to describe what we observe. As previously, in this post we will be only considering the Gumbel distribution $G$ of the corresponding probability density function (pdf) $g$ given as:
$$G(z;\ a,b) = e^{-e^{-z}} \ \ \mbox{for}\ \ z=\frac{x-b}{a}, \ x\in\Re$$ and
$$g(z;\ a,b) = b^{-1} e^{-z}e^{-e^{-z}} \ .$$ where $a$ and $b$ are the location parameter and scale parameter, respectively. You can convince yourself that fitting $g(z;\ a,b)$ function truly works in practice:

# fitting locG, scaleG = gumbel_l.fit(dfblack.Return) # location, scale parameters   dx = 0.0001 x = [dx*i for i in range(-int(1/dx), int(1/dx)+1)] x2 = x.copy()   plt.figure(num=3, figsize=(9, 4)) plt.hist(dfblack.Return, bins=50, normed=True, color=c, label="NYSE Data of 2716 EREs") pdf1 = gumbel_l.pdf(x2, loc=locG, scale=scaleG) y = pdf1.copy() plt.plot(x2, y, 'r', linewidth=2, label="Gumbel PDF fit") plt.xlim([-1, 0]) plt.xlabel('Daily loss [%]', fontsize = 12) plt.xticks(fontsize = 12) plt.legend(loc=2) plt.savefig('fig03.png', format='png')

By deriving:

print(locG, scaleG, locG-scaleG, locG+scaleG)   # Pr(Loss < locG-scaleG) Pr_bs = gumbel_l.cdf(locG-scaleG, loc=locG, scale=scaleG) print(Pr_bs)

we see that

-0.151912998711 0.0931852781844 -0.245098276896 -0.058727720527 0.307799372445

i.e. the Gumbel pdf’s location parameter (peak) is at -15.2% with scale of 9.3%. Using an analogy to Normal distribution, we define two brackets, namely:
$$a-b \ \ \mbox{and} \ \ a+b$$ to be -24.5% and -5.9%, respectively. Given that, we define the black swan event as a daily loss of:
$$L < a-b$$ magnitude. Based on NYSE data sample we find that 30.8% of EREs (i.e. greater than 24.5% in loss) meet that criteria. We re-plot their time-magnitude history as follows:

# skip data after 2015-04-10 (last 252 business days) dfb = dfblack0[dfblack0.index < dt.date(2015, 4, 10)] # re-fit Gumbel (to remain politically correct in-sample ;) locG, scaleG = gumbel_l.fit(dfb.Return) print(locG, locG-scaleG) # -0.16493276305 -0.257085820659   # extract black swans from EREs data set dfb = dfb[dfb.Return < locG-scaleG] dfb0 = dfb.copy()   plt.figure(figsize=(13, 6)) plt.plot(dfb.Return*100, '.r') plt.xlabel('Time Interval [%s to %s]' % (dfb.index[0], dfb.index[-1]), fontsize=14) plt.ylabel('Black Swan Magnitude [%]', fontsize=14) plt.title('Time Distribution of Black Swans of %g NYSE Stocks (L < %.1f%%)' % (len(dfb), (locG-scaleG)*100), fontsize=14) plt.xticks(fontsize = 14) plt.yticks(fontsize = 14) plt.ylim([-100, 0]) plt.xlim([dfb.index[0], dt.date(2016, 4, 14)]) plt.savefig('fig04.png', format='png')

where we deliberately skipped last 252+ data points to allow ourselves for some backtesting later on (out-of-sample). That reduces the number of black swans from 807 down to 627 (in-sample).

3. Modelling the Probability of an Event Based on Observed Occurrences

Let’s be honest. It’s nearly impossible to predict in advance the exact day when next black swan will appear, for which stocks, will it be isolated or trigger some stocks to fall down on the same day? Therefore, what follows from now on, you should consider as a common sense in quantitative modelling of black swans backed by our knowledge of statistics. And yes, despite the opinions of Taleb (2010) in his book of The Black Swan: The Impact of the Highly Improbable we might derive some risk uncertainties and become more aware of the impact if the worst strikes.

The classical guideline we find in Vose’s book Quantitative Risk Analysis: A Guide to Monte Carlo Simulation Modelling (1996). David delivers not only a solid explanations of many statistical distributions but also excellent hints on their practical usage for risk analysts. Since the discovery of this book in 2010 in Singapore, it is my favourite reference manual of all times.

The modelling of the probability of an event based on observed occurrences may follow two distinct processes, analogous to the difference between discrete and continuous distributions: (1) an event that may occur only among a set of specific discrete opportunities, and (2) an event that may occur over a continuum of opportunity.

Considering the black swan events we are not limited by a finite number of instances, therefore we will be taking into account the continuous exposure probability modelling. A continuous exposure process is characterized by the mean interval between events (MIBE) $\beta$. The underlying assumption refers to the uncertainty displaying the properties of a Poisson process, i.e. the probability of an event is independent of however many events have occurred in the past or how recently. And this is our starting point towards a better understanding of NYSE black swans.

3.1. Mean Interval Between Black Swan Events (MIBE)

The MIBE is the average interval between $n$ observed occurrences of a black swan event. Its true value can be estimated from the observed occurrences using central limit theorem:
$$\mbox{MIBE}\beta = \mbox{Normal} \left( \bar{t}, \frac{\sigma}{\sqrt{n-1}} \right)$$ where $\bar{t}$ is the average of the $n-1$ observed intervals $t_i$ between the $n$ observed contiguous black swan events and $\sigma$ is the standard deviation of the $t_i$ intervals. The larger the value of $n$, the greater our confidence on knowing its true value.

First, we estimate $\bar{t}$ from the data. Using NumPy’s histogram function, extra, we confirm that a plain mean is the same as the expected value, $E(\Delta t)$, derived based on the histogram,
$$E(\Delta t) = \sum_{i} p_i x_i \ ,$$ i.e. $\bar{t} = E(\Delta t)$ where there is $p_i$ probability of observing $x_i$. In Python we execute that check as follows:

deltat = np.diff(dfb.Delta0) # time differences between points   print('tbar = %.2f [days]' % np.mean(deltat)) print('min, max = %.2f, %.2f [days]' % (np.min(deltat), np.max(deltat))) print()   # Confirm the mean using histogram data (mean = the expected value) y, x = np.histogram(deltat, bins=np.arange(np.ceil(np.max(deltat)+2)), density=True) hmean = np.sum(x[:-1]*y) print('E(dt) = %.2f [days]' % hmean)   # plot historgam with a fitted Exp PDF plt.figure(figsize=(9, 5)) plt.bar(x[:-1], y, width=1, color=c) plt.xlim([0, 20]) plt.title('Time Differences between Black Swan Events', fontsize=12) plt.xlabel('Days', fontsize=12) plt.savefig('fig05.png', format='png')

what returns

tbar = 17.38 [days] min, max = 0.00, 950.00 [days]   E(dt) = 17.38 [days]

What is worth noticing is the fact that the largest recorded separation among historical black swans at NYSE was 950 days, i.e. 2.6 years. By plotting the distribution of time differences, we get:

what could be read out more clearly if we type, e.g.:

# the first 5 data points from the historgram print(y[0:5]) print(x[0:5])

returning

[ 0.30990415 0.12300319 0.08466454 0.04472843 0.03354633] [ 0. 1. 2. 3. 4.]

and means that our data suggest that there is a probability of 31.0% of observing the black swan events on the same day, 12.3% with a 1 day gap, 8.5% with a 2 day separation, and so on.

Since MIBE$\beta$ is not given by a single number, the recipe we have for its estimation (see the formula above) allows us solely for modelling of Mean Interval Between Black Swan Events as follows. First, we calculate the corresponding standard deviation:

sig = np.std(deltat, ddof=1) # standard deviation   print(tbar, sig) print(tbar, sig/np.sqrt(ne-1))
15.8912 57.5940828519 15.8912 2.30192251209

and next

ne = len(dfb) # 627, a number of black swans in data sample   betaMIBE_sample = norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1), size=10000)   plt.figure(figsize=(9, 5)) plt.hist(betaMIBE_sample, bins=25, color=c) plt.xlabel("MIBE beta [days]", fontsize=12) plt.title("Mean Interval Between Black Swan Events", fontsize=12) plt.savefig('fig06.png', format='png')

3.2. Time Till Next Black Swan Event

Once we have estimated the MIBE $\beta$, it is possible to use its value in order to estimate the time till the next black swan event. It obeys
$$\mbox{TTNE} = \mbox{Expon}(\beta)$$ i.e. the exponential distribution given by the probability density function of:
$$f(x; \beta) = \lambda \exp(-\lambda x) = \frac{1}{\beta} \exp\left(-\frac{x}{\beta}\right) \ .$$ In Python we model TTNE in the following way:

betaMIBE_one = norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1), size=1)   # The time till next event = Expon(beta) ttne = expon.rvs(loc=0, scale=betaMIBE, size=100000)   y, x = np.histogram(ttne, bins=np.arange(int(np.max(ttne))), density=True)   plt.figure(figsize=(9, 5)) plt.bar(x[:-1], y, width=1, color=c, edgecolor=c) plt.xlim([0, 120]) plt.title("Time Till Next Black Swan Event", fontsize=12) plt.xlabel("Days", fontsize=12) plt.savefig('fig07.png', format='png')

where by checking

print(y[0:5]) print(x[0:5])

we get

[0.05198052 0.0498905 0.04677047 0.04332043] [1 2 3 4]

id est there is 5.20% of probability that the next black swan event at NYSE will occur on the next day, 4.99% in 2 days, 4.68% in 3 days, etc.

3.3. The Expected Number of Black Swans per Business Day

The distribution of the number of black swan events to occur at NYSE per unit time (here, per business day) can be modelled using Poisson distribution:
$$\mbox{Poisson}(\lambda) \ \ \mbox{where} \ \ \lambda = \frac{1}{\beta} \ .$$ The Poisson distribution models the number of occurrences of an event in a time $T$ when the time between successive events follows a Poisson process. If $\beta$ is the mean time between events, as used by the Exponential distribution, then $\lambda = T/\beta$. We will use it in the “out-of-sample” backtesting in Section 4.

Let’s turn words into action by executing the following Python code:

# generate a set of Poisson rvs nep = poisson.rvs(1/betaMIBE_one, size=100000)   y, x = np.histogram(nep, bins=np.arange(np.max(nep)+1), density=True)   plt.figure(figsize=(9, 5)) plt.title("The Expected Number of Black Swans per Business Day", fontsize=12) plt.ylabel("N", fontsize=12) plt.ylabel("Probability", fontsize=12) plt.xticks(x) plt.xlim([-0.1, 2.1]) plt.bar(x[:-1], y, width=0.05, color='r', edgecolor='r', align='center') plt.savefig('fig08.png', format='png')

The derived probabilities:

print(y) print(x[:-1])
[0.95465 0.04419 0.00116] [0 1 2]

inform us that there is 95.5% of chances that there will be no occurrence of the black swan event per business day, 4.4% that we will record only one, and 0.12% that we may find 2 out of 2500+ NYSE stocks that will suffer from the loss greater than 25.7% on the next business day (unit time).

3.4. Probability of the Occurrence of Several Events in an Interval

The Poisson $(1/\beta)$ distribution calculates the distribution of the number of events that will occur in a single unit interval. The probability of exactly $k$ black swan events in $m$ trials is given by:
$$\mbox{Pr}[X = N] = \frac{\lambda^k}{k!} \exp(-\lambda) = \frac{1}{\beta^k k!} \exp\left(-\frac{1}{\beta}\right)$$
with $\beta$ rescaled to reflect the period of exposure, e.g.:
$$\beta_{yr} = \beta/252$$ where 252 stands for a number of business days in a calendar year. Therefore, the number of black swan events in next business year can be modelled by:
$$N = \mbox{Poisson}(\lambda) = \mbox{Poisson}(1/\beta_{yr}) = \mbox{Poisson}(252/\mbox{Normal}(\bar{t}, \sigma/\sqrt{n-1}))$$ what we can achieve in Python, coding:

inNdays = 252   N = poisson.rvs(inNdays/norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1)), size=100000) exN = np.round(np.mean(N)) stdN = np.round(np.std(N, ddof=1))   y, x = np.histogram(N, bins=np.arange(int(np.max(N))), density=True)   _ = plt.figure(figsize=(9, 5)) _ = plt.title("The Expected Number of Black Swans in next %g Business Days" % inNdays, fontsize=12) _ = plt.ylabel("N", fontsize=12) _ = plt.ylabel("Probability", fontsize=12) _ = plt.bar(x[:-1], y, width=1, color=c) plt.savefig('fig09.png', format='png')   # probability that it will occur <N> events in next 'inNdays' days print('E(N) = %g' % exN) print('stdN = %g' % stdN)   tmp = [(i,j) for i, j in zip(x,y) if i == exN] print('Pr(X=%g) = %.4f' % (exN, tmp[0][1]))   tmp = [(i,j) for i, j in zip(x,y) if i == 0] print('Pr(X=%g) = %.4f' % (0, tmp[0][1]))

Our Monte Carlo simulation may return, for instance:

where the calculated measures are:

E(N) = 16 stdN = 4 Pr(X=16) = 0.0992 Pr(X=0) = 0.0000

The interpretation is straightforward. On average, we expect to record $16\pm 4$ black swan events in next 252 business days among 2500+ stocks traded at NYSE as for Apr 10, 2015 COB. The probability of observing exactly 16 black swans is 9.92%. It is natural, based on our data analysis, that the resultant probability of the “extreme luck” of not having any black swan at NYSE, Pr$(X=0)$, in the following trading year is zero. But, miracles happen!

9.92% is derived based on a single run of the simulation. Using the analytical formula for Pr$(X=N)$ given above, we compute the uncertainty of
$$\mbox{Pr}[X = (N=16)] = \frac{1}{(\beta_{yr})^{16} 16!} \exp\left(-\frac{1}{\beta_{yr}}\right) \ ,$$

Pr = [] for nsim in range(100000): betayr = norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1), size=1) / inNdays p = 1/(betayr**exN * factorial(exN)) * np.exp(-1/betayr) Pr.append(p)   print('Pr[X = E(N) = %g] = %.2f +- %.2f' % (exN, np.mean(Pr), np.std(Pr, ddof=1)))

that yields

Pr[X = E(N) = 16] = 0.08 +- 0.02

and stays in and agreement with our result.

4. Prediction and Out-of-Sample Backtesting

All right, time to verify a theoretical statistics in practice. Our goal is to analyze the sample of black swan data skipped so far but available to us. This procedure usually is known as an “out-of-sample” backtesting, i.e. we know “the future” as it has already happened but we pretend it is unknown.

In first step, we look for a suitable time interval of exactly 252 business days “in the future”. It occurs to be:

print(np.busday_count(dt.date(2015, 4, 10) , dt.date(2016,3, 29))) # 252

Next, we extract out-of-sample (oos) DataFrame and illustrate them in a similar fashion as previously for in-sample data:

oos = dfblack0[(dfblack0.index >= dt.date(2015, 4, 10)) & (dfblack0.index <= dt.date(2016, 3, 29))]   # extract black swans oos = oos[oos.Return < locG-scaleG]   plt.figure(figsize=(13, 6)) plt.plot(oos.Return*100, '.b') plt.xlabel('Time Interval [%s to %s]' % (dfblack.index[0], oos.index[-1]), fontsize=14) plt.ylabel('Black Swan Magnitude [%]', fontsize=14) plt.title('Time Distribution of Out-of-Sample Black Swans of %g NYSE Stocks (L < %.1f%%)' % (len(oos), (locG-scaleG)*100), fontsize=14) plt.xticks(fontsize = 14) plt.yticks(fontsize = 14) plt.ylim([-100, 0]) plt.xlim([dfblack.index[0], oos.index[-1]]) plt.savefig('fig10.png', format='png')

Based on that, we compare the MIBE as predicted with what the next 252 business days reveal:

deltat_oos = np.diff(oos.Delta0) # time differences between points   tbar_oos = np.mean(deltat_oos) print('MIBE (predicted) = %.2f +- %.2f [days]' % (tbar, sig/np.sqrt(ne-1))) print('MIBE out-of-sample = %.2f [days]' % tbar_oos) print()   y, x = np.histogram(deltat_oos, bins=np.arange(np.ceil(np.max(deltat)+2)), density=True)   print(y[0:5]) # [ 0.3814433 0.20618557 0.17525773 0.06185567 0.05154639] print(x[0:5]) # [ 0. 1. 2. 3. 4.]   # Predicted probabilities were: # [ 0.30990415 0.12300319 0.08466454 0.04472843 0.03354633]

The outcome surprises, i.e.

MIBE (predicted) = 17.38 +- 2.30 [days] MIBE out-of-sample = 2.32 [days]

clearly indicating that Apr 2015–Apr 2016 period was nearly 7.5 times more “active” than “all-time” NYSE black swan data would predict to be. Have I already said “it’s alarming”?! Well, at least scary enough for all who trade on the daily basis holding portfolios rich in NYSE assets. The probability of having black swan on the same day increased from 31% to 38%. A new, pure definition of sophisticated gambling.

A verification of the time till next black swan event requires a gentle modification of oos DataFrame. Namely, we need to eliminate all data points with the same Delta0‘s:

oos2 = oos.Delta0.drop_duplicates(inplace=False) tdiff = np.diff(oos2)   y, x = np.histogram(tdiff, bins=np.arange(int(np.max(tdiff))), density=True)   _ = plt.figure(figsize=(9, 5)) _ = plt.bar(x[:-1], y, width=1, color=c, edgecolor=c) _ = plt.xlim([1, 30]) _ = plt.title("Time Till Next Black Swan Event (Out-of-Sample)", fontsize=12) _ = plt.xlabel("Days", fontsize=12) _ = plt.ylabel("Probability", fontsize=12) plt.savefig('fig11.png', format='png')   print(y[1:5]) print(x[1:5])

Again, from our modeling of Poisson($1/\beta$) we expected to have about 5% chances of waiting for the next black swan event a day, two, or three. Out-of-sample data suggest,

[0.33898305 0.28813559 0.10169492 0.08474576] [1 2 3 4]

that we are 6.3 times more likely to record black swan event at NYSE on the next business day. Terrifying.

The calculation of the expected number of black swans per business day in out-of-sample data I’m leaving with you as a homework in Python programming. Our earlier prediction of those events in next 252 business days returned 16$\pm$4. As we already have seen it, in a considered time interval after Apr 10, 2015 we recorded 98 swans which is $6.1^{+2.1}_{-1.2}$ times more than forecasted based on a broad sample.

In next post of this series, we will try to develop a methodology for a quantitative derivation of the probability of black swan for $N$-asset portfolio.

Reflection

Investors invest in stocks with anticipation of earning profit. As we have witnessed above, the birds become a serious threat to flying high with a significant increase in this investment air-traffic. Your plane (stock) can be grounded and your airline (portfolio) can suffer from huge losses. Hoping is not the solution. The better ways of protecting against the unavoidable should be implemented for investments while your money are in the game.

But what you should do when black swan strikes? Denzel Washington once said If you pray for rain, you gotta deal with the mud too.

## Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders (2)

This part is awesome. Trust me! Previously, in Part 1, we examined two independent methods in order to estimate the probability of a very rare event (heavy or extreme loss) that an asset could experience on the next day. The first one was based on the computation of the tail probability, i.e.: given the upper threshold $L_{thr}$ find the corresponding $\alpha$ such that $Pr(L < -L_{thr}) = \alpha$ where a random variable $L$ was referred to as a daily asset return (in percent). In this case, we showed that this approach was strongly dependent on the number of similar events in the past that would contribute to the changes of $\alpha$'s in time. The second method was based on a classical concept of the conditional probability. For example, given that the index (benchmark, etc.) drops by $L'$ what is the probability that an asset will lose $L\le L'$? We found that this methods fails if the benchmark never experienced a loss of $L'$.

In both cases, the lack of information on the occurrence of a rare event was the major problem. Therefore, is there any tool we can use to predict an event that has not happened yet?!

This post addresses an appealing solution and delivers a black-box for every portfolio holder he or she can use as an integrated part of the risk management computer system.

1. Bayesian Conjugates

Bayesian statistics treats the problem of finding the probability of an event a little bit differently than what we know from school. For example, if we flip a fair coin, in long run, we expect the same proportion of heads and tails. Formally, it is known as a frequentist interpretation. Bayesian interpretation assumes nothing about the coin (is it fair or weighted) and with every toss of the coin, it builds a grand picture on the underlying probability distribution. And it’s fascinating because we can start with an assumption that the coin is made such that we should observe a number of heads more frequently whereas, after $N$ tosses, we may find that the coin is fair.

Let me skip the theoretical part on that aspect of Bayesian statistics which has been beautifully explained by Mike Halls-Moore within his two articles (Bayesian Statistics: A Beginner’s Guide and Bayesian Inference Of A Binomial Proportion – The Analytical Approach) that I recommend you to study before reading this post further down.

Looking along the return-series for any asset (spanned over $M$ days), we may denote by $x_i = 1$ ($i=1,…,M$) a day when a rare event took place (e.g. an asset lost more than -16% or so) and by $x_i = 0$ a lack of such event. Treating days as a series of Bernoulli trials with unknown probability $p$, if we start with an uninformed prior, such as Beta($\alpha, \beta$) with $\alpha=\beta=1$ then the concept of Bayesian conjugates can be applied in order to update a prior belief. The update is given by:
$$\alpha’ = \alpha + n_e = \alpha + \sum_{i=1}^{N\le M} x_i \\ \beta’ = \beta + N – n_e = \beta + N – \sum_{i=1}^{N\le M} x_i$$ what simply means that after $N$ days from some initial moment in the past, by observing a rare event $n_e$ times, the probability distribution is given by Beta($\alpha’, \beta’$). Therefore, a posterior predictive mean for Bernoulli likelihood can be interpreted as a new probability of an event to take place on the next day:
$$\mbox{Pr}(L < -L_{thr}) = \frac{\alpha'}{\alpha' + \beta'}$$ And that's the key concept we are going to use now for the estimation of probability for heavy or extreme losses that an asset can experience in trading.

As a quick visualisation, imagine you analyse the last 252 trading days of the AAPL stock. The stock never lost more than 90% in one day. No such extreme event took place. According to our new method, the probability that AAPL will plummet 90% (or more) in next trading session is not zero(!) but
$$\mbox{Pr}(L < -90\%) = \frac{\alpha'}{\alpha' + \beta'} = \frac{1+0}{1+252-0} = 0.3953\%$$ with 95% confidence interval: $$\left[ B^{-1}(0.05, \alpha', \beta');\ B^{-1}(0.95, \alpha', \beta') \right]$$ where $B^{-1}$ denotes the percent point function (quantile function) for Beta distribution.

The estimation of a number of events corresponding to the loss of $L < -L_{thr}$ during next 252 trading day, we find by: $$E(n_e) = \left\lceil \frac{252}{[B^{-1}(0.95, \alpha', \beta')]^{-1}} \right\rceil$$ where $\lceil \rceil$ denotes the operation of rounding.

2. Analysis of Facebook since its IPO

Having a tool that includes information whether an event took place or not is powerful. You may also notice that the more information we gather (i.e. $N\to\infty$) the more closer we are to the “true” probability function. It also means that the spread around a new posterior predictive mean:
$$\sigma = \sqrt{ \frac{\alpha’ \beta’} { (\alpha’+\beta’)^2 (\alpha’+\beta’+1) } } \ \to 0 .$$ It is logical that the analysis of an asset traded daily for the past 19 years will deliver “more trustful” estimation of $\mbox{Pr}(L < -L_{thr})$ rather than a situation where we take into account only a limited period of time. However, what if the stock is a newbie in the market?

Let’s investigate the case of Facebook (NASDAQ:FB). The information we have the record of, spans a bit over 3.5 years back, when Facebook’s initial public offering (IPO) took place on May 18th, 2012. Since that day till today (December 6, 2015) the return-series (daily data) is $M=892$ points long. Using the analytical formulations as given above and writing a Python code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 # Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders (2) # (c) 2015 Pawel Lachowicz, QuantAtRisk.com   import numpy as np from scipy.stats import beta import matplotlib.pyplot as plt import pandas_datareader.data as web     # Fetching Yahoo! Finance for Facebook data = web.DataReader("FB", data_source='yahoo', start='2012-05-18', end='2015-12-04')['Adj Close'] cp = np.array(data.values) # daily adj-close prices ret = cp[1:]/cp[:-1] - 1 # compute daily returns N = len(ret)   # Plotting IBM price- and return-series plt.figure(num=2, figsize=(9, 6)) plt.subplot(2, 1, 1) plt.plot(cp) plt.axis("tight") plt.ylabel("FB Adj Close [USD]") plt.subplot(2, 1, 2) plt.plot(ret, color=(.6, .6, .6)) plt.axis("tight") plt.ylabel("Daily Returns") plt.xlabel("Time Period [days]") #plt.show()   # provide a threshold Lthr = -0.21   # how many events of L < -21% occured? ne = np.sum(ret < Lthr) # of what magnitude? ev = ret[ret < Lthr] # avgloss = np.mean(ev) # if ev is non-empty array     # prior alpha0 = beta0 = 1 # posterior alpha1 = alpha0 + ne beta1 = beta0 + N - ne pr = alpha1/(alpha1+beta1) cl1 = beta.ppf(0.05, alpha1, beta1) cl2 = beta.ppf(0.95, alpha1, beta1) ne252 = np.round(252/(1/cl2))   print("ne = %g" % ne) print("alpha', beta' = %g, %g" % (alpha1, beta1)) print("Pr(L < %3g%%) = %5.2f%%\t[%5.2f%%, %5.2f%%]" % (Lthr*100, pr*100, cl1*100, cl2*100)) print("E(ne) = %g" % ne252)

first we visualise both the price- and return-series:

and while the computations we find:

ne = 0 alpha', beta' = 1, 893 Pr(L < -21%) = 0.11% [ 0.01%, 0.33%] E(ne) = 1

There is no record of an event within 892 days of FB trading, $n_e = 0$, that the stock lost 21% or more in a single day. However, the probability that such even may happen at the end of the next NASDAQ session is 0.11%. Non zero but nearly zero.

A quick verification of the probability estimate returned by our Bayesian method we obtain by changing the value of a threshold from -0.21 to Lthr = 0. Then, we get:

alpha', beta' = 422, 472 Pr(L < 0%) = 47.20% [44.46%, 49.95%] E(ne) = 126

i.e. 47.2% that on the next day the stock will close lower. It is in an excellent agreement with what can be calculated based on the return-series, i.e. there were 421 days with the negative daily returns, thus $421/892 = 0.4719$. Really cool, don’t you think?!

Now, imagine for a second that after first few days since the Facebook IPO we recalculate the posterior predictive mean, day-by-day, eventually to reach the most current estimate of the probability that the stock may lose -21% or more at the end of the next trading day — such a procedure allows us to plot the change of $\mbox{Pr}(L < -L_{thr})$ during the whole trading history of FB. We achieve that by running a modified version of the previous code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 import numpy as np from scipy.stats import beta import matplotlib.pyplot as plt import pandas_datareader.data as web     data = web.DataReader("FB", data_source='yahoo', start='2012-05-18', end='2015-12-05')['Adj Close'] cp = np.array(data.values) # daily adj-close prices returns = cp[1:]/cp[:-1] - 1 # compute daily returns   Lthr = -0.21   Pr = CL1 = CL2 = np.array([]) for i in range(2, len(returns)): # data window ret = returns[0:i] N = len(ret) ne = np.sum(ret < Lthr) # prior alpha0 = beta0 = 1 # posterior alpha1 = alpha0 + ne beta1 = beta0 + N - ne pr = alpha1/(alpha1+beta1) cl1 = beta.ppf(0.05, alpha1, beta1) cl2 = beta.ppf(0.95, alpha1, beta1) # Pr = np.concatenate([Pr, np.array([pr*100])]) CL1 = np.concatenate([CL1, np.array([cl1*100])]) CL2 = np.concatenate([CL2, np.array([cl2*100])])   plt.figure(num=1, figsize=(10, 5)) plt.plot(Pr, "k") plt.plot(CL1, "r--") plt.plot(CL2, "r--") plt.axis("tight") plt.ylim([0, 5]) plt.grid((True)) plt.show()

what displays:

i.e. after 892 days we reach the value of 0.11% as found earlier. The black line tracks the change of probability estimate while both red dashed lines denote the 95% confidence interval. The black line is smooth, i.e. because of a lack of -21% event in the past, a recalculated probability did not changed its value “drastically”.

This is not the case when we consider the scenario for Lthr = -0.11, delivering:

what stays in agreement with:

ne = 1 alpha', beta' = 2, 892 Pr(L < -11%) = 0.22% [ 0.04%, 0.53%] E(ne) = 1

and confirms that only one daily loss of $L < -11\%$ took place while the entire happy life of Facebook on NASDAQ so far.

Since the computer can find the number of losses to be within a certain interval, that gives us an opportunity to estimate:
$$\mbox{Pr}(L_1 \le L < L_2)$$ in the following way:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 import numpy as np from scipy.stats import beta import matplotlib.pyplot as plt import pandas_datareader.data as web     data = web.DataReader("FB", data_source='yahoo', start='2012-05-18', end='2015-12-05')['Adj Close'] cp = np.array(data.values) # daily adj-close prices ret = cp[1:]/cp[:-1] - 1 # compute daily returns N = len(ret)   dL = 2 for k in range(-50, 0, dL): ev = ret[(ret >= k/100) & (ret < (k/100+dL/100))] avgloss = 0 ne = np.sum((ret >= k/100) & (ret < (k/100+dL/100))) # # prior alpha0 = beta0 = 1 # posterior alpha1 = alpha0 + ne beta1 = beta0 + N - ne pr = alpha1/(alpha1+beta1) cl1 = beta.ppf(0.05, alpha1, beta1) cl2 = beta.ppf(0.95, alpha1, beta1) if(len(ev) > 0): avgloss = np.mean(ev) print("Pr(%3g%% < L < %3g%%) =\t%5.2f%%\t[%5.2f%%, %5.2f%%] ne = %4g" " E(L) = %.2f%%" % \ (k, k+dL, pr*100, cl1*100, cl2*100, ne, avgloss*100)) else: print("Pr(%3g%% < L < %3g%%) =\t%5.2f%%\t[%5.2f%%, %5.2f%%] ne = " \ "%4g" % (k, k+dL, pr*100, cl1*100, cl2*100, ne))

By doing so, we visualise the distribution of potential losses (that may occur on the next trading day) along with their magnitudes:

Pr(-50% < L < -48%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-48% < L < -46%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-46% < L < -44%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-44% < L < -42%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-42% < L < -40%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-40% < L < -38%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-38% < L < -36%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-36% < L < -34%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-34% < L < -32%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-32% < L < -30%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-30% < L < -28%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-28% < L < -26%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-26% < L < -24%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-24% < L < -22%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-22% < L < -20%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-20% < L < -18%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-18% < L < -16%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-16% < L < -14%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-14% < L < -12%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-12% < L < -10%) = 0.34% [ 0.09%, 0.70%] ne = 2 E(L) = -11.34% Pr(-10% < L < -8%) = 0.67% [ 0.29%, 1.17%] ne = 5 E(L) = -8.82% Pr( -8% < L < -6%) = 0.89% [ 0.45%, 1.47%] ne = 7 E(L) = -6.43% Pr( -6% < L < -4%) = 2.91% [ 2.05%, 3.89%] ne = 25 E(L) = -4.72% Pr( -4% < L < -2%) = 9.84% [ 8.26%, 11.53%] ne = 87 E(L) = -2.85% Pr( -2% < L < 0%) = 33.11% [30.54%, 35.72%] ne = 295 E(L) = -0.90%

For example, based on a complete trading history of FB, there is a probability of 0.34% that the stock will close with a loss to be between -12% and -10%, and if so, the “expected” loss would be -11.3% (estimated based solely on 2 events).

Again. The probability, not certainty.

## Student t Distributed Linear Value-at-Risk

One of the most underestimated feature of the financial asset distributions is their kurtosis. A rough approximation of the asset return distribution by the Normal distribution becomes often an evident exaggeration or misinterpretations of the facts. And we know that. The problem arises if we investigate a Value-at-Risk (VaR) measure. Within a standard approach, it is computed based on the analytical formula:
$$\mbox{VaR}_{h, \alpha} = \Phi ^{-1} (1-\alpha)\sqrt{h}\sigma – h\mu$$ what expresses $(1-\alpha)$ $h$-day VaR, $\Phi ^{-1} (1-\alpha)$ denotes the standardised Normal distribution $(1-\alpha)$ quantile, and $h$ is a time horizon. The majority of stocks display significant excess kurtosis and negative skewness. In plain English that means that the distribution is peaked and likely to have a heavy left tail. In this scenario, the best fit of the Normal probability density function (pdf) to the asset return distribution underestimates the risk accumulated in a far negative territory.

In this post we will introduce the concept of Student t Distributed Linear VaR, i.e. an alternative method to measure VaR by taking into account “a correction” for kurtosis of the asset return distribution. We will use the market stock data of IBM as an exemplary case study and investigate the difference in a standard and non-standard VaR calculation based on the parametric models. It will also be an excellent opportunity to learn how to do it in Python, quickly and effectively.

1. Leptokurtic Distributions and Student t VaR

A leptokurtic distribution is one whose density function has a higher peak and greater mass in the tails than the normal density function of the same variance. In a symmetric unimodal distribution, i.e. one whose density function has only one peak, leptokurtosis is indicated by a positive excess kurtosis (Alexander 2008). The impact of leptokurtosis on VaR is non-negligible. Firstly, assuming high significance levels, e.g. $\alpha \le 0.01$, the estimation of VaR for Normal and leptokurtic distribution would return:
$$\mbox{VaR}_{lepto,\ \alpha\le 0.01} \gt \mbox{VaR}_{Norm,\ \alpha\le 0.01}$$
which is fine, however for low significance levels the situation may change:
$$\mbox{VaR}_{lepto,\ \alpha\ge 0.05} \lt \mbox{VaR}_{Norm,\ \alpha\ge 0.05} \ .$$
A good example of the leptokurtic distribution is Student t distribution. If it describes the data better and displays significant excess kurtosis, there is a good chance that VaR for high significance levels will be a much better measure of the tail risk. The Student t distribution is formally given by its pdf:
$$f_\nu(x) = \sqrt{\nu\pi} \Gamma\left(\frac{\nu}{2}\right)^{-1} \Gamma\left(\frac{\nu+1}{2}\right) (1+\nu^{-1} x^2)^{-(\nu+1)/2}$$ where $\Gamma$ denotes the gamma function and $\nu$ the degrees of freedom. For $\nu\gt 2$ its variance is $\nu(\nu-2)^{-1}$ and the distribution has a finite excess kurtosis of $\kappa = 6(\nu-4)^{-1}$ for $\nu \gt 4$. It is obvious that for $\nu\to \infty$ the Student t pdf approaches the Normal pdf. Large values of $\nu$ for the fitted Student t pdf in case of the financial asset daily return distribution are unlikely, unless the Normal distribution is the best pdf model indeed (possible to obtain for very large samples).

We find $\alpha$ quantile of the Student t distribution ($\mu=0$ and $\sigma=1$) by the integration of the pdf and usually denote it as $\sqrt{\nu^{-1}(\nu-2)} t_{\nu}^{-1}(\alpha)$ where $t_{\nu}^{-1}$ is the percent point function of the t distribution. Accounting for the fact that in practice we deal with random variables of the form of $X = \mu + \sigma T$ where $X$ would describe the asset return given as a transformation of the standardised Student t random variable, we reach to the formal definition of the $(1-\alpha)$ $h$-day Student t VaR given by:
$$\mbox{Student}\ t\ \mbox{VaR}_{\nu, \alpha, h} = \sqrt{\nu^{-1}(\nu-2)h}\ t_{\nu}^{-1}(1-\alpha)\sigma – h\mu$$ Having that, we are ready to use this knowledge in practice and to understand, based on data analysis, the best regimes where one should (and should not) apply the Student t VaR as a more “correct” measure of risk.

2. Normal and Student t VaR for IBM Daily Returns

Within the following case study we will make of use of Yahoo! Finance data provider in order to fetch the price-series of the IBM stock (adjusted close) and transform it into return-series. As the first task we have to verify before heading further down this road will be the sample skewness and kurtosis. We begin, as usual:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 # Student t Distributed Linear Value-at-Risk # The case study of VaR at the high significance levels # (c) 2015 Pawel Lachowicz, QuantAtRisk.com   import numpy as np import math from scipy.stats import skew, kurtosis, kurtosistest import matplotlib.pyplot as plt from scipy.stats import norm, t import pandas_datareader.data as web     # Fetching Yahoo! Finance for IBM stock data data = web.DataReader("IBM", data_source='yahoo', start='2010-12-01', end='2015-12-01')['Adj Close'] cp = np.array(data.values) # daily adj-close prices ret = cp[1:]/cp[:-1] - 1 # compute daily returns   # Plotting IBM price- and return-series plt.figure(num=2, figsize=(9, 6)) plt.subplot(2, 1, 1) plt.plot(cp) plt.axis("tight") plt.ylabel("IBM Adj Close [USD]") plt.subplot(2, 1, 2) plt.plot(ret, color=(.6, .6, .6)) plt.axis("tight") plt.ylabel("Daily Returns") plt.xlabel("Time Period 2010-12-01 to 2015-12-01 [days]")

where we cover the most recent past 5 years of trading of IBM at NASDAQ. Both series visualised look like:

where, in particular, the bottom one reveals five days where the stock experienced 5% or higher daily loss.

In Python, a correct computation of the sample skewness and kurtosis is possible thanks to the scipy.stats module. As a test you may verify that for a huge sample (e.g. of 10 million) of random variables $X\sim N(0, 1)$, the expected skewness and kurtosis,

from scipy.stats import skew, kurtosis X = np.random.randn(10000000) print(skew(X)) print(kurtosis(X, fisher=False))

is indeed equal 0 and 3,

0.0003890933008049 3.0010747070253507

respectively.

As we will convince ourselves in a second, this is not the case for our IBM return sample. Continuing,

31 32 33 34 35 36 37 38 39 40 41 print("Skewness = %.2f" % skew(ret)) print("Kurtosis = %.2f" % kurtosis(ret, fisher=False)) # H_0: the null hypothesis that the kurtosis of the population from which the # sample was drawn is that of the normal distribution kurtosis = 3(n-1)/(n+1) _, pvalue = kurtosistest(ret) beta = 0.05 print("p-value = %.2f" % pvalue) if(pvalue < beta): print("Reject H_0 in favour of H_1 at %.5f level\n" % beta) else: print("Accept H_0 at %.5f level\n" % beta)

we compute the moments and run the (excess) kurtosis significance test. As already explained in the comment, we are interested whether the derived value of kurtosis is significant at $\beta=0.05$ level or not? We find that for IBM, the results reveal:

Skewness = -0.70 Kurtosis = 8.39 p-value = 0.00 Reject H_0 in favour of H_1 at 0.05000 level

id est, we reject the null hypothesis that the kurtosis of the population from which the sample was drawn is that of the normal distribution kurtosis in favour of the alternative hypothesis ($H_1$) that it is not. And if not then we have work to do!

In the next step we fit the IBM daily returns data sample with both Normal pdf and Student t pdf in the following way:

43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 # N(x; mu, sig) best fit (finding: mu, stdev) mu_norm, sig_norm = norm.fit(ret) dx = 0.0001 # resolution x = np.arange(-0.1, 0.1, dx) pdf = norm.pdf(x, mu_norm, sig_norm) print("Integral norm.pdf(x; mu_norm, sig_norm) dx = %.2f" % (np.sum(pdf*dx))) print("Sample mean = %.5f" % mu_norm) print("Sample stdev = %.5f" % sig_norm) print()   # Student t best fit (finding: nu) parm = t.fit(ret) nu, mu_t, sig_t = parm pdf2 = t.pdf(x, nu, mu_t, sig_t) print("Integral t.pdf(x; mu, sig) dx = %.2f" % (np.sum(pdf2*dx))) print("nu = %.2f" % nu) print()

where we ensure that the complete integral over both pdfs return 1, as expected:

Integral norm.pdf(x; mu_norm, sig_norm) dx = 1.00 Sample mean = 0.00014 Sample stdev = 0.01205   Integral t.pdf(x; mu, sig) dx = 1.00 nu = 3.66

The best fit of the Student t pdf returns the estimate of 3.66 degrees of freedom.

From this point, the last step is the derivation of two distinct VaR measure, the first one corresponding to the best fit of the Normal pdf and the second one to the best fit of the Student t pdf, respectively. The trick is

61 62 63 64 65 66 67 68 69 # Compute VaR h = 1 # days alpha = 0.01 # significance level StudenthVaR = (h*(nu-2)/nu)**0.5 * t.ppf(1-alpha, nu)*sig_norm - h*mu_norm NormalhVaR = norm.ppf(1-alpha)*sig_norm - mu_norm   lev = 100*(1-alpha) print("%g%% %g-day Student t VaR = %.2f%%" % (lev, h, StudenthVaR*100)) print("%g%% %g-day Normal VaR = %.2f%%" % (lev, h, NormalhVaR*100))

that in the calculation of the $(1-\alpha)$ $1$-day Student t VaR ($h=1$) according to:
$$\mbox{Student}\ t\ \mbox{VaR}_{\nu, \alpha=0.01, h=1} = \sqrt{\nu^{-1}(\nu-2)}\ t_{\nu}^{-1}(0.99)\sigma – \mu$$ we have to take $\sigma$ and $\mu$ given by sig_norm and mu_norm variable (in the code) and not the ones returned by the t.fit(ret) function in line #54. We find out that:

99% 1-day Student t VaR = 3.19% 99% 1-day Normal VaR = 2.79%

i.e. VaR derived based on the Normal pdf best fit may underestimate the tail risk for IBM, as suspected in the presence of highly significant excess kurtosis. Since the beauty of the code is aesthetic, the picture is romantic. We visualise the results of our investigation, extending our code with:

71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 plt.figure(num=1, figsize=(11, 6)) grey = .77, .77, .77 # main figure plt.hist(ret, bins=50, normed=True, color=grey, edgecolor='none') plt.hold(True) plt.axis("tight") plt.plot(x, pdf, 'b', label="Normal PDF fit") plt.hold(True) plt.axis("tight") plt.plot(x, pdf2, 'g', label="Student t PDF fit") plt.xlim([-0.2, 0.1]) plt.ylim([0, 50]) plt.legend(loc="best") plt.xlabel("Daily Returns of IBM") plt.ylabel("Normalised Return Distribution") # inset a = plt.axes([.22, .35, .3, .4]) plt.hist(ret, bins=50, normed=True, color=grey, edgecolor='none') plt.hold(True) plt.plot(x, pdf, 'b') plt.hold(True) plt.plot(x, pdf2, 'g') plt.hold(True) # Student VaR line plt.plot([-StudenthVaR, -StudenthVaR], [0, 3], c='g') # Normal VaR line plt.plot([-NormalhVaR, -NormalhVaR], [0, 4], c='b') plt.text(-NormalhVaR-0.01, 4.1, "Norm VaR", color='b') plt.text(-StudenthVaR-0.0171, 3.1, "Student t VaR", color='g') plt.xlim([-0.07, -0.02]) plt.ylim([0, 5]) plt.show()

what brings us to:

where in the inset, we zoomed in the left tail of the distribution marking both VaR results. Now, it is much clearly visible why at $\alpha=0.01$ the Student t VaR is greater: the effect of leptokurtic distribution. Moreover, the fit of the Student t pdf appears to be a far better parametric model describing the central mass (density) of IBM daily returns.

There is an ongoing debate which one out of these two risk measures should be formally applied when it comes to reporting of 1-day VaR. I do hope that just by presenting the abovementioned results I made you more aware that: (1) it is better to have a closer look at the excess kurtosis for your favourite asset (portfolio) return distribution, and (2) normal is boring.

Someone wise once said: “Don’t be afraid of being different. Be afraid of being the same as everyone else.” The same means normal. Dare to fit non-Normal distributions!

tStudentVaR.py

REFERENCES
Alexander, C., 2008, Market Risk Analysis. Volume IV., John Wiley & Sons Ltd

## Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders (1)

The probability of improbable events. The simplicity amongst complexity. The purity in its best form. The ultimate cure for those who trade, for those who invest. Does it exist? Can we compute it? Is it really something impossible? In this post we challenge ourselves to the frontiers of accessible statistics and data analysis in order to find most optimal computable solutions to this enigmatic but mind-draining problem. Since there is no certainty, everything remains shrouded in the veil of probability. The probability we have to face and hope that something unlikely, determined to take place anyhow, eventually will not happen.

Our goal is to calculate the probability of a very rare event (e.g. a heavy and/or extreme loss) in the trading market (e.g. of a stock plummeting 5% or much more) in a specified time-horizon (e.g. on the next day, in one week, in one month, etc.). The probability. Not the certainty of that event.

In this Part 1, first, we look at the tail of an asset return distribution and compress our knowledge on Value-at-Risk (VaR) to extract the essence required to understand why VaR-stuff is not the best card in our deck. Next, we move to a classical Bayes’ theorem which helps us to derive a conditional probability of a rare event given… yep, another event that (hypothetically) will take place. Eventually, in Part 2, we will hit the bull between its eyes with an advanced concept taken from the Bayesian approach to statistics and map, in real-time, for any return-series its loss probabilities. Again, the probabilities, not certainties.

1. VaR (not) for Rare Events

In the framework of VaR we take into consideration $T$ days of trading history of an asset. Next, we drive a number (VaR) that describes a loss that is likely to take place with the probability of approximately $\alpha$. “To take place” does not mean here that it will take place. In this approach we try to provide some likelihood (a quantitative measure) of the rare event in a specified time-horizon (e.g. on the next day if daily return-series are under our VaR investigation; a scenario considered in this post).

If by $L$ we denote a loss (in percent) an asset can experience on the next day, then:
$$\mbox{Pr}(L \le -\mbox{VaR}_{1-\alpha}) = \alpha$$ would be the probability of a loss of $-L\times D$ dollars where $-L\times D\ge -\mbox{VaR}\times D$, equal, for instance, $\alpha=0.05$ (also referred to as $(1-\alpha)$% VaR measure; $\mbox{VaR}_{95}$, etc.) and $D$ is the position size (money invested in the asset in terms of physical currency, e.g. in dollars). In other words, the historical data can help us to find $\mbox{VaR}_{95}$ given $\alpha$ assuming 5% of chances that $\mbox{VaR}_{1-\alpha}$ will be exceeded on the next day.

In order to illustrate that case and its shortcomings when it comes to the analysis of rare events, let’s look at the 10 year trading history of two stocks in the NASDAQ market: highly volatile CAAS (China Automotive Systems, Inc.; of the market capital of 247M) and highly liquid AAPL (Apple Inc.; of the market capital of 750B). First, we fetch their adjusted close price-series from Yahoo! Finance and derive the corresponding daily return-series utilising Python:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 # Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # heavy1.py   import pandas.io.data as web import matplotlib.pyplot as plt import numpy as np from pyvar import findvar, findalpha     # ---1. Data Processing   # fetch and download daily adjusted-close price series for CAAS # and AAPL stocks using Yahoo! Finance public data provider caas = web.DataReader("CAAS", data_source='yahoo', start='2005-05-13', end='2015-05-13')['Adj Close'] aapl = web.DataReader("AAPL", data_source='yahoo', start='2005-05-13', end='2015-05-13')['Adj Close']   CAAScp = np.array(caas.values) AAPLcp = np.array(aapl.values)   f = file("data1.dat","wb") np.save(f, CAAScp) np.save(f, AAPLcp) f.close()   # read in the data from a file f = file("data1.dat","rb") CAAScp = np.load(f) AAPLcp = np.load(f) f.close()   # compute return-series retCAAS = CAAScp[1:]/CAAScp[:-1]-1 retAAPL = AAPLcp[1:]/AAPLcp[:-1]-1

The best way to understand the data is by plotting them:

39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 # plotting (figure #1) # adjusted-close price-series fig, ax1 = plt.subplots(figsize=(10, 6)) plt.xlabel("Trading days 13/05/2005-13/05/2015") plt.plot(CAAScp, '-r', label="CAAS") plt.axis("tight") plt.legend(loc=(0.02, 0.8)) plt.ylabel("CAAS Adj Close Price (US$)") ax2 = ax1.twinx() plt.plot(AAPLcp, '-', label="AAPL") plt.legend(loc=(0.02, 0.9)) plt.axis("tight") plt.ylabel("AAPL Adj Close Price (US$)")   # plotting (figure #2) # daily return-series plt.figure(num=2, figsize=(10, 6)) plt.subplot(211) plt.grid(True) plt.plot(retCAAS, '-r', label="CAAS") plt.axis("tight") plt.ylim([-0.25,0.5]) plt.legend(loc="upper right") plt.ylabel("CAAS daily returns") plt.subplot(212) plt.grid(True) plt.plot(retAAPL, '-', label="AAPL") plt.legend(loc="upper right") plt.axis("tight") plt.ylim([-0.25,0.5]) plt.ylabel("AAPL daily returns") plt.xlabel("Trading days 13/05/2005-13/05/2015")

We obtain the price-series

and return-series

respectively. For the latter plot, by fixing the scaling of both $y$-axes we immediately gain an chance to inspect the number of daily trades closing with heavy losses. Well, at least at first glance and for both directions of trading. In this post we will be considering the long positions only.

Having our data pre-processed we may implement two different strategies to make use of the VaR framework in order to work out the probabilities for tail events. The first one is based on setting $\alpha$ level and finding $-VaR_{1-\alpha}$. Let’s assume $\alpha=0.01$, then the following piece of code

73 74 75 76 77 78 79 80 # ---2. Compuation of VaR given alpha   alpha = 0.01   VaR_CAAS, bardata1 = findvar(retCAAS, alpha=alpha, nbins=200) VaR_AAPL, bardata2 = findvar(retAAPL, alpha=alpha, nbins=100)   cl = 100.*(1-alpha)

aims at computation of the corresponding numbers making use of the function:

def findvar(ret, alpha=0.05, nbins=100): # Function computes the empirical Value-at-Risk (VaR) for return-series # (ret) defined as NumPy 1D array, given alpha # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # compute a normalised histogram (\int H(x)dx = 1) # nbins: number of bins used (recommended nbins>50) hist, bins = np.histogram(ret, bins=nbins, density=True) wd = np.diff(bins) # cumulative sum from -inf to +inf cumsum = np.cumsum(hist * wd) # find an area of H(x) for computing VaR crit = cumsum[cumsum <= alpha] n = len(crit) # (1-alpha)VaR VaR = bins[n] # supplementary data of the bar plot bardata = hist, n, wd return VaR, bardata

Here, we create the histogram with a specified number of bins and $-VaR_{1-\alpha}$ is found in an empirical manner. For many reasons this approach is much better than fitting the Normal Distribution to the data and finding VaR based on the integration of that continuous function. It is a well-know fact that such function would underestimate the probabilities in far tail of the return distribution. And the game is all about capturing what is going out there, right?

The results of computation we display as follows:

82 83 print("%g%% VaR (CAAS) = %.2f%%" % (cl, VaR_CAAS*100.)) print("%g%% VaR (AAPL) = %.2f%%\n" % (cl, VaR_AAPL*100.))

i.e.

99% VaR (CAAS) = -9.19% 99% VaR (AAPL) = -5.83%

In order to gain a good feeling of those numbers we display the left-tails of both return distributions

85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 # plotting (figure #3) # histograms of daily returns; H(x) # plt.figure(num=3, figsize=(10, 6)) c = (.7,.7,.7) # grey color (RGB) # # CAAS ax = plt.subplot(211) hist1, bins1 = np.histogram(retCAAS, bins=200, density=False) widths = np.diff(bins1) b = plt.bar(bins1[:-1], hist1, widths, color=c, edgecolor="k", label="CAAS") plt.legend(loc=(0.02, 0.8)) # # mark in red all histogram values where int_{-infty}^{VaR} H(x)dx = alpha hn, nb, _ = bardata1 for i in range(nb): b[i].set_color('r') b[i].set_edgecolor('k') plt.text(-0.225, 30, "VaR$_{%.0f}$ (CAAS) = %.2f%%" % (cl, VaR_CAAS*100.)) plt.xlim([-0.25, 0]) plt.ylim([0, 50]) # # AAPL ax2 = plt.subplot(212) hist2, bins2 = np.histogram(retAAPL, bins=100, density=False) widths = np.diff(bins2) b = plt.bar(bins2[:-1], hist2, widths, color=c, edgecolor="k", label="AAPL") plt.legend(loc=(0.02, 0.8)) # # mark in red all histogram bars where int_{-infty}^{VaR} H(x)dx = alpha hn, nb, wd = bardata2 for i in range(nb): b[i].set_color('r') b[i].set_edgecolor('k') plt.text(-0.225, 30, "VaR$_{%.0f}$ (AAPL) = %.2f%%" % (cl, VaR_AAPL*100.)) plt.xlim([-0.25, 0]) plt.ylim([0, 50]) plt.xlabel("Stock Returns (left tail only)") plt.show()

where we mark our $\alpha=$1% regions in red:

As you may notice, so far, we haven’t done much new. A classical textbook example coded in Python. However, the last figure reveals the main players of the game. For instance, there is only 1 event of a daily loss larger than 15% for AAPL while CAAS experienced 4 heavy losses. Much higher 99% VaR for CAAS takes into account those 4 historical events and put more weight on 1-day VaR as estimated in our calculation.

Imagine now that we monitor all those tail extreme/rare events day by day. It’s not too difficult to notice (based on the inspection of Figure #2) that in case of AAPL, the stock recorded its first serious loss of $L \lt -10$% approximately 650 days since the beginning of our “monitoring” which commenced on May 13, 2005. In contrast, CAAS was much more volatile and you needed to wait only ca. 100 days to record the loss of the same magnitude.

If something did not happen, e.g. $L \lt -10$%, the VaR-like measure is highly inadequate measure of probabilities for rare events. Once the event took place, the estimation of VaR changes (is updated) but decreases in time until a new rare event occurs. Let’s illustrate it with Python. This is our second VaR strategy: finding $\alpha$ given the threshold for rare events. We write a simple function that does the job for us:

def findalpha(ret, thr=1, nbins=100): # Function computes the probablity P(X<thr)=alpha given threshold # level (thr) and return-series (NumPy 1D array). X denotes the # returns as a rv and nbins is number of bins used for histogram # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # compute normalised histogram (\int H(x)dx=1) hist, bins = np.histogram(ret, bins=nbins, density=True) # compute a default histogram hist1, bins1 = np.histogram(ret, bins=nbins, density=False) wd = np.diff(bins1) x = np.where(bins1 < thr) y = np.where(hist1 != 0) z = list(set(x[0]).intersection(set(y[0]))) crit = np.cumsum(hist[z]*wd[z]) # find alpha try: alpha = crit[-1] except Exception as e: alpha = 0 # count number of events falling into (-inft, thr] intervals nevents = np.sum(hist1[z]) return alpha, nevents

We call it in our main program:

126 127 128 129 130 131 132 133 134 135 136 137 138 # ---3. Computation of alpha, given the threshold for rare events   thr = -0.10   alpha1, ne1 = findalpha(retCAAS, thr=thr, nbins=200) alpha2, ne2 = findalpha(retAAPL, thr=thr, nbins=200)   print("CAAS:") print(" Pr( L < %.2f%% ) = %.2f%%" % (thr*100., alpha1*100.)) print(" %g historical event(s)" % ne1) print("AAPL:") print(" Pr( L < %.2f%% ) = %.2f%%" % (thr*100., alpha2*100.)) print(" %g historical event(s)" % ne2)

what returns the following results:

CAAS: Pr( L < -10.00% ) = 0.76% 19 historical event(s) AAPL: Pr( L < -10.00% ) = 0.12% 3 historical event(s)

And that’s great however all these numbers are given as a summary of 10 years of analysed data (May 13/2005 to May/13 2015 in our case). The final picture could be better understood if we could dynamically track in time the changes of both probabilities and the number of rare events. We achieve it by the following not-state-of-the-art code:

141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 # ---4. Mapping alphas for Rare Events   alphas = [] nevents = [] for t in range(1,len(retCAAS)-1): data = retCAAS[0:t] alpha, ne = findalpha(data, thr=thr, nbins=500) alphas.append(alpha) nevents.append(ne)   alphas2 = [] nevents2 = [] for t in range(1,len(retAAPL)-1): data = retAAPL[0:t] alpha, ne = findalpha(data, thr=thr, nbins=500) alphas2.append(alpha) nevents2.append(ne)   # plotting (figure #4) # running probability for rare events # plt.figure(num=4, figsize=(10, 6)) ax1 = plt.subplot(211) plt.plot(np.array(alphas)*100., 'r') plt.plot(np.array(alphas2)*100.) plt.ylabel("Pr( L < %.2f%% ) [%%]" % (thr*100.)) plt.axis('tight') ax2 = plt.subplot(212) plt.plot(np.array(nevents), 'r', label="CAAS") plt.plot(np.array(nevents2), label="AAPL") plt.ylabel("# of Events with L < %.2f%%" % (thr*100.)) plt.axis('tight') plt.legend(loc="upper left") plt.xlabel("Trading days 13/05/2005-13/05/2015") plt.show()

revealing the following picture:

It’s a great way of looking at and understanding the far left-tail volatility of the asset under your current investigation. The probability between two rare/extreme events decreases for the obvious reason: along the time axis we include more and more data therefore the return distribution evolves and shifts its mass to the right leaving left tail events less and less probable.

The question remains: if an asset has never experienced an extreme loss of given magnitute, the probability of such event, within our VaR framework, simply remains zero! For example, for losses $L \lt -20$%,

CAAS displays 2 extreme events while AAPL none! Therefore, our estimation of superbly rare daily loss of -20% (or more) for AAPL stock based on 10 years of data is zero or undefined or… completely unsound. Can we do better than this?

2. Classical Conditional Prediction for Rare Events

Let’s consider a case where we want to predict the probability of the asset/stock (CAAS; traded at NASDAQ exchange) falling down more than $-L$%. Previously we have achieved such estimation through the integration of its probability density function. An alternative way is via derivation of the conditional probability, i.e. that CAAS will lose more than $-L$% given that NASDAQ index drops more than $-L$% on the next day.

Formula? Well, Bayes’ formula. That all what we need:
$$\mbox{Pr}(B|R) = \frac{ \mbox{Pr}(R|B)\ \mbox{Pr}(B) } { \mbox{Pr}(R) } .$$
Great, now, the meaning of each term. Let $A$ denotes the event of the stock daily return to be between -L% and 0%. Let B is the event of the stock return to be $\lt -L$%. Let $R$ is the event of NASDAQ daily return to be $\lt -L$%. Given that, based on both CAAS and NASDAQ historical time-series, we are able to compute $\mbox{Pr}(A)$, $\mbox{Pr}(B)$, $\mbox{Pr}(R|A)$, $\mbox{Pr}(R|B)$, therefore
$$\mbox{Pr}(R) = \mbox{Pr}(R|A)\mbox{Pr}(A) + \mbox{Pr}(R|B)\mbox{Pr}(B)$$ as well. Here, $\mbox{Pr}(R|A)$ would stand for the probability of NASDAQ falling down more than $-L$% given the observation of the CAAS return to be in $(-L;0)$% interval and $\mbox{Pr}(R|B)$ would denote the probability of NASDAQ falling down more than $-L$% given the observation of the CAAS return also dropping more than $-L$% on the same day.

With Bayes’ formula we inverse the engineering, aiming at providing the answer on $\mbox{Pr}(B|R)$ given all available data. Ready to code it? Awesome. Here we go:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 # Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # heavy2.py   import pandas.io.data as web import matplotlib.pyplot as plt import numpy as np   # ---1. Data Processing   # fetch and download daily adjusted-close price series for CAAS stock # and NASDAQ index using Yahoo! Finance public data provider ''' caas = web.DataReader("CAAS", data_source='yahoo', start='2005-05-13', end='2015-05-13')['Adj Close'] nasdaq = web.DataReader("^IXIC", data_source='yahoo', start='2005-05-13', end='2015-05-13')['Adj Close']   CAAScp = np.array(caas.values) NASDAQcp = np.array(nasdaq.values)   f = file("data2.dat","wb") np.save(f,CAAScp) np.save(f,NASDAQcp) f.close() '''   f = file("data2.dat","rb") CAAScp = np.load(f) NASDAQcp = np.load(f) f.close()   # compute the return-series retCAAS = CAAScp[1:]/CAAScp[:-1]-1 retNASDAQ = NASDAQcp[1:]/NASDAQcp[:-1]-1

The same code as used in Section 1 but instead of AAPL data, we fetch NASDAQ index daily close prices. Let’s plot the return-series:

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 # plotting (figure #1) # return-series for CAAS and NASDAQ index # plt.figure(num=2, figsize=(10, 6)) plt.subplot(211) plt.grid(True) plt.plot(retCAAS, '-r', label="CAAS") plt.axis("tight") plt.ylim([-0.25,0.5]) plt.legend(loc="upper right") plt.ylabel("CAAS daily returns") plt.subplot(212) plt.grid(True) plt.plot(retNASDAQ, '-', label="NASDAQ") plt.legend(loc="upper right") plt.axis("tight") plt.ylim([-0.10,0.15]) plt.ylabel("NASDAQ daily returns") plt.xlabel("Trading days 13/05/2005-13/05/2015") plt.show()

i.e.,

where we observe a different trading dynamics for NASDAQ index between 750th and 1000th day (as counted from May/13, 2005). Interestingly, the heaviest losses of NASDAQ are not ideally correlated with those of CAAS. That make this case study more exciting! Please also note that CAAS is not the part of the NASDAQ index (i.e., its component). Bayes’ formula is designed around independent events and, within a fair approximation, we may think of CAAS as an asset ticking off that box here.

What remains and takes a lot of caution is the code that “looks at” the data in a desired way. Here is its final form:

58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 # ---2. Computations of Conditional Probabilities for Rare Events   # isolate return-series displaying negative returns solely # set 1 for time stamps corresponding to positive returns nretCAAS = np.where(retCAAS < 0, retCAAS, 1) nretNASDAQ = np.where(retNASDAQ < 0, retNASDAQ, 1)   # set threshold for rare events thr = -0.065   # compute the sets of events A = np.where(nretCAAS < 0, nretCAAS, 1) A = np.where(A >= thr, A, 1) B = np.where(nretCAAS < thr, retCAAS, 1) R = np.where(nretNASDAQ < thr, retNASDAQ, 1) nA = float(len(A[A != 1])) nB = float(len(B[B != 1])) n = float(len(nretCAAS[nretCAAS != 1])) # n must equal to nA + nB # (optional) print(nA, nB, n == (nA + nB)) # check, if True then proceed further print(len(A), len(B), len(R)) print   # compute the probabilities pA = nA/n pB = nB/n   # compute the conditional probabilities pRA = np.sum(np.where(R+A < 0, 1, 0))/n pRB = np.sum(np.where(R+B < 0, 1, 0))/n   pR = pRA*pA + pRB*pB   # display results print("Pr(A)\t = %5.5f%%" % (pA*100.)) print("Pr(B)\t = %5.5f%%" % (pB*100.)) print("Pr(R|A)\t = %5.5f%%" % (pRA*100.)) print("Pr(R|B)\t = %5.5f%%" % (pRB*100.)) print("Pr(R)\t = %5.5f%%" % (pR*100.))   if(pR>0): pBR = pRB*pB/pR print("\nPr(B|R)\t = %5.5f%%" % (pBR*100.)) else: print("\nPr(B|R) impossible to be determined. Pr(R)=0.")

Python’s NumPy library helps us in a tremendous way by its smartly designed function of where which we employ in lines #69-72 and #86-87. First we test a logical condition, if it’s evaluated to True we grab the right data (actual returns), else we return 1. That opens for us a couple of shortcuts in finding the number of specific events and making sure we are still on the right side of the force (lines #73-79).

As you can see, in line #66 we specified our threshold level of $L = -6.5$%. Given that, we derive the probabilities:

(1216.0, 88.0, True) (2516, 2516, 2516)   Pr(A) = 93.25153% Pr(B) = 6.74847% Pr(R|A) = 0.07669% Pr(R|B) = 0.15337% Pr(R) = 0.08186%   Pr(B|R) = 12.64368%

The level of -6.5% is pretty random but delivers an interesting founding. Namely, based on 10 years of data there is 12.6% of chances that on May/14 2015 CAAS will lose more than -6.5% if NASDAQ drops by the same amount. How much exactly? We don’t know. It’s not certain. Only probable in 12.6%.

The outcome of $\mbox{Pr}(R|B)$ which is close to zero may support our assumption that CAAS has a negligible “influence” on NASDAQ itself. On the other side of the rainbow, it’s much more difficult to interpret $\mbox{Pr}(R)$, the probability of an “isolated” rare event (i.e., $L<-6.5$%) since its estimation is solely based on two players in the game: CAAS and NASDAQ. However, when computed for $N\ge 100$ stocks outside the NASDAQ index but traded at NASDAQ, such distribution of $\mbox{Pr}(R)$'s would be of great value as an another alternative estimator for rare events (I should write about it a distinct post). Now, back the business. There is one problem with our conditional estimation of rare event probabilities as outlined within this Section. A quick check for $L = -10$% reveals:

Pr(A) = 98.69632% Pr(B) = 1.30368% Pr(R|A) = 0.00000% Pr(R|B) = 0.00000% Pr(R) = 0.00000%   Pr(B|R) impossible to be determined. Pr(R)=0.

Let’s remind that $R$ is the event of NASDAQ daily return to be $\lt -L$%. A quick look at NASDAQ return-series tells us the whole story in order to address the following questions that come to your mind: why, why, why zero? Well, simply because there was no event in past 10 year history of the index that it slid more than -10%. And we are cooked in the water.

A visual representation of that problem can be obtained by executing the following piece of code:

104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 from pyvar import cpr # non-at-all PEP8 style ;)   prob = [] for t in range(2,len(retCAAS)-1): ret1 = retCAAS[0:t] ret2 = retNASDAQ[0:t] pBR, _ = cpr(ret1, ret2, thr=thr) prob.append(pBR)   # plotting (figure #2) # the conditional probability for rare events given threshold # plt.figure(num=2, figsize=(10, 6)) plt.plot(np.array(prob)*100., 'r') plt.ylabel("Pr(B|R) [%%]") plt.axis('tight') plt.title("L$_{thr}$ = %.2f%%" % (thr*100.)) plt.xlabel("Trading days 13/05/2005-13/05/2015") plt.show()

where we moved all Bayes’-based calculations (lines #58-102) into a function cpr (a part of pyvar.py local library; see Download section below) and repeated our previous experiment, however, this time, $\mbox{Pr}(R|B)$ changing in time given the threshold of $L=-6.5$%. The resulting plot would be:

Before ca. 800th day the NASDAQ index was lacking any event of $L\lt -6.5$% therefore the $\mbox{Pr}(R|B)$ could not be determined. After that period we got some heavy hits and the conditional probability could be derived. The end value (as for May 13, 2015) is 12.64%.

Hope amongst Hopelessness?

There is no better title to summarise the outcomes we derived. Our goal was to come up with some (running) probability for very rare event(s) that, in our case, would be expressed as a 1-day loss of $-L$% in trading for a specified financial asset. In the first attempt we engaged VaR-related framework and the estimation of occurrence of the rare event based on the PDF integration. In our second attempt we sought for an alternative solution making use of Bayes’ conditional prediction of the asset’s loss of $-L$% (or more) given the observation of the same event somewhere else (NASDAQ index). For the former we ended up the a running probability of $\mbox{Pr}(L \le L_{thr})$ while for the latter with $\mbox{Pr}(R|B)$ given $L_{thr}$ for both B and R events.

Assuming $L_{thr}$ to be equal to -6.5%, we can present the corresponding computations in the following chart:

Well, the situation looks hopeless. It ain’t get better even if, as discussed earlier, the probability of $\mbox{Pr}(R)$ has been added. Both methodologies seem to deliver the answer to the same question however, somehow, we are left confused, disappointed, misled, in tasteless despair…

Stay tuned as in Part 2 we will see the light at the end of the tunnel. I promise.

heavy1.py, heavy2.py, pyvar.py

Featured in: Data Science Weekly Newsletter, Issue 76 (May 7, 2015)

It has been over a year since I posted Hacking Google Finance in Real-Time for Algorithmic Traders article. Surprisingly, it became the number one URL of QaR that Google has been displaying as a result to various queries and the number two most frequently read post. Thank You! It’s my pleasure to provide quality content covering interesting topics that I find potentially useful.

You can be surprised how fast Python solutions went forward facilitating life of quants and algo traders. For instance, yesterday, haphazardly, I found a code that seems to work equally well as compared to my first version, and, in fact, is more flexible in data content that could be retrieved. The idea stays the same as previously, however, our goal this time is to monitor changes of stock prices provided by Google Finance in real-time before the market opens.

Constructing Pre-Market Price-Series

The pre-market trading session typically occurs between 8:00am and 9:30am EDT each trading day though for some stocks we often observe frequent movements much earlier, e.g. at 6:00am. Many investors and traders watch the pre-market trading activity to judge the strength and direction of the market in anticipation for the regular trading session. Pre-market trading activity generally has limited volume and liquidity, and therefore, large bid-ask spreads are common. Many retail brokers offer pre-market trading, but may limit the types of orders that can be used during the pre-market period$^1$.

In Google Finance the stock price in pre-market is usually displayed right beneath the tricker, for example:

The price of the stock (here: AAPL) varies depending on interest, good/bad news, etc.

In Python we can fetch those changes (I adopt a code found on the Web) in the following way:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import urllib2 # works fine with Python 2.7.9 (not 3.4.+) import json import time   def fetchPreMarket(symbol, exchange): link = "http://finance.google.com/finance/info?client=ig&q=" url = link+"%s:%s" % (exchange, symbol) u = urllib2.urlopen(url) content = u.read() data = json.loads(content[3:]) info = data[0] t = str(info["elt"]) # time stamp l = float(info["l"]) # close price (previous trading day) p = float(info["el"]) # stock price in pre-market (after-hours) return (t,l,p)     p0 = 0 while True: t, l, p = fetchPreMarket("AAPL","NASDAQ") if(p!=p0): p0 = p print("%s\t%.2f\t%.2f\t%+.2f\t%+.2f%%" % (t, l, p, p-l, (p/l-1)*100.)) time.sleep(60)

In this code we target Google to get every 60 seconds an update of the pre-market price (line #14). What we retrieve is a JSON file of the form:

// [ { "id": "22144" ,"t" : "AAPL" ,"e" : "NASDAQ" ,"l" : "125.80" ,"l_fix" : "125.80" ,"l_cur" : "125.80" ,"s": "1" ,"ltt":"4:02PM EDT" ,"lt" : "May 5, 4:02PM EDT" ,"lt_dts" : "2015-05-05T16:02:28Z" ,"c" : "-2.90" ,"c_fix" : "-2.90" ,"cp" : "-2.25" ,"cp_fix" : "-2.25" ,"ccol" : "chr" ,"pcls_fix" : "128.7" ,"el": "126.10" ,"el_fix": "126.10" ,"el_cur": "126.10" ,"elt" : "May 6, 6:35AM EDT" ,"ec" : "+0.30" ,"ec_fix" : "0.30" ,"ecp" : "0.24" ,"ecp_fix" : "0.24" ,"eccol" : "chg" ,"div" : "0.52" ,"yld" : "1.65" ,"eo" : "" ,"delay": "" ,"op" : "128.15" ,"hi" : "128.45" ,"lo" : "125.78" ,"vo" : "21,812.00" ,"avvo" : "46.81M" ,"hi52" : "134.54" ,"lo52" : "82.90" ,"mc" : "741.44B" ,"pe" : "15.55" ,"fwpe" : "" ,"beta" : "0.84" ,"eps" : "8.09" ,"shares" : "5.76B" ,"inst_own" : "62%" ,"name" : "Apple Inc." ,"type" : "Company" } ]

You can download it individually if we execute in the browser a query:

http://www.google.com/finance/info?infotype=infoquoteall&q=NASDAQ:AAPL

Some of those information you can easily decipher. For our task we need to get only: el (an asset price in pre-market or after-hours trading; a.k.a. extended hours trading); elt (corresponding time stamp); and l (most recent last price). This is what our Python code does for us in lines #12-14. Nice and smoothly.

When executed before 9.30am EDT (here for NASDAQ:AAPL), we may construct the pre-market price-series every time the price has been changed:

May 6, 6:35AM EDT 125.80 126.18 +0.38 +0.30% May 6, 6:42AM EDT 125.80 126.21 +0.41 +0.33% May 6, 6:45AM EDT 125.80 126.16 +0.36 +0.29% May 6, 6:46AM EDT 125.80 126.18 +0.38 +0.30% May 6, 6:49AM EDT 125.80 126.10 +0.30 +0.24% May 6, 6:51AM EDT 125.80 126.20 +0.40 +0.32% May 6, 6:57AM EDT 125.80 126.13 +0.33 +0.26% May 6, 7:00AM EDT 125.80 126.20 +0.40 +0.32% May 6, 7:01AM EDT 125.80 126.13 +0.33 +0.26% May 6, 7:07AM EDT 125.80 126.18 +0.38 +0.30% May 6, 7:09AM EDT 125.80 126.20 +0.40 +0.32% May 6, 7:10AM EDT 125.80 126.19 +0.39 +0.31% May 6, 7:10AM EDT 125.80 126.22 +0.42 +0.33% May 6, 7:12AM EDT 125.80 126.20 +0.40 +0.32% May 6, 7:22AM EDT 125.80 126.27 +0.47 +0.37% May 6, 7:28AM EDT 125.80 126.24 +0.44 +0.35% ... May 6, 9:02AM EDT 125.80 126.69 +0.89 +0.71% May 6, 9:03AM EDT 125.80 126.71 +0.91 +0.72% May 6, 9:04AM EDT 125.80 126.73 +0.93 +0.74% May 6, 9:08AM EDT 125.80 126.67 +0.87 +0.69% May 6, 9:09AM EDT 125.80 126.69 +0.89 +0.71% May 6, 9:10AM EDT 125.80 126.68 +0.88 +0.70% May 6, 9:13AM EDT 125.80 126.67 +0.87 +0.69% May 6, 9:14AM EDT 125.80 126.72 +0.92 +0.73% May 6, 9:16AM EDT 125.80 126.74 +0.94 +0.75% May 6, 9:17AM EDT 125.80 126.72 +0.92 +0.73% May 6, 9:18AM EDT 125.80 126.70 +0.90 +0.72% May 6, 9:19AM EDT 125.80 126.73 +0.93 +0.74% May 6, 9:20AM EDT 125.80 126.75 +0.95 +0.76% May 6, 9:21AM EDT 125.80 126.74 +0.94 +0.75% May 6, 9:21AM EDT 125.80 126.79 +0.99 +0.79% (*) May 6, 9:23AM EDT 125.80 126.78 +0.98 +0.78% May 6, 9:24AM EDT 125.80 126.71 +0.91 +0.72% May 6, 9:25AM EDT 125.80 126.73 +0.93 +0.74% May 6, 9:26AM EDT 125.80 126.75 +0.95 +0.76% May 6, 9:27AM EDT 125.80 126.70 +0.90 +0.72% May 6, 9:28AM EDT 125.80 126.75 +0.95 +0.76% May 6, 9:29AM EDT 125.80 126.79 +0.99 +0.79%

Since the prices in pre-market tend to vary slowly, 60 second time interval is sufficient to keep our eye on the stock. You can compare a live result retrieved using our Python code at 9:21am (*) with the above screenshot I took at the same time.

A simple joy of Python in action. Enjoy!

HOMEWORK
1. The code fails after 9.30am EST (NYC time). Modify it to catch this exception.
2. Modify the code (or write a new function) that works after 9.30am EDT.
3. It is possible to get $N\gt1$ queries for $N$ stocks by calling, for example:
NASDAQ:AAPL,NYSE:JNJ,… in line #7 of the code. Modify the program
to fetch pre-market time-series, $x_i(t)$ $(i=1,…,N)$, for $N$-asset portfolio.
Given that, compute a fractional root-mean-square volatility, $\sigma_{x_i(t)}/\langle x_i(t) \rangle$,
i.e. standard deviation divided by the mean, between 6am and 9.30am EDT
for each asset and check can you use it as an indicator for stock price movement
after 9.30am? Tip: the higher frms the more trading is expected in first 15 min
of a new session at Wall Street.
4. Modify the code to monitor after-hours trading till 4.30pm.

NetworkError.org, 2013, Google’s Undocumented Finance API

REFERENCES
$^1$Pre-Market, Investopedia, http://www.investopedia.com/terms/p/premarket.asp

## Walsh–Hadamard Transform and Tests for Randomness of Financial Return-Series

Randomness. A magic that happens behind the scene. An incomprehensible little blackbox that does the job for us. Quants. Many use it every day without thinking, without considering how those beautifully uniformly distributed numbers are drawn?! Why so fast? Why so random? Is randomness a byproduct of chaos and order tamed somehow? How trustful can we be placing this, of minor importance, command of random() as piece of our codes?

Randomness Built-in. Not only that’s a name of a chapter in my latest book but the main reason I’m writing this article: wandering sideways around the subject that led me to new undiscovered territories! In general the output in a form of a random number comes for a dedicated function. To be perfectly honest, it is like a refined gentleman, of sophisticated quality recognised by its efficiency when a great number of drafts is requested. The numbers are not truly random. They are pseudo-random. That means the devil resides in details. A typical pseudo-random number generator (PRNG) is designed for speed but defined by underlying algorithm. In most of computer languages Mersenne Twister developed by 松本 眞 and 西村 拓士 in 1997 has become a standard. As one might suspect, an algorithm must repeat itself and in case of our Japanese hero, its period is superbly long, i.e. $2^{19937}-1$. A reliable implementation in C or C++ guarantees enormous speedups in pseudo-random numbers generation. Interestingly, Mersenne Twister is not perfect. Its use had been discouraged when it comes to obtaining cryptographic random numbers. Can you imagine that?! But that’s another story. Ideal for a book chapter indeed!

In this post, I dare to present the very first, meaningful, and practical application of the Walsh–Hadamard Transform (WHT) in quantitative finance. Remarkably, this tool, of marginal use in digital signal processing, had been shown to serve as a great facility in testing any binary sequence for its statistically significant randomness!

Within the following sections we introduce the transform to finance. After Oprina et al. (2009) we encode WHT Significance Tests to Python (see Download section) and verify a hypothesis of the evidence of randomness as generated by Mersenne Twister PRNG against the alternative one at vey high significance levels of $\alpha\le 0.00001$. Next, we show a practical application of the WHT framework in search for randomness in any financial time-series by example of 30-min FX return-series and compare them to the results obtained from PRNG available in Python by default. Curious about the findings? Welcome to the world of magic!

1. Walsh–Hadamard Transform And Binary Sequences

I wish I had for You this great opening story on how Jacques Hadamard and Joseph L. Walsh teamed up with Jack Daniels on one Friday’s night in the corner pub somewhere in San Francisco coming up to a memorable breakthrough in theory of numbers. I wish I had! Well, not this time. However, as a make-up, below I will present in a nutshell a theoretical part on WHT I’ve been studying for past three weeks and found it truly mind-blowing because of its simplicity.

Let’s consider any real-valued discrete signal $X(t_i)$ where $i=0,1,…,N-1$. Its trimmed version, $x(t_i)$, of the total length of $n=2^M$ such that $2^M\le(N-1)$ and $M\in\mathbb{Z}^+$ at $M\ge 2$ is considered as an input signal for the Walsh–Hadamard Transform, the latter defined as:
$$WHT_n = \mathbf{x} \bigotimes_{i=1}^M \mathbf{H_2}$$ where the Hadamard matrix of order $n=2^M$ is obtainable recursively by the formula:
$$\mathbf{H_{2^M}} = \begin{pmatrix} H_{2^{M-1}} & H_{2^{M-1}} \\ H_{2^{M-1}} & -H_{2^{M-1}} \end{pmatrix} \ \ \ \ \ \text{therefore} \ \ \ \ \ \mathbf{H_2} = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$$ and $\otimes$ denotes the Kronecker product between two matrices. Given that, $WHT_n$ is the dot product between the signal (1D array; vector) and resultant Kronecker multiplications of $\mathbf{H_2}$ $M$-times (Johnson & Puschel 2000).

1.1. Walsh Functions

The Walsh-Hadamard transform uses the orthogonal square-wave functions, $w_j(x)$, introduced by Walsh (1923), which have only two values $\pm 1$ in the interval $0\le x\lt 1$ and the value zero elsewhere. The original definition of the Walsh functions is based on the following recursive equations:
$$w_{2j}(x) = w_j(2x)+(-1)^j w_j (2x -1) \quad \mbox{for} \ \ j=1,2,… \\ w_{2j-1}(x) = w_{j-1}(2x)-(-1)^{j-1} w_{j-1} (2x -1) \quad \mbox{for} \ \ j=1,2,…$$ with the initial condition of $w_0(x)= 1$. You can meet with different ordering of Walsh functions in the literature but in general it corresponds to the ordering of the orthogonal harmonic functions $h_j(x)$, which are defined as:
$$h_{2j}(x) = \cos(2\pi j x)\quad \mbox{for} \ \ j=0,1,… \\ h_{2j-1}(x) = \sin(2\pi j x)\quad \mbox{for} \ \ j=1,2,…$$ on the interval $0\le x\lt 1$, and have zero value for all other values of $x$ outside this interval. A comparison of both function classes looks as follows:

where the first eight harmonic functions and Walsh functions are given in the left and right panels, respectively. Walsh functions with even and odd orders are called the cal and sal functions, respectively, and they correspond to the cosine and sine functions in Fourier analysis (Aittokallio et al. 2001).

1.2. From Hadamard to Walsh Matrix

In Python the Hadamard matrix of order $2^M$ is obtainable as a NumPy 2D array making use of SciPy module as follows:

from scipy.linalg import hadamard M=3; n=2**M H=hadamard(n) print(H)

what is this case returns:

[[ 1 1 1 1 1 1 1 1] [ 1 -1 1 -1 1 -1 1 -1] [ 1 1 -1 -1 1 1 -1 -1] [ 1 -1 -1 1 1 -1 -1 1] [ 1 1 1 1 -1 -1 -1 -1] [ 1 -1 1 -1 -1 1 -1 1] [ 1 1 -1 -1 -1 -1 1 1] [ 1 -1 -1 1 -1 1 1 -1]]

Each row of the matrix corresponds to Walsh function. However, the ordering is different (Hadamard ordering). Therefore, to be able to “see” the shape of Walsh functions as presented in the figure above, we need to rearrange their indexing. The resultant matrix is known as Walsh matrix. We derive it in Python:

def Hadamard2Walsh(n): # Function computes both Hadamard and Walsh Matrices of n=2^M order # (c) 2015 QuantAtRisk.com, coded by Pawel Lachowicz, adopted after # au.mathworks.com/help/signal/examples/discrete-walsh-hadamard-transform.html import numpy as np from scipy.linalg import hadamard from math import log   hadamardMatrix=hadamard(n) HadIdx = np.arange(n) M = log(n,2)+1   for i in HadIdx: s=format(i, '#032b') s=s[::-1]; s=s[:-2]; s=list(s) x=[int(x) for x in s] x=np.array(x) if(i==0): binHadIdx=x else: binHadIdx=np.vstack((binHadIdx,x))   binSeqIdx = np.zeros((n,M)).T   for k in reversed(xrange(1,int(M))): tmp=np.bitwise_xor(binHadIdx.T[k],binHadIdx.T[k-1]) binSeqIdx[k]=tmp   tmp=np.power(2,np.arange(M)[::-1]) tmp=tmp.T SeqIdx = np.dot(binSeqIdx.T,tmp)   j=1 for i in SeqIdx: if(j==1): walshMatrix=hadamardMatrix[i] else: walshMatrix=np.vstack((walshMatrix,hadamardMatrix[i])) j+=1   return (hadamardMatrix,walshMatrix)

Therefore, by calling the function in an exemplary main program:

from WalshHadamard import Hadamard2Walsh import matplotlib.pyplot as plt import numpy as np   M=3; n=2**M (H,W)=Hadamard2Walsh(n) # display Hadamard matrix followed by Walsh matrix (n=8) print(H); print; print(W)   # a visual comparison of Walsh function (j=2) M=3; n=2**M _,W=Hadamard2Walsh(n) plt.subplot(2,1,1) plt.step(np.arange(n).tolist(),W[2],where="post") plt.xlim(0,n) plt.ylim(-1.1,1.1); plt.ylabel("order M=3")   M=8; n=2**M _,W=Hadamard2Walsh(n) plt.subplot(2,1,2) plt.step(np.arange(n).tolist(),W[2],where="post") plt.xlim(0,n) plt.ylim(-1.1,1.1); plt.ylabel("order M=8")   plt.show()

first, we display Hadamard and Walsh matrices of order $n=2^3$, respectively:

[[ 1 1 1 1 1 1 1 1] [ 1 -1 1 -1 1 -1 1 -1] [ 1 1 -1 -1 1 1 -1 -1] [ 1 -1 -1 1 1 -1 -1 1] [ 1 1 1 1 -1 -1 -1 -1] [ 1 -1 1 -1 -1 1 -1 1] [ 1 1 -1 -1 -1 -1 1 1] [ 1 -1 -1 1 -1 1 1 -1]]   [[ 1 1 1 1 1 1 1 1] [ 1 1 1 1 -1 -1 -1 -1] [ 1 1 -1 -1 -1 -1 1 1] [ 1 1 -1 -1 1 1 -1 -1] [ 1 -1 -1 1 1 -1 -1 1] [ 1 -1 -1 1 -1 1 1 -1] [ 1 -1 1 -1 -1 1 -1 1] [ 1 -1 1 -1 1 -1 1 -1]]

and next we visually verify that the shape of the third Walsh function ($j=2$) is preserved for two different orders, here $M=3$ and $M=8$, respectively:

The third possibility of ordering the Walsh functions is so that they are arranged in increasing order of their sequencies or number of zero-crossings: sequency order. However, we won’t be interested in it this time.

1.3. Signal Transformations

As we have seen in the beginning, the WHT is able to perform a signal transformation for any real-vauled time-series. The sole requirement is that the signal ought to be of $2^M$ length. Now, it should be obvious why is so. When you think about WHT for a longer second you may understand its uniqueness as contrasted with the Fourier transform. Firstly, the waveforms are much simpler. Secondly, the complexity of computation is significantly reduced. Finally, if the input signal is converted from its original form down to only two discrete values $\pm 1$, we end up with a bunch of trivial arithmetical calculations while WHT.

The Walsh-Hadamard transform found its application in medical signal processing, audio/sound processing, signal and image compression, pattern recognition, and cryptography. In the most simplistic cases, one deals with the input signal to be of the binary form, e.g. 01001011010110. If we consider the binary function $f: \mathbb{Z}_2^n \rightarrow \mathbb{Z}_2$ then the following transformation is possible:
$$\bar{f}(\mathbf{x}) = 1 – 2f(\mathbf{x}) = (-1)^{f(\mathbf{x})}$$ therefore $\bar{f}: \mathbb{Z}_2^n \rightarrow \{-1,1\}$. What does it do is it performs the following conversion, for instance:
$$\{0,1,0,1,1,0,1,0,0,0,1,…\} \rightarrow \{-1,1,-1,1,1,-1,1,-1,-1,-1,1,…\}$$ of the initial binary time-series. However, what would be more interesting for us as it comes to the financial return-series processing, is the transformation:
$$\bar{f}(\mathbf{x}) = \left\{ \begin{array}{l l} 1 & \ \text{if f(\mathbf{x})\ge 0}\\ -1 & \ \text{if f(\mathbf{x})\lt 0} \end{array} \right.$$ Given that, any return-series in the value interval $[-1,1]$ (real-valued) is transformed to the binary form of $\pm 1$, for example:
$$\{0.031,0.002,-0.018,-0.025,0.011,…\} \rightarrow \{1,1,-1,-1,1,…\} \ .$$ This simple signal transformation in connection with the power of Walsh-Hadamard Transform opens new possibilities of analysing the underlying true signal. WHT is all “made of” $\pm 1$ values sleeping in its Hadamard matrices. Coming in a close contact with the signal of the same form this is “where the rubber meets the road” (Durkee, Jacobs, & Meat Loaf 1995).

1.4. Discrete Walsh-Hadamard Transform in Python

The Walsh-Hadamard transform is an orthogonal transformation that decomposes a signal into a set of orthogonal, rectangular waveforms (Walsh functions). For any real-valued signal we derive WHT as follows:

def WHT(x): # Function computes (slow) Discrete Walsh-Hadamard Transform # for any 1D real-valued signal # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz x=np.array(x) if(len(x.shape)<2): # make sure x is 1D array if(len(x)>3): # accept x of min length of 4 elements (M=2) # check length of signal, adjust to 2**m n=len(x) M=trunc(log(n,2)) x=x[0:2**M] h2=np.array([[1,1],[1,-1]]) for i in xrange(M-1): if(i==0): H=np.kron(h2,h2) else: H=np.kron(H,h2)   return (np.dot(H,x)/2.**M, x, M) else: print("HWT(x): Array too short!") raise SystemExit else: print("HWT(x): 1D array expected!") raise SystemExit

Despite the simplicity, this solution slows down for signals of length of $n\ge 2^{22}$ where, in case of my MacBook Pro, 16 GB of RAM is just not enough! Therefore, the mechanical derivation of WHT making use of Kronecker products between matrices is often referred to as Slow Walsh-Hadamard Transform. It is obvious that Fast WHT exists but its application for the use of this article (and research) is not required. Why? We will discuss it later on.

We can see our Python’s WHT(x) function in action coding, for instance:

from WalshHadamard import WHT from random import randrange   x1=[0.123,-0.345,-0.021,0.054,0.008,0.043,-0.017,-0.036] wht,_,_=WHT(x1) print("x1 = %s" % x1) print("WHT = %s" % wht)   x2=[randrange(-1,2,2) for i in xrange(2**4)] wht,_,_=WHT(x2) print; print("x2 = %s" % x2) print("WHT = %s" % wht)

what returns

x1 = [0.123, -0.345, -0.021, 0.054, 0.008, 0.043, -0.017, -0.036] WHT = [-0.023875 0.047125 -0.018875 0.061125 -0.023375 0.051125 -0.044875 0.074625]   x2 = [1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, -1, -1] WHT = [ 0.375 0.125 0.125 -0.125 -0.125 0.125 -0.375 -0.125 0.375 0.125 -0.375 0.375 -0.125 0.125 0.125 0.375]

where we performed WHT of a real-values signal (e.g. a financial return-series) of $x_1$ and a random binary sequence of $x_2$. Is it correct? You can always verify those results in Matlab (version used here: 2014b) by executing the corresponding function of fwht from the Matlab’s Signal Processing Toolbox:

>> x1=[0.123, -0.345, -0.021, 0.054, 0.008, 0.043, -0.017, -0.036] x1 = 0.1230 -0.3450 -0.0210 0.0540 0.0080 0.0430 -0.0170 -0.0360   >> y1 = fwht(x1,length(x1),'hadamard') y1 = -0.0239 0.0471 -0.0189 0.0611 -0.0234 0.0511 -0.0449 0.0746   >> x2=[1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, -1, -1] x2 = 1 -1 1 1 1 1 1 1 -1 1 1 -1 1 1 -1 -1   >> y2 = fwht(x2,length(x2),'hadamard')' y2 = 0.3750 0.1250 0.1250 -0.1250 -0.1250 0.1250 -0.3750 -0.1250 0.3750 0.1250 -0.3750 0.3750 -0.1250 0.1250 0.1250 0.3750

All nice, smooth, and in agreement. The difference between WHT(x) and Matlab’s fwht is that the former trims the signal down while the latter allows for padding with zeros. Just keep that in mind if you employ Matlab in your own research.

2. Random Sequences and Walsh-Hadamard Transform Statistical Test

You may raise a question. Why do we need to introduce WHT at all, transform return-series into binary $\pm 1$ sequences, and what does it have in common with randomness? The answer is straightforward and my idea is simple, namely: we may test any financial return time-series for randomness using our fancy WHT approach. A bit of creativity before sundown.

WFT of the binary signal returns a unique pattern. If signal $x(t)$ is of random nature, we might suspect not to find any sub-string of it to be repeatable. That’s the main motivation standing behind PRNGs: to imitate true randomness met in nature.

In 2009 Oprina et alii (hereinafter Op09) proposed a statistical test based on results derived for any binary $\pm 1$ signal $x(t)$. But they were smart, instead of looking at the signal as a whole, they suggested its analysis chunk by chunk (via signal segmentation into equi-lengthy blocks). The statistical test they designed aims at performing class of autocorrelation tests with the correlation mask given by the rows of Hadamard matrix. As a supplement to the methodology presented in NIST Special Publication 800-22 (Rukhin at al. 2010) which specifies 16 independent statistical tests for random and pseudorandom number generators for cryptographic applications, Op09 proposed another two methods, based on confidence intervals, which can detect a more general failure in the random and pseudorandom generators. All 5 statistical tests of Op09 form the Walsh-Hadamard Transform Statistical Test Suite and we, what follows, will encode them to Python, focusing both on statistical meaning and application.

The working line standing behind the Op09’s Suite was a generalised test suitable for different purposes: randomness testing, cryptographic design, crypto-analysis techniques and stegano-graphic detection. In this Section, first, we concentrate our effort to see how this new test suite works for the standard Mersenne Twister PRNG concealed in the Python’s random function class. Next, we move to the real-life financial data as an input where we aim to verify whether any chunk (block) of time-series can be regarded as of random nature at a given significance level of $\alpha$. If so, which one. In not, well, well, well… why not?! For the latter case, this test opens a new window of opportunities to look for non-stochastic nature of, e.g. trading signals, and their repeatable properties (if any).

2.1. Signal Pre-Processing

Each tests requires some input data. In case of Op09 tests, we need to provide: a trimmed signal $x(t)$ of the total length $n=2^M$; choose a sequence (block) size of length $2^m$; a significance level of $\alpha$ (rejection level); a probability $p$ of occurrence of the digit 1. In step 1 we transform (if needed) $x(t_i)$ into $\pm 1$ sequence as specified in Section 1.3. In step 2 we compute lower and upper rejection limits of the test $u_{\alpha/2}$ and $u_{1-\alpha/2}$. In step 3 we compute the number of sequences to be processed $a=n/(2m)$ and split $x(t)$ into $a$ adjacent blocks (sequences).

Since it sounds complicated let’s see how easy it is in fact. We design a function that splits $X(t)$ signal into $a$ blocks of $b$ length:

def xsequences(x): x=np.array(x) # X(t) or x(t) if(len(x.shape)<2): # make sure x is 1D array if(len(x)>3): # accept x of min length of 4 elements (M=2) # check length of signal, adjust to 2**M if needed n=len(x) M=trunc(log(n,2)) x=x[0:2**M] a=(2**(M/2)) # a number of adjacent sequences/blocks b=2**M/a # a number of elements in each sequence y=np.reshape(x,(a,b)) #print(y) return (y,x,a,b,M) else: print("xsequences(x): Array too short!") raise SystemExit else: print("xsequences(x): 1D array expected!") raise SystemExit

where the conversion of Python list’s (or NumPy 1D array’s) values into $\pm 1$ signal we obtain by:

def ret2bin(x): # Function converts list/np.ndarray values into +/-1 signal # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz Y=[]; ok=False print(str(type(x))) print(x) if('numpy' in str(type(x)) and 'ndarray' in str(type(x))): x=x.tolist() ok=True elif('list' in str(type(x))): ok=True if(ok): for y in x: if(y<0): Y.append(-1) else: Y.append(1) return Y else: print("Error: neither 1D list nor 1D NumPy ndarray") raise SystemExit

To be perfectly informed about what took place we may wish to display a summary on pre-processed signal as follows:

def info(X,xt,a,b,M): line() print("Signal\t\tX(t)") print(" of length\tn = %d digits" % len(X)) print("trimmed to\tx(t)") print(" of length\tn = %d digits (n=2^%d)" % (a*b,M)) print(" split into\ta = %d sub-sequences " % a) print("\t\tb = %d-digit long" % b) print

Now, let see how this section of our data processing works in practice. As an example, first, we generate a random signal $X(t)$ of length $N=2^{10}+3$ and values spread in the interval of $[-0.2,0.2]$. That should provide us with some sort of feeling of the financial return time-series (e.g. collected daily based on stock or FX-series trading). Next, employing the function of xsequences we will trim the signal $X(t)$ down to $x(t)$ of length $n=2^{10}$ however converting first the return-series into $\pm 1$ sequence denoted by $x'(t)$. Finally, the $x(t)$ we split into $a$ sub-sequences of length $b$; $x_{\rm seq}$. An exemplary main program executing those steps could be coded as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 from WalshHadamard import ret2bin, xsequencies, info from random import uniform, seed import numpy as np   # generate a random signal X(t) seed(4515) X=[uniform(-0.2,0.2) for i in range(2**10)]; X=X+[0.12,-0.04,0.01] Xorg=X   # convert its values into +/-1 sequence X=ret2bin(X) # x'(t)   # split X'(t) into a blocks; save result in xseq 2D array (xseq,x,a,b,M) = xsequences(X)   print("X(t) =") for i in xrange(0,len(Xorg)): print("%10.5f" % Xorg[i]) print("\nx'(t) =") print(np.array(X)) print("\nx(t) =") print(x) print("\nxseq =") print(xseq) print   info(X,x,a,b,M)

returning:

X(t) = 0.17496 0.07144 -0.15979 0.11344 0.08134 -0.01725 ... -0.16005 0.01419 -0.08748 -0.03218 -0.07908 -0.02781 0.12000 -0.04000 0.01000   X'(t) = [ 1 1 -1 ..., 1 -1 1]   x(t) = [ 1 1 -1 ..., -1 -1 -1]   xseq = [[ 1 1 -1 ..., -1 -1 1] [-1 -1 -1 ..., 1 -1 1] [-1 -1 1 ..., -1 1 1] ..., [-1 1 1 ..., 1 -1 1] [-1 1 1 ..., -1 -1 -1] [ 1 -1 -1 ..., -1 -1 -1]]   ---------------------------------------------------------------------- Signal X(t) of length n = 1027 digits trimmed to x(t) of length n = 1024 digits (n=2^10) split into a = 32 sub-sequences b = 32-digit long

Having such framework for initial input data, we are ready to program WHT Statistical Test based on Op09 recipe.

2.2. Test Statistic

2D matrix of $x_{\rm seq}$ holding our signal under investigation is the starting point to its tests for randomness. Op09’s test is based on computation of WHT for each row of $x_{\rm seq}$ and the t-statistics, $t_{ij}$, as a test function based on Walsh-Hadamard transformation of all sub-sequencies of $x(t)$.

It is assumed that for any signal $y(t_i)$ where $i=0,1,…$ the WHT returns a sequence $\{w_i\}$ and: (a) for $w_0$ the mean value is $m_0=2^m(1-2p)$; the variance is given by $\sigma^2_0=2^{m+2}p(1-p)$ and the distribution of $(w_0-m_0)/\sigma_0 \sim N(0,1)$ for $m>7$; (b) for $w_i$ ($i\ge 1$) the mean value is $m_i=0$; the variance is $\sigma^2_i=2^{m+2}p(1-p)$ and the distribution of $(w_i-m_i)/\sigma_i \sim N(0,1)\$ for $m>7$. Recalling that $p$ stands for probability of occurrence of the digit 1 in $x_{{\rm seq},j}$ for $p = 0.5$ (our desired test probability) the mean value of $w_i$ is equal 0 for every $i$.

In $x_{\rm seq}$ array for every $j=0,1,…,(a-1)\$ and for every $i=0,1,…,(b-1)\$ we compute t-statistic as follows:
$$t_{ij} = \frac{w_{ij} – m_i}{\sigma_i}$$ where $w_{ij}$ is the $i$-th Walsh-Hadamard transform component of the block $j$. In addition, we convert all $t_{ij}$ into $p$-values:
$$p{\rm-value} = P_{ij} = {\rm Pr}(X\lt t_{ij}) = 1-\frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{\frac{-x^2}{2}} dx$$ such $t_{ij}\sim N(0,1)$, i.e. has a normal distribution with zero mean and unit standard deviation. Are you still with me? Great! The Python code for this part of our analysis may look in the following way:

def tstat(x,a,b,M): # specify the probability of occurrence of the digit "1" p=0.5 print("Computation of WHTs...") for j in xrange(a): hwt, _, _ = WHT(x[j]) if(j==0): y=hwt else: y=np.vstack((y,hwt)) # WHT for xseq print(" ...completed") print("Computation of t-statistics..."), t=[]; for j in xrange(a): # over sequences/blocks (rows) for i in xrange(b): # over sequence's elements (columns) if(i==0): if(p==0.5): m0j=0 else: m0j=(2.**M/2.)*(1.-2.*p) sig0j=sqrt((2**M/2)*p*(1.-p)) w0j=y[j][i] t0j=(w0j-m0j)/sig0j t.append(t0j) else: sigij=sqrt((2.**((M+2.)/2.))*p*(1.-p)) wij=y[j][i] tij=wij/sigij t.append(tij) t=np.array(t) print("completed") print("Computation of p-values..."), # standardised t-statistics; t_{i,j} ~ N(0,1) t=(t-np.mean(t))/(np.std(t)) # p-values = 1-[1/sqrt(2*pi)*integral[exp(-x**2/2),x=-inf..t]] P=1-ndtr(t) print("completed\n") return(t,P,y)

and returns, as the output, two 2D arrays storing $t_{ij}$ statistics for every $w_{ij}$ (t variable) and corresponding $p$-values of $P_{ij}$ (P variable).

Continuing the above example, we extend its body by adding what follows:

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 from WalshHadamard import tstat   (t, P, xseqWHT) = tstat(xseq,a,b,M)   print("t =") print(t) print("\np-values =") print(P) print("\nw =") print(xseqWHT)   # display WHTs for "a" sequences plt.imshow(xseqWHT, interpolation='nearest', cmap='PuOr') plt.colorbar() plt.xlabel("Sequence's Elements"); plt.ylabel("Sequence Number") plt.title("Walsh-Hadamard Transforms") plt.show()

to derive:

Computation of WHTs... ...completed Computation of t-statistics... completed Computation of p-values... completed   t = [ 0.26609308 0.72730106 -0.69960094 ..., -1.41305193 -0.69960094 0.01385006]   p-values = [ 0.39508376 0.23352077 0.75791172 ..., 0.92117977 0.75791172 0.4944748 ]   w = [[ 0.125 0.125 -0.125 ..., 0.125 -0.125 -0.125 ] [ 0.1875 -0.1875 -0.0625 ..., -0.0625 0.0625 -0.0625] [ 0.0625 0.0625 -0.4375 ..., -0.0625 -0.0625 -0.0625] ..., [-0.1875 -0.3125 0.1875 ..., 0.1875 0.1875 -0.1875] [-0.125 0.125 -0.125 ..., -0.125 -0.125 -0.375 ] [ 0. 0.125 -0.25 ..., -0.25 -0.125 0. ]]

and, at the end, we display WHTs computed for every single (horizontal) sequence of $x_{\rm seq}$ array as stored in $w$ matrix by tstats function:

Again, every element (a square) in the above figure corresponds to $w_{ij}$ value. I would like also pay your attention to the fact that both arrays of $t$ and $P$ are 1D vectors storing all t-statistics and $p$-values, respectively. In the following WHT tests we will make an additional effort to split them into $a\times b$ matrices corresponding exactly to $w$-like arrangement. That’s an easy part so don’t bother too much about that.

2.3. Statistical Test Framework

In general, here, we are interested in verification of two opposite statistical hypotheses. The concept of hypothesis testing is given in every textbook on the subject. We test our binary signal $x(t)$ for randomness. Therefore, $H_0$: $x(t)$ is generated by a binary memory-less source i.e. the signal does not contain any predictable component; $H_1$: $x(t)$ is not produced by a binary memory-less source, i.e. the signal contains a predictable component.

Within the standard framework of hypothesis testing we can make two errors. The first one refers to $\alpha$ (so-called significance level) and denotes the probability of occurrence of a false positive result. The second one refers to $\beta$ and denotes the probability of the occurrence of a false negative result.

The testing procedure that can be applied here is: for a fixed value of $\alpha$ we find a confidence region for the test statistic and check if the statistical test value is in the confidence region. The confidence levels are computed using the quantiles $u_{\alpha/2}$ and $u_{1-\alpha/2}$ (otherwise, specified in the text of the test). Alternatively, if an arbitrary $t_{\rm stat}$ is the value of the test statistics (test function) we may compare $p{\rm-value}={\rm Pr}(X\lt t_{\rm stat})$ with $\alpha$ and decide on randomness when $p$-value$\ \ge\alpha$.

2.4. Crude Decision (Test 1)

The first WHT test from the Op09’s suite is a crude decision or majority decision. For chosen $\alpha$ and at $u_\alpha$ denoting the quantile of order $\alpha$ of the normal distribution, if:
$$t_{ij} \notin [u_{\alpha/2}; u_{1-\alpha/2}]$$ then reject the hypothesis of randomness regarding $i$-th test statistic of the signal $x(t)$ at the significance level of $\alpha$. Jot down both $j$ and $i$ corresponding to sequence number and sequence’s element, respectively. Op09 suggest that this test is suitable for small numbers of $a<1/\alpha\$ which is generally always fulfilled for our data. We code this test in Python as follows:

def test1(cl,t,a,b,otest): alpha=1.-cl/100. u1=norm.ppf(alpha/2.) u2=norm.ppf(1-alpha/2.) Results1=[] for l in t: if(l<u1 or l>u2): Results1.append(0) else: Results1.append(1) nfail=a*b-np.sum(Results1) print("Test 1 (Crude Decision)") print(" RESULT: %d out of %d test variables stand for " \ "randomness" % (a*b-nfail,a*b)) if((a*b-nfail)/float(a*b)>.99): print("\t Signal x(t) appears to be random") else: print("\t Signal x(t) appears to be non-random") otest.append(100.*(a*b-nfail)/float(a*b)) # gather per cent of positive results print("\t at %.5f%% confidence level" % (100.*(1.-alpha))) print return(otest)

Op09’s decision on rejection of $H_0$ is too stiff. In the function we calculate the number of $t_{ij}$’s falling outside the test interval. If their number exceeds 1%, we claim on the lack of evidence of randomness for $x(t)$ as a whole.

2.5. Proportion of Sequences Passing a Test (Test 2)

Recall that for each row (sub-sequence of $x(t)$) and its elements, we have computed both $t_{ij}$’s and $P_{ij}$ values. Let’s use the latter here. In this test, first, we check for every row of (re-shaped) $t$ 2D array a number of $p$-values to be $P_{ij}\lt\alpha$. If this number is greater than zero, we reject $j$-th sub-sequence of $x(t)$ at the significance level of $\alpha$ to pass the test. For all $a$ sub-sequences we count its total number of those which did not pass the test, $n_2$. If:
$$n_2 \notin \left[ a\alpha \sqrt{a \alpha (1-\alpha))} u_{\alpha/2}; a\alpha \sqrt{a \alpha (1-\alpha))} u_{1-\alpha/2} \right]$$ then there is evidence that signal $x(t)$ is non-random.

We code this test simply as:

def test2(cl,P,a,b,otest): alpha=1.-cl/100. u1=norm.ppf(alpha/2.) u2=norm.ppf(1-alpha/2.) Results2=[] rP=np.reshape(P,(a,b)) # turning P 1D-vector into (a x b) 2D array! for j in xrange(a): tmp=rP[j][(rP[j]<alpha)] #print(tmp) if(len(tmp)>0): Results2.append(0) # fail for sub-sequence else: Results2.append(1) # pass   nfail2=a-np.sum(Results2) # total number of sub-sequences which failed t2=nfail2/float(a) print("Test 2 (Proportion of Sequences Passing a Test)") b1=alpha*a+sqrt(a*alpha*(1-alpha))*u1 b2=alpha*a+sqrt(a*alpha*(1-alpha))*u2 if(t2<b1 or t2>b2): print(" RESULT: Signal x(t) appears to be non-random") otest.append(0.) else: print(" RESULT: Signal x(t) appears to be random") otest.append(100.) print("\t at %.5f%% confidence level" % (100.*(1.-alpha))) print return(otest)

This test is also described well by Rukhin et al. (2010) though we follow the method of Op09 adjusted for the proportion of $p$-values failing the test for randomness as counted sub-sequence by sub-sequence.

2.7. Uniformity of p-values (Test 3)

In this test, the distribution of $p$-values is examined to ensure uniformity. This may be visually illustrated using a histogram, whereby, the interval between 0 and 1 is divided into $K=10$ sub-intervals, and the $p$-values, i.e. in our case $P_{ij}$’s, that lie within each sub-interval are counted and displayed.

Uniformity may also be determined via an application of a $\chi^2$ test and the determination of a $p$-value corresponding to the Goodness-of-Fit Distributional Test on the $p$-values obtained for an arbitrary statistical test (i.e., a $p$-value of the $p$-values). We accomplish that via computation of the test statistic:
$$\chi^2 = \sum_{i=1}^{K} \frac{\left( F_i-\frac{a}{K} \right)^2}{\frac{a}{K}}$$ where $F_i$ is the number of $P_{ij}$ in the histogram’s bin of $i$, and $a$ is the number of sub-sequences of $x(t)$ we investigate.

We reject the hypothesis of randomness regarding $i$-th test statistic $t_{ij}$ of $x(t)$ at the significance level of $\alpha$ if:
$$\chi^2_i \notin [0; \chi^2(\alpha, K-1)] \ .$$ Let $\chi^2(\alpha, K-1)$ be the quantile of order $\alpha$ of the distribution $\chi^2(K-1)$. In Python we may calculated it in the following way:

from scipy.stats import chi2 alpha=0.001 K=10 # for some derived variable of chi2_test print(Chi2(alpha,K-1))

If our test value of Test 3 $\chi_i^2\le \chi^2(\alpha, K-1)$ then we count $i$-th statistics to be not against randomness of $x(t)$. This is an equivalent to testing $i$-th $p$-value of $P_{ij}$ if $P_{ij}\ge\alpha$. The latter can be computed in Python as:

from scipy.special import gammainc alpha=0.001 K=10 # for some derived variable of chi2_test pvalue=1-gammainc((K-1)/2.,chi2_test/2.)

where gammainc(a,x) stands for incomplete gamma function defined as:
$$\Gamma(a,x) = \frac{1}{\Gamma(a)} \int_0^x e^{-t} t^{a-1} dt$$ and $\Gamma(a)$ denotes a standard gamma function.

Given that, we code Test 3 of Uniformity of $p$-values in the following way:

def test3(cl,P,a,b,otest): alpha=1.-cl/100. rP=np.reshape(P,(a,b)) rPT=rP.T Results3=0 for i in xrange(b): (hist,bin_edges,_)=plt.hist(rPT[i], bins=list(np.arange(0.0,1.1,0.1))) F=hist K=len(hist) # K=10 for bins as defined above S=a chi2=0 for j in xrange(K): chi2+=((F[j]-S/K)**2.)/(S/K) pvalue=1-gammainc(9/2.,chi2/2.) if(pvalue>=alpha and chi2<=Chi2(alpha,K-1)): Results3+=1 print("Test 3 (Uniformity of p-values)") print(" RESULT: %d out of %d test variables stand for randomness"\ % (Results3,b)) if((Results3/float(b))>.99): print("\t Signal x(t) appears to be random") else: print("\t Signal x(t) appears to be non-random") otest.append(100.*(Results3/float(b))) print("\t at %.5f%% confidence level" % (100.*(1.-alpha))) print return(otest)

where again we allow for less than 1% of all results not to stand against the rejection of $x(t)$ as random signal.

Please note on the transposition of rP matrix. The reason for testing WHT’s values (converted to $t_{ij}$’s or $P_{ij}$’s) is to detect autocorrelation patterns in the tested signal $x(t)$. The same approach has been applied by Op09 in Test 4 and Test 5 as discussed in the following sections and constitutes the core of invention and creativity added to Rukhin et al. (2010)’s suite of 16 tests for signals generated by various PRNGs to meet the cryptographic levels of acceptance as close-to-truly random.

2.8. Maximum Value Decision (Test 4)

This test is based again on the confidence levels approach. Let $T_{ij}=\max_j t_{ij}$ then if:
$$T_{ij} \notin \left[ u_{\left(\frac{\alpha}{2}\right)^{a^{-1}}}; u_{\left(1-\frac{\alpha}{2}\right)^{a^{-1}}} \right]$$ then reject the hypothesis of randomness (regarding $i$-th test function) of signal $x(t)$ at the significance level of $\alpha$. We encode it to Python no simpler as that:

def test4(cl,t,a,b,otest): alpha=1.-cl/100. rt=np.reshape(t,(a,b)) rtT=rt.T Results4=0 for i in xrange(b): tmp=np.max(rtT[i]) u1=norm.ppf((alpha/2.)**(1./a)) u2=norm.ppf((1.-alpha/2.)**(1./a)) if not(tmp<u1 or tmp>u2): Results4+=1 print("Test 4 (Maximum Value Decision)") print(" RESULT: %d out of %d test variables stand for randomness" % (Results4,b)) if((Results4/float(b))>.99): print("\t Signal x(t) appears to be random") else: print("\t Signal x(t) appears to be non-random") otest.append(100.*(Results4/float(b))) print("\t at %.5f%% confidence level" % (100.*(1.-alpha))) print return(otest)

Pay attention how this test looks at the results derived based on WHTs. It is sensitive to the distribution of maximal values along $i$-th’s elements of $t$-statistics.

2.9. Sum of Square Decision (Test 5)

Final test makes use of the $C$-statistic designed as:
$$C_i = \sum_{j=0}^{a-1} t_{ij}^2 \ .$$ If $C_i \notin [0; \chi^2(\alpha, a)]$ we reject the hypothesis of randomness of $x(t)$ at the significance level of $\alpha$ regarding $i$-th test function. The Python reflection of this test finds its form:

def test5(cl,t,a,b,otest): alpha=1.-cl/100. rt=np.reshape(t,(a,b)) rtT=rt.T Results5=0 for i in xrange(b): Ci=0 for j in xrange(a): Ci+=(rtT[i][j])**2. if(Ci<=Chi2(alpha,a)): Results5+=1 print("Test 5 (Sum of Square Decision)") print(" RESULT: %d out of %d test variables stand for randomness" % (Results5,b)) if((Results5/float(b))>.99): print("\t Signal x(t) appears to be random") else: print("\t Signal x(t) appears to be non-random") otest.append(100.*(Results5/float(b))) print("\t at %.5f%% confidence level" % (100.*(1.-alpha))) print return(otest)

Again, we allow of 1% of false negative results.

2.10. The Overall Test for Randomness of Binary Signal

We accept signal $x(t)$ to be random if the average passing rate from all five WHT statistical tests is greater than 99%, i.e. 1% can be due to false negative results, at the significance level of $\alpha$.

def overalltest(cl,otest): alpha=1.-cl/100. line() print("THE OVERALL RESULT:") if(np.mean(otest)>=99.0): print(" Signal x(t) displays an evidence for RANDOMNESS"), T=1 else: print(" Signal x(t) displays an evidence for NON-RANDOMNESS"), T=0 print("at %.5f%% c.l." % (100.*(1.-alpha))) print(" based on Walsh-Hadamard Transform Statistical Test\n") return(T)

and run all 5 test by calling the following function:

def WHTStatTest(cl,X): (xseq,xt,a,b,M) = xsequences(X) info(X,xt,a,b,M) if(M<7): line() print("Error: Signal x(t) too short for WHT Statistical Test") print(" Acceptable minimal signal length: n=2^7=128\n") else: if(M>=7 and M<19): line() print("Warning: Statistically advisable signal length: n=2^19=524288\n") line() print("Test Name: Walsh-Hadamard Transform Statistical Test\n") (t, P, _) = tstat(xseq,a,b,M) otest=test1(cl,t,a,b,[]) otest=test2(cl,P,a,b,otest) otest=test3(cl,P,a,b,otest) otest=test4(cl,t,a,b,otest) otest=test5(cl,t,a,b,otest) T=overalltest(cl,otest) return(T) # 1 if x(t) is random

fed by binary $\pm 1$ signal of $X$ (see example in Section 2.1). The last function return T variable storing $1$ for the overall decision that $x(t)$ is random, $0$ otherwise. It can be used for a great number of repeated WHT tests for different signals in a loop, thus for determination of ratio of instances the WHT Statistical Test passed.

3. Randomness of random()

I know what you think right now. I have just spent an hour reading all that stuff so far, how about some real-life tests? I’m glad you asked! Here we go! We start with Mersenne Twister algorithm being the Rolls-Royce engine of Python’s random() function (and its derivatives). The whole fun of the theoretical part given above comes down to a few lines of code as given below.

Let’s see our WHT Suite of 5 Statistical Tests in action for a very long (of length $n=2^{21}$) random binary signal of $\pm 1$ form. Let’s run the exemplary main program calling the test:

# Walsh-Hadamard Transform and Tests for Randomness of Financial Return-Series # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # Mersenne Twister PRNG test using WHT Statistical Test   from WalshHadamard import WHTStatTest, line from random import randrange   # define confidence level for WHT Statistical Test cl=99.9999   # generate random binary signal X(t) X=[randrange(-1,2,2) for i in xrange(2**21)] line() print("X(t) =") for i in range(20): print(X[i]), print("...")   WHTStatTest(cl,X)

what returns lovely results, for instance:

---------------------------------------------------------------------- X(t) = -1 -1 1 1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 1 -1 1 1 ... ---------------------------------------------------------------------- Signal X(t) of length n = 2097152 digits trimmed to x(t) of length n = 2097152 digits (n=2^21) split into a = 1024 sub-sequences b = 2048-digit long   ---------------------------------------------------------------------- Test Name: Walsh-Hadamard Transform Statistical Test   Computation of WHTs... ...completed Computation of t-statistics... completed Computation of p-values... completed   Test 1 (Crude Decision) RESULT: 2097149 out of 2097152 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 2 (Proportion of Sequences Passing a Test) RESULT: Signal x(t) appears to be random at 99.99990% confidence level   Test 3 (Uniformity of p-values) RESULT: 2045 out of 2048 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 4 (Maximum Value Decision) RESULT: 2047 out of 2048 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 5 (Sum of Square Decision) RESULT: 2048 out of 2048 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   ---------------------------------------------------------------------- THE OVERALL RESULT: Signal x(t) displays an evidence for RANDOMNESS at 99.99990% c.l. based on Walsh-Hadamard Transform Statistical Test

where we assumed the significance level of $\alpha=0.000001$. Impressive indeed!

A the same $\alpha$ however for shorter signal sub-sequencies ($a=256; n=2^{16}$), we still get a significant number of passed tests supporting randomness of random(), for instance:

---------------------------------------------------------------------- X(t) = -1 -1 1 -1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 1 1 -1 1 -1 ... ---------------------------------------------------------------------- Signal X(t) of length n = 65536 digits trimmed to x(t) of length n = 65536 digits (n=2^16) split into a = 256 sub-sequences b = 256-digit long   ---------------------------------------------------------------------- Warning: Statistically advisable signal length: n=2^19=524288   ---------------------------------------------------------------------- Test Name: Walsh-Hadamard Transform Statistical Test   Computation of WHTs... ...completed Computation of t-statistics... completed Computation of p-values... completed   Test 1 (Crude Decision) RESULT: 65536 out of 65536 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 2 (Proportion of Sequences Passing a Test) RESULT: Signal x(t) appears to be random at 99.99990% confidence level   Test 3 (Uniformity of p-values) RESULT: 254 out of 256 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 4 (Maximum Value Decision) RESULT: 255 out of 256 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 5 (Sum of Square Decision) RESULT: 256 out of 256 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   ---------------------------------------------------------------------- THE OVERALL RESULT: Signal x(t) displays an evidence for RANDOMNESS at 99.99990% c.l. based on Walsh-Hadamard Transform Statistical Test

For this random binary signal $x(t)$, the Walsh-Hadamard Transforms for the first 64 signal sub-sequences reveal pretty “random” distributions of $w_{ij}$ values:

4. Randomness of Financial Return-Series

Eventually, we stand face to face in the grand finale with the question: do financial return time-series can be of random nature? More precisely, if we convert any return-series (regardless of time step, frequency of trading, etc.) to the binary $\pm 1$ signal, do the corresponding positive and negative returns occur randomly or not? Good question.

Let’s see by example of FX 30-min return-series of USDCHF currency pair traded between Sep/2009 and Nov/2014, how our WHT test for randomness works. We use the tick-data downloaded from Pepperstone.com and rebinned up to 30-min evenly distributed price series as provided in my earlier post of Rebinning Tick-Data for FX Algo Traders. We read the data by Python and convert the price-series into return-series.

What follows is similar what we have done within previous examples. First, we convert the return-series into binary signal. As we will see, the signal $X(t)$ is 65732 point long and can be split into 256 sub-sequences 256-point long. Therefore $n=2^{16}$ stands for the trimmed signal of $x(t)$. For the same of clarity, we plot first 64 segments (256-point long) for the USDCHF price-series marking all segments with vertical lines. Next, we run the WHT Statistical Test for all $a=256$ sequences but we display WHTs for only first 64 blocks as visualised for the price-series. The main code takes form:

# Walsh-Hadamard Transform and Tests for Randomness of Financial Return-Series # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # 30-min FX time-series (USDCHF) traded between 9/2009 and 11/2014   from WalshHadamard import WHTStatTest, ret2bin, line as Line import matplotlib.pyplot as plt import numpy as np import csv   # define confidence level for WHT Statistical Test cl=99.9999   # open the file and read in the 30min price of USD/CHF P=[] with open("USDCHF.30m") as f: c = csv.reader(f, delimiter=' ', skipinitialspace=True) for line in c: price=line[6] P.append(price)   x=np.array(P,dtype=np.float128) # convert to numpy array r=x[1:]/x[0:-1]-1. # get a return-series Line() print("r(t) =") for i in range(7): print("%8.5f" % r[i]), print("...")   X=ret2bin(r) print("X(t) =") for i in range(7): print("%8.0f" %X[i]), print("...")   plt.plot(P) plt.xlabel("Time (from 1/05/2009 to 27/09/2010)") plt.ylabel("USDCHF (30min)") #plt.xlim(0,len(P)) plt.ylim(0.95,1.2) plt.xlim(0,64*256) plt.gca().xaxis.set_major_locator(plt.NullLocator()) for x in range(0,256*265,256): plt.hold(True) plt.plot((x,x), (0,10), 'k-') plt.show() plt.close("all")   WHTStatTest(cl,X)

and displays both plots as follows: (a) USDCHF clipped price-series,

and (b) WFTs for first 64 sequences 256-point long,

From the comparison of both figures one can understand the level of details how WHT results are derived. Interesting, for FX return-series, the WHT picture seems to be quite non-uniform suggesting that our USDCHF return-series is random. Is it so? The final answer deliver the results of WHT statistical test, summarised by our program as follows:

---------------------------------------------------------------------- r(t) = -0.00092 -0.00033 -0.00018 0.00069 -0.00009 -0.00003 -0.00086 ... X(t) = -1 -1 -1 1 -1 -1 -1 ... ---------------------------------------------------------------------- Signal X(t) of length n = 65731 digits trimmed to x(t) of length n = 65536 digits (n=2^16) split into a = 256 sub-sequences b = 256-digit long   ---------------------------------------------------------------------- Warning: Statistically advisable signal length: n=2^19=524288   ---------------------------------------------------------------------- Test Name: Walsh-Hadamard Transform Statistical Test   Computation of WHTs... ...completed Computation of t-statistics... completed Computation of p-values... completed   Test 1 (Crude Decision) RESULT: 65536 out of 65536 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 2 (Proportion of Sequences Passing a Test) RESULT: Signal x(t) appears to be random at 99.99990% confidence level   Test 3 (Uniformity of p-values) RESULT: 250 out of 256 test variables stand for randomness Signal x(t) appears to be non-random at 99.99990% confidence level   Test 4 (Maximum Value Decision) RESULT: 255 out of 256 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 5 (Sum of Square Decision) RESULT: 256 out of 256 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   ---------------------------------------------------------------------- THE OVERALL RESULT: Signal x(t) displays an evidence for RANDOMNESS at 99.99990% c.l. based on Walsh-Hadamard Transform Statistical Test

Nice!! Wasn’t worth it all the efforts to see this result?!

I’ll leave you with a pure joy of using the software I created. There is a lot to say and a lot to verify. Even the code itself can be amended a bit and adjusted for different sequence number $a$ and $b$’s. If you discover a strong evidence for non-randomness, post it in comments or drop me an e-mail. I can’t do your homework. It’s your turn. I need to sleep sometimes… ;-)

REFERENCES

## Covariance Matrix for N-Asset Portfolio fed by Quandl in Python

A construction of your quantitative workshop in Python requires a lot of coding or at least spending a considerable amount of time assembling different blocks together. There are many simple fragments of code reused many times. The calculation of covariance matrix is not a problem once NumPy is engaged but the meaning is derived once you add some background idea what you try to achieve.

Let’s see in this lesson of Accelerated Python for Quants tutorial how to use Quandl.com data provider in construction of any $N$-Asset Portfolio based on SEC securities and for return-series we may calculate a corresponding covariance matrix.

Quandl and SEC Stock List

Quandl is a great source of data. With their ambition to become the largest data provider on the planet free of charge, no doubt they do an amazing job. You can use their Python API to feed your code directly with Open, High, Low, Close for any SEC stock. In the beginning we will need a list of companies (tickers) and, unfortunately, the corresponding internal call-tickers as referred to by Quandl. The .csv file containing all information you can download from this website or directly here: secwiki_tickers.csv. The file contains the following header:

Ticker Name Sector Industry Price Collection

where we are interested in matching Ticker with Price field. The latter for AAPL stock displays “WIKI/AAPL” code. That’s all we need for now to grab.

Our Portfolio

Let’s say we have a freedom of choice to select any of 2277 stock data from our SEC universe (provided in secwiki_tickers.csv file). For the sake of simplicity I’ll select ony three and save them in a plain text file of portfolio.lst containing:

AAPL IBM TXN

I do it on purpose in order to show you how easily we can read in this list from a file in Python. As usual, we start our adventure from data pre-processing part:

1 2 3 4 5 6 7 8 9 10 11 # Covariance Matrix for N-Asset Portfolio fed by Quandl in Python # (c) 2014 QuantAtRisk, by Pawel Lachowicz   import Quandl import numpy as np import pandas as pd   df=pd.read_csv('secwiki_tickers.csv')   dp=pd.read_csv('portfolio.lst',names=['pTicker']) pTickers=dp.pTicker.values # converts into a list

In order to install Quandl module in your Mac/Linux environment simply type pip install Quandl (for more information see here). In the code above, we employ pandas’ read_csv function both for reading in the data from .csv file as well as from a plain text file. For the latter, line #10, we add a name of the column, pTicker, to point at portfolio tickers, and next we convert pandas’ DataFrame object into a Python list.

Now, we gonna use the power of Python over Matlab in search for the corresponding Quandl (Price) ticker code:

13 14 15 16 17 18 19 20 21 tmpTickers=[] for i in range(len(pTickers)): test=df[df.Ticker==pTickers[i]] if not(test.empty): tmp=test.Price.values+'.4' # of <type 'numpy.ndarray'> tmp2=tmp.tolist() tmpTickers.append(tmp2)   print(tmpTickers)

This is executed in line #15 almost automatically. The result of the search is the DataFrame record (empty or containing the corresponding information on the stock as read out from (df) .csv data source. Simply, if the ticker of the security in our portfolio.lst file does not exist, it is skipped. Quandl allows you to retrieve information on stock’s Open, High, Low, Close, Volume by calling (Price) Quandl ticker in the following form, respectively:

'WIKI/AAPL.1' 'WIKI/AAPL.2' 'WIKI/AAPL.3' 'WIKI/AAPL.4' 'WIKI/AAPL.5'

Below we will stick to Close prices. That is why, in line #17, we add ‘.4′ to the string. Please note that we should get the same data by calling ‘GOOG/NASDAQ_AAPL.4′ and ‘WIKI/AAPL.4′ Quandl tickers. Because a variable tmp is of numpy.ndarray type, in line #18, we convert it into a list type. The final list of tmpTickers contains therefore all corresponding Quandl tickers in the form:

[['WIKI/AAPL.4'], ['WIKI/IBM.4'], ['WIKI/TXN.4']]

and before they can be used for return-series retrieval from the Quandl database, we need to scalp each item a bit in the following way:

23 24 25 26 27 28 29 30 tmp=[] for i in range(len(tmpTickers)): tmp2=str(tmpTickers[i]).strip('[]') print(tmp) tmp.append(str(tmp2).strip('\'\''))   QuandlTickers=tmp print(QuandlTickers)

what returns:

['WIKI/AAPL.4', 'WIKI/IBM.4', 'WIKI/TXN.4']

We fetch the data pretty easily:

32 33 34 35 36 37 data= Quandl.get(QuandlTickers, authtoken='YourAuthToken', \ trim_start='2014-10-01', trim_end='2014-11-04', \ transformation='rdiff')   d=data.values.T print(d)

where rdiff enforces price-series transformation into return-series expressed as row-on-row % change, y'[t] = (y[t] – y[t-1])/y[t-1] (see more here: Using the Quandl API). By default we download daily returns:

[[-0.01558313 0.00725953 -0.0028028 0. -0.00873319 0.02075949 0.00218254 -0.00287072 -0.00913333 -0.01062018 -0.01225316 -0.01312282 0.01464783 0.02139859 0.0271652 0.00507466 0.01786581 0.00372031 -0.00104543 0.01550756 0.00562114 -0.00335383 0.00953449 0.01296296 -0.00731261] [-0.01401254 -0.00138911 0.0094163 0.0019611 -0.01761532 0.0196543 -0.01552598 -0.00262847 -0.01296187 0.00152572 -0.01115343 -0.01050894 0.0122887 -0.0711343 -0.03471319 -0.00882191 0.00241053 -0.0006166 -0.00129566 0.01068759 -0.00085575 0.00544476 0.00030423 -0.00024331 -0.01040399] [-0.01698469 nan nan -0.00416489 -0.01319035 0.020213 -0.01959949 -0.07127336 -0.0189518 0.00906272 0.01063578 0.01941066 0.00183528 0.01694527 0.05314118 -0.00320718 0.00815101 0.01212766 0.00798823 0.01147028 -0.00350515 -0.01655287 0.0448138 0.00845751 0.00638978]]

You may notice some missing data in the form denoted by nan. It may happen for any dataset. Life is not perfect. Quandl is only an option. In the first approximation we may fill the missing returns with zeros using pandas:

39 40 41 42 43 44 for i in range(d.shape[0]): df=pd.DataFrame(d[i]) df.fillna(0,inplace=True) d[i]=df.values.T   print(d)

what works quickly and efficiently:

[[-0.01558313 0.00725953 -0.0028028 0. -0.00873319 0.02075949 0.00218254 -0.00287072 -0.00913333 -0.01062018 -0.01225316 -0.01312282 0.01464783 0.02139859 0.0271652 0.00507466 0.01786581 0.00372031 -0.00104543 0.01550756 0.00562114 -0.00335383 0.00953449 0.01296296 -0.00731261] [-0.01401254 -0.00138911 0.0094163 0.0019611 -0.01761532 0.0196543 -0.01552598 -0.00262847 -0.01296187 0.00152572 -0.01115343 -0.01050894 0.0122887 -0.0711343 -0.03471319 -0.00882191 0.00241053 -0.0006166 -0.00129566 0.01068759 -0.00085575 0.00544476 0.00030423 -0.00024331 -0.01040399] [-0.01698469 0. 0. -0.00416489 -0.01319035 0.020213 -0.01959949 -0.07127336 -0.0189518 0.00906272 0.01063578 0.01941066 0.00183528 0.01694527 0.05314118 -0.00320718 0.00815101 0.01212766 0.00798823 0.01147028 -0.00350515 -0.01655287 0.0448138 0.00845751 0.00638978]]

Having data is better than missing them. It is important to check how many nans every return-series has. In our case of TXN missing information constitutes 8% at 25 data points. A good indicator is to have less than 3%. Otherwise an impact on any further calculations may be significantly propagated. In this point it is good to recall Bruce Lee who said: “Empty your mind, be formless, shapeless, like water. If you put water into a cup, it becomes the cup. If you put water in a bottle, it becomes the bottle. If you put it in a teacup, it becomes the teacup. Now, water can flow, and it can crush. Be water, my friend.” Two added zeros contribute to the TXN distribution of returns in its central part, therefore the more we “fill in” with zeros, a fit of normal distribution flattens the tails (a distribution becomes more peaked). Therefore, in such case, I would agree with Bruce only on the “crush” part of his wisdom.

Covariance Matrix

Keeping above short note on some dirty tricks in mind, we obtain the desired covariance matrix simply and painfully,

46 47 covmat=np.cov(d) print(covmat)

id est,

[[ 1.44660397e-04 -2.13905277e-05 1.31742330e-04] [ -2.13905277e-05 3.12211511e-04 -5.62146677e-05] [ 1.31742330e-04 -5.62146677e-05 5.44348868e-04]]

what accomplishes our efforts.

Applied Portfolio Optimization with Risk Management using Matlab

## GARCH(1,1) Model in Python

In my previous article GARCH(p,q) Model and Exit Strategy for Intraday Algorithmic Traders we described the essentials of GARCH(p,q) model and provided an exemplary implementation in Matlab. In general, we apply GARCH model in order to estimate the volatility one time-step forward, where:
$$\sigma_t^2 = \omega + \alpha r_{t-1}^2 + \beta \sigma_{t-1}^2$$ based on the most recent update of $r$ and $\sigma$, where $r_{t-1} = \ln({P_{t-1}}/{P_{t-2}})$ and $P$ corresponds to an asset price. For any financial time-series, $\{r_j\}$, the estimation of $(\omega,\alpha,\beta)$ parameters can be conducted utilising the maximum likelihood method. The latter is an iterative process by looking for the maximum value of the sum among all sums defined as:
$$\sum_{i=3}^{N} \left[ -\ln(\sigma_i^2) – \frac{r_i^2}{\sigma_i^2} \right]$$ where $N$ denotes the length of the return series $\{r_j\}$ ($j=2,…,N$) under study.

Let’s assume we have a test array of input data, $\{r_j\}$, stored in Python variable of r, and we write a function, GARCH11_logL, that will be used in the optimisation process. It contains two input parameters where parma is an 3-element array with some trial values corresponding to $(\omega,\alpha,\beta)$ and u denotes the return-series.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # GARCH(1,1) Model in Python # uses maximum likelihood method to estimate (omega,alpha,beta) # (c) 2014 QuantAtRisk, by Pawel Lachowicz; tested with Python 3.5 only   import numpy as np from scipy import optimize import statistics as st   r = np.array([0.945532630498276, 0.614772790142383, 0.834417758890680, 0.862344782601800, 0.555858715401929, 0.641058419842652, 0.720118656981704, 0.643948007732270, 0.138790608092353, 0.279264178231250, 0.993836948076485, 0.531967023876420, 0.964455754192395, 0.873171802181126, 0.937828816793698])   def GARCH11_logL(param, r): omega, alpha, beta = param n = len(r) s = np.ones(n)*0.01 s[2] = st.variance(r[0:3]) for i in range(3, n): s[i] = omega + alpha*r[i-1]**2 + beta*(s[i-1]) # GARCH(1,1) model logL = -((-np.log(s) - r**2/s).sum()) return logL

In this point it is important to note that in line #32 we multiply the sum by $-1$ in order to find maximal value of the expression. Why? This can be understood as we implement optimize.fmin function from Python’s optimize module. Therefore, we seek for best estimates as follows:

34 o = optimize.fmin(GARCH11_logL,np.array([.1,.1,.1]), args=(r,), full_output=1)

and we display the results,

36 37 38 R = np.abs(o[0]) print() print("omega = %.6f\nbeta = %.6f\nalpha = %.6f\n" % (R[0], R[2], R[1]))

what, in case of the array of r as given in the code, returns the following results:

Optimization terminated successfully. Current function value: 14.705098 Iterations: 88 Function evaluations: 162   omega = 0.788244 beta = 0.498230 alpha = 0.033886

Python vs. Matlab Solution

Programming requires caution. It is always a good practice to test the outcome of one algorithm against alternative solutions. Let’s run the GARCH(1,1) model estimation for the same input array and compare Python and Matlab results:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 % GARCH(1,1) Model in Matlab 2013a % (c) 2014 QuantAtRisk, by Pawel Lachowicz   clear all; close all; clc;   r=[0.945532630498276, ... 0.614772790142383, ... 0.834417758890680, ... 0.862344782601800, ... 0.555858715401929, ... 0.641058419842652, ... 0.720118656981704, ... 0.643948007732270, ... 0.138790608092353, ... 0.279264178231250, ... 0.993836948076485, ... 0.531967023876420, ... 0.964455754192395, ... 0.873171802181126, ... 0.937828816793698]';   % GARCH(p,q) parameter estimation model = garch(1,1) % define model [fit,VarCov,LogL,Par] = estimate(model,r); % extract model parameters parC=Par.X(1); % omega parG=Par.X(2); % beta (GARCH) parA=Par.X(3); % alpha (ARCH) % estimate unconditional volatility gamma=1-parA-parG; VL=parC/gamma; volL=sqrt(VL); % redefine model with estimatated parameters model=garch('Constant',parC,'GARCH',parG,'ARCH',parA)

what returns:

model =   GARCH(1,1) Conditional Variance Model: -------------------------------------- Distribution: Name = 'Gaussian' P: 1 Q: 1 Constant: NaN GARCH: {NaN} at Lags [1] ARCH: {NaN} at Lags [1]   ____________________________________________________________ Diagnostic Information   Number of variables: 3   Functions Objective: @(X)OBJ.nLogLikeGaussian(X,V,E,Lags,1,maxPQ,T,nan,trapValue) Gradient: finite-differencing Hessian: finite-differencing (or Quasi-Newton)   Constraints Nonlinear constraints: do not exist   Number of linear inequality constraints: 1 Number of linear equality constraints: 0 Number of lower bound constraints: 3 Number of upper bound constraints: 3   Algorithm selected sequential quadratic programming     ____________________________________________________________ End diagnostic information Norm of First-order Iter F-count f(x) Feasibility Steplength step optimality 0 4 1.748188e+01 0.000e+00 5.758e+01 1 27 1.723863e+01 0.000e+00 1.140e-03 6.565e-02 1.477e+01 2 31 1.688626e+01 0.000e+00 1.000e+00 9.996e-01 1.510e+00 3 35 1.688234e+01 0.000e+00 1.000e+00 4.099e-02 1.402e+00 4 39 1.686305e+01 0.000e+00 1.000e+00 1.440e-01 8.889e-01 5 44 1.685246e+01 0.000e+00 7.000e-01 2.379e-01 5.088e-01 6 48 1.684889e+01 0.000e+00 1.000e+00 9.620e-02 1.379e-01 7 52 1.684835e+01 0.000e+00 1.000e+00 2.651e-02 2.257e-02 8 56 1.684832e+01 0.000e+00 1.000e+00 8.389e-03 7.046e-02 9 60 1.684831e+01 0.000e+00 1.000e+00 1.953e-03 7.457e-02 10 64 1.684825e+01 0.000e+00 1.000e+00 7.888e-03 7.738e-02 11 68 1.684794e+01 0.000e+00 1.000e+00 3.692e-02 7.324e-02 12 72 1.684765e+01 0.000e+00 1.000e+00 1.615e-01 5.862e-02 13 76 1.684745e+01 0.000e+00 1.000e+00 7.609e-02 8.429e-03 14 80 1.684740e+01 0.000e+00 1.000e+00 2.368e-02 4.072e-03 15 84 1.684739e+01 0.000e+00 1.000e+00 1.103e-02 3.142e-03 16 88 1.684739e+01 0.000e+00 1.000e+00 1.183e-03 2.716e-04 17 92 1.684739e+01 0.000e+00 1.000e+00 9.913e-05 1.378e-04 18 96 1.684739e+01 0.000e+00 1.000e+00 6.270e-05 2.146e-06 19 97 1.684739e+01 0.000e+00 7.000e-01 4.327e-07 2.146e-06   Local minimum possible. Constraints satisfied.   fmincon stopped because the size of the current step is less than the default value of the step size tolerance and constraints are satisfied to within the selected value of the constraint tolerance.   <stopping criteria details>     GARCH(1,1) Conditional Variance Model: ---------------------------------------- Conditional Probability Distribution: Gaussian   Standard t Parameter Value Error Statistic ----------- ----------- ------------ ----------- Constant 0.278061 26.3774 0.0105417 GARCH{1} 0.457286 49.4915 0.0092397 ARCH{1} 0.0328433 1.65576 0.0198358   model =   GARCH(1,1) Conditional Variance Model: -------------------------------------- Distribution: Name = 'Gaussian' P: 1 Q: 1 Constant: 0.278061 GARCH: {0.457286} at Lags [1] ARCH: {0.0328433} at Lags [1]

id est
$$(\omega,\beta,\alpha)_{\rm Matlab} = (0.278061,0.457286,0.0328433) \ .$$This slightly differs itself from the Python solution which was
$$(\omega,\beta,\alpha)_{\rm Py} =(0.788244,0.498230,0.033886) \ .$$At this stage it is difficult to assess which solution is “better”. Both algorithms and applied methodologies simply may be different what is usually the case. Having that in mind, further extensive tests are required, for example, a dependance of Python solution on trial input $(\omega,\alpha,\beta)$ values as displayed in line #33 of the Python code.

## Quantitative Risk Assessment Model for Investment Options

Working in the superannuation industry in Australia has some great advantages. A nice atmosphere at work, gym sessions with colleagues during lunch time, endless talks about girls after hours. However, as Brian Tracy once said: When you go to work, you work. And this is true. And this is rewarding.

Superannuation is the Australian way of making people rich when they retire. Sadly, barely understood by many, it offers a wide palette of investment options for your super (9-12% of every salary is credited to the super account by the employer; regulated by law). Over 120 super funds across Australia, over 7000 options to choose from. All styles of investments, from cash options, through growth and diversified fixed interest, ending among high growth-high risk options. Trust me, the funds’ portfolio managers do their best to beat the benchmarks, the markets, and competing super funds.

The industry establishes standards. Regulations. Smarter and unified ways of reporting. On 29 June 2010, Australian Prudential Regulation Authority (APRA) issued a letter to superannuation trustees advising that APRA would be providing guidance on the disclosure of superannuation investment risk to fund members. The letter to trustees states that risk should be measured as the likely number of negative annual returns over any 20 year period. And this is where our story begins.

Risk Assessment

A working practice of super funds across Australia revealed that since 2013 the way of reporting risk (following the abovementioned definition) still leaves a lot behind the curtain. It’s not clear how the assessment of an individual investment option’s risk has been derived.

APRA’s guidance states that this classification system should help members ‘readily distinguish the characteristics of each investment strategy’. To achieve this, there needs to be sufficient categories to meaningfully differentiate between various options based on risk.

The funds dare to report the risk score/label according to a seven level classification system to provide sufficient granularity. The Joint ASFA/FSC Working Group’s analysis supports the claim that the number of annual negative periods over any 20 year period is likely to fall in the range of 0 to 7 for the majority of investment options:

Risk Band Risk Label Estimated number of negative annual returns over any 20 year period
1 Very Low Less than 0.5
2 Low 0.5 to less than 1
3 Low to Medium 1 to less than 2
4 Medium 2 to less than 3
5 Medium to High 3 to less than 4
6 High 4 to less than 6
7 Very High 6 or Greater

Having an access to any investment option performance over past years does not solve fully the problem of their risk assessment following the guidance. Firstly, not every option’s life time is long enough to derive meaningful results. Maximal data span on only few occasions reaches 10 years. Secondly, the data have gaps, therefore not for every option we can fetch its monthly performance figures. Thirdly, an access to data requires huge amount of labour of dedicated people who put all grains of sand into one jar and then are willing to sell it (e.g. the research house of SuperRatings in Sydney).

Lastly, knowing the option’s performance over past 3 years the questions arise. How should we estimate its risk score correctly? How many times in the upcoming 20 year period a considered investment option is going to denote negative annual returns? We need a model.

Model

A quantitative idea standing behind a unification of APRA’s risk measure guidance for investment options can find solution in the fundamentals of statistics. Below we will design a way of guessing the answer for the problem we question.

If we recall the concept of the Binomial Distribution, we immediately recognise its potential application. Consider a time-series of monthly returns, $\{ r_i \}$, of the total length of $N$ months. If $N$ is larger than 12, we can calculate
$$m=(N-12)+1$$ times the annual return,
$$R_j = \left[ \prod_{i}^{12} (1+r_i) \right] – 1 \ \ \mbox{for} \ j = 1,…,m$$ where $r_i$ are given as decimals and $j$ denotes specific period of 12 consecutive months. $R_j$ should be understood as a rolling annual return given the statistically justified minimum requirement of $N\ge 31$ months of data for a specific option. For example, if we have 5 years of data (60 months) then we are able to get $m=(60-12)+1=49$ test annual returns based on the uninterrupted data sequences. Here, the word test is crucial.

We build our risk score model in the following way. For any option of data length of $N$ months we construct a vector of $m$ annual returns $R$. We count the total number of negative values and denote it as $k$ out of $m$ trials. The probability of a single year to close up with the negative annual return is given as:
$$p=\frac{k}{m}$$ where $p$ can be assigned as the probability of success (in obtaining the negative annual return). Under the assumption of independent trials, one can find the probability of obtaining exactly $k$ successes out of, in general, $n$ trials making use of Binomial Distribution that is given by the probability mass function:
$$Pr⁡(X=k) = {{n}\choose{k}} p^k (1-p)^{n-k}$$ where obviously
$${{n}\choose{k}} = \frac{n!}{k!(n-k)!} \ \ .$$ From the data analysis of the option performance, we aim at calculating the probability of $k$ negative annual returns in any $n=20$ trials, namely:
$$Pr⁡(X=k) = {{20}\choose{k}} p^k (1-p)^{20-k}$$

The model can be easily coded in Excel/VBA as a macro. The core algorithm of assessing option’s risk score you can grasp below:

If m > 0 Then 'probability of success Dim p As Double p = k / m 'probability for r=1..7 Dim pr As Double, trials As Integer, r As Integer, newton As Double Dim ans As Integer, v0 As Double, v1 As Double v0 = -1 v1 = -1 trials = 20 For r = 1 To 7 newton = Application.Fact(trials) / _ (Application.Fact(r) * Application.Fact(trials - r)) pr = newton * Application.Power(p, r) * Application.Power(1 - p, trials - r) If pr > v0 Then v0 = pr v1 = r End If Next r 'a new risk score If v1 = 1 Then scoreSR = 1 ' "Very Low" Else If v1 = 2 Then scoreSR = 2 ' "Low" Else If v1 = 3 Then scoreSR = 3 ' "Low to Medium" Else If v1 = 4 Then scoreSR = 4 ' "Medium" Else If v1 = 5 Then scoreSR = 5 ' "Medium to High" Else If v1 = 6 Then scoreSR = 6 ' "High" Else If v1 = 7 Then scoreSR = 7 ' "Very High" Else scoreSR = "" End If End If End If End If End If End If End If

The resulting solution may lead us to the reformulation of the risk assessment for individual investment options as presented in the following table:

Risk Band New Risk Label Estimated number of negative annual returns over any 20 year period ($k$)
1 Very Low 0 to 1
2 Low 2
3 Low to Medium 3
4 Medium 4
5 Medium to High 5
6 High 6
7 Very High 7 or Greater

Intuitively, any Cash Option which always denotes positive monthly returns is also ranked (by our model) as an option of a Very Low risk. You can apply this model for any monthly return series since we play we the numbers. This is what quants do best. Number crunching!

Nasty Assumption

The motivation behind the model sleeps in one word you can find in the definition we underlined at the very beginning: Estimated number of negative annual returns over any 20 year period. ‘Any’ allows us to choose any random sequence of 12 consecutive monthly returns and that pays off in the larger number of trials under the assumption of independence.

Na, right. The nasty piece of the model: the independence of individual trials. This is a cornerstone as it comes to Binomial Distribution. As a smart quant you can question its validity within our model. I tackled with this problem for a while and I came to an empirical lemma supporting our line of defence.

The independence of individual trials can be justified as an analogous to the following experiment. Imagine that a sheet of A4-format paper is available. The paper has a colour gradually changing from left to right from black to light grey, and the colour gradient is constant in a vertical direction. We cut the sheet horizontally into a finite number of strips of paper of 1 cm height, however starting every new strip by 1 cm to the right (the concept of a sliding window in time-serie analysis). Any comparison of two adjacent strips of paper indicates a finite common area of the same colour as defined by the gradient: correlation. We put all strips into one jar and churn it. Next we randomly pull strips out of the jar. Each strip is an independent trial out of all strips placed inside the jar as the memory of correlation between two formally adjacent strips has been lost.

Invest smartly and don’t worry. Correlations are everywhere.

## Pre-Processing of Asset Price Series for Portfolio Optimization

Portfolio Optimization is a significant component of Matlab’s Financial Toolbox. It provides us with ready-to-use solution in finding optimal weights of assets that we consider for trading deriving them based on the historical asset performance. From a practical point of view, we can include it in our algorithmic trading strategy and backtest its applicability under different initial conditions. This is a subject of my next up-coming post. However, before we can enjoy the view from the peak, we need to climb the mountain first.

In Matlab, the portfolio is created as a dedicated object of the same name. It doesn’t read the raw stock data. We need to feed that beast. Two major ingredients satisfy the input: a vector of the expected asset returns and a covariance matrix. Matlab helps us to estimate these moments but first we need to deliver asset data in a digestable form.

In this post we will see how one can quickly download the stock data from the Internet based on our own stock selection and pre-process them for solving portfolio optimization problem in Matlab.

Initial Setup for Portfolio Object

Let’s say that at any point of time you have your own list of stocks you wish to buy. For simplicity let’s also assume that the list contains stocks traded on NYSE or NASDAQ. Since you have been a great fun of this game, now you are almost ready to buy what you jotted down on your ShoppingList.lst. Here, an example of 10 tech stocks:

AAPL AOL BIDU GOOG HPQ IBM INTC MSFT NVDA TXN

They will constitute your portfolio of stocks. The problem of portfolio optimization requires a look back in time in the space of returns obtained in trading by each stock. Based on them the Return Proxy and Risk Proxy can be found.

The return matrix $R$ of dimensions $(N-1)\times M$ where $N$ stands for number of historical prices (e.g. derived daily, or monthly, etc.) and $M$ for the number of stocks in our portfolio, is required by Matlab as an input. We will see how does it work in next post. For now let’s solely focus on creation of this matrix.

In the article Create a Portfolio of Stocks based on Google Finance Data fed by Quandl I discussed Quandl.com as an attractive data provider for US stocks. Here, we will follow this solution making use of Quandl resources to pull out the stock price series for our shopping list. Ultimately, we aim at building a function, here: QuandlForPortfolio, that does the job for us:

% Pre-Processing of Asset Price Series for Portfolio Optimization in Matlab % (c) 2013, QuantAtRisk.com, by Pawel Lachowicz   clear all; close all; clc;   % Input Parameters n=1*365; tickers='ShoppingList.lst'; qcodes='QuandlStockCodeListUS.xlsx';   [X,Y,R,AssetList] = QuandlForPortfolio(n,tickers,qcodes);

We call this function with three input parameters. The first one, $n$, denotes a number of calendar days from today (counting backwards) for which we wish to retrieve the stock data. Usually, 365 days will correspond to about 250$-$252 trading days. The second parameter is a path/file name to our list of stock (desired to be taken into account in the portfolio optimisation process) while the last input defines the path/file name to the file storing stocks’ tickers and associated Quandl Price Codes (see here for more details).

Feeding the Beast

The QuandlForPortfolio Matlab function is an extended version of the previously discussed solution. It contains an important correcting procedure for the data fetched from the Quandl servers. First, let’s have a closer look on the function itself:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 % Function assists in fetching Google Finance data from the Quandl.com % server for a given list of tickers of stocks traded on NYSE or % NASDAQ. Data are retrieved for last 'n' days with daily sampling. % % INPUT % n : number of calendar days from 'today' (e.g. 365 would % correspond to about 252 business days) % tickers : a path/file name of a text file listing tickers % qcodes : a path/file name of Excel workbook (.xlsx) containing a list % of tickers and Quandl Price Codes in the format of % [Ticker,Stock Name,Price Code,Ratios Code,In Market?] % OUTPUT % X0 : [Nx1] column vector with days % Y0 : [NxM] matrix with Close Prices for M stocks % R0 : [(N-1)xM] matrix of Retruns % AssetList : a list of tickers (cell array) % % (c) 2013, QuantAtRisk.com, by Pawel Lachowicz   function [X0,Y0,R0,AssetList0] = QuandlForPortfolio(n,tickers,qcodes) fileID = fopen(tickers); tmp = textscan(fileID, '%s'); fclose(fileID); AssetList=tmp{1}; % a list as a cell array   % Read in the list of tickers and internal Quandl codes % [~,text,~] = xlsread(qcodes); quandlc=text(:,1); % again, as a list in a cell array quandlcode=text(:,3); % corresponding Quandl's Price Code   date1=datestr(today-n,'yyyy-mm-dd'); % from date2=datestr(today,'yyyy-mm-dd'); % to   % Fetch the data from Quandl.com % QData={}; for i=1:length(AssetList) for j=1:length(quandlc) if(strcmp(AssetList{i},quandlc{j})) fprintf('%4.0f %s\n',i,quandlc{j}); fts=0; [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ... 'authcode','x',... 'start_date',date1,'end_date',date2,'collapse','daily'); QData{i}=fts; end end end   % Post-Processing of Fetched Data % % create a list of days across all tickers TMP=[]; for i=1:length(QData) tmp=fts2mat(QData{i},1); tmp=tmp(:,1); TMP=[TMP; tmp]; end ut=unique(TMP); % use that list to find these days that are not present % among all data sets TMP=[]; for i=1:length(QData) tmp=fts2mat(QData{i},1); tmp=tmp(:,1); TMP=[TMP; setdiff(ut,tmp)]; end ut=unique(TMP); % finally, extract Close Prices from FTS object and store them % in Y0 matrix, plus corresponding days in X0 X0=[]; Y0=[]; for i=1:length(QData) tmp=fts2mat(QData{i},1); cp=[]; for j=1:size(tmp,1) [r,~,~]=find(ut==tmp(j,1)); if(isempty(r)) cp=[cp; tmp(j,5)]; % column 5 corresponds to Close Price if(i<2) % create a time column vector listing days % common among all data sets X0=[X0; tmp(j,1)]; end end end Y0=[Y0 cp]; end % transform Close Prices into Returns, R(i)=cp(i)/cp(i-1)-1 R0=tick2ret(Y0); AssetList0=AssetList'; end

The main bottleneck comes from the fact that Matlab’s portfolio object demands an equal number of historical returns ($N-1$) in the matrix of $R$ for all $M$ assets. We design the function in the way that it sets the common timeframe for all stocks listed on our shopping list. Of course, we ensure that all stocks were traded in the markets for about $n$ last days (rough estimation).

Now, the timeframe of $n$ last days should be understood as a first approximation. We fetch the data from Quandl (numeric date, Open, High, Low, Close, Volume) and save them in the cell array QData (lines #37-49) for each stock separately as FTS objects (Financial Time-Series objects; see Financial Toolbox). However, it may occur that not every stock we fetched displays the same amount of data. That is why we need to investigate for what days and for what stocks we miss the data. We achieve that by scanning each FTS object and creating a unique list of all days for which we have data (lines #54-60).

Next, we loop again over the same data sets but now we compare that list with a list of all dates for each stock individually (lines #63-69), capturing (line #67) those dates that are missing. Their complete list is stored as a vector in line #69. Eventually, given that, we are able to compile the full data set (e.g. Close Prices; here line #80) for all stocks in our portfolio ensuring that we will include only those dates for which we have prices across all $M$ assets (lines #70-91).

Beast Unleashed

We test our data pre-processing simply by running the block of code listed above engaging QuandlForPortfolio function and we check the results in the Matlab’s command window as follows:

>> whos X Y R AssetList Name Size Bytes Class Attributes   AssetList 1x10 1192 cell R 250x10 20000 double X 251x1 2008 double Y 251x10 20080 double

what confirms the correctness of dimensions as expected.

At this stage, the aforementioned function can be used two-fold. First, we are interested in the portfolio optimisation and we look back at last $n$ calendar days since the most current one (today). The second usage is handy too. We consider our stocks on the shopping list and fetch for their last, say, $n=7\times365$ days with data. If all stocks were traded over past 7 years we should be able to collect a reach data set. If not, the function will adjust the beginning and end date to meet the initial time constrains as required for $R$ matrix construction. For the former case, we can use 7-year data sample for direct backtesting of algo models utilizing Portfolio Optimization.

Stay tuned as we will rock this land in the next post!

Any Questions?

Share them across QuantCove.com – the official Forum of QuantAtRisk.

## Anxiety Detection Model for Stock Traders based on Principal Component Analysis

So, is there a way to disentangle the emotional part involved in trading from all other factors (e.g. the application of technical analysis, bad news consequences, IPOs, etc.) which are somehow easier to deduce? In this post I will try to make a quantitative attempt towards solving this problem. Although the solution will not have the final and closed form, my goal is to deliver an inspiration for quants and traders interested in the subject by putting a simple idea into practice: the application of Principal Component Analysis.

1. Principal Component Analysis (PCA)

Called by many as one of the most valuable results from applied linear algebra, the Principal Component Analysis, delivers a simple, non-parametric method of extracting relevant information from often confusing data sets. The real-world data usually hold some relationships among their variables and, as a good approximation, in the first instance we may suspect them to be of the linear (or close to linear) form. And the linearity is one of stringent but powerful assumptions standing behind PCA.

Imagine we observe the daily change of prices of $m$ stocks (being a part of your portfolio or a specific market index) over last $n$ days. We collect the data in $\boldsymbol{X}$, the matrix $m\times n$. Each of $n$-long vectors lie in an $m$-dimensional vector space spanned by an orthonormal basis, therefore they are a linear combination of this set of unit length basic vectors: $\boldsymbol{BX} = \boldsymbol{X}$ where a basis $\boldsymbol{B}$ is the identity matrix $\boldsymbol{I}$. Within PCA approach we ask a simple question: is there another basis which is a linear combination of the original basis that represents our data set? In other words, we look for a transformation matrix $\boldsymbol{P}$ acting on $\boldsymbol{X}$ in order to deliver its re-representation:
$$\boldsymbol{PX} = \boldsymbol{Y} \ .$$ The rows of $\boldsymbol{P}$ become a set of new basis vectors for expressing the columns of $\boldsymbol{X}$. This change of basis makes the row vectors of $\boldsymbol{P}$ in this transformation the principal components of $\boldsymbol{X}$. But how to find a good $\boldsymbol{P}$?

Consider for a moment what we can do with a set of $m$ observables spanned over $n$ days? It is not a mystery that many stocks over different periods of time co-vary, i.e. their price movements are closely correlated and follow the same direction. The statistical method to measure the mutual relationship among $m$ vectors (correlation) is achieved by the calculation of a covariance matrix. For our data set of $\boldsymbol{X}$:
$$\boldsymbol{X}_{m\times n} = \left[ \begin{array}{cccc} \boldsymbol{x_1} \\ \boldsymbol{x_2} \\ … \\ \boldsymbol{x_m} \end{array} \right] = \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & … & x_{1,n} \\ x_{2,1} & x_{2,2} & … & x_{2,n} \\ … & … & … & … \\ x_{m,1} & x_{m,2} & … & x_{m,n} \end{array} \right]$$
the covariance matrix takes the following form:
$$cov(\boldsymbol{X}) \equiv \frac{1}{n-1} \boldsymbol{X}\boldsymbol{X}^{T}$$ where we multiply $\boldsymbol{X}$ by its transposed version and $(n-1)^{-1}$ helps to secure the variance to be unbiased. The diagonal elements of $cov(\boldsymbol{X})$ are the variances corresponding to each row of $\boldsymbol{X}$ whereas the off-diagonal terms of $cov(\boldsymbol{X})$ represent the covariances between different rows (prices of the stocks). Please note that above multiplication assures us that $cov(\boldsymbol{X})$ is a square symmetric matrix $m\times m$.

All right, but what does it have in common with our PCA method? PCA looks for a way to optimise the matrix of $cov(\boldsymbol{X})$ by a reduction of redundancy. Sounds a bit enigmatic? I bet! Well, all we need to understand is that PCA wants to ‘force’ all off-diagonal elements of the covariance matrix to be zero (in the best possible way). The guys in the Department of Statistics will tell you the same as: removing redundancy diagonalises $cov(\boldsymbol{X})$. But how, how?!

Let’s come back to our previous notation of $\boldsymbol{PX}=\boldsymbol{Y}$. $\boldsymbol{P}$ transforms $\boldsymbol{X}$ into $\boldsymbol{Y}$. We also marked that:
$$\boldsymbol{P} = [\boldsymbol{p_1},\boldsymbol{p_2},…,\boldsymbol{p_m}]$$ was a new basis we were looking for. PCA assumes that all basis vectors $\boldsymbol{p_k}$ are orthonormal, i.e. $\boldsymbol{p_i}\boldsymbol{p_j}=\delta_{ij}$, and that the directions with the largest variances are the most principal. So, PCA first selects a normalised direction in $m$-dimensional space along which the variance in $\boldsymbol{X}$ is maximised. That is first principal component $\boldsymbol{p_1}$. In the next step, PCA looks for another direction along which the variance is maximised. However, because of orthonormality condition, it looks only in all directions perpendicular to all previously found directions. In consequence, we obtain an orthonormal matrix of $\boldsymbol{P}$. Good stuff, but still sounds complicated?

The goal of PCA is to find such $\boldsymbol{P}$ where $\boldsymbol{Y}=\boldsymbol{PX}$ such that $cov(\boldsymbol{Y})=(n-1)^{-1}\boldsymbol{XX}^T$ is diagonalised.

We can evolve the notation of the covariance matrix as follows:
$$(n-1)cov(\boldsymbol{Y}) = \boldsymbol{YY}^T = \boldsymbol{(PX)(PX)}^T = \boldsymbol{PXX}^T\boldsymbol{P}^T = \boldsymbol{P}(\boldsymbol{XX}^T)\boldsymbol{P}^T = \boldsymbol{PAP}^T$$ where we made a quick substitution of $\boldsymbol{A}=\boldsymbol{XX}^T$. It is easy to prove that $\boldsymbol{A}$ is symmetric. It takes a longer while to find a proof for the following two theorems: (1) a matrix is symmetric if and only if it is orthogonally diagonalisable; (2) a symmetric matrix is diagonalised by a matrix of its orthonormal eigenvectors. Just check your favourite algebra textbook. The second theorem provides us with a right to denote:
$$\boldsymbol{A} = \boldsymbol{EDE}^T$$ where $\boldsymbol{D}$ us a diagonal matrix and $\boldsymbol{E}$ is a matrix of eigenvectors of $\boldsymbol{A}$. That brings us at the end of the rainbow.

We select matrix $\boldsymbol{P}$ to be a such where each row $\boldsymbol{p_1}$ is an eigenvector of $\boldsymbol{XX}^T$, therefore
$$\boldsymbol{P} = \boldsymbol{E}^T .$$

Given that, we see that $\boldsymbol{E}=\boldsymbol{P}^T$, thus we find $\boldsymbol{A}=\boldsymbol{EDE}^T = \boldsymbol{P}^T\boldsymbol{DP}$ what leads us to a magnificent relationship between $\boldsymbol{P}$ and the covariance matrix:
$$(n-1)cov(\boldsymbol{Y}) = \boldsymbol{PAP}^T = \boldsymbol{P}(\boldsymbol{P}^T\boldsymbol{DP})\boldsymbol{P}^T = (\boldsymbol{PP}^T)\boldsymbol{D}(\boldsymbol{PP}^T) = (\boldsymbol{PP}^{-1})\boldsymbol{D}(\boldsymbol{PP}^{-1})$$ or
$$cov(\boldsymbol{Y}) = \frac{1}{n-1}\boldsymbol{D},$$ i.e. the choice of $\boldsymbol{P}$ diagonalises $cov(\boldsymbol{Y})$ where silently we also used the matrix algebra theorem saying that the inverse of an orthogonal matrix is its transpose ($\boldsymbol{P^{-1}}=\boldsymbol{P}^T$). Fascinating, right?! Let’s see now how one can use all that complicated machinery in the quest of looking for human emotions among the endless rivers of market numbers bombarding our sensors every day.

2. Covariances of NASDAQ, Eigenvalues of Anxiety

We will try to build a simple quantitative model for detection of the nervousness in the trading markets using PCA.

By its simplicity I will understand the following model assumption: no matter what the data conceal, the 1st Principal Component (1-PC) of PCA solution links the complicated relationships among a subset of stocks triggered by a latent factor attributed by us to a common behaviour of traders (human and pre-programmed algos). It is a pretty reasonable assumption, much stronger than, for instance, the influence of Saturn’s gravity on the annual silver price fluctuations. Since PCA does not tell us what its 1-PC means in reality, this is our job to seek for meaningful explanations. Therefore, a human factor fits the frame as a trial value very well.

Let’s consider the NASDAQ-100 index. It is composed of 100 technology stocks. The most current list you can find here: nasdaq100.lst downloadable as a text file. As usual, we will perform all calculations using Matlab environment. Let’s start with data collection and pre-processing:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 % Anxiety Detection Model for Stock Traders % making use of the Principal Component Analsis (PCA) % and utilising publicly available Yahoo! stock data % % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz   clear all; close all; clc;     % Reading a list of NASDAQ-100 components nasdaq100=(dataread('file',['nasdaq100.lst'], '%s', 'delimiter', '\n'))';   % Time period we are interested in d1=datenum('Jan 2 1998'); d2=datenum('Oct 11 2013');   % Check and download the stock data for a requested time period stocks={}; for i=1:length(nasdaq100) try % Fetch the Yahoo! adjusted daily close prices between selected % days [d1;d2] tmp = fetch(yahoo,nasdaq100{i},'Adj Close',d1,d2,'d'); stocks{i}=tmp; disp(i); catch err % no full history available for requested time period end end

where, first, we try to check whether for a given list of NASDAQ-100’s components the full data history (adjusted close prices) are available via Yahoo! server (please refer to my previous post of Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting for more information on the connectivity).

The cell array stocks becomes populated with two-dimensional matrixes: the time-series corresponding to stock prices (time,price). Since the Yahoo! database does not contain a full history for all stocks of our interest, we may expect their different time spans. For the purpose of demonstration of the PCA method, we apply additional screening of downloaded data, i.e. we require the data to be spanned between as defined by $d1$ and $d2$ variables and, additionally, having the same (maximal available) number of data points (observations, trials). We achieve that by:

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 % Additional screening d=[]; j=1; data={}; for i=1:length(nasdaq100) d=[d; i min(stocks{i}(:,1)) max(stocks{i}(:,1)) size(stocks{i},1)]; end for i=1:length(nasdaq100) if(d(i,2)==d1) && (d(i,3)==d2) && (d(i,4)==max(d(:,4))) data{j}=sortrows(stocks{i},1); fprintf('%3i %1s\n',i,nasdaq100{i}) j=j+1; end end m=length(data);

The temporary matrix of $d$ holds the index of stock as read in from nasdaq100.lst file, first and last day number of data available, and total number of data points in the time-series, respectively:

>> d d = 1 729757 735518 3970 2 729757 735518 3964 3 729757 735518 3964 4 729757 735518 3969 .. .. .. .. 99 729757 735518 3970 100 729757 735518 3970

Our screening method saves $m=21$ selected stock data into data cell array corresponding to the following companies from our list:

 1 AAPL 7 ALTR 9 AMAT 10 AMGN 20 CERN 21 CHKP 25 COST 26 CSCO 30 DELL 39 FAST 51 INTC 64 MSFT 65 MU 67 MYL 74 PCAR 82 SIAL 84 SNDK 88 SYMC 96 WFM 99 XRAY 100 YHOO

Okay, some people say that seeing is believing. All right. Let’s see how it works. Recall the fact that we demanded our stock data to be spanned between ‘Jan 2 1998′ and ‘Oct 11 2013′. We found 21 stocks meeting those criteria. Now, let’s assume we pick up a random date, say, Jul 2 2007 and we extract for all 21 stocks their price history over last 90 calendar days. We save their prices (skipping the time columns) into $Z$ matrix as follows:

t=datenum('Jul 2 2007'); Z=[]; for i=1:m [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-90)); Z=[Z data{i}(r,2)] end

and we plot them all together:

plot(Z) xlim([1 length(Z)]); ylabel('Stock price (US$)'); xlabel('T-90d'); It’s easy to deduct that the top one line corresponds to Apple, Inc. (AAPL) adjusted close prices. The unspoken earlier data processing methodology is that we need to transform our time-series into the comparable form. We can do it by subtracting the average value and dividing each of them by their standard deviations. Why? For a simple reason of an equivalent way of their mutual comparison. We call that step a normalisation or standardisation of the time-series under investigation: [N,M]=size(Z); X=(Z-repmat(mean(Z),[N 1]))./repmat(std(Z),[N 1]); This represents the matrix$\boldsymbol{X}$that I discussed in a theoretical part of this post. Note, that the dimensions are reversed in Matlab. Therefore, the normalised time-series, % Display normalized stock prices plot(X) xlim([1 length(Z)]); ylabel('(Stock price-Mean)/StdDev'); xlabel('T-90d'); look like: For a given matrix of$\boldsymbol{X}$, its covariance matrix, % Calculate the covariance matrix, cov(X) CovX=cov(X); imagesc(CovX); as for data spanned 90 calendar day back from Jul 2 2007, looks like: where the colour coding goes from the maximal values (most reddish) down to the minimal values (most blueish). The diagonal of the covariance matrix simply tells us that for normalised time-series, their covariances are equal to the standard deviations (variances) of 1 as expected. Going one step forward, based on the given covariance matrix, we look for the matrix of$\boldsymbol{P}$whose columns are the corresponding eigenvectors: % Find P [P,~]=eigs(CovX,5); imagesc(P); set(gca,'xticklabel',{1,2,3,4,5},'xtick',[1 2 3 4 5]); xlabel('Principal Component') ylabel('Stock'); set(gca,'yticklabel',{'AAPL', 'ALTR', 'AMAT', 'AMGN', 'CERN', ... 'CHKP', 'COST', 'CSCO', 'DELL', 'FAST', 'INTC', 'MSFT', 'MU', ... 'MYL', 'PCAR', 'SIAL', 'SNDK', 'SYMC', 'WFM', 'XRAY', 'YHOO'}, ... 'ytick',[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21]); which results in$\boldsymbol{P}$displayed as: where we computed PCA for five principal components in order to illustrate the process. Since the colour coding is the same as in the previous figure, a visual inspection of of the 1-PC indicates on negative numbers for at least 16 out of 21 eigenvalues. That simply means that over last 90 days the global dynamics for those stocks were directed south, in favour of traders holding short-position in those stocks. It is important to note in this very moment that 1-PC does not represent the ‘price momentum’ itself. It would be too easy. It represents the latent variable responsible for a common behaviour in the stock dynamics whatever it is. Based on our model assumption (see above) we suspect it may indicate a human factor latent in the trading. 3. Game of Nerves The last figure communicates an additional message. There is a remarkable coherence of eigenvalues for 1-PC and pretty random patterns for the remaining four principal components. One may check that in the case of our data sample, this feature is maintained over many years. That allows us to limit our interest to 1-PC only. It’s getting exciting, isn’t it? Let’s come back to our main code. Having now a pretty good grasp of the algebra of PCA at work, we may limit our investigation of 1-PC to any time period of our interest, below spanned between as defined by$t1$and$t2$variables: 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 % Select time period of your interest t1=datenum('July 1 2006'); t2=datenum('July 1 2010'); results=[]; for t=t1:t2 tmp=[]; A=[]; V=[]; for i=1:m [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-60)); A=[A data{i}(r,2)]; end [N,M]=size(A); X=(A-repmat(mean(A),[N 1]))./repmat(std(A),[N 1]); CovX=cov(X); [V,D]=eigs(CovX,1); % Find all negative eigenvalues of the 1st Principal Component [r,c,v]=find(V(:,1)<0); % Extract them into a new vector neg1PC=V(r,1); % Calculate a percentage of negative eigenvalues relative % to all values available ratio=length(neg1PC)/m; % Build a new time-series of 'ratio' change over required % time period (spanned between t1 and t2) results=[results; t ratio]; end We build our anxiety detection model based on the change of number of eigenvalues of the 1st Principal Component (relative to the total their numbers; here equal 21). As a result, we generate a new time-series tracing over$[t1;t2]$time period this variable. We plot the results all in one plot contrasted with the NASDAQ-100 Index in the following way: 75 76 77 78 79 80 81 82 83 84 85 % Fetch NASDAQ-100 Index from Yahoo! data-server nasdaq = fetch(yahoo,'^ndx','Adj Close',t1,t2,'d'); % Plot it subplot(2,1,1) plot(nasdaq(:,1),nasdaq(:,2),'color',[0.6 0.6 0.6]); ylabel('NASDAQ-100 Index'); % Add a plot corresponding to a new time-series we've generated subplot(2,1,2) plot(results(:,1),results(:,2),'color',[0.6 0.6 0.6]) % add overplot 30d moving average based on the same data hold on; plot(results(:,1),moving(results(:,2),30),'b') leading us to: I use 30-day moving average (a solid blue line) in order to smooth the results (moving.m). Please note, that line in #56 I also replaced the earlier value of 90 days with 60 days. Somehow, it is more reasonable to examine with the PCA the market dynamics over past two months than for longer periods (but it’s a matter of taste and needs). Eventually, we construct the core model’s element, namely, we detect nervousness among traders when the percentage of negative eigenvalues of the 1st Principal Component increases over (at least) five consecutive days: 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 % Model Core x1=results(:,1); y1=moving(results(:,2),30); tmp=[]; % Find moments of time where the percetage of negative 1-PC % eigenvalues increases over time (minimal requirement of % five consecutive days for i=5:length(x1) if(y1(i)>y1(i-1))&&(y1(i-1)>y1(i-2))&&(y1(i-2)>y1(i-3))&& ... (y1(i-3)>y1(i-4))&&(y1(i-4)>y1(i-5)) tmp=[tmp; x1(i)]; end end % When found z=[]; for i=1:length(tmp) for j=1:length(nasdaq) if(tmp(i)==nasdaq(j,1)) z=[z; nasdaq(j,1) nasdaq(j,2)]; end end end subplot(2,1,1); hold on; plot(z(:,1),z(:,2),'r.','markersize',7); The results of the model we over-plot with red markers on top of the NASDAQ-100 Index: Our simple model takes us into a completely new territory of unexplored space of latent variables. Firstly, it does not predict the future. It still (unfortunately) remains unknown. However, what it delivers is a fresh look at the past dynamics in the market. Secondly, it is easily to read out from the plot that results cluster into three subgroups. The first subgroup corresponds to actions in the stock trading having further negative consequences (see the events of 2007-2009 and the avalanche of prices). Here the dynamics over any 60 calendar days had been continued. The second subgroup are those periods of time when anxiety led to negative dynamics among stock traders but due to other factors (e.g. financial, global, political, etc.) the stocks surged dragging the Index up. The third subgroup (less frequent) corresponds to instances of relative flat changes of Index revealing a typical pattern of psychological hesitation about the trading direction. No matter how we might interpret the results, the human factor in trading is evident. Hopefully, the PCA approach captures it. If not, all we are left with is our best friend: a trader’s intuition. Acknowledgments An article dedicated to Dr. Dariusz Grech of Physics and Astronomy Department of University of Wroclaw, Poland, for his superbly important! and mind-blowing lectures on linear algebra in the 1998/99 academic year. ## Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting Within the evolution of Mathworks’ MATLAB programming environment, finally, in the most recent version labelled 2013a we received a longly awaited line-command facilitation for pulling stock data directly from the Yahoo! servers. What does that mean for quants and algo traders? Honestly, a lot. Now, simply writing a few commands we can have nearly all what we want. However, please keep in mind that Yahoo! data are free therefore not always in one hundred percent their precision remains at the level of the same quality as, e.g. downloaded from Bloomberg resources. Anyway, just for pure backtesting of your models, this step introduces a big leap in dealing with daily stock data. As usual, we have a possibility of getting open, high, low, close, adjusted close prices of stocks supplemented with traded volume and the dates plus values of dividends. In this post I present a short example how one can retrieve the data of SPY (tracking the performance of S&P500 index) using Yahoo! data in a new Matlab 2013a and I show a simple code how one can test the time period of buying-holding-and-selling SPY (or any other stock paying dividends) to make a profit every time. The beauty of Yahoo! new feature in Matlab 2013a has been fully described in the official article of Request data from Yahoo! data servers where you can find all details required to build the code into your Matlab programs. Model for Dividends It is a well known opinion (based on many years of market observations) that one may expect the drop of stock price within a short timeframe (e.g. a few days) after the day when the stock’s dividends have been announced. And probably every quant, sooner or later, is tempted to verify that hypothesis. It’s your homework. However, today, let’s look at a bit differently defined problem based on the omni-working reversed rule: what goes down, must go up. Let’s consider an exchange traded fund of SPDR S&P 500 ETF Trust labelled in NYSE as SPY. First, let’s pull out the Yahoo! data of adjusted Close prices of SPY from Jan 1, 2009 up to Aug 27, 2013 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 % Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz close all; clear all; clc; date_from=datenum('Jan 1 2009'); date_to=datenum('Aug 27 2013'); stock='SPY'; adjClose = fetch(yahoo,stock,'adj close',date_from,date_to); div = fetch(yahoo,stock,date_from,date_to,'v') returns=(adjClose(2:end,2)./adjClose(1:end-1,2)-1); % plot adjusted Close price of and mark days when dividends % have been announced plot(adjClose(:,1),adjClose(:,2),'color',[0.6 0.6 0.6]) hold on; plot(div(:,1),min(adjClose(:,2))+10,'ob'); ylabel('SPY (US$)'); xlabel('Jan 1 2009 to Aug 27 2013');

and visualize them:

Having the data ready for backtesting, let’s look for the most profitable period of time of buying-holding-and-selling SPY assuming that we buy SPY one day after the dividends have been announced (at the market price), and we hold for $dt$ days (here, tested to be between 1 and 40 trading days).

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 % find the most profitable period of holding SPY (long position) neg=[]; for dt=1:40   buy=[]; sell=[]; for i=1:size(div,1) % find the dates when the dividends have been announced [r,c,v]=find(adjClose(:,1)==div(i,1)); % mark the corresponding SPY price with blue circle marker hold on; plot(adjClose(r,1),adjClose(r,2),'ob'); % assume you buy long SPY next day at the market price (close price) buy=[buy; adjClose(r-1,1) adjClose(r-1,2)]; % assume you sell SPY in 'dt' days after you bought SPY at the market % price (close price) sell=[sell; adjClose(r-1-dt,1) adjClose(r-1-dt,2)]; end   % calculate profit-and-loss of each trade (excluding transaction costs) PnL=sell(:,2)./buy(:,2)-1; % summarize the results neg=[neg; dt sum(PnL<0) sum(PnL<0)/length(PnL)];   end

If we now sort the results according to the percentage of negative returns (column 3 of neg matrix), we will be able to get:

>> sortrows(neg,3)   ans = 18.0000 2.0000 0.1111 17.0000 3.0000 0.1667 19.0000 3.0000 0.1667 24.0000 3.0000 0.1667 9.0000 4.0000 0.2222 14.0000 4.0000 0.2222 20.0000 4.0000 0.2222 21.0000 4.0000 0.2222 23.0000 4.0000 0.2222 25.0000 4.0000 0.2222 28.0000 4.0000 0.2222 29.0000 4.0000 0.2222 13.0000 5.0000 0.2778 15.0000 5.0000 0.2778 16.0000 5.0000 0.2778 22.0000 5.0000 0.2778 27.0000 5.0000 0.2778 30.0000 5.0000 0.2778 31.0000 5.0000 0.2778 33.0000 5.0000 0.2778 34.0000 5.0000 0.2778 35.0000 5.0000 0.2778 36.0000 5.0000 0.2778 6.0000 6.0000 0.3333 8.0000 6.0000 0.3333 10.0000 6.0000 0.3333 11.0000 6.0000 0.3333 12.0000 6.0000 0.3333 26.0000 6.0000 0.3333 32.0000 6.0000 0.3333 37.0000 6.0000 0.3333 38.0000 6.0000 0.3333 39.0000 6.0000 0.3333 40.0000 6.0000 0.3333 5.0000 7.0000 0.3889 7.0000 7.0000 0.3889 1.0000 9.0000 0.5000 2.0000 9.0000 0.5000 3.0000 9.0000 0.5000 4.0000 9.0000 0.5000

what simply indicates at the most optimal period of holding the long position in SPY equal 18 days. We can mark all trades (18 day holding period) in the chart:

where the trade open and close prices (according to our model described above) have been marked in the plot by black and red circle markers, respectively. Only 2 out of 18 trades (PnL matrix) occurred to be negative with the loss of 2.63% and 4.26%. The complete distribution of profit and losses from all trades can be obtained in the following way:

47 48 49 50 figure(2); hist(PnL*100,length(PnL)) ylabel('Number of trades') xlabel('Return (%)')

returning

Let’s make some money!

The above Matlab code delivers a simple application of the newest build-in connectivity with Yahoo! server and the ability to download the stock data of our interest. We have tested the optimal holding period for SPY since the beginning of 2009 till now (global uptrend). The same code can be easily used and/or modified for verification of any period and any stock for which the dividends had been released in the past. Fairly simple approach, though not too frequent in trading, provides us with some extra idea how we can beat the market assuming that the future is going to be/remain more or less the same as the past. So, let’s make some money!

## Modern Time Analysis of Black Swans

I decided to take the data analysis on Black Swan and Extreme Loss Modeling to the next level and examine the time distribution of extreme losses across the entire S&P 500 universe of traded stocks. Previously, we were interested, first, in finding the maximum loss among all trading days for a given stock in a specified time interval (stock life-time on the trading floor), secondly, in plotting the distribution of those extreme losses (found to be well fitted with the Gumbel distribution). The investigation when all those losses occurred in time can provide us with an extra vantage point, and if we are lucky researchers we can discover some new information on the population of Black Swans.

An excellent approach to data analysis is through time analysis. It is one of many, many data analytic techniques available but it is closest to my heart and I would like to provide you with a sophisticated taste of easy-to-digest mathematics applied to the financial time-series analysis. So what is a data analysis about in our case? Can we code it or be tempted by ready-to-use dedicated data analytic software? Both ways are accessible but let’s do it in more educational way taking computer programming class in Matlab.

By the time analysis we can understand the behaviour of a given quantity in time, e.g. the price of an asset traded in the exchange market. It is pretty straightforward that our eye looks for patterns and tries to classify them somehow. If we observe a repeating structure, or sequence, or shape $-$ that may help us to code its recognition within our trading platforms. The quest for hunting the most fundamental pattern of hidden volatility in the data remains obvious: a sine wave. Fairly painless in understanding, covered by maths teachers in high-schools, ‘up and down’ approach, good moods versus bad moods, bear markets followed by bull markets. A sinusoidal behaviour is everywhere around us. Therefore it is so important to know how to find it!

In quantitative finance and risk modeling, a periodicity or cyclicality constitutes an attractive and simplest model of volatility. To be able to figure out what spectrum of characteristic frequencies our given data set conceals we may need to use a proper tool. In time-series analysis this tool is known as a periodogram. However, before starting using it properly, it is essential to understand the theoretical background.

1. The Method of Period Detection

1.1. General Formulations and Model Orthogonality

In general, we learn from experiments by fitting data, $x$, with a model, $x_{\|}$. The data contain $n$ measurements, and the model, $n_{\|}$ free parameters. The consistency of the data with the model is measured by a function, $\Theta$, called a statistic. A given model, $x_{\|}$, using a given statistic (e.g., $\chi^2$), yields its particular value, $\Theta_1$. Various methods used in the analysis of time series differ both in their choice of the model and the statistic; hence are difficult to compare directly. To enable such a comparison and for determining the significance of results, $\Theta$ is converted into the false alarm probability,$P_1$. This is done considering a hypothetic situation, $H_1$, in which $x$ is pure white noise. Then each pair $(x_{\|},\Theta)$ corresponds to certain cumulative probability distribution of $\Theta$, namely $P(n_{\|},n;\Theta)$, with $P_1$ being the tail probability that under the hypothesis $H_1$ the experiment yields $\Theta>\Theta_1$, i.e., $P_1(\Theta>\Theta_1) = 1-P(n_{\|},n;\Theta_1)$.

Up to here, we have just outlined the classical Neyman-Pearson procedure of statistics. The specific method for analysis of time series used here differs from those commonly encountered in astronomy only in the choices of $x_{\|}$ and $\Theta$. Then, our accounting for variance, correlation and multiple frequencies in calculating $P$ is dictated by the laws of statistics. The probabilities derived by us from the data are the false alarm probabilities. However, we also call them below just probabilities or significance levels.

We note then that Fourier harmonics are not orthogonal in terms of the scalar product with weights at unevenly distributed observations. Certain statistical procedures employing classical probability distributions hold for orthogonal models only and fail in other cases. To avoid that, a popular variant of the power spectrum, Lomb (1976) and Scargle (1982, hereafter LS) Lomb-Scargle periodogram, $P_{\rm LS}(\nu)$, relies on a special choice of phase such that the sine and cosine functions become orthogonal:
$$P_{\rm LS}(\nu) = A_{\rm LS} \left|\hat{x}_{\rm LS}(\nu) \right|^2 .$$ Square of the Fourier amplitude, $\hat{x}(\nu)$, takes form:
$$\begin{eqnarray} \left|\hat{x}_{\rm LS}(\nu)\right|^2 & = \left[ \sum_{k=1}^{n} (x_k-\bar{x}) \cos (2\pi\nu (t_k-\tau)) \right]^2 + \nonumber \\ & \left[ \sum_{k=1}^{n} (x_k-\bar{x}) \sin (2\pi\nu (t_k-\tau)) \right]^2 . \end{eqnarray}$$ The phase $\tau$ is defined as:
$$\tan(4\pi\nu\tau) = \frac{\sum_{k=1}^{n} \sin(4\pi\nu x_k)} {\sum_{k=1}^{N_{\rm obs}} \cos(4\pi\nu x_k)}.$$ where, as usual, we consider a time-series $\{x_i\}\ (i=1,…,n)$; $\bar{x}$ denotes the subtracted mean value, and the discrete Fourier transform takes our signal from time to frequency domain. Orignally, normalization of LS periodogram was proposed as $A_{\rm LS}=1/2\sigma^2$ in order to account for normalization to the level of white noise but different variants are available as well.

In practice, some simplification to the full version of LS periodogram are applied where we are interested in understanding the power density spectrum of $|\hat{x}(\nu)|^2$ distribution. The following Matlab function allow us to obtain a modified LS solution for a time-series $[t,y]$:

% Function calculates the power density spectrum (periodogram) for % a time-series [t,y] based on discrete Fourier transform. The % series is expected to be evenly distributed, but gaps are allowed. % % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz % % Input: % t : time vector [1xn] % y : time-series [1xn] % dt : average time-series sampling time % Output: % freq : frequency vector [1xm] % psd : power spectral density [1xm]   function [freq,pds]=qarPDS(t,y,dt); n=length(t); % number of data points X=fft(y,n); Pyy=X.*conj(X); freq=[1:n/2]*(1/n/dt); % in physical units pds=(Pyy(2:(floor(n/2)+1))); pds=pds'; % pds end

We will use it later on to the time analysis of our S&P 500 data of extreme losses across the whole universe of stocks traded. The function computes the periodogram for any time-series so don’t be worried too much about initial data units.

Now, back to the first base. We may extend the approach used within LS method by employing Szego orthogonal trigonometric polynomials as model functions. A series of $n_{\|}=2N+1$ polynomials corresponds to the orthogonal combinations of the $N$ lowest Fourier harmonics (Schwarzenberg-Czerny 1996). Orthogonal series are optimal from the statistical point of view because, by virtue of the Fisher lemma (Fisz 1963; Schwarzenberg-Czerny 1998), they guarantee the minimum variance of the fit residuals for a given model complexity (given by $n_{\|}$). Szego polynomials are also convenient in computations since the least-square solution may be obtained using recurrence and orthogonal projections, resulting in high computational efficiency, with the number of steps $\propto N$ instead of $N^3$ for $N$ harmonics.

1.2. Variance, the AoV statistics, and model complexity

The LS method employs the sine as a model, and the quadratic norm, $$\Theta_{\chi^2}=\|x-x_{\|}\|^2 ,$$ as the statistic. The corresponding probability distribution is $\chi^2$ with 2 degrees of freedom. Prior to use of the $\chi^2$ distribution, $\Theta_{\chi^2}$ has to be divided by the signal variance, $V$. However,$V$ is usually not known and has to be estimated from the data themselves. Then, neither $\Theta_{\chi^2}$ and variance estimates are independent nor their ratio follows the $\chi^2$ distribution, which effect has to be accounted for. A simple way to do it is to apply the Fisher Analysis of Variance (AoV) statistic, $$\Theta\equiv (n-n_{\|}) \|x_{\|}\|^2/ (n_{\|}\|x – x_{\|}\|^2) .$$ Hence we call our method, involving Szego polynomials model and the AoV statistics, the multi-harmonic analysis of variance or mhAoV periodogram (Schwarzenberg-Czerny 1996). The probability distribution is then the Fisher-Snedecor distribution, $F$, rather then $\chi^2$, and $P_1= 1-F(n_{\|},n_{\perp};\Theta)$ where $n_{\perp}=n-n_{\|}$. For everything else fixed, replacing $\chi^2$ with $F$ for $n=100$ yields an increase of $P_1(\chi^2)=0.001$ to $P_1(F)=0.01$. Thus, accounting for the unknown variance yields the mhAoV detection less significant, but more trustworthy. In this work, $n$ usually is larger, for which $P_1(F)/ P_1(\chi^2)$ reduces to several.

Apart from the choice of the statistic, our method for $N=1$ differs from the LS one in the average flux being subtracted in the latter (thus yielding $n_\|=2$) whereas a constant term is fitted in the former (which can be often of significant advantage, see Foster 1995). If the periodic modulation in the data differs significantly from a sinusoid (e.g., due to dips, eclipses, etc.), then our $N>1$ models account for that more complex shape and perform considerably better then the LS one.

1.3. Multiple trials

Probability can be assigned to a period found in data according to one of two statistical hypotheses. Namely, (i) one knows in advance the trial frequency, $\nu_0$ (from other data), and would like to check whether it is also present in a given data set or (ii) one searches a whole range, $\Delta\nu$, of $N_{\rm eff}$ frequencies and finds the frequency, $\nu$, corresponding to the most significant modulation. The two cases correspond to the probabilities $P_1$ and $P_{N_{\rm eff}}$ to win in a lottery after 1 and $N_{\rm eff}$ trials, respectively, i.e., they represent the false alarm probabilities in single and multiple experiments, respectively. They are related by
$$P_{N_{\rm eff}}= 1-(1-P_1)^{N_{\rm eff}} .$$ Note that the hypothesis (ii) and the probability $P_{N_{\rm eff}}$ must be always employed in order to claim any new frequency in the object under study. The hypothesis (i) is rarely used. However, since $P_1\lt P_{N_{\rm eff}}$, it is the more sensitive one. For this reason, we advocate its use in the situations where the modulation frequency is already known, and we aim at checking for its manifestation in the same object but in a new band, new data set, etc. We stress that we do not use the hypothesis (i) to claim any new frequency.

An obstacle hampering use of the (ii) hypothesis is that no analytical method is known to calculate $N_{\rm eff}$. The number $N_{\rm eff}$ corresponds to independent trials, whereas values of periodograms at many frequencies are correlated because of the finite width of the peaks, $\delta\nu$, and because of aliasing. As no analytical method is known to determine $N_{\rm eff}$, Monte Carlo simulations have been used (e.g., Paltani 2004). Here, we use a simple conservative estimate, $N_{\rm eff}= \min(\Delta\nu/\delta\nu, N_{\rm calc},n)$, where $N_{\rm calc}$ is the number of the values at which the periodogram is calculated. The estimate is conservative in the sense that it corresponds to the upper limit on $P_{N_{\rm eff}}$, and thus the minimum significance of detection. This effect applies to all methods of period search (Horne & Baliunas 1986). In general, it may reduce significance of a new frequency detection for large $N_{\rm eff}$ as $P_{N_{\rm eff}}\gg P_1$. In practice, it underscores the role of any prior knowledge, in a way similar to the Bayesian statistics: with any prior knowledge of the given frequency we are able to use the hypothesis (i) to claim the detection with large significance (small $P_1$).

1.4. Correlation length

The $P_1$, and other common probability distributions used to set the detection criteria, are derived under the assumption of the noise being statistically
independent. Often this is not the case, as seen, e.g., in light curves of cataclysmic variables (CVs). The correlated noise, often termed red noise, obeys different probability distribution than the standard $P_1$, and hence may have a profound effect. For example, noise with a Gaussian autocorrelation function (ACF) correlated over a time interval, $\delta t$, yields a power spectrum with the Gaussian shape centered at $\nu=0$ and the width $\delta\nu=1/\delta t$. It may be demonstrated that the net effect of the correlation on $P_1$ in analysis of low frequency processes is to decimate the number of independent observations by a factor $n_{\rm corr}$, the average number of observations in the correlation interval $\delta t$ (Schwarzenberg-Czerny 1991). Effectively, one should use $n_{\perp}/n_{\rm corr}$ and $\Theta/n_{\rm corr}$ instead of $n_{\perp}$ and $\Theta$ in calculating $P_1$. This result holds generally, for both least squares and maximum likelihood analyses of time series.

For independent observations, $m=2$ consecutive residuals have the same sign on average (e.g., Fisz 1963). Thus, counting the average length, $m$, of series of residuals of the same sign provides an estimate of the number of consecutive observations being correlated, $n_{\rm corr}$. Note that $m=n/l$ where $l$ is the number of such series (both positive and negative). For correlated observations, the average length of series with the same sign is $m=2n_{\rm corr}$, which allows us to calculate $n_{\rm corr}$.

Let $\Theta$ denote the Fisher-Snedecor statistics from the mhAoV periodogram (i.e. from Fourier series fit) computed for $n_{\|}=2N+1$ parameters, $n$ observations and $n_{\perp}=n-n_{\|}$ degrees of freedom. To account for $n_{\rm corr}$, we calculate $P_1$ as follows,

P_1=1- F\left(n_{\|},\frac{n_{\perp}}{n_{\rm
corr}};\frac{\Theta}{n_{\rm corr}}\right)=
I_{z}\left(\frac{n_{\perp}}{2n_{\rm
corr}},\frac{n_{\|}}{2}\right)\label{P1},

where $z= n_{\perp}/(n_{\perp}+n_{\|}\Theta)$ and $I_z(a,b)$ is the incomplete (regularized) beta function (Abramowitz & Stegun 1971), see Schwarzenberg-Czerny 1998 and references therein. In the popular Mathematica (Wolfram 1996) that function is called BetaRegularized. In Matlab, the following function does the calculations for us:

1 2 3 4 5 6 7 8 9 10 11 12 % Function computes the mhAoV periodogram peak significance % Usage: [P1,Pneff]=pneff(n,nh,ncorr,neff,theta)   function [P1,Pneff]=pneff(n,nh,ncorr,neff,theta); npar=2.0*nh+1; nper=n-npar; z=nper/(nper+npar*theta); a=nper/(2.0*ncorr); b=npar/2.0; P1=betainc(z,a,b); Pneff=1.0-(1.0-P1)^neff; end

In the following section we will apply both approaches of modified Lomb-Scargle (LS) and multi-harmonic AoV periodograms for financial data and we will discuss the potential consequences coming from time analysis of largest daily losses for stocks traded publicly within S&P500 universe. So buckle up as we are ready for take off!

2. The Annual Migration Routes of Black Swans

A theory can both beautiful and exhausting. So let’s do some work to capture the beauty. Our goal is to re-analyze the data of extreme losses extracted by us previously within Black Swan and Extreme Loss Modeling article. First, we extract the value of the maximum loss for each stock and store them in a matrix in Matlab data as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 % Modern Time Analysis of Black Swans among % Traded Stocks in S&P 500 % % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz     clear all; close all; clc; tic;   %% DATA READING AND VISUALIZATION   % read a list of stock names StockNames=dataread('file',['sp500u.lst'],'%s','delimiter', '\n'); K=length(StockNames); % the number of stocks in the universe % path to data files path=['./SP500_1984_2011'];   data=[]; fprintf('data reading and preprocessing..\n'); for si=1:K % --stock name stock=StockNames{si}; fprintf('%4.0f %7s\n',si,stock); % --load data n=[path,'/',stock,'.dat']; % check for NULL and change to NaN (using 'sed' command % in Unix/Linux/MacOS environment cmd=['sed -i ''s/NULL/NaN/g''',' ',n]; [status,result]=system(cmd); % construct FTS object for daily data FTS=ascii2fts(n,1,2); % fill any missing values denoted by NaNs FTS=fillts(FTS); % extract the close price of the stock cp=fts2mat(FTS.CLOSE,0); dd=fts2mat(FTS.CLOSE,1); % extract Matlab matrix containing value of maximum % loss per stock and corresponding day rtmp=cp(2:end)./cp(1:end-1)-1; % daily returns dtmp=dd(2:end,1); % time vector tmp{si}=[dtmp rtmp]; [tmp1,tmp2]=min(tmp{si}(:,2)); % maximum loss data=[data; dtmp(tmp2) tmp1]; % [time of maximum loss, loss value] end   data=sortrows(data,1); % sort data according to time of loss occurrence

where, again, the required data files can be downloaded here as sp500u.zip (23.8 MB) and sp500u.lst, respectively.

The visualization of collected data provides us with a new dimension on time distribution of the maximum losses (Black Swans events) across the S&P 500 universe as traded between 3-Jan-1984 and 8-Mar-2011:

46 47 48 49 plot(data(:,1),data(:,2)*100,'.-','color',[0.7 0.7 0.7]) xlim([min(data(:,1)) max(data(:,1)) ]); xlabel('Time (days)'); ylabel('R_{\rm min} (%)');

All individual stock maximum losses have been depicted with dot markers. As we found within Gumbel distribution analysis, the expected value was -22.6% with a heavy tail extending up to nearly complete losses of $\sim$98%.

Changing the way how we look at our data, we allow to connect the dots and think about data as a new time-series $x_i\ (i=1,…,n=954)$. From this standpoint we can continue our analysis in various direction. Let’s have a look at one case in particular: annual average maximum losses in a function of time. Why? Such approach has been suggested as interesting by McNeil, Frey, and Embrechts in their book Quantitative Risk Management, section 7.1.4., making use of the block maxima method in order to find return levels for stress losses. We turn this idea in practice by rebinning our time-series $\{x_i\}$ with a new time step of 252 (trading) days utilizing the code published within my past post on Rebinning of Financial Time-Series as follows:

51 52 [rx,ry]=rebin(data(:,1),data(:,2),[],252); hold on; plot(rx,ry*100,'or');

and allowing for inappropriate data profanity with a gentle data interpolation between the points:

54 55 56 57 xi=rx(1):0.01:rx(end); rdatai=interp1(rx,ry,xi,'pchip'); rdatai=[xi; rdatai]'; hold on; plot(rdatai(:,1),rdatai(:,2)*100,'r-');

resulting in:

Next, based on non-interpolated data we compute the Fourier power spectrum (a modified LS periodogram) as follows:

59 60 61 62 63 64 65 % Periodogram [freq,pds]=qarPDS(rx,ry,252);   figure(2); plot(freq,pds); xlabel('Frequency [1/d]'); ylabel('Power Spectrum');

which returns:

It is obvious that the periodogram is calculated based on a fixed frequency grid with a frequency step of $\Delta\nu = 1/T = 0.000104$ [1/d]. The peak of highest power corresponds to the sine modulation detected in the time-series which period is equal $1/0.001462$ or 684 days. The maximal allowed frequency is the Nyquist frequency of $1/(2\Delta t)$ or 0.00198 [1/d]. Honestly, the plot is terrible. To improve its quality it is allowed in spectral analysis of time-series to apply over-samling in frequency, i.e to adopt the frequency grid of computations with a step of $\Delta\nu = 1/(kT)$ where $k$ denotes the over-sampling (an integer) factor. Why do we need the over-sampling? One of the reasons is: to find the value of periodicity as accurately as possible.

Let’s see how mhAoV periodogram copes with this task in practice. The source codes for mhAoV can be downloaded directly from Alex’s webpage (available in Fortran 95 and Python), though I still make use of our old version executable directly in Unix/Linux environment: aov. Let’s first store rebinned data (with 252 d step) in an external file of rdata.dat:

67 68 69 70 71 rdata=[rx ry]; fn=['./rdata.dat']; fid=fopen(fn,'wt'); fprintf(fid,'%f %f\n',rdata'); fclose(fid);

and, next, let’s compute aov periodogram:

./aov -f=0.000104,0.002 -nh=1 -fos=5 rdata.dat   mhAov periodogram, by Alex Schwarzenberg-Czerny ver. 27.9.2006 updated by Pawel Lachowicz   datname=rdata.dat trfname=rdata.trf maxname=rdata.max nobs=38 method=ORTAOV nh=1 nbf=20 fos=5.00 frstart=0.0001040 frend=0.0019809 frstep=0.0000107 nfr=176 frequency period theta quality 0.0014747 678.1132747 5.53729 0.743 0.0013152 760.3542866 1.39906 0.146 0.0016301 613.4435376 1.37416 0.138 0.0003351 2984.5922602 1.30742 0.116 0.0001733 5771.4262041 1.22450 0.088 0.0011538 866.7094426 1.12090 0.050

i.e. employing the model which contains a single sinusoid (nh=1) and adopting over-sampling in frequency with $k=5$ (fos). It occurs that the highest value of mhAoV $\Theta=5.537$ statistics corresponds to the period of 678 days.

Fitting the annual data with the model defined as:
$$f(t) = c + A\sin(2\pi t/P – g\pi)$$ we find for $P_1=678$ d the estimation for amplitude $A_1=0.12$ and $g_1=0.79$ and the best fit we over-plot as follows:

This model is not perfect but delivers us a good taste of concealed periodic pattern following a sinusoidal model in about 60% of time between 1984 to 2011. This is an interesting result though the computation of:

>> [P1,Pneff]=pneff(38,1,1.3,7.1,5.53729)   P1 = 0.0138 Pneff = 0.0941

indicates at only 9% of significance of this periodicity. This can be understood as a poor fit of the model to the complicated and variable shape of annual changes in maximal losses for different traded stocks.

A sort of improvement of the model we could achieve by inclusion of variation of the amplitude in function of time, i.e. $A_1(t)$. This can be practically extracted from the wavelet analysis via computation of the continuous wavelet transform. If this is subject of your interest check a draft of this approach in the paper of Czerny et alii (2010) I co-authored.

3. Conclusions

Was Joseph Fourier a birdwatcher? We don’t know. But his approach to the time-series analysis allowed us to check whether any periodic patterns in annual migration (occurrences) of Black Swan events do exist? With a very low probability we found a cyclical trend repeating every 678 days. Will that allow us to forecast the future and next density of massive losses as incurred by individual stocks? Well, now, equipped with power tools of modern approach to time analysis, we can always come back and verify our hypotheses.

References

The theory on the methods of period detection based on publication of Lachowicz et alii (2006). For deeper references to source readings mentioned in the text, check the reference section inside the aforementioned publication.

Forecasting future has always been a part of human untamed skill to posses. In an enjoyable Hollywood production of Next Nicolas Cage playing a character of Frank Cadillac has an ability to see the future just up to a few minutes ahead. This allows him to take almost immediate actions to avoid the risks. Now, just imagine for a moment that you are an (algorithmic) intraday trader. What would you offer for a glimpse of knowing what is going to happen within following couple of minutes? What sort of risk would you undertake? Is it really possible to deduce the next move on your trading chessboard? Most probably the best answer to this question is: it is partially possible. Why then partially? Well, even with a help of mathematics and statistics, obviously God didn’t want us to know future by putting his fingerprint into our equations and calling it a random variable. Smart, isn’t it? Therefore, our goal is to work harder in order to guess what is going to happen next!?

In this post I will shortly describe one of the most popular methods of forecasting future volatility in financial time-series employing a GARCH model. Next, I will make use of 5-min intraday stock data of close prices to show how to infer possible stock value in next 5 minutes using current levels of volatility in intraday trading. Ultimately, I will discuss an exit strategy from a trade based on forecasted worst case scenario (stock price is forecasted to exceed the assumed stop-loss level). But first, let’s warm up with some cute equations we cannot live without.

Inferring Volatility

Capturing and digesting volatility is somehow like an art that does not attempt to represent external, recognizable reality but seeks to achieve its effect using shapes, forms, colors, and textures. The basic idea we want to describe here is a volatility $\sigma_t$ of a random variable (rv) of, e.g. an asset price, on day $t$ as estimated at the end of the previous day $t-1$. How to do it in the most easiest way? It’s simple. First let’s assume that a logarithmic rate of change of the asset price between two time steps is:
$$r_{t-1} = \ln\frac{P_{t-1}}{P_{t-2}}$$ what corresponds to the return expressed in percents as $R_{t-1}=100[\exp(r_{t-1})-1]$ and we will be using this transformation throughout the rest of the text. This notation leaves us with a window of opportunity to denote $r_t$ as an innovation to the rate of return under condition that we are able, somehow, to deduce, infer, and forecast a future asset price of $P_t$.

Using classical definition of a sample variance, we are allowed to write it down as:
$$\sigma_t^2 = \frac{1}{m-1} \sum_{i=1}^{m} (r_{t-i}-\langle r \rangle)^2$$ what is our forecast of the variance rate in next time step $t$ based on past $m$ data points, and $\langle r \rangle=m^{-1}\sum_{i=1}^{m} r_{t-i}$ is a sample mean. Now, if we examine return series which is sampled every one day, or an hour, or a minute, it is worth to notice that $\langle r \rangle$ is very small as compared with the standard deviation of changes. This observation pushes us a bit further in rewriting the estimation of $\sigma_t$ as:
$$\sigma_t^2 = \frac{1}{m} \sum_{i=1}^{m} r_{t-i}^2$$ where $m-1$ has been replaced with $m$ by adding an extra degree of freedom (equivalent to a maximum likelihood estimate). What is superbly important about this formula is the fact that it gives an equal weighting of unity to every value of $r_{t-i}$ as we can always imagine that quantity multiplied by one. But in practice, we may have a small wish to associate some weights $\alpha_i$ as follows:
$$\sigma_t^2 = \sum_{i=1}^{m} \alpha_i r_{t-i}^2$$ where $\sum_{i=1}^{m} \alpha_i = 1$ what replaces a factor of $m^{-1}$ in the previous formula. If you think for a second about the idea of $\alpha$’s it is pretty straightforward to understand that every observation of $r_{t-i}$ has some significant contribution to the overall value of $\sigma_t^2$. In particular, if we select $\alpha_i<\alpha_j$ for $i>j$, every past observation from the most current time of $t-1$ contributes less and less.

In 1982 R. Engle proposed a tiny extension of discussed formula, finalized in the form of AutoRegressive Conditional Heteroscedasticity ARCH($m$) model:
$$\sigma_t^2 = \omega + \sum_{i=1}^{m} \alpha_i r_{t-i}^2$$ where $\omega$ is the weighted long-run variance taking its position with a weight of $\gamma$, such $\omega=\gamma V$ and now $\gamma+\sum_{i=1}^{m} \alpha_i = 1$. What the ARCH model allows is the estimation of future volatility, $\sigma_t$, taking into account only past $m$ weighted rates of return $\alpha_i r_{t-i}$ and additional parameter of $\omega$. In practice, we aim at finding weights of $\alpha_i$ and $\gamma$ using maximum likelihood method for given return series of $\{r_i\}$. This approach, in general, requires approximately $m>3$ in order to describe $\sigma_t^2$ efficiently. So, the question emerges: can we do much better? And the answer is: of course.

Four years later, in 1986, a new player entered the ring. His name was Mr T (Bollerslev) and he literally crushed Engle in the second round with an innovation of the Generalized AutoRegressive Conditional Heteroscedasticity GARCH($p,q$) model:
$$\sigma_t^2 = \omega + \sum_{i=1}^{p} \alpha_i r_{t-i}^2 + \sum_{j=1}^{q} \beta_j \sigma_{t-j}^2$$ which derives its $\sigma_t^2$ based on $p$ past observations of $r^2$ and $q$ most recent estimates of the variance rate. The inferred return is then equal $r_t=\sigma_t\epsilon_t$ where $\epsilon_t\sim N(0,1)$ what leave us with a rather pale and wry face as we know what in practice that truly means! A some sort of simplification meeting a wide applause in financial hassle delivers the solution of GARCH(1,1) model:
$$\sigma_t^2 = \omega + \alpha r_{t-1}^2 + \beta \sigma_{t-1}^2$$ which derives its value based solely on the most recent update of $r$ and $\sigma$. If we think for a shorter while, GARCH(1,1) should provide us with a good taste of forecasted volatility when the series of past couple of returns were similar, however its weakness emerges in the moments of sudden jumps (shocks) in price changes what causes overestimated volatility predictions. Well, no model is perfect.

Similarly as in case of ARCH model, for GARCH(1,1) we may use the maximum likelihood method to find the best estimates of $\alpha$ and $\beta$ parameters leading us to a long-run volatility of $[\omega/(1-\alpha-\beta)]^{1/2}$. It is usually achieved in the iterative process by looking for the maximum value of the sum among all sums computed as follows:
$$\sum_{i=3}^{N} \left[ -\ln(\sigma_i^2) – \frac{r_i^2}{\sigma_i^2} \right]$$ where $N$ denotes the length of the return series $\{r_j\}$ ($j=2,…,N$) available for us. There are special dedicated algorithms for doing that and, as we will see later on, we will make use of one of them in Matlab.

For the remaining discussion on verification procedure of GARCH model as a tool to explain volatility in the return time-series, pros and cons, and other comparisons of GARCH to other ARCH-derivatives I refer you to the immortal and infamous quant’s bible of John Hull and more in-depth textbook by a financial time-series role model Ruey Tsay.

Predicting the Unpredictable

The concept of predicting the next move in asset price based on GARCH model appears to be thrilling and exciting. The only worry we may have, and as it has been already recognized by us, is the fact that the forecasted return value is $r_t=\sigma_t\epsilon_t$ with $\epsilon_t$ to be a rv drawn from a normal distribution of $N(0,1)$. That implies $r_t$ to be a rv such $r_t\sim N(0,\sigma_t)$. This model we are allowed to extend further to an attractive form of:
$$r_t = \mu + \sigma_t\epsilon_t \ \ \ \sim N(\mu,\sigma_t)$$ where by $\mu$ we will understand a simple mean over past $k$ data points:
$$\mu = k^{-1} \sum_{i=1}^{k} r_{t-i} \ .$$ Since we rather expect $\mu$ to be very small, its inclusion provides us with an offset in $r_t$ value modeling. Okay, so what does it mean for us? A huge uncertainty in guessing where we gonna end up in a next time step. The greater value of $\sigma_t$ inferred from GARCH model, the wider spread in possible values that $r_t$ will take. God’s fingerprint. End of hopes. Well, not exactly.

Let’s look at the bright side of our $N(\mu,\sigma_t)$ distribution. Remember, never give in too quickly! Look for opportunities! Always! So, the distribution has two tails. We know very well what is concealed in its left tail. It’s the devil’s playground of worst losses! Can a bad thing be a good thing? Yes! But how? It’s simple. We can always compute, for example, 5% quantile of $Q$, what would correspond to finding a value of a rv $r_t$ for which there is 5% and less of chances that the actual value of $r_t$ will be smaller or equal $Q$ (a sort of equivalent of finding VaR). Having this opportunity, we may wish to design a test statistic as follows:
$$d = Q + r’$$ where $r’$ would represent a cumulative return of an open trading position. If you are an intraday (or algorithmic high-frequency) trader and you have a fixed stop-loss limit of $s$ set for every trade you enter, a simple verification of a logical outcome of the condition of:
$$d < s \ ?$$ provides you immediately with an attractive model for your decision to close your position in next time step based on GARCH forecasting derived at time $t-1$. All right, let’s cut the cheese and let’s see now how does that work in practice!

5-min Trading with GARCH Exit Strategy

In order to illustrate the whole theory of GARCH approach and dancing at the edge of uncertainty of the future, we analyze the intraday 5-min stock data of Toyota Motor Corporation traded at Toyota Stock Exchange with a ticker of TYO:7203. The data file of TYO7203m5.dat contains a historical record of trading in the following form:

TYO7203 DATES TIMES OPEN HIGH LOW CLOSE VOLUME 07/8/2010 11:00 3135 3140 3130 3130 50900 07/8/2010 11:05 3130 3135 3130 3130 16100 07/8/2010 11:10 3130 3145 3130 3140 163700 ... 01/13/2011 16:50 3535 3540 3535 3535 148000 01/13/2011 16:55 3540 3545 3535 3535 1010900

i.e. it is spanned between 8-Jul-2010 11:00 and 13-Jan-2011 16:55. Using Matlab language,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 % Forecasting Volatility, Returns and VaR for Intraday % and High-Frequency Traders % % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz   clear all; close all; clc;   %% ---DATA READING AND PREPROCESSING   fname=['TYO7203m5.dat']; % construct FTS object for 5-min data fts=ascii2fts(fname,'T',1,2,[1 2]); % sampling dt=5; % minutes   % extract close prices cp=fts2mat(fts.CLOSE,1);   % plot 5-min close prices for entire data set figure(1); plot(cp(:,1)-cp(1,1),cp(:,2),'Color',[0.6 0.6 0.6]); xlabel('Days since 08-Jul-2010 11:00'); ylabel('TYO:7203 Stock Price (JPY)');

let’s plot 5-min close prices for a whole data set available:

Now, a moment of concentration. Let’s imagine you are trading this stock based on 5-min close price data points. It is late morning of 3-Dec-2010, you just finished your second cup of cappuccino and a ham-and-cheese sandwich when at 10:55 your trading model sends a signal to open a new long position in TYO:7203 at 11:00.

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 % ---DATES SPECIFICATION   % data begin on/at date1='08-Jul-2010'; time1='11:00'; % last available data point (used for GARCH model feeding) dateG='03-Dec-2010'; timeG='12:35'; % a corresponding data index for that time point (idxGARCH) [idxGARCH,~,~]=find(cp(:,1)==datenum([dateG,' ' ,timeG])); % enter (trade) long position on 03-Dec-2010 at 11:00am dateT='03-Dec-2010'; timeT='11:00'; % a corresposding data index for that time point(idx) [idx,~,~]=find(cp(:,1)==datenum([dateT,' ',timeT]));

You buy $X$ shares at $P_{\rm{open}}=3310$ JPY per share at 11:00 and you wait observing the asset price every 5 min. The price goes first up, than starts to fall down. You look at your watch, it is already 12:35. Following a trading data processing, which your Matlab does for you every 5 minutes,

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 % ---DATA PREPROCESSING (2)   % extract FTS object spanned between the beginning of data % and last available data point ftsGARCH=fetch(fts,date1,time1,dateG,timeG,1,'d'); cpG=fts2mat(ftsGARCH.CLOSE,1); % extract close prices (vector) retG=log(cpG(2:end,2)./cpG(1:end-1,2)); % extract log return (vector)   figure(1); hold on; plot(cp(1:idxGARCH,1)-cp(1,1),cp(1:idxGARCH,2),'b') % a blue line in Figure 2   figure(2); % plot close prices x=1:5:size(cpG,1)*5; x=x-x(idx+1); % plot(x,cpG(:,2),'-ok'); xlim([-20 110]); ylim([3285 3330]); ylabel('TYO:7203 Stock Price (JPY)'); xlabel('Minutes since 03-Dec-2010 11:00'); % add markers to the plot hold on; % plot(-dt,cpG(idx,2),'o','MarkerEdgeColor','g'); hold on; plot(0,cpG(idx+1,2),'o','MarkerFaceColor','k','MarkerEdgeColor','k'); % mark -0.5% stoploss line xs=0:5:110; stoploss=(1-0.005)*3315; % (JPY) ys=stoploss*ones(1,length(xs)); plot(xs,ys,':r');

you re-plot charts, here: as for data extracted at 3-Dec-2010 12:35,

where by blue line it has been marked the current available price history of the stock (please ignore a grey line as it refers to the future prices which we of course don’t know on 3-Dec-2010 12:35 but we use it here to understand better the data feeding process in time, used next in GARCH modeling), and

what presents the price history with time axis adjusted to start at 0 when a trade has been entered (black filled marker) as suggested by the model 5 min earlier (green marker). Assuming that in our trading we have a fixed stop-loss level of -0.5% per every single trade, the red dotted line in the figure marks the corresponding stock price of suggested exit action. As for 12:35, the actual trading price of TYO:7203 is still above the stop-loss. Great! Well, are you sure?

As we trade the stock since 11:00 every 5 min we recalculate GARCH model based on the total data available being updated by a new innovation in price series. Therefore, as for 12:35, our GARCH modeling in Matlab,

73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 % ---ANALYSIS   % GARCH(p,q) parameter estimation model = garch(3,3) % define model [fit,VarCov,LogL,Par] = estimate(model,retG) % extract model parameters parC=Par.X(1) % omega parG=Par.X(2) % beta (GARCH) parA=Par.X(3) % alpha (ARCH) % estimate unconditional volatility gamma=1-parA-parG VL=parC/gamma; volL=sqrt(VL) % redefine model with estimatated parameters model=garch('Constant',parC,'GARCH',parG,'ARCH',parA)   % infer 5-min varinace based on GARCH model sig2=infer(model,retG); % vector

runs,

model = GARCH(3,3) Conditional Variance Model: -------------------------------------- Distribution: Name = 'Gaussian' P: 3 Q: 3 Constant: NaN GARCH: {NaN NaN NaN} at Lags [1 2 3] ARCH: {NaN NaN NaN} at Lags [1 2 3] ____________________________________________________________ Diagnostic Information   Number of variables: 7   Functions Objective: @(X)OBJ.nLogLikeGaussian(X,V,E,Lags,...) Gradient: finite-differencing Hessian: finite-differencing (or Quasi-Newton)   Constraints Nonlinear constraints: do not exist   Number of linear inequality constraints: 1 Number of linear equality constraints: 0 Number of lower bound constraints: 7 Number of upper bound constraints: 7   Algorithm selected sequential quadratic programming ____________________________________________________________ End diagnostic information Norm of First-order Iter F-count f(x) Feasibility Steplength step optimality 0 8 -2.545544e+04 0.000e+00 1.392e+09 1 60 -2.545544e+04 0.000e+00 8.812e-09 8.813e-07 1.000e+02   Local minimum found that satisfies the constraints.   Optimization completed because the objective function is non-decreasing in feasible directions, to within the selected value of the function tolerance, and constraints are satisfied to within the selected value of the constraint tolerance.

and returns the best estimates of GARCH model’s parameters:

 GARCH(1,1) Conditional Variance Model: ---------------------------------------- Conditional Probability Distribution: Gaussian   Standard t Parameter Value Error Statistic ----------- ----------- ------------ ----------- Constant 2.39895e-07 6.87508e-08 3.48934 GARCH{1} 0.9 0.0224136 40.1542 ARCH{1} 0.05 0.0037863 13.2055   fit = GARCH(1,1) Conditional Variance Model: -------------------------------------- Distribution: Name = 'Gaussian' P: 1 Q: 1 Constant: 2.39895e-07 GARCH: {0.9} at Lags [1] ARCH: {0.05} at Lags [1]

defining the GARCH(1,1) model on 3-Dec-2010 12:35 of TYO:7302 as follows:
$$\sigma_t^2 = 2.39895\times 10^{-7} + 0.05r_{t-1}^2 + 0.9\sigma_{t-1}^2 \ \ .$$ where the long-run volatility $V=0.0022$ ($0.2\%$) at $\gamma=0.05$. Please note, that due to some computational mind tricks, we allowed to define the raw template of GARCH model as GARCH(3,3) which has been adjusted by Matlab’s solver down to resulting GARCH(1,1).

Based on the model and data available till $t-1=$ 12:35, we get a forecast of $\sigma_t$ value at $t=$ 12:40 to be

92 93 94 % forcast varianace 1-step (5-min) ahead, sigma^2_t sig2_t=forecast(model,1,'Y0',retG,'V0',sig2); % scalar sig_t=sqrt(sig2_t) % scalar

$$\sigma_t=0.002 \ (0.2\%) \ .$$
Plotting return series for our trade, at 12:35 we have:

96 97 98 99 100 101 102 103 % update a plot of 5-min returns figure(3); x=1:length(retG); x=x-idx; x=x*5; plot(x,100*(exp(retG)-1),'ko-'); xlim([-20 110]); ylim([-0.8 0.8]); xlabel('Minutes since 03-Dec-2010 11:00'); ylabel('R_t [%]');

Estimating the mean value of return series by taking into account last 12 data points (60 min),

105 106 % estimate mean return over passing hour (dt*12=60 min) mu=mean(retG(end-12:end))

we get $\mu=-2.324\times 10^{-4}$ (log values), what allows us to define the model for return at $t=$ 12:40 to be:
$$R_t = \mu + \sigma_t\epsilon_t = -0.02324 + 0.2\epsilon \ , \ \ \ \epsilon\sim N(0,1)\ .$$ The beauty of God’s fingerprint denoted by $\epsilon$ we can understand better but running a simple Monte-Carlo simulation and drawing 1000 rvs of $r_t=\mu+\sigma_t\epsilon$ which illustrate 1000 possible values of stock return as predicted by GARCH model!

108 109 % simulate 1000 rvs ~ N(0,sig_t) r_t=sig_t*randn(1000,1);

One of them could be the one at 12:40 but we don’t know which one?! However, as discussed in Section 2 of this article, we find 5% quantile of $Q$ from the simulation and we mark this value by red filled circle in the plot.

111 112 113 114 115 116 117 tmp=[(length(retG)-idx)*5+5*ones(length(r_t),1) 100*(exp(r_t)-1)]; hold on; plot(tmp(:,1),tmp(:,2),'x');   Q=norminv(0.05,mu,sig_t); Q=100*(exp(Q)-1);   hold on; plot((length(retG)-idx)*5+5,Q,'or','MarkerFaceColor','r');

Including 1000 possibilities (blue markers) of realisation of $r_t$, the updated Figure 4 now looks like:

where we find from simulation $Q=-0.3447\%$. The current P&L for an open long position in TYO:7203,

119 120 121 % current P/L for the trade P_open=3315; % JPY (3-Dec-2010 11:00) cumRT=100*(cpG(end,2)/P_open-1) % = r'

returns $r’=-0.3017\%$. Employing a proposed test statistic of $d$,

123 124 % test statistics (percent) d=cumRT+Q

we get $d=r’+Q = -0.6464\%$ which, for the first time since the opening the position at 11:00, exceeds our stop-loss of $s=-0.5\%$. Therefore, based on GARCH(1,1) forecast and logical condition of $d\lt s$ to be true, our risk management engine sends out at 12:35 an order to close the position at 12:40. The forecasted closing price,

126 127 % forecasted stock price under test (in JPY) f=P_open*(1+d/100)

is estimated to be equal 3293.6 JPY, i.e. -0.6464% loss.

At 12:40 the order is executed at 3305 JPY per share, i.e. with realized loss of -0.3017%.

The very last question you asks yourself is: using above GARCH Exit Strategy was luck in my favor or not? In other words, has the algorithm taken the right decision or not? We can find out about that only letting a time to flow. As you come back from the super-quick lunch break, you sit down, look at the recent chart of TYO:7203 at 12:55,

and you feel relief, as the price went further down making the algo’s decision a right decision.

For reference, the red filled marker has been used to denote the price at which we closed our long position with a loss of -0.3017% at 12:40, and by open red circle the forecasted price under $d$-statistics as derived at 12:35 has been marked in addition for the completeness of the grand picture and idea standing behind this post.

## Rebinning of Financial Time-Series

Working with financial time-series, especially in trading and its following analysis of data trends or so on, we wish to rebin (resample) our data into a new time-series which would provide us with some sort of a new information on the average value of the underlying data records characterized by high volatility in time.

The following algorithm in Maltab does the job pretty well:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 % Function rebins any time-series [x,y] where x is a time vector % and y stores time-series values from the current (regular or % irregular samping). A new time-series sampling is dt. % % Input: x : time vector [1xN] % y : data vector [1xN] % dy : data error vector [1xN] % dt : new rebinning time (dt > current sampling) % % (c) 2013, QuantAtRisk.com, by Pawel Lachowicz % % Example: x=1:1:1001; % assume 1001 trading days % y=rand(1,1001); % simulate pseudo data % dy=0.1*rand(1,1001); % simulate pseudo data errors (if required) % dt=50; % rebin data from 1 day sampling down to 50 day intervals % [rx,ry]=rebin(x,y,dy,dt); % or [rx,ry]=rebin(x,y,[],dt]; % plot(x,y,'color',[0.7 0.7 0.7]); % plot original time-series % hold on; plot(rx,ry,'ro-'); % plot down-rebined time-series   function [rx,ry]=rebin(x,y,dy,dt) if(isempty(dy)) dy=ones(1,length(y)); end rdata=[]; j=1; k=1; t2=x(1); while(j<=length(x)) i=j; if((x(i)+dt)>x(end)) break else t2=t2+dt; end i=j; sa=0; wa=0; ea=0; il=0; while(x(i)<t2) sa=sa+(y(i)/dy(i)^2); wa=wa+(1/dy(i)^2); i=i+1; il=il+1; end ry=sa/wa; rx=t2-dt; if(il>=1) rdata=[rdata; rx ry]; end j=j+il; end rx=rdata(:,1); ry=rdata(:,2); end

Example

As an example we will use the daily rate of returns in trading of AAPL (Apple Inc.) from Jan-1984 to Apr-2011 (download Matlab’s aaplr.mat M-file). Below, the short set of command lines allow us to execute the rebinning process of return series,

clear all; close all; clc;   load aaplr.mat plot(aaplR)   x=1:length(aaplR); % 6855 days y=aaplR; plot(x,y,'color',[0.7 0.7 0.7]);   % rebin the data with dt=25 days step (one trading month) dt=25; dy=[]; % we have no information on data errors   [rx,ry]=rebin(x,y,dy,dt)   % overplot results and display first 500 days hold on; plot(rx,ry,'ro-'); xlim([0 500]); ylim([-0.125 0.15]);

and plot both time-series, original and rebinned with a new bin time of 25 days for the first 500 days of trading:

where a red line denotes a new rebinned data time-series with a binning time of 25 trading days. The function (the algorithm) computes a simple weighted mean based on data points falling into an interval of $\langle t,t+dt \rangle$. If we do not specify the input data error vector of $dy$ as in the example above, we should get a simple mean as a consequence.

## Trend Identification for FX Traders (Part 2)

In my previous post I provided an introduction to the trading model invention and design. We made use of FX data of AUDUSD pair sampled hourly and splitting data into weekly time-series.

Today, we will work a bit harder over formulation of the very first rules for the model. This step will require an engagement of our creativity in understanding what data are like as well as how we can make a basic set of rules which would help us to perform an attractive time-series classification. Our objective is to invent a method which will be helpful in classification of last week FX pair’s time-series to be either in the downtrend or in the uptrend.

The most naive way of classification of directional information contained in any time-series is its slope: a straight line fit to the data. Let’s use it as our starting point. Instead of fitting all data points for a given week, we find median values for the first and the last 12 data points both in Time $(x1,x2)$ and Pair Ratio $(y1,y2)$ as specified in lines 92 to 94:

65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 % FX time-series analysis % (c) Quant at Risk, 2012 % % Part 2: Classification of weekly time-series   close all; scrsz = get(0,'ScreenSize'); h=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none'); fprintf('\nuptrend/downtrend identification.. '); % for viewing use the loop hold off; set(0,'CurrentFigure',h);   % pre-define variables trend=zeros(nw,1); slope=zeros(nw,1); midp={}; % middle points endp={}; % end points (median based on last 12 points)   for i=1:nw %--- a loop over total number of weeks available   % reading time-series for a current week w=week{i}; x=w(:,1); y=w(:,2);   % plot the time-series hold on; plot(x,y,'k');   % linear trend estimation x1=median(x(1:12)); x2=median(x(end-11:end)); y1=median(y(1:12)); y2=median(y(end-11:end));   % define end-point of the time-series and mark it on the plot endp{i}=[x2 y2]; hold on; plot(endp{i}(1),endp{i}(2),'b*');   % find slope m=(y2-y1)/(x2-x1); slope(i)=m; xl=x1:dt:x2; yl=m*xl-m*x2+y2; % a line of representing the slope hold on; plot(xl,yl,'b:');   % find middle point of the line and mark it on the plot mx=mean(xl); my=mean(yl); midp{i}=[mx my]; hold on; plot(midp{i}(1),midp{i}(2),'bo');

As an example of the code execution, for the first two weeks we plot slopes, mid-points and end-points:

We assume that our classification procedure will be based solely on the information provided for end-points and slopes. The exception we make for the classification of the first two weeks. For the first week the distinction between uptrend and downtrend is made based on the position of the first and last point:

113 114 115 116 117 118 119 120 121 122 123 124 % Time-Series Classification   if(i==1) ybeg=y(1); yend=y(end); if(ybeg<yend) trend(i)=+1; % uptrend hold on; plot(x,y,'k'); else trend(i)=-1; % downtrend hold on; plot(x,y,'r'); end end

where we mark the result of classification with a sign $+1$ or $-1$ in a vector element of $trend$, and plot it with black and red color denoting uptrend and downtrend, respectively.

For the second week, our rules of classification are enriched by additional information about the end-point of a current and a previous week:

126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 if(i==2) % week(current-1) tmp=week{i-1}; x1=tmp(:,1); y1=tmp(:,2); y1b=y1(1); y1e=y1(end); % week(current) y0b=y(1); y0e=y(end); if(y0e>y1e) trend(i)=+1; % uptrend hold on; plot(x,y,'k'); else trend(i)=-1; % downtrend hold on; plot(x,y,'r'); end end

For weeks number 3 and higher we do our creative research over the data to define a specific set of rules. We allow to take into account the information from two weeks prior to the current one and combine them all together. The following code represents an attractive solution, subject to improvement:

142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 if(i>2) % week(current-2) mid2=midp{i-2}(2); end2=endp{i-2}(2); slp2=slope(i-2); % week(current-1) mid1=midp{i-1}(2); end1=endp{i-1}(2); slp1=slope(i-1); % week(current) mid0=midp{i}(2); end0=endp{i}(2); slp0=slope(i); if((mid0>mid1)) % up-trend if((mid0>mid2)&&(end0>end1)) trend(i)=+1; hold on; plot(x,y,'k'); % strong up-trend elseif((mid0>mid2)&&(end0>end2)) trend(i)=+1; hold on; plot(x,y,'k'); % weak up-trend elseif((mid0<mid2)&&(end0<end2)&&(slp0<0)) trend(i)=-1; hold on; plot(x,y,'r'); % turns into possible down-trend elseif((mid0<mid2)&&(end0<end2)&&(slp0>0)) trend(i)=+1; hold on; plot(x,y,'k'); % turns into possible up-trend else trend(i)=+1; hold on; plot(x,y,'k'); % turns into possible up-trend end elseif(mid0<mid1) % down-trend if((mid0<mid2)&&(end0<end1)&&(end0<end2)) trend(i)=-1; hold on; plot(x,y,'r'); % weak down-trend elseif((mid0<mid2)&&(end0<end2)&&(end0>end1)) trend(i)=+1; hold on; plot(x,y,'k'); % possible up-trend elseif((mid0<mid2)&&(end0>end2)) trend(i)=+1; hold on; plot(x,y,'k'); % turns into possible up-trend elseif((mid0>mid2)&&(end0<end1)&&(end0<end2)) trend(i)=-1; hold on; plot(x,y,'r'); elseif((mid0>mid2)&&(end0>end2)) trend(i)=+1; hold on; plot(x,y,'k'); % turns into possible up-trend elseif((mid0>mid2)&&(end0>end1)) trend(i)=+1; hold on; plot(x,y,'k'); % turns into possible up-trend else trend(i)=-1; hold on; plot(x,y,'r'); end end end end

Since one picture is worth millions of lines of code, below we present three examples of our model in action. The last plot corresponds to the latest Global Financial Crisis and shows how weeks in uptrends of 2009 followed these in downtrend a year before.

It is straightforward to note that the performance of our rules works very intuitively and stays in general agreement with the market sentiments.

## Extracting Time-Series from Tick-Data

The tick-data provided in the .csv (comma separated values) file format sometimes may be a real problem to handle quickly, especially when the total size starts to count in hundreds of GB. If your goal is to extract a time-series with, say, hourly time resolution only, this article will provide you with some fresh and constructive guidelines how to do it smartly in the Linux environment.

First, let’s have a closer look at the data. Say, we have a collection of 2148 .csv files hosting the FX trading history of AUDUSD pair, covering nearly 10 years between 2000 and 2010. Each file is 7.1 MB large what leaves us with approximately 15 GB of data to process. Having a look into the randomly selected file we can identify the header and data themselves:

$head -10 audusd_216.csv Ticks,TimeStamp,Bid Price,Bid Size,Ask Price,Ask Size 632258349000000015,2004-07-19 11:55:00.000,0.7329,1000000,0.7334,1000000 632258349000000016,2004-07-19 11:55:00.000,0.7329,1000000,0.7334,1000000 632258349000000017,2004-07-19 11:55:00.000,0.7329,1000000,0.7333,1000000 632258349000000018,2004-07-19 11:55:00.000,0.7327,1000000,0.7333,1000000 632258349000000019,2004-07-19 11:55:00.000,0.7327,1000000,0.7333,1000000 632258349000000020,2004-07-19 11:55:00.000,0.7328,1000000,0.7333,1000000 632258349000000021,2004-07-19 11:55:00.000,0.7328,1000000,0.7334,1000000 632258349600000000,2004-07-19 11:56:00.000,0.7328,1000000,0.7334,1000000 632258349600000001,2004-07-19 11:56:00.000,0.7328,1000000,0.7336,1000000 Our aim will be to extract Bid and Ask Price time-series. We will make use of a few Linux standard tools, e.g. sed, awk, supplemented with extra f77 codes. It is also to demonstrate how shell programming can be useful while we have an opportunity to explore the enigmatic syntax of its programs. Generally, we will be writing a shell script, executable for any FX pair name, e.g. gbpnzd, eurjpy, and so on. In the first step of the script we create a list of all files. This is tricky in Linux as the standard command of ‘ls -lotr’ though returns a desired list but also all details on the file size, attributes, etc. which we do not simply want. Lines 9 and 10 solve the problem, 1 2 3 4 5 6 7 8 9 10 # Extracting Time-Series from Tick-Data .csv files # (c) Quant at Risk, 2012 # # Exemplary usage: ./script.src audusd #!/bin/bash echo "..making a sorted list of .csv files" for i in$1_*.csv; do echo ${i##$1_} $i${i##.csv}; done | sort -n | awk '{print $2}' >$1.lst

and a newly create file of \$1.lst (note: \$1 corresponds in the shell script to the parameter’s name we called the script with, e.g. audusd; therefore \$1.lst physically means audusd.lst) contains the list: audusd_1.csv audusd_2.csv audusd_3.csv ... audusd_2148.csv We create one data file from all 2148 pieces by creating and executing an in-line script: 12 13 14 15 16 17 echo "..creating one data file" awk '{print "cat",$1," &gt;&gt; tmp.lst"}' $1.lst &gt; tmp.cmd chmod +x tmp.cmd ./tmp.cmd rm tmp.cmd mv tmp.lst$1.tmp

Now, \$1.tmp is a 15 GB file and we may wish to remove some unnecessary comments and tokens: 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 echo "..removing comments" sed 's/Ticks,TimeStamp,Bid Price,Bid Size,Ask Price,Ask Size//g'$1.tmp > $1.tmp2 rm$1.tmp   echo "..removing empty lines" sed '/^$/d'$1.tmp2 > $1.tmp rm$1.tmp2   echo "..removing token ," sed 's/,/ /g' $1.tmp >$1.tmp2 rm $1.tmp echo "..removing token :" sed 's/:/ /g'$1.tmp2 > $1.tmp rm$1.tmp2   echo "..removing token -" sed 's/-/ /g' $1.tmp >$1.tmp2 rm $1.tmp echo "..removing column with ticks and ask/bid size" awk '{print$2,$3,$4,$5,$6,$7,$8,$10}'$1.tmp2 > $1.tmp rm$1.tmp2

In order to convert a time information into a continuous measure of time, we modify the f77 code for our task as follows:

c Extracting Time-Series from Tick-Data .csv files c (c) Quant at Risk, 2012 c c Program name: fx_getmjd.for c Aim: removes ticks and coverts trade time into MJD time [day] c Input data format: YYYY MM DD HH MM SS.SSS BID BID_Vol ASK ASK_Vol   implicit double precision (a-z) integer y,m,d,hh,mm,jd integer*8 bidv,askv character zb*50   call getarg(1,zb) open(1,file=zb) do i=1.d0, 500.d6 read(1,*,end=1) y,m,d,hh,mm,ss,bid,ask jd= d-32075+1461*(y+4800+(m-14)/12)/4+367*(m-2-(m-14)/12*12) _ /12-3*((y+4900+(m-14)/12)/100)/4 mjd=(jd+(hh-12.d0)/24.d0+mm/1440.d0+ss/86400.d0)-2400000.5d0 mjd=mjd-51544.d0 ! T_0 = 2000.01.01 00:00 abr=ask/bid write(*,2) mjd,bid,ask,abr enddo 1 close(1) 2 format(F15.8,F8.4,F8.4,F12.6)   end

and execute it in our script:

43 44 45 echo "..changing a date to MJD" fx_getmjd $1.tmp >$1.dat rm $1.tmp In the aforementioned f77 code, we set a zero time point (MJD=0.00) on Jan 1, 2000 00:00. Since that day, now our time is expressed as a single column measuring time progress in days with fractional parts tracking hours and minutes. We may split the data into two separate time-series containing Bid and Ask Prices at the tick-data level: 47 48 49 echo "..splitting into bid/ask/abr files" awk '{print$1,$2}'$1.dat > $1.bid awk '{print$1,$3}'$1.dat > $1.ask A quick inspection of both files reveals we deal with nearly$500\times 10^6$lines! Before we reach our chief aim, i.e. rebinning the series with 1-hour time resolution, there is a need to, unfortunately, separate input into 5 parts, each of maximum$100\times 10^6$lines. The latter may vary depending of RAM memory size available, and if sufficient, this step can be even skipped. We proceed: 51 52 53 54 55 56 echo "..spliting bid/ask/abr into separate files" fx_splitdat$1 1 fx_splitdat $1 2 fx_splitdat$1 3 fx_splitdat $1 4 fx_splitdat$1 5

where fx_splitdat.for code is given as follows:

c Extracting Time-Series from Tick-Data .csv files c (c) Quant at Risk, 2012 c c Program name: fx_splitdat.for c Exemplary usage: ./fx_splitdat audusd [1,2,3,4,5]   implicit double precision (a-z) integer nc character*6 zb character*10 zbask,zbbid character*16 zb1ask,zb2ask,zb3ask,zb4ask,zb5ask character*16 zb1bid,zb2bid,zb3bid,zb4bid,zb5bid character*1 par2,st   c zb- name of length equal 6 characters only call getarg(1,zb) call getarg(2,par2) ! case   write(st,'(a1)') par2 read(st,'(i1)') nc   zbask=zb(1:6)//'.ask' zbbid=zb(1:6)//'.bid'   zb1ask=zb(1:6)//'.ask.part1' zb2ask=zb(1:6)//'.ask.part2' zb3ask=zb(1:6)//'.ask.part3' zb4ask=zb(1:6)//'.ask.part4' zb5ask=zb(1:6)//'.ask.part5'   zb1bid=zb(1:6)//'.bid.part1' zb2bid=zb(1:6)//'.bid.part2' zb3bid=zb(1:6)//'.bid.part3' zb4bid=zb(1:6)//'.bid.part4' zb5bid=zb(1:6)//'.bid.part5'   open(11,file=zbask) open(12,file=zbbid)   if(nc.eq.1) then open(21,file=zb1ask) open(22,file=zb1bid) do i=1.d0, 100.d6 read(11,*,end=1) mjd_ask,dat_ask read(12,*,end=1) mjd_bid,dat_bid if((i>=1.0).and.(i<100000001.d0)) then write(21,2) mjd_ask,dat_ask write(22,2) mjd_bid,dat_bid endif enddo endif   if(nc.eq.2) then open(31,file=zb2ask) open(32,file=zb2bid) do i=1.d0, 200.d6 read(11,*,end=1) mjd_ask,dat_ask read(12,*,end=1) mjd_bid,dat_bid if((i>=100000001.d0).and.(i<200000001.d0)) then write(31,2) mjd_ask,dat_ask write(32,2) mjd_bid,dat_bid endif enddo endif   if(nc.eq.3) then open(41,file=zb3ask) open(42,file=zb3bid) do i=1.d0, 300.d6 read(11,*,end=1) mjd_ask,dat_ask read(12,*,end=1) mjd_bid,dat_bid if((i>=200000001.d0).and.(i<300000001.d0)) then write(41,2) mjd_ask,dat_ask write(42,2) mjd_bid,dat_bid endif enddo endif   if(nc.eq.4) then open(51,file=zb4ask) open(52,file=zb4bid) do i=1.d0, 400.d6 read(11,*,end=1) mjd_ask,dat_ask read(12,*,end=1) mjd_bid,dat_bid if((i>=300000001.d0).and.(i<400000001.d0)) then write(51,2) mjd_ask,dat_ask write(52,2) mjd_bid,dat_bid endif enddo endif   if(nc.eq.5) then open(61,file=zb5ask) open(62,file=zb5bid) do i=1.d0, 500.d6 read(11,*,end=1) mjd_ask,dat_ask read(12,*,end=1) mjd_bid,dat_bid if((i>=400000001.d0).and.(i<500000001.d0)) then write(61,2) mjd_ask,dat_ask write(62,2) mjd_bid,dat_bid endif enddo endif   1 close(1) 2 format(F15.8,F8.4)   stop end

and compiling it as usual:

f77 fx_splitdat.for -o fx_splitdat

Finally, we can extract the rebinned Bid and Ask Price time-series with bin time of 1 hour, i.e. $dt=0.041666667$ d, making use of the following f77 code:

c Extracting Time-Series from Tick-Data .csv files c (c) Quant at Risk, 2012 c c Program name: fx_rebin.for c Exemplary usage: ./fx_rebin audusd 2   implicit double precision (a-z) parameter (dim=100.d6)   double precision f(dim), mjd(dim), step character*50 par1, par2, st   call getarg(1,par1) ! file name call getarg(2,par2) ! bining [d]   write(st,'(a20)') par2 read(st,'(f20)') step   c reading data open(1,file=par1) do i=1,100.d6 read(1,*,end=1) _ mjd(i),f(i) enddo 1 close(1) n=i-1.d0   c main loop j=1.d0 k=1.d0 t2=0. t2=dint(mjd(j)) do while (j.lt.n) i=j if ((mjd(i)+step).gt.(mjd(n))) then print* stop else t2=t2+step endif i=j il=0.d0 s=0.d0 do while (mjd(i).lt.t2) s=s+f(i) i=i+1.d0 il=il+1.d0 ! how many points in segment enddo av=s/il day=t2-step if (il.ge.1.d0) then write(*,3) day,av endif j=j+il enddo   2 format(f30.7,2f30.6) 3 format(f20.8,f8.4)   10 stop end

executed in our script for all five part of both tick-data time-series:

58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 echo "..rebinning with dt = 1 h" dt=0.041666667 fx_rebin $1.ask.part1$dt > $1.ask.part1.1h fx_rebin$1.ask.part2 $dt >$1.ask.part2.1h fx_rebin $1.ask.part3$dt > $1.ask.part3.1h fx_rebin$1.ask.part4 $dt >$1.ask.part4.1h fx_rebin $1.ask.part5$dt > $1.ask.part5.1h fx_rebin$1.bid.part1 $dt >$1.bid.part1.1h fx_rebin $1.bid.part2$dt > $1.bid.part2.1h fx_rebin$1.bid.part3 $dt >$1.bid.part3.1h fx_rebin $1.bid.part4$dt > $1.bid.part4.1h fx_rebin$1.bid.part5 $dt >$1.bid.part5.1h   echo "..appending rebinned files" cat $1.ask.part1.1h$1.ask.part2.1h $1.ask.part3.1h$1.ask.part4.1h $1.ask.part5.1h >$1.ask.1h.tmp cat $1.bid.part1.1h$1.bid.part2.1h $1.bid.part3.1h$1.bid.part4.1h $1.bid.part5.1h >$1.bid.1h.tmp rm *part*   echo "..removing empty lines in rebinned files" sed '/^$/d'$1.ask.1h.tmp > $1.ask.1h rm$1.ask.1h.tmp sed '/^$/d'$1.bid.1h.tmp > $1.bid.1h rm$1.bid.1h.tmp   echo "..done!"

As the final product we obtain two files, say,

audusd.bid.1h audusd.ask.1h

of the content:

$tail -5 audusd.bid.1h 3772.62500304 0.9304 3772.66666970 0.9283 3772.70833637 0.9266 3772.75000304 0.9263 3772.79166970 0.9253 where the first column is time (MJD) with 1-hour resoulution and the second column contains a simple Price average from all tick prices falling into 1-hour time bin. As for dessert, we plot both Bid and Ask Price time-series what accomplishes our efforts: ## Bank format of numbers in Matlab The following Matlab function allows for displaying any double-type number using a nice bank format that includes commas separators and two digit precision (without rounding). The number$num\$ is converted into a new variable of a string type.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 function [str]=bankformat(num) num2=floor(num*1000)/1000; r=int32(100*(num-floor(num))); str = num2str(num2); k=find(str == '.'); if(isempty(k)) str=[str,'.00']; end FIN = min(length(str),find(str == '.')-1); for i = FIN-2:-3:2 str(i+1:end+1) = str(i:end); str(i) = ','; end x=mod(r,10); if(x==0) str=[str,'0']; end k=find(str == '.'); d=length(str)-k; if(d>2) str=str(1:end-(d-2)); end end

Example:

>> n=70149506.5; >> s=bankformat(n); >> fprintf('\nInvested capital: £\%1s\n',s)   Invested capital: £70,149,506.50