Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

Applied Portfolio VaR Decomposition. (1) Marginal and Component VaR.

Risk. The only ingredient of life that makes us growing and pushing outside our comfort zones. In finance, taking the risk is a risky business. Once your money have been invested, you need to keep your eyes on the ball that is rolling. Controlling the risk is the art and is the science: a quantitative approach that allows us to ease the pressure. With some proper tools we can manage our positions in the markets. No matter whether you invest in a long-term horizon or trade many times per day, your portfolio is a subject to the constantly changing levels of risk and asset volatility.


VaR. Value-at-Risk. The most hated and most adorable quantitative measure of risk. Your running security amongst the unpredictability of the floor. A drilling pain in the ass; a few sleepless nights you’ve been so familiarised with. The omnipresent and rarely confirmed assumptions of the assets returns to be normally distributed. Well, we all know the truth but, somehow, we love to settle for the analytical solutions allowing us to “see a less attractive girl” — more attractive.

In this post we will stay in the mystic circle of those assumptions as a starting point. Don’t blame me. This is what they teach us in the top-class textbooks on financial risk management and test our knowledge within FRM or CFA exams. You may pass the exam but do you really, I mean, really understand the effects of what you have just studied in real portfolio trading?

We kick off from a theory on VaR decomposition in the framework of an active $N$-asset portfolio when our initial capital of $C$ has been fully invested in the stock market (as an example). Next, we investigate the portfolio VaR and by computation of its marginal VaR we derive new positions in our portfolio that minimise the portfolio VaR.

The Analytical VaR for N-Asset Portfolio. The Beginning.

Let’s keep our theoretical considerations to the absolute but essential minimum. At the beginning you have $C$ dollars in hand. You decide to invest them all into $N$ assets (stocks, currencies, commodities, etc.). Great. Say, you pick up US equities and buy shares of $N=7$ stocks. The number of shares depends on stock price and each position size (in dollars) varies depending on your risk-return appetite. Therefore, the position size can be understood as a weighting factor, $w_i$, for $i=1,…,N$, such:
$$
\sum_{i=1}^N w_i = 1 \ \ \mbox{or} \ \ 100\mbox{%}
$$ or $d_i=w_iC$ in dollars and $\sum_i d_i=C$. Once the orders have been fulfilled, you are a proud holder of $N$-asset portfolio of stocks. Congratulations!

From now on, every day, when the markets close, you check a daily rate of return for individual assets, $r_{t,i} = P_{i,t}/P_{i,{t-1}}-1$ where by $P$ we denote the stock close price time-series. Your portfolio rate of return, day-to-day i.e. from $t$-1 to $t$, will be:
$$
R_{p,t} = \sum_{i=1}^N w_ir_{i,t} = \boldsymbol{w}’\boldsymbol{r}
$$ or alternatively:
$$
R_{p,t} = \sum_{i=1}^N d_ir_{i,t} = \boldsymbol{d}’\boldsymbol{r}
$$ what provides us with the updated portfolio value. Now, that differs from the portfolio expected return defined as:
$$
\mu_P = E\left(\sum_{i=1}^{N} w_ir_i \right) = \sum_{i=1}^{N} w_i\mu_i = \boldsymbol{w}’\boldsymbol{\mu}
$$ where the expected daily return of the $i$-th asset, $\mu_i$, needs to be estimated based on its historical returns (a return-series). This is the most fragile spot in the practical application of the Modern Portfolio Theory. If we estimate,
$$
\mu_i = \sum_{t’={(t-1)}-T}^{{t-1}} r_{t’}/T
$$ looking back at last $T$ trading days, it is obvious that we will derive different $\mu_i$’s for $T$=252 (1 year) and for $T$=1260 (5 years). That affects directly the computation of the portfolio covariance matrix thus, as we will see below, the (expected) portfolio volatility. Keep that crucial remark in mind!

The expected portfolio variance is:
$$
\sigma_P^2 = \boldsymbol{w}’ \boldsymbol{M}_2 \boldsymbol{w}
$$
where $\boldsymbol{M}_2$ is the covariance matrix $N\times N$ with the individual covariances of $c_{ij}=\mbox{cov}(\{r_i\},\{r_j\})$ between security $i$ and $j$ computed based on the return-series. Therefore,
$$
\sigma_P^2 = [w_1,…,w_N]
\left[
\begin{array}{cccc}
c_{11} & c_{12} & … & c_{1N} \\
… & … & … & … \\
c_{N1} & c_{N2} & … & c_{NN}
\end{array}
\right]
\left[
\begin{array}{cccc}
w_{1} \\
… \\
w_{N}
\end{array}
\right]
$$ what can be expressed in the terms of the dollar exposure as:
$$
\sigma_P^2C^2 = \boldsymbol{d}’ \boldsymbol{M}_2 \boldsymbol{d} \ .
$$ In order to move from the portfolio variance to the portfolio VaR we need to know the distribution of the portfolio returns. And we assume it to be normal (delta-normal model). Given that, the portfolio VaR at the 95% confidence level ($\alpha=1.65$) is:
$$
\mbox{VaR}_P = 1.65 \sigma_P C = 1.65\left(\boldsymbol{d}’ \boldsymbol{M}_2 \boldsymbol{d} \right)^{1/2}
$$ what can be understood as diversified VaR since we use our portfolio to reduce the overall risk of the investment. On the other hand, the 95% individual VaR describing individual risk of each asset in the portfolio is:
$$
\mbox{VaR}_i = 1.65 |\sigma_i d_i|
$$ where $\sigma_i$ represents the asset volatility over past period of $T$.

It is extremely difficult to obtain a portfolio with a perfect correlation (unity) among all its components. In such a case, we would talk about undiversified VaR possible to be computed as a sum of $\mbox{VaR}_i$ when, in addition as a required condition, there is no short position held in our portfolio. Reporting the risk for a portfolio currently alive is often conducted by the risk managers by providing a quotation of both diversified and undiversified VaR.

Now, imagine that we add one unit of an $i$-th asset to our portfolio and we aim at understanding the impact of this action on the portfolio VaR. In this case we talk about a marginal contribution to risk, possible to be derived through the differentiation of the portfolio variance with respect to $w_i$:
$$
\frac{\mbox{d}\sigma_P}{\mbox{d}w_i} = \frac { \left[ \sum_{i=1}^{N} w_i^2\sigma_i^2 + \sum_{i=1}^N \sum_{j=1, j\ne i}^{N} w_iw_jc_{ij} \right]^{1/2} }
{ \mbox{d}w_i }
$$
$$
= \frac{1}{2} \left[ 2w_i\sigma_i^2 + 2 \sum_{j=1, j\ne 1}^{N} w_jc_{ij} \right] \times \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \left[ \sum_{i=1}^{N} w_i^2\sigma_i^2 + \sum_{i=1}^N \sum_{j=1, j\ne i}^{N} w_iw_jc_{ij} \right]^{-1/2}
$$
$$
= \frac{w_i^2\sigma_i^2 + \sum_{j=1, j\ne 1}^{N} w_j c_{ij} } {\sigma_P}
= \frac{c_{iP}}{\sigma_P}
$$ where, again, $c_{ij}$ denotes the covariance between any $i$-th and $j$-th asset and $c_{iP}$ between $i$-th component and portfolio. That allows us to define the marginal VaR for $i$-th asset in the portfolio in the following way:
$$
\Delta\mbox{VaR}_i = \frac{\mbox{dVaR}_P}{\mbox{d}d_i} = \frac{\mbox{dVaR}_P}{\mbox{d}w_iC} = \frac{\mbox{d}(\alpha\sigma_P C)}{\mbox{d}w_iC} = \alpha\frac{C}{C}
\frac{\mbox{d}\sigma_P}{\mbox{d}w_i} = \alpha \frac{c_{iP}}{\sigma_P}
$$ that has a close relation to the systematic risk of $i$-th asset as confronted with the portfolio itself, described by:
$$
\beta_i = \frac{c_{iP}}{\sigma_P^2}
$$ therefore
$$
\Delta\mbox{VaR}_i = \alpha \sigma_P \beta_i = \beta_i \frac{ \mbox{VaR}_P } {C}
$$ The best and most practical interpretation of the marginal VaR calculated for all positions in the portfolio would be: the higher $\Delta\mbox{VaR}_i$ the corresponding exposure of the $i$-th component should be reduced to lower the overall portfolio VaR. Simple as that. Hold on, we will see that at work in the calculations a little bit later.

Since the portfolio volatility is a highly nonlinear function of its components, a simplistic computation of individual VaRs and adding the up all together injects the error into portfolio risk estimation. However, with a help of the marginal VaR we gain a tool to capture the fireflies amongst the darkness of the night.

We define the component VaR as:
$$
\mbox{cVaR}_i = \Delta\mbox{VaR}_i d_i = \Delta\mbox{VaR}_i w_i C = \beta_i \frac{ \mbox{VaR}_P w_i C } {C} = \beta_i w_i \mbox{VaR}_P
$$ utilising the observation that:
$$
\sigma_P = \sum_{i=1}^{N} w_i c_{iP} = \sum_{i=1}^{N} w_i \beta_i \sigma_P = \sigma_P \sum_{i=1}^{N} w_i \beta_i = \sigma_P
$$ what very nicely leads us to:
$$
\sum_{i=1}^{N} \mbox{cVaR}_i = \mbox{VaR}_P \sum_{i=1}^{N} w_i \beta_i = \mbox{VaR}_P \ .
$$ Having that, we get the percent contribution of the $i$-th asset to the portfolio VaR as:
$$
i = \frac{ \mbox{cVaR}_i } { \mbox{VaR}_P } = w_i \beta_i
$$ It is possible to calculate the required changes in the positions sizes based on our VaR analysis. We obtain it by re-evaluation of the portfolio marginal VaRs meeting now the following condition:
$$
\Delta\mbox{VaR}_i = \beta_i \frac{ \mbox{VaR}_P } { C } = \ \mbox{constant .}
$$ a subject of finding a solution for a “Risk-Minimising Position” problem. All right. The end of a lecture. Time for coding!

 

Case Study

Do You like sex? I do! So, let’s do it in Matlab this time! First, we pick up randomly seven stocks among US equites (for the simplicity of this example), and their tickers are given in the portfolio.lst text file as follows:

AAPL
DIS
IBM
JNJ
KO
NKE
TXN

and next we download their 3 years of historical data from Quandl.com as the data provider:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
% Applied Portfolio VaR Decomposition. (1) Marginal and Component VaR.
%
% (c) 2015 QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
format short
 
% Read the list of the portfolio components
fileID = fopen('portfolio.lst');
tmp = textscan(fileID, '%s');
fclose(fileID);
pc=tmp{1};  % a list as a cell array
 
% fetch stock data for last 3 years since:
t0=735976; % 12-Jan-2015
date2=datestr(t0,'yyyy-mm-dd');        % from
date1=datestr(t0-3*365,'yyyy-mm-dd');  % to
 
% create an empty array for storing stock data
stockd={};
for i=1:length(pc) % scan the tickers and fetch the data from Quandl.com
    quandlc=['WIKI/',pc{i}];
    fprintf('%4.0f %s\n',i,quandlc);
    % fetch the data of the stock from Quandl
    % using recommanded Quandl's command and
    % saving them directly into Matlab's FTS object (fts)
    fts=0;
    [fts,headers]=Quandl.get(quandlc,'type','fints', ...
                  'authcode','YourQuandlCode',...
                  'start_date',date1,'end_date',date2);
    stockd{i}=fts; % entire FTS object in an array's cell
end
 
% limit data to 3 years of trading days, select the adjusted
% close price time-series and calculate return-series
rs=[];
for i=1:length(pc)
    cp=fts2mat(stockd{i}.Adj_Close,0);
    rv=cp(2:end,1)./cp(1:end-1,1)-1;
    rs=[rs rv(end-(3*5*4*12):end)];
end
rs(isnan(rs))=0.0;

The covariance matrix, $\boldsymbol{M}_2$, computed based on 7 return-series is:

44
45
% covariance matrix
M2=cov(rs);

Let’s now assume some random dollar exposure, i.e. position size for each asset,

47
48
49
50
51
52
53
54
55
56
57
% exposure per position
d=[   55621.00; ...
     101017.00; ...
      23409.00; ...
    1320814.00; ...
     131145.00; ...
     321124.00; ...
    1046867.00]
 
% invested capital of C
C=sum(d)

returning our total exposure of $C=\$$2,999,997 in the stock market which is based on the number of shares purchased and/or our risk-return appetite (in Part II of this series, we will witness how that works in a simulated trading).

We turn a long theory on the VaR decomposition into numbers just within a few lines of code, namely:

59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
% the portfolio volatility
vol_p=sqrt(d'*M2*d);
 
% diversified portfolio VaR
VaR_P=1.65*vol_p;
 
% volatility of assets (based on daily returns)
v=sqrt(diag(cov(rs)));
 
% individual and undiversified portfolio VaR
VaRi=1.65*abs(v.*d);
uVaR_P=sum(VaRi);
 
% the portfolio betas
beta=C*(M2*d)/vol_p^2;
 
% the portfolio marginal VaR
DVaR=1.65*(M2*d)/vol_p;
 
% the component VaR [percent]
cVaR=100*DVaR.*d/VaR_P;
 
% initial positions of assets in portfolio [percent]
orgpos=100*d/C

The last column vector provides us with an information on the original (currently held) rate of exposure of our invested capital in a function of the asset number in portfolio:

orgpos =
          1.85
          3.37
          0.78
         44.03
          4.37
         10.70
         34.90

Before we generate our very first risk report, let’s develop a short code that would allow us to find a solution for “Risk-Minimising Position” problem. Again, we want to find new position sizes (different from orgpos) at the lowest possible level of the portfolio risk. We construct intelligently a dedicated function that does the job for us:

% Finding a solution for the "Risk-Minimasing Position" problem
%   based on portfolio VaR
% (c) 2015 QuantAtRisk, by Pawel Lachowicz
 
function [best_x,best_dVaR,best_VaRp,best_volp]=rmp(M2,d,VaR_p)
    c=sum(d);
    N=length(d);
    best_VaRp=VaR_p;
    for i=1:100                     % an arbitrary number (recommended >50)
        k=0;
        while(k~=1)
            d=random('uniform',0,1,N,1);
            d=d/sum(d)*c;
            vol_p=sqrt(d'*M2*d);
            % diversified VaR (portfolio)
            VaRp=1.65*vol_p;
            dVaR=1.65*(M2*d)/vol_p;
            test=fix(dVaR*100);     % set precision here
            m=fix(mean(test));
            t=(test==m);
            k=sum(t)/N;
            if(VaRp<best_VaRp)
                best_x=d;
                best_dVaR=dVaR;
                best_VaRp=VaRp;
                best_volp=vol_p;
            end
        end
    end
end

We feed it with three input parameters, namely, the covariance matrix of $\boldsymbol{M}_2$, the current dollar exposure per asset of $\boldsymbol{d}$, and the current portfolio VaR. Inside we generate randomly a new dollar exposure column vector, next re-calculate the portfolio VaR and marginal VaR until,
$$
\Delta\mbox{VaR}_i = \beta_i \frac{ \mbox{VaR}_P } { C } = \ \mbox{constant}
$$ condition is met. We need to be sure that a new portfolio VaR is lower than the original one and somehow lowest among a large number of random possibilities. That is why we repeat that step 100 times. In this point, as a technical remark, I would suggest to use more than 50 iterations and to experiment with a precision that is required to find marginal VaRs (or $\beta$’s) to be (nearly) the same. The greater both golden numbers are the longer time of waiting for the final solution.

It is also that apex of the day in daily bank’s portfolio evaluation process that assesses the bank’s market risk and global exposure. A risk manager knows how long it takes to find new risk-minimaised positions and how fast C/C++/GPU solutions need to be implemented to cut the time of those computations. Here, with our Matlab function of rmp, we can feel the pain of this game.

Eventually, we finalise our computations:

84
85
86
[nd,nDVaR,nVaR_P,nvol_P]=rmp(M2,d,VaR_P);
redVaR=100*(nVaR_P-VaR_P)/VaR_P;
newpos=100*nd/C;

followed by a nicely formatted report:

88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
fprintf('\n============================================\n');
fprintf('  Portfolio Risk Report as of %s\n',datestr(t0));
fprintf('============================================\n\n');
fprintf('1-day VaR estimation at %1.0f%% confidence level\n\n',95.0)
fprintf('Number of assets:\t\t %1.0f \n',length(d))
fprintf('Current exposure:\t\t $%1s\n',bankformat(C));
fprintf('Portfolio VaR (undiversified):\t $%1s\n',bankformat(uVaR_P));
fprintf('Portfolio VaR (diversified):\t $%1s\n',bankformat(VaR_P));
fprintf('Component VaR [individual VaR]\n');
for i=1:length(d)
    fprintf('   %s\t%6.2f%%   $%-10s\t[$%-10s]\n',pc{i},cVaR(i), ...
            bankformat(cVaRd(i)),bankformat(VaRi(i)));
end
fprintf('\nRisk-Minimising Position scenario\n');
fprintf('--------------------------------------------------------------');
fprintf('\n\t\t   Original   Marginal     New        Marginal\n');
fprintf('\t\t   position   VaR          position   VaR\n');
for i=1:length(d)
    fprintf('\t  %-4s\t    %6.2f%%   %1.5f\t    %6.2f%%   %1.2f\n', ...
            pc{i},orgpos(i), ...
            DVaR(i),newpos(i),nDVaR(i))
end
fprintf('div VaR\t\t $%1s\t\t $%1s\n',bankformat(VaR_P), ...
        bankformat(nVaR_P));
fprintf('\t\t\t\t\t%10.2f%%\n',redVaR);
fprintf('ann Vol\t\t%10.2f%%\t\t%10.2f%%\n',  ...
        sqrt(252)*vol_P/1e6*1e2,sqrt(252)*nvol_P/1e6*1e2);
fprintf('--------------------------------------------------------------');
fprintf('\n\n');

where bankformat function you can find here.

As a portfolio risk manager you are interested in numbers, numbers, and numbers that, in fact, reveal a lot. It is of paramount importance to analyse the portfolio risk report with attention before taking any further steps or decisions:

 
============================================
  Portfolio Risk Report as of 12-Jan-2015
============================================
 
1-day VaR estimation at 95% confidence level
 
Number of assets:		 7 
Current exposure:		 $2,999,997.00
Portfolio VaR (undiversified):	 $54,271.26
Portfolio VaR (diversified):	 $33,027.94
Component VaR [individual VaR]
   AAPL	  0.61%   $202.99    	[$1,564.29  ]
   DIS	  1.71%   $564.13    	[$1,885.84  ]
   IBM	  0.26%   $84.43     	[$429.01    ]
   JNJ	 30.99%   $10,235.65 	[$17,647.63 ]
   KO	  1.19%   $392.02    	[$2,044.05  ]
   NKE	  9.10%   $3,006.55  	[$7,402.58  ]
   TXN	 56.14%   $18,542.15 	[$23,297.83 ]
 
Risk-Minimising Position scenario
--------------------------------------------------------------
		   Original   Marginal     New        Marginal
		   position   VaR          position   VaR
	  AAPL	      1.85%   0.00365	      6.02%   0.01
	  DIS 	      3.37%   0.00558	      1.69%   0.01
	  IBM 	      0.78%   0.00361	     10.79%   0.01
	  JNJ 	     44.03%   0.00775	     37.47%   0.01
	  KO  	      4.37%   0.00299	     27.36%   0.01
	  NKE 	     10.70%   0.00936	      8.13%   0.01
	  TXN 	     34.90%   0.01771	      8.53%   0.01
div VaR		 $33,027.94		 $26,198.39
					    -20.68%
ann Vol		     31.78%		     25.21%
--------------------------------------------------------------

From the risk report we can spot the difference of $\$$21,243.32 between the undiversified and diversified portfolio VaR and the largest contribution of Texas Instruments Incorporated stock (TXN) to the overall portfolio VaR though it is Johnson & Johnson stock (JNJ) that occupies the largest dollar position in our portfolio.

The original marginal VaR is the largest for TXN, NKE, and JNJ. Therefore, in order to minimise portfolio VaR, we should cut them and/or reduce their positions. On the other hand, the exposure should be increased in case of KO, IBM, AAPL, and DIS which display the lowest marginal VaR. Both suggestions find the solution in a form of derived new holdings that would reduce the portfolio VaR by $-$20.68% decreasing concurrently the annualised portfolio volatility by 6.57%.

Indeed. Risk is like sex. The more the merrier! Unless, you are a risk manager and you calculate your “positions”.

GARCH(1,1) Model in Python

In my previous article GARCH(p,q) Model and Exit Strategy for Intraday Algorithmic Traders we described the essentials of GARCH(p,q) model and provided an exemplary implementation in Matlab. In general, we apply GARCH model in order to estimate the volatility one time-step forward, where:
$$
\sigma_t^2 = \omega + \alpha r_{t-1}^2 + \beta \sigma_{t-1}^2
$$ based on the most recent update of $r$ and $\sigma$, where $r_{t-1} = \ln({P_{t-1}}/{P_{t-2}})$ and $P$ corresponds to an asset price. For any financial time-series, $\{r_j\}$, the estimation of $(\omega,\alpha,\beta)$ parameters can be conducted utilising the maximum likelihood method. The latter is an iterative process by looking for the maximum value of the sum among all sums defined as:
$$
\sum_{i=3}^{N} \left[ -\ln(\sigma_i^2) – \frac{r_i^2}{\sigma_i^2} \right]
$$ where $N$ denotes the length of the return series $\{r_j\}$ ($j=2,…,N$) under study.

Let’s assume we have a test array of input data, $\{r_j\}$, stored in Python variable of r, and we write a function, GARCH11_logL, that will be used in the optimisation process. It contains two input parameters where parma is an 3-element array with some trial values corresponding to $(\omega,\alpha,\beta)$ and u denotes the return-series.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# GARCH(1,1) Model in Python
#   uses maximum likelihood method to estimate (omega,alpha,beta)
# (c) 2014 QuantAtRisk, by Pawel Lachowicz; tested with Python 3.5 only
 
import numpy as np
from scipy import optimize
import statistics as st
 
r = np.array([0.945532630498276,
              0.614772790142383,
              0.834417758890680,
              0.862344782601800,
              0.555858715401929,
              0.641058419842652,
              0.720118656981704,
              0.643948007732270,
              0.138790608092353,
              0.279264178231250,
              0.993836948076485,
              0.531967023876420,
              0.964455754192395,
              0.873171802181126,
              0.937828816793698])
 
def GARCH11_logL(param, r):
    omega, alpha, beta = param
    n = len(r)
    s = np.ones(n)*0.01
    s[2] = st.variance(r[0:3])
    for i in range(3, n):
        s[i] = omega + alpha*r[i-1]**2 + beta*(s[i-1])  # GARCH(1,1) model
    logL = -((-np.log(s) - r**2/s).sum())
    return logL

In this point it is important to note that in line #32 we multiply the sum by $-1$ in order to find maximal value of the expression. Why? This can be understood as we implement optimize.fmin function from Python’s optimize module. Therefore, we seek for best estimates as follows:

34
o = optimize.fmin(GARCH11_logL,np.array([.1,.1,.1]), args=(r,), full_output=1)

and we display the results,

36
37
38
R = np.abs(o[0])
print()
print("omega = %.6f\nbeta  = %.6f\nalpha = %.6f\n" % (R[0], R[2], R[1]))

what, in case of the array of r as given in the code, returns the following results:

Optimization terminated successfully.
         Current function value: 14.705098
         Iterations: 88
         Function evaluations: 162
 
omega = 0.788244
beta  = 0.498230
alpha = 0.033886

Python vs. Matlab Solution

Programming requires caution. It is always a good practice to test the outcome of one algorithm against alternative solutions. Let’s run the GARCH(1,1) model estimation for the same input array and compare Python and Matlab results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
% GARCH(1,1) Model in Matlab 2013a
% (c) 2014 QuantAtRisk, by Pawel Lachowicz
 
clear all; close all; clc;
 
r=[0.945532630498276, ...
   0.614772790142383, ...
   0.834417758890680, ...
   0.862344782601800, ...
   0.555858715401929, ...
   0.641058419842652, ...
   0.720118656981704, ...
   0.643948007732270, ...
   0.138790608092353, ...
   0.279264178231250, ...
   0.993836948076485, ...
   0.531967023876420, ...
   0.964455754192395, ...
   0.873171802181126, ...
   0.937828816793698]';
 
% GARCH(p,q) parameter estimation
model = garch(1,1) % define model
[fit,VarCov,LogL,Par] = estimate(model,r);
% extract model parameters
parC=Par.X(1); % omega
parG=Par.X(2); % beta (GARCH)
parA=Par.X(3); % alpha (ARCH)
% estimate unconditional volatility
gamma=1-parA-parG;
VL=parC/gamma;
volL=sqrt(VL);
% redefine model with estimatated parameters
model=garch('Constant',parC,'GARCH',parG,'ARCH',parA)

what returns:

model = 
 
    GARCH(1,1) Conditional Variance Model:
    --------------------------------------  
    Distribution: Name = 'Gaussian'
               P: 1
               Q: 1
        Constant: NaN
           GARCH: {NaN} at Lags [1]
            ARCH: {NaN} at Lags [1]
 
____________________________________________________________
   Diagnostic Information
 
Number of variables: 3
 
Functions 
Objective:       @(X)OBJ.nLogLikeGaussian(X,V,E,Lags,1,maxPQ,T,nan,trapValue)
Gradient:        finite-differencing
Hessian:         finite-differencing (or Quasi-Newton)
 
Constraints
Nonlinear constraints:                do not exist
 
Number of linear inequality constraints:    1
Number of linear equality constraints:      0
Number of lower bound constraints:          3
Number of upper bound constraints:          3
 
Algorithm selected
   sequential quadratic programming
 
 
____________________________________________________________
   End diagnostic information
                                                          Norm of First-order
 Iter F-count            f(x) Feasibility  Steplength        step  optimality
    0       4    1.748188e+01   0.000e+00                           5.758e+01
    1      27    1.723863e+01   0.000e+00   1.140e-03   6.565e-02   1.477e+01
    2      31    1.688626e+01   0.000e+00   1.000e+00   9.996e-01   1.510e+00
    3      35    1.688234e+01   0.000e+00   1.000e+00   4.099e-02   1.402e+00
    4      39    1.686305e+01   0.000e+00   1.000e+00   1.440e-01   8.889e-01
    5      44    1.685246e+01   0.000e+00   7.000e-01   2.379e-01   5.088e-01
    6      48    1.684889e+01   0.000e+00   1.000e+00   9.620e-02   1.379e-01
    7      52    1.684835e+01   0.000e+00   1.000e+00   2.651e-02   2.257e-02
    8      56    1.684832e+01   0.000e+00   1.000e+00   8.389e-03   7.046e-02
    9      60    1.684831e+01   0.000e+00   1.000e+00   1.953e-03   7.457e-02
   10      64    1.684825e+01   0.000e+00   1.000e+00   7.888e-03   7.738e-02
   11      68    1.684794e+01   0.000e+00   1.000e+00   3.692e-02   7.324e-02
   12      72    1.684765e+01   0.000e+00   1.000e+00   1.615e-01   5.862e-02
   13      76    1.684745e+01   0.000e+00   1.000e+00   7.609e-02   8.429e-03
   14      80    1.684740e+01   0.000e+00   1.000e+00   2.368e-02   4.072e-03
   15      84    1.684739e+01   0.000e+00   1.000e+00   1.103e-02   3.142e-03
   16      88    1.684739e+01   0.000e+00   1.000e+00   1.183e-03   2.716e-04
   17      92    1.684739e+01   0.000e+00   1.000e+00   9.913e-05   1.378e-04
   18      96    1.684739e+01   0.000e+00   1.000e+00   6.270e-05   2.146e-06
   19      97    1.684739e+01   0.000e+00   7.000e-01   4.327e-07   2.146e-06
 
Local minimum possible. Constraints satisfied.
 
fmincon stopped because the size of the current step is less than
the default value of the step size tolerance and constraints are 
satisfied to within the selected value of the constraint tolerance.
 
<stopping criteria details>
 
 
    GARCH(1,1) Conditional Variance Model:
    ----------------------------------------
    Conditional Probability Distribution: Gaussian
 
                                  Standard          t     
     Parameter       Value          Error       Statistic 
    -----------   -----------   ------------   -----------
     Constant       0.278061       26.3774      0.0105417
     GARCH{1}       0.457286       49.4915      0.0092397
      ARCH{1}      0.0328433       1.65576      0.0198358
 
model = 
 
    GARCH(1,1) Conditional Variance Model:
    --------------------------------------  
    Distribution: Name = 'Gaussian'
               P: 1
               Q: 1
        Constant: 0.278061
           GARCH: {0.457286} at Lags [1]
            ARCH: {0.0328433} at Lags [1]

id est
$$
(\omega,\beta,\alpha)_{\rm Matlab} = (0.278061,0.457286,0.0328433) \ .
$$This slightly differs itself from the Python solution which was
$$
(\omega,\beta,\alpha)_{\rm Py} =(0.788244,0.498230,0.033886) \ .
$$At this stage it is difficult to assess which solution is “better”. Both algorithms and applied methodologies simply may be different what is usually the case. Having that in mind, further extensive tests are required, for example, a dependance of Python solution on trial input $(\omega,\alpha,\beta)$ values as displayed in line #33 of the Python code.

Gap-on-Open Profitable Trading Strategy


After a longer while, QuantAtRisk is back to business. As an algo trader I have been always tempted to test a gap-on-open trading strategy. There were various reasons standing behind it but the most popular one was always omni-discussed: good/bad news on the stock. And what? The stock price skyrocketed/dropped down on the following days. When we approach such price patterns, we talk about triggers or triggered events. The core of the algorithm’s activity is the trigger identification and taking proper actions: to go long or short. That’s it. In both cases we want to make money.

In this post we will design the initial conditions for our gap-on-open trading strategy acting as the triggers and we will backtest a realistic scenario of betting our money on those stocks that opened higher on the next trading day. Our goal is to find the most optimal holding period for such trades closed with a profit.

Portfolio

Our strategy can be backtested using any $N$-asset portfolio. Here, for simplicity, let us use a random subset of 10 stocks (portfolio.lst) being a part of a current Dow Jones Index:

AXP   CSCO   DIS   IBM   JNJ   KO   NKE   PG   UTX   XOM

In Matlab, we fetch the stock prices from Google Finance data provider accessible via Quandl.com’s Matlab API (see this post for its setup in Matlab). We commence writing our main backtesting code as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
% Gap on Open Trading Strategy
%  Fetching stock prices via Quandl and Strategy Backtesting
%
% (c) 2014 by Pawel Lachowicz, QuantAtRisk.com
 
 
clear all; close all; clc;
 
fname=['portfolio.lst'];
 
% Model's parameter #1 (years)
parm1=1;  
ndays=parm1*365;
lday=datenum('2014-08-05');
% fetching stock data
[Top,Thp,Tlp,Tcp,N,ntdays]=FetchQuandl(fname,ndays,lday);

where we use a pre-designed function of FetchQuandl to import 4 separate price-series of each stock’s open (Top), high (Thp), low (Tlp), and close (Tcp) daily prices:

function [Top,Thp,Tlp,Tcp,N,ntdays]=FetchQuandl(fname,ndays,lday)
    % Read the list of Dow Jones components
    fileID = fopen(fname);
    tmp = textscan(fileID,'%s');
    fclose(fileID);
    components=tmp{1};  % a list as a cell array
 
    % Read in the list of tickers and internal codes from Quandl.com
    [~,text,~] = xlsread('QuandlStockCodeListUS.xlsx');
    quandlc=text(:,1);    % again, as a list in a cell array
    quandlcode=text(:,3); % corresponding Quandl's Price Code
 
    % fetch stock data for last ‘ndays’
    date2=datestr(lday,'yyyy-mm-dd');       % from
    date1=datestr(lday-ndays,'yyyy-mm-dd'); % to
 
    Rop={}; Tcp={};
    % scan all tickers and fetch the data from Quandl.com
    for i=1:length(components)
        for j=1:length(quandlc)
            if(strcmp(components{i},quandlc{j}))
                fprintf('%4.0f %s\n',i,quandlc{j});
                fts=0;
                [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                              'authcode','PutHereYourQuandlCode',...
                              'start_date',date1,'end_date',date2);
                cp=fts2mat(fts.Close,1); Tcp{i}=cp;     % close price-series
                op=fts2mat(fts.Open,1);  Top{i}=op;     % open price-series
                hp=fts2mat(fts.High,1);  Thp{i}=hp;     % high price
                lp=fts2mat(fts.Low,1);   Tlp{i}=lp;     % low price
                %Rcp{i}=cp(2:end,2)./cp(1:end-1,2)-1;   % return-series cp
            end
        end
    end
    N=length(components);
    ntdays=length(Tcp{1});
end

Please note that in line #12 we specified number of years, i.e. how far our backtest should be extended backward in time (or number of calendar days; see line #13) from the day specified in line #14 (last day).

Trading Model

First, let us design the trading strategy. We scan concurrently four price-series for each stock separately. We define the strategy’s trigger as follows:
triggers-2 i.e. if a stock open price on day $t$ was higher than the close price on the day $t-1$ and the lowest prices on day $t$ was higher than the highest price on day $t-1$. Having that, we make a BUY LONG decision! We buy that stock on the next day at its market price (close price). This approach should remove the slippage bias effectively (see more on slippage in stock trading here).

Now, we run the backtest on each stock and each open trade. We select the second parameter (parm2) to be a number of days, i.e. how long we hold the stock. In the following piece of code, let us allow to sell the stock after/between 1 to 21 calendar days ($\pm$ weekend or public holidays time period):

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
% pre-defined matrix for backtest final results
results=[];
 
for parm2=0:20
 
    cR=[];
    for i=1:N
        % just for a purpose of plotting of price-series
        if(i==1)
            % open (blue color)
            plot(Top{i}(:,1),Top{i}(:,2),'')
            hold on
            % close (red color)
            plot(Tcp{i}(:,1),Tcp{i}(:,2),'r')
            hold on
            % high (green color)
            plot(Thp{i}(:,1),Thp{i}(:,2),'g')
            %
            xlabel('Days');
            ylabel('AXP Stock Prices [US$]');
        end
 
        Tbuy=[];
        for t=2:ntdays
            % define indicators 
            ind1=Tcp{i}(t-1,2);  % cp on (t-1)day
            ind2=Thp{i}(t-1,2);  % hp on (t-1)day
            ind3=Top{i}(t,2);    % op on (t)day
            ind4=Tlp{i}(t,2);    % lp on (t)day
            % detect trigger
            if(ind1<ind3)&&(ind2<ind4)
                % plotting only for AXP
                if(i==1)
                    hold on;
                    plot(Top{i}(t,1),Top{i}(t,2),'o');
                end
                % date of a trigger
                tday=Top{i}(t,1);
                nextbusdate=busdate(tday,1); % find next trading date
                Tbuy=[Tbuy; nextbusdate];
            end
        end
        Tsell=busdate(Tbuy+parm2,1);

Here, in lines #57 and #60 we constructed time array storing physical information on those days. Now, we will use them to check the price on trade’s open and close and derive profit and loss for each stock:

62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
        R=[];
        for k=1:length(Tbuy)
            j=find(Tbuy(k)==Tcp{i}(:,1));
            pbuy=Tcp{i}(j,2);
            j=find(Tsell(k)==Tcp{i}(:,1));
            psell=Tcp{i}(j,2);
            ret=(psell/pbuy-1); % return per trade
            R=[R; ret];
        end
 
        compR=prod(R+1)-1;  % compound return per stock
        cR=[cR; compR];
 
    end
 
    results=[results cR];
 
end

In the inner loop (lines #24 to #75, i.e. tracking a number of stocks in portfolio; index $i$, here 1 to 10) we capture all trades per stock (lines #63-70) and calculate a multi-period compound return (line #72) as if we were trading that stock solely using our model.

For instance, for stock $i=1$ (AXP) from our portfolio, our code displays 1-year price-series:
post05082014-fig01 where days meeting our trigger criteria have been denoted by open-circle markers. If you now re-run the backtest making a gentle substitution in line #24 now to be:

24
    for i=1:1

we can find that by running through some extra lines of code as defined:

81
82
83
84
figure(2)
stem((0:20),100*results)
xlabel('Holding Period [days]');
ylabel('AXP: Compound Return [%]');

we obtain an appealing result:
post05082014-fig02
The chart reveals that for AXP, over past 251 days (since Aug/4 2014 backwards), we had 16 triggers therefore 16 trades and, surprisingly, regardless of holding period, the compound return from all closed trades was highly positive (profitable).

This is not the case if we consider, for example, $i=4$, IBM stock:
post05082014-fig03 This result points that for different holding periods (and different stocks of course) certain extra trading indicators should be applied to limit the losses (e.g. profit targets).

If we traded a whole portfolio using our gap-on-open model, we would end up with very encouraging result:
post05082014-fig04 where for each holding period we displayed the averaged over 10 stocks compound return. Taking into account the global up-trend in the US stock markets between August of 2013 and 2014, this strategy is worth its consideration with any further modifications (e.g. considering short or both long and short triggers, FX time-series, etc.).

Someone wise once said: Sometimes you win. Sometimes you learn. In algo trading we all learn to win.

In next post…
Marginal Value-at-Risk for Portfolio Managers

WANT TO LEARN MORE ON PORTFOLIOS in MATLAB!?
Click here!

 

Visualisation of N-Asset Portfolio in Matlab

Many of you who wished to use the Matlab’s Financial Toolbox for portfolio visualization most probably realised that something was wrong. Not every time the annualised risk and return for different time-series matched with what was expected. Below you can find an improved piece of code which does the job correctly.

The input sampling of your financial time-series now can be specified directly as daily, weekly, monthly, or quarterly and the risk-return diagram will be displayed with annualised results. Otherwise, you have an option, instead of annualised value, specify another time-horizon. All smooth and easy.

% Specialised plotting function for visualisation of portfolios
% based on the Financial Toolbox's portfolio objects
% 
% Modified by Pawel Lachowicz, 2014/03/01
%
% Remarks: Essential corrections introduced in the calculation and plotting
% of the annualised risk and returns coded incorrectly by the team at 
% The MathWorks, Inc.
 
function plotportfolio(varargin)
 
    plottitle = varargin{1};  % plot title, e.g. 'Efficient Frontier'
    plotfreq =  varargin{2};  % frequency of raw data (dt; or 'daily', etc.)
    plotfreq2=  varargin{3};  % e.g. ’annualised’
    plotfreq3=  varargin{4};
    plotlegend = [];
 
    switch plotfreq2
        case 'annualised'
            switch plotfreq
                case 'daily'
                    periods=252;
                case 'weekly'
                    periods=52;
                case 'monthly'
                    periods=12;
                case 'quarterly'
                    periods=4;
            end
        otherwise
            periods=plotfreq3;
    end
 
    for i = 5:nargin
        plotinfo = varargin{i};
 
        plottype = plotinfo{1};
        plotrsk  = plotinfo{2};
        plotret  = plotinfo{3};
        if numel(plotinfo) > 3
            plotlabel = plotinfo{4};
        else
            plotlabel = [];
        end
        if numel(plotinfo) > 4
            plotstyle = plotinfo{5};
            if isempty(plotstyle)
                plotstyle = 'b';
            end
        else
            if strcmpi(plottype,'line')
                plotstyle = 'b';
            else
                plotstyle = 'g';
            end
        end
        if numel(plotinfo) > 5
            plotline = plotinfo{6};
            if isempty(plotline)
                plotline = 1;
            end
        else
            if strcmpi(plottype,'line')
                plotline = 2;
            else
                plotline = [];
            end
        end
 
        % line plot
        if strcmpi(plottype,'line')
            hold on;
            for k = 1:size(plotrsk,2)
                plot(plotrsk(:,k)*sqrt(periods),((plotret(:,k)+1).^periods-1), ...
                     plotstyle, 'LineWidth', plotline);
                if i == 2 && k == 1
                    hold on
                end
                if ~isempty(plotlabel) && ~isempty(plotlabel{k})
                    plotlegend = [ plotlegend, {plotlabel{k}} ];
                end
            end
 
        % scatter plot
        else
            if any(plotstyle == '.')
                scatter(plotrsk*sqrt(periods),((plotret(:,k)+1).^periods-1),'or','Filled');
            else
                scatter(plotrsk*sqrt(periods),((plotret(:,k)+1).^periods-1),'og','Filled');
            end			
            if i==2
                hold on
            end
            if ~isempty(plotlabel)
                for k = 1:numel(plotrsk)
                    if ~isempty(plotlabel{k})
                        text(plotrsk(k)*sqrt(periods)+0.005, ...
                            ((plotret(k)+1).^periods-1), plotlabel{k},'FontSize', 9);
                    end
                end
            end
        end
    end
 
    if ~isempty(plotlegend)
        legend(plotlegend,'Location','SouthEast');
    end
    if(plotfreq2=='annualised')
        xlabel('Annualised Volatility');
        ylabel('Annualised Expected Returns');
    else
        xlabel('Volatility');
        ylabel('Returns');
    end
    grid on
 
end

Now, you can call it with various parameters of your interest, matching the data directly, for instance:

plotportfolio('Portfolio',dt,'annualised',[], ... 
    {'line', prsk, pret}, ...
    {'scatter',CashRsk,CashRet, {'Cash'}}, ...
    {'scatter', sqrt(diag(p.AssetCovar)), p.AssetMean, p.AssetList,'.r'});

For some awesome practical application of this function in the process of construction and analysis of $N$-asset portfolio grab Applied Portfolio Optimization with Risk Management using Matlab ebook today.

Extreme VaR for Portfolio Managers


Regardless of who you are, either an algo trader hidden in the wilderness of Alaska or an active portfolio manager in a large financial institution in São Paulo, your investments are the subject to sudden and undesired movements in the markets. There is one fear. We know it very well. It’s a low-probability, high-impact event. An extreme loss. An avalanche hitting your account balance hard. Really hard.

The financial risk management has developed a set of quantitative tools helping to rise an awareness of the worst case scenarios. The most popular two are Value-at-Risk (VaR) and Expected Shortfall (ES). Both numbers provide you with the amount of stress you put on your shoulders when the worst takes place indeed.

Quoting $VaR_{0.05}^{\rm 1-day}$ value we assume that in 95% of cases we should feel confident that the maximal loss for our portfolio, $C\times VaR_{0.05}^{\rm 1-day}$, will not be exceed in trading on the next day (1-day VaR) where $C$ is a current portfolio value (e.g. $\$$35,292.13). However if unforeseen happens, our best guess is that we will lose $C\times ES_{0.05}^{\rm 1-day}$ of dollars, where $ES_{0.05}^{\rm 1-day}$ stands for the 1-day expected shortfall estimate. Usually $ES_{0.05}^{\rm 1-day} > VaR_{0.05}^{\rm 1-day}$. We illustrated this case previously across VaR and Expected Shortfall vs. Black Swan article using IBM stock trading data. We also discussed a superbly rare case of extreme-loss in the markets on Oct 19, 1987 (-22%), far beyond any measure.

In this post we will underline an additional quantitative risk measure, quite often overlooked by portfolio managers, an extreme value-at-risk (EVaR). Using Matlab and an exemplary portfolio of stocks, we gonna confront VaR, ES, and EVaR for a better understanding of the amount of risk involved in the game. Hitting hard is inevitable under drastic market swings so you need to know the highest pain threshold: EVaR.

Generalised Extreme-Value Theory and Extreme VaR

The extreme-value theory (EVT) plays an important role in today’s financial world. It provides a tailor-made solution for those who wish to deal with highly unexpected events. In a classical VaR-approach, the risk manager is more concentrated and concerned around central tendency statistics. We are surrounded by them every day. The mean values, standard deviations, $\chi^2$ distributions, etc. They are blood under the skin. The point is they are governed by central limit theorems, which, on a further note, cannot be applied to extremes. That is where the EVT enters the game.

From the statistical point of view, the problem is simple. If we assume that $X$ is random variable of the loss and $X$ is independent and identically distributed (iid) by that we also assume a general framework suggesting us that $X$ comes from an unknown distribution of $F(x)=Pr(X\le x)$. EVT makes use of the fundamental Fisher-Trippett theorem and delivers practical solution for a large sample of $n$ extreme random loss variables knows also as a generalised extreme-value (GEV) distribution defined as:
$$
H(z;a,b) = \left\{
\begin{array}{lr}
\exp\left[-\left( 1+ z\frac{x-b}{a} \right)^{-1/z} \right] & : z \ne 0\\
\exp\left[-\exp\left(\frac{x-b}{a} \right) \right] & : z = 0 \\
\end{array}
\right.
$$ Under the condition of $x$ meeting $1+z(x-b)/a > 0$ both parameters $a$ and $b$ in GEV distribution ought to be understood as location and scale parameters of limiting distribution $H$. Their meaning is close but distinct from mean and standard deviation. The last parameter of $z$ corresponds to a tail index. Tht tail is important in $H$. Why? Simply because it points our attention to the heaviness of extreme losses in the data sample.

In our previous article on Black Swan and Extreme Loss Modeling we applied $H$ with $z=0$ to fit the data. This case of GEV is known as the Gumbel distribution referring to $F(x)$ to have exponential (light) tails. If the underlying data fitted with $H$ model reveal $z<0$, we talk about Fréchet distribution. From a practical point of view, the Fréchet case is more useful as it takes into account heavy tails more carefully. By plotting standardised Gumbel and Fréchet distributions ($a=1$, $b=0$), you can notice the gross mass of the $H$ contained between $x$ values of -2 and 6 what translates into $b-2a$ to $b+6a$ interval. With $z>0$, the tail contributes more significantly in production of very large $X$-values. The remaining case for $z<0$, Weibull distribution, we will leave out of our interest here (the case of tails lighter than normal).

If we set the left-hade side of GEV distribution to $p$, take logs of both sides of above equations and rearrange them accordingly, we will end up with a setup of:
$$
\ln(p) = \left\{
\begin{array}{lr}
-\left(1+zk \right)^{-1/z} & : z \ne 0\\
-\exp(-k) & : z = 0 \\
\end{array}
\right.
$$ where $k=(x-b)/a$ and $p$ denotes any chosen cumulative probability. From this point we can derive $x$ values in order to obtain quantiles associated with $p$, namely:
$$
x = \left\{
\begin{array}{ll}
b – \frac{a}{z} \left[1-(-\ln(p))^{-z}\right] & : ({\rm Fréchet}; z>0) \\
b – a\ln\left[-\ln(p)\right] & : ({\rm Gumbel}; z=0) \\
\end{array}
\right.
$$ Note, when $z\rightarrow 0$ then Fréchet quantiles tend to their Gumbel equivalents. Also, please notice that for instance for the standardised Gumbel distribution the 95% quantile is $-\ln[-\ln(0.95)] = 2.9702$ and points at its location in the right-tail, and not in the left-tail as usually associated with heaviest losses. Don’t worry. Gandhi says relax. In practice we fit GEV distribution with reversed sign or we reverse sign for 0.95th quantile for random loss variables of $x$. That point will become more clear with some examples that follow.

We allow ourselves to define the Extreme Value-at-Risk (EVaR) associated directly with fitted GEV distribution, $H_n$, to the data ($n$ data points), passing
$$
Pr[H_n < H'] = p = {Pr[X < H']}^n = [\alpha]^n $$ where $\alpha$ is the VaR confidence level associated with the threshold $H'$, in the following way: $$ \mbox{EVaR} = \left\{ \begin{array}{ll} b_n - \frac{a_n}{z_n} \left[1-(-n\ln(\alpha))^{-nz_n}\right] & : ({\rm Fréchet}; z>0) \\
b_n – a_n\ln\left[-n\ln(\alpha)\right] & : ({\rm Gumbel}; z=0) \\
\end{array}
\right.
$$ These formulas provide a practical way of estimating VaR based the analysis of the distribution and frequency of extreme losses in the data samples and are meant to work best for very high confidence levels. Additionally, as pointed by Dowd and Hauksson, the EVaR does not undergo the rule of square root of a holding period what is applicable for a standard VaR derived for (log-)normal distributions of return series.

1-day EVaR for Assets in your Portfolio

All right. Time to connect the dots. Imagine, you hold a portfolio of 30 bluechips being a part of Dow Jones Industrial index. Before market opens, regardless of quantity and direction of positions you hold in each of those 30 stocks, you, as a cautious portfolio manager, want to know the estimation of extreme loss that can take place on the next trading day.

You look at last 252 trading days of every stock, therefore you are able to extract a sample of $N=30$ largest losses (one per stock over last 365 calendar days). Given this sample, we can fit it with GEV distribution and find the best estimates for $z, a$ and $b$ parameters. However, in practice, $N=30$ is a small sample. There is nothing wrong to extract for each stock 5 worst losses that took place within last 252 trading days. That allows us to consider $n=150$ sample (see the code below).

Let’s use Matlab and the approach described in Create a Portfolio of Stocks based on Google Finance Data fed by Quandl post, here re-quoted in a shorter form:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
% Extreme Value-at-Risk
%
% (c) 2014 QuantAtRisk.com, by Pawel Lachowicz
 
 
close all; clear all; clc;
 
% Read the list of Dow Jones components
fileID = fopen('dowjones.lst');
tmp = textscan(fileID, '%s');
fclose(fileID);
dowjc=tmp{1};  % a list as a cell array
 
% Read in the list of tickers and internal code from Quandl.com
[ndata, text, alldata] = xlsread('QuandlStockCodeListUS.xlsx');
quandlc=text(:,1); % again, as a list in a cell array
quandlcode=text(:,3) % corresponding Quandl's Price Code
 
% fetch stock data for last 365 days
date2=datestr(today,'yyyy-mm-dd')     % from
date1=datestr(today-365,'yyyy-mm-dd') % to
 
stockd={};
% scan the Dow Jones tickers and fetch the data from Quandl.com
for i=1:length(dowjc)
    for j=1:length(quandlc)
        if(strcmp(dowjc{i},quandlc{j}))
            fprintf('%4.0f %s\n',i,quandlc{j});
            fts=0;
            [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                          'authcode','YourQUANDLtocken',...
                          'start_date',date1,'end_date',date2);
            cp=fts2mat(fts.Close,0); % close price
            R{i}=cp(2:end)./cp(1:end-1)-1; % return-series
        end
    end
end
N=length(dowjc);

where make use of external files: dowjones.lst (tickers of stocks in our portfolio) and QuandlStockCodeListUS.xlsx (see here for more details on utilising Quandl.com‘s API for fetching Google Finance stock data).

The Matlab’s cell array of $R$ holds 30 return-series (each 251 point long). Having that, we find a set of top 5 maximal daily losses for each stock ending up with a data sample of $n=150$ points as mentioned eariler.

40
41
42
43
44
45
Rmin=[];
for i=1:N
    ret=sort(R{i});
    Mn=ret(1:5); % extract 5 worst losses per stock
    Rmin=[Rmin; Mn];
end

Now, we fit the GEV distribution,
$$
H(z;a,b) = \exp\left[-\left( 1+ z\frac{x-b}{a} \right)^{-1/z} \right] \ \ ,
$$ employing a ready-to-use function, gevfit, from Matlab’s Statistics Toolbox:

47
48
49
50
51
% fit GEV distribution
[parmhat,parmci] = gevfit(Rmin)
z=parmhat(1); % tail index
a=parmhat(2); % scale parameter
b=parmhat(3); % location parameter

which returns for date2 = 2014-03-01,

parmhat =
   -0.8378    0.0130   -0.0326
parmci =
   -0.9357    0.0112   -0.0348
   -0.7398    0.0151   -0.0304

id est, the best estimates of the model’s parameters:
$$
z_{150} = -0.8378^{+0.0980}_{-0.0979}, \ \ a_{150} = 0.0130^{+0.0021}_{-0.0018}, \ \ b_{150} = -0.0326^{+0.0022}_{-0.0022} \ \ .
$$ At this point, it is essential to note that negative $z$ indicates Fréchet distribution since we fitted data with negative signs.

Plotting data with model,

53
54
55
56
57
58
59
60
61
62
63
64
65
66
% plot data with the GEV distribution for found z,a,b
x=-1:0.001:1;
hist(Rmin,-1:0.01:0);
h=findobj(gca,'Type','patch');
set(h,'FaceColor',[0.7 0.7 0.7],'EdgeColor',[0.6 0.6 0.6]);
h=findobj(gca,'Type','box');
set(h,'Color','k');
% GEV pdf 
pdf1=gevpdf(x,z,a,b);
y=0.01*length(Rmin)*pdf1;
line(x,y,'color','r'); box on;
xlabel('R_{min}');
ylabel(['f(R_{min}|z,a,b)']); 
xlim([-0.15 0]);

illustrates our efforts in a clear manner:

gev01

Eventually, performing derivation,

68
69
z=-z;
EVaR=b-(a/z)*(1-(-n*log(0.95))^(-n*z))

we calculate the best estimation of 1-day EVaR as discussed:
$$
\mbox{EVaR} =
b_{150} – \frac{a_{150}}{-z_{150}} \left[1-(-n\ln(0.95))^{nz_{150}}\right] = -0.0481 \ \ .
$$ This number tells us that among 30 bluechips in our portfolio we should expect extreme loss of 4.81% on the following day (estimated using last 252 days of trading history).

Now, is 4.81% a lot or not? That reminds me an immortal question asked by one of my colleagues many years ago, a guy who loved studying philosophy: Is 4 kilograms of potatoes a lot or not? The best way to address it, is by comparison. Let us again quickly look at return series of 30 stocks in our portfolio (R cell array in the code). For each individual series we derive its normalised distribution (histogram), fit it with a normal probability density function, and calculate corresponding 95% VaR (var local variable) in the following way:

71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
%--figure;
VaR=[];
for i=1:N
    rs=R{i};
    % create the histogram for return series
    [H,h]=hist(rs,100);
    sum=0;
    for i=1:length(H)-1
        sum=sum+(H(i)*(h(i+1)-h(i)));
    end
    H=H/sum;
    %--bar(h,H,'FaceColor',[0.7 0.7 0.7],'EdgeColor',[0.7 0.7 0.7])
    % fit the histogram with a normal distribution
    [muhat,sigmahat] = normfit([rs]');
    Y=normpdf(h,muhat(1),sigmahat(1));
    %--hold on; plot(h,Y,'b-'); % and plot the fit using a blue line
    % find VaR for stock 'i'
    sum=0;
    i=1;
    while (i<length(h))
        sum=sum+(Y(i)*(h(i+1)-h(i)));
        if(sum>=0.05) 
            break;
        end
        i=i+1;
    end
    var=h(i)
    %--hold on; plot(h(i),0,'ro','markerfacecolor','r'); % mark VaR_fit
    VaR=[VaR; var];
end

The vector of VaR then holds all thirty 95% VaR. By running the last piece of code,

102
103
104
105
106
107
108
figure;
[H,h]=hist(VaR*100,20);
line=[EVaR*100 0; EVaR*100 6];
bar(h,H,'FaceColor',[0.7 0.7 0.7],'EdgeColor',[0.7 0.7 0.7])
hold on; plot(line(:,1),line(:,2),'r:');
xlim([-6 0]); ylim([0 6]);
xlabel('95% VaR [percent]'); ylabel('Histogram');

we plot their distribution and compare them to EVaR of -4.81%, here marked by red dotted line in the chart:

gev02

The calculated 1-day 95% EVaR of -4.81% places itself much further away from all classical 95% VaR derived for all 30 stocks. It’s an extreme case indeed.

You can use 1-day EVaR to calculate the expected loss in your portfolio just by assuming that every single stock will drop by -4.81% on the next day. This number (in terms of currency of your portfolio) for sure will make you more alert.

Richard Branson once said: If you gonna fail, fail well. Don’t listen to him.

Accelerated Matlab for Quants. (4) Values, Vectors, and Value-at-Risk

YouTube Preview Image

 

Accelerated Matlab for Quants. (3) Advanced Arithmetic.

In the 3rd part of this tutorial on Accelerated Matlab for Quants, we burn out new patterns of using numbers in Matlab. Basic school arithmetic remains the same but we dare to mix it with the way of thinking how to direct the flow of logical coding to get what we want. The first flavor of random numbers is not omitted.

If you start your Matlab adventure with QuantAtRisk, listen to your intuition. Download the code from this lesson (lesson03.m) and dare to spend time amending it as much as you want. If you don’t understand something, drop me and email (pawel at quantatrisk .com). Otherwise, dare to make mistakes and study the Matlab’s manual. I can’t do your push-ups but my wish is to inspire you to explore Matlab on your own! Just remember, every guru was one day a beginner! Enjoy!!

YouTube Preview Image

 

Accelerated Matlab for Quants. (2) Your First Matlab Script.

The 2nd part of my tutorial on Accelerated Matlab for Quants is now available on QuantAtRisk YouTube channel. It covers the introductory topics on:
   – opening and editing the Matlab script
   – clearing all variables from the memory
   – displaying and suppressing values of variables
   – running the script

YouTube Preview Image

 

Pre-Processing of Asset Price Series for Portfolio Optimization

Portfolio Optimization is a significant component of Matlab’s Financial Toolbox. It provides us with ready-to-use solution in finding optimal weights of assets that we consider for trading deriving them based on the historical asset performance. From a practical point of view, we can include it in our algorithmic trading strategy and backtest its applicability under different initial conditions. This is a subject of my next up-coming post. However, before we can enjoy the view from the peak, we need to climb the mountain first.

In Matlab, the portfolio is created as a dedicated object of the same name. It doesn’t read the raw stock data. We need to feed that beast. Two major ingredients satisfy the input: a vector of the expected asset returns and a covariance matrix. Matlab helps us to estimate these moments but first we need to deliver asset data in a digestable form.

In this post we will see how one can quickly download the stock data from the Internet based on our own stock selection and pre-process them for solving portfolio optimization problem in Matlab.

Initial Setup for Portfolio Object

Let’s say that at any point of time you have your own list of stocks you wish to buy. For simplicity let’s also assume that the list contains stocks traded on NYSE or NASDAQ. Since you have been a great fun of this game, now you are almost ready to buy what you jotted down on your ShoppingList.lst. Here, an example of 10 tech stocks:

AAPL   AOL   BIDU   GOOG   HPQ   IBM   INTC   MSFT   NVDA   TXN

They will constitute your portfolio of stocks. The problem of portfolio optimization requires a look back in time in the space of returns obtained in trading by each stock. Based on them the Return Proxy and Risk Proxy can be found.

The return matrix $R$ of dimensions $(N-1)\times M$ where $N$ stands for number of historical prices (e.g. derived daily, or monthly, etc.) and $M$ for the number of stocks in our portfolio, is required by Matlab as an input. We will see how does it work in next post. For now let’s solely focus on creation of this matrix.

In the article Create a Portfolio of Stocks based on Google Finance Data fed by Quandl I discussed Quandl.com as an attractive data provider for US stocks. Here, we will follow this solution making use of Quandl resources to pull out the stock price series for our shopping list. Ultimately, we aim at building a function, here: QuandlForPortfolio, that does the job for us:

% Pre-Processing of Asset Price Series for Portfolio Optimization in Matlab
%  (c) 2013, QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
 
% Input Parameters
n=1*365;
tickers='ShoppingList.lst';
qcodes='QuandlStockCodeListUS.xlsx';
 
[X,Y,R,AssetList] = QuandlForPortfolio(n,tickers,qcodes);

We call this function with three input parameters. The first one, $n$, denotes a number of calendar days from today (counting backwards) for which we wish to retrieve the stock data. Usually, 365 days will correspond to about 250$-$252 trading days. The second parameter is a path/file name to our list of stock (desired to be taken into account in the portfolio optimisation process) while the last input defines the path/file name to the file storing stocks’ tickers and associated Quandl Price Codes (see here for more details).

Feeding the Beast

The QuandlForPortfolio Matlab function is an extended version of the previously discussed solution. It contains an important correcting procedure for the data fetched from the Quandl servers. First, let’s have a closer look on the function itself:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
% Function assists in fetching Google Finance data from the Quandl.com
%  server for a given list of tickers of stocks traded on NYSE or
%  NASDAQ. Data are retrieved for last 'n' days with daily sampling.
%
% INPUT
%   n       : number of calendar days from 'today' (e.g. 365 would
%             correspond to about 252 business days)
%   tickers : a path/file name of a text file listing tickers
%   qcodes  : a path/file name of Excel workbook (.xlsx) containing a list
%              of tickers and Quandl Price Codes in the format of
%              [Ticker,Stock Name,Price Code,Ratios Code,In Market?]
% OUTPUT
%   X0        : [Nx1] column vector with days
%   Y0        : [NxM] matrix with Close Prices for M stocks
%   R0        : [(N-1)xM] matrix of Retruns
%   AssetList : a list of tickers (cell array)
%
% (c) 2013, QuantAtRisk.com, by Pawel Lachowicz
 
function [X0,Y0,R0,AssetList0] = QuandlForPortfolio(n,tickers,qcodes)
    fileID = fopen(tickers);
    tmp = textscan(fileID, '%s');
    fclose(fileID);
    AssetList=tmp{1};  % a list as a cell array
 
    % Read in the list of tickers and internal Quandl codes
    %
    [~,text,~] = xlsread(qcodes);
    quandlc=text(:,1); % again, as a list in a cell array
    quandlcode=text(:,3); % corresponding Quandl's Price Code
 
    date1=datestr(today-n,'yyyy-mm-dd'); % from
    date2=datestr(today,'yyyy-mm-dd');   % to
 
    % Fetch the data from Quandl.com
    %
    QData={};
    for i=1:length(AssetList)
        for j=1:length(quandlc)
            if(strcmp(AssetList{i},quandlc{j}))
                fprintf('%4.0f %s\n',i,quandlc{j});
                fts=0;
                [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                   'authcode','x',...
                   'start_date',date1,'end_date',date2,'collapse','daily');
                QData{i}=fts;
            end
        end
    end
 
    % Post-Processing of Fetched Data
    %
    % create a list of days across all tickers
    TMP=[];
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        tmp=tmp(:,1);
        TMP=[TMP; tmp];
    end
    ut=unique(TMP);
    % use that list to find these days that are not present
    %  among all data sets
    TMP=[];
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        tmp=tmp(:,1);
        TMP=[TMP; setdiff(ut,tmp)];
    end
    ut=unique(TMP);
    % finally, extract Close Prices from FTS object and store them
    %  in Y0 matrix, plus corresponding days in X0
    X0=[];
    Y0=[]; 
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        cp=[];
        for j=1:size(tmp,1)
            [r,~,~]=find(ut==tmp(j,1));
            if(isempty(r))
                cp=[cp; tmp(j,5)]; % column 5 corresponds to Close Price
                if(i<2)
                    % create a time column vector listing days
                    % common among all data sets
                    X0=[X0; tmp(j,1)];
                end
            end
        end
        Y0=[Y0 cp];
    end
    % transform Close Prices into Returns, R(i)=cp(i)/cp(i-1)-1
    R0=tick2ret(Y0);
    AssetList0=AssetList';
end

The main bottleneck comes from the fact that Matlab’s portfolio object demands an equal number of historical returns ($N-1$) in the matrix of $R$ for all $M$ assets. We design the function in the way that it sets the common timeframe for all stocks listed on our shopping list. Of course, we ensure that all stocks were traded in the markets for about $n$ last days (rough estimation).

Now, the timeframe of $n$ last days should be understood as a first approximation. We fetch the data from Quandl (numeric date, Open, High, Low, Close, Volume) and save them in the cell array QData (lines #37-49) for each stock separately as FTS objects (Financial Time-Series objects; see Financial Toolbox). However, it may occur that not every stock we fetched displays the same amount of data. That is why we need to investigate for what days and for what stocks we miss the data. We achieve that by scanning each FTS object and creating a unique list of all days for which we have data (lines #54-60).

Next, we loop again over the same data sets but now we compare that list with a list of all dates for each stock individually (lines #63-69), capturing (line #67) those dates that are missing. Their complete list is stored as a vector in line #69. Eventually, given that, we are able to compile the full data set (e.g. Close Prices; here line #80) for all stocks in our portfolio ensuring that we will include only those dates for which we have prices across all $M$ assets (lines #70-91).

Beast Unleashed

We test our data pre-processing simply by running the block of code listed above engaging QuandlForPortfolio function and we check the results in the Matlab’s command window as follows:

>> whos X Y R AssetList
  Name             Size            Bytes  Class     Attributes
 
  AssetList        1x10             1192  cell                
  R              250x10            20000  double              
  X              251x1              2008  double              
  Y              251x10            20080  double

what confirms the correctness of dimensions as expected.

At this stage, the aforementioned function can be used two-fold. First, we are interested in the portfolio optimisation and we look back at last $n$ calendar days since the most current one (today). The second usage is handy too. We consider our stocks on the shopping list and fetch for their last, say, $n=7\times365$ days with data. If all stocks were traded over past 7 years we should be able to collect a reach data set. If not, the function will adjust the beginning and end date to meet the initial time constrains as required for $R$ matrix construction. For the former case, we can use 7-year data sample for direct backtesting of algo models utilizing Portfolio Optimization.

Stay tuned as we will rock this land in the next post!

Any Questions?

Share them across QuantCove.com – the official Forum of QuantAtRisk.

QuantAtRisk on YouTube!

Since Dec/1 you can follow QuantAtRisk YouTube Channel! My goal is to deliver you a quality information and visualisation of solutions for complex financial problems.

We kick off this session with the very first video on Accelerated Matlab for Quants, a tutorial for complete beginners in Matlab programming.

YouTube Preview Image

 

Enjoy!

Ideal Stock Trading Model for the Purpose of Backtesting Only


There is only one goal in algorithmic trading: to win the game. To emerge victorious. To feel the sunlight again after the months of walking through the valley of shadows being stuck in the depths of algo drawdowns. An endless quest for the most brilliant minds, to find the way to win over and over, and over again. To discover a perfect solution as your new private weapon in this battleground. Is it possible? If the answer were so simple, we wouldn’t be so engaged in seeking for a golden mean.

However, algo trading is a journey, and sometimes in the process of programming and testing of our trading systems, we need to have an ideal trading model ready-to-use straight away! What I mean by the ideal model is a sort of template, a benchmark that will allow us to list a number of successful trades, their starting and closing times, open and close price of the trades being executed, and the realized return coming down to our pocket from each trade. Such a trading model template also allows us to look at the trading data from a different perspective and re-consider and/or apply an additional risk management framework. In fact, the benefits are limitless for the backtesting purposes.

In this post we will explore one of the easiest ways in programming a perfect model by re-directing the time-axis backwards. Using an example of the data of a Google, Inc. (GOOG) stock listed at NASDAQ, we will analyse the stock trading history and find all possible trades returning at least 3% over past decade.

The results of this strategy I will use within the upcoming (this week!) series of new posts on Portfolio Optimization in Matlab for Algo Traders.

Model and Data

Let’s imagine we are interested in finding a large number of trades with the expected return from each trade to be at least 3%. We consider GOOG stock (daily close prices) with data spanning 365$\times$10 days back in time since the present day (last 10 years). We will make use of Google Finance data powered by Quandl as described in one of my previous posts, namely, Create a Portfolio of Stocks based on Google Finance Data fed by Quandl. Shall we begin?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
% Ideal Stock Trading Model for the Purpose of Backtesting Only
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
 
clear all; close all; clc;
 
stocklist={'GOOG'};
 
% Read in the list of tickers and internal code from Quandl.com
[ndata, text, alldata] = xlsread('QuandlStockCodeListUS.xlsx');
quandlc=text(:,1); % again, as a list in a cell array
quandlcode=text(:,3) % corresponding Quandl's Price Code
 
% fetch stock data for last 10 years
date2=datestr(today,'yyyy-mm-dd')     % from
date1=datestr(today-365*10,'yyyy-mm-dd') % to
 
stockd={};
for i=1:length(stocklist)
    for j=1:length(quandlc)
        if(strcmp(stocklist{i},quandlc{j}))
            fprintf('%4.0f %s\n',i,quandlc{j});
            % fetch the data of the stock from Quandl
            % using recommanded Quandl's command and
            % saving them directly into FTS object (fts)
            fts=0;
            [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                          'authcode','ENTER_YOUR_CODE',...
                          'start_date',date1,'end_date',date2);
            stockd{i}=fts; % entire FTS object in an array's cell
        end
    end
end

The extracted data of GOOG from Google Finance via Quandl we can visualize immediately as follows:

36
37
38
39
40
41
42
% plot the close prices of GOOG
cp=fts2mat(stockd{1}.Close,1);
plot(cp(:,1),cp(:,2),'color',[0.6 0.6 0.6])
xlim([min(cp(:,1)) max(cp(:,1))]);
ylim([min(cp(:,2)) max(cp(:,2))]);
xlabel('Nov 2003 - Nov 2013 (days)');
ylabel('GOOG Close Price ($)');

ideal-fig01

We open the trade on the first day (long position) and as time goes by, on the following day we check whether the stock value increased by 3% or more. If not, we increase the current time position by one day and check again. If the condition is met, we close the opened trade on the following trading (business) day at the stock’s market price. We allow to open a new trade at the stock’s market price one next business day after the day when the previous trade has been closed (exercised). Additionally, we check if the ‘running’ return from the open trade exceeds -5%. If so, we substitute the open time of a new trade with the time when the latter condition has been fulfilled (the same for the open price of that trade). The latter strategy allow us to restart the backtesting process therefore the search for a new profitable trade.

44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
t0=cp(1,1);    % starting day for backtesting
tN=cp(end,1);  % last day
 
trades=[]; % open a log-book for all executed traded
status=0; % status meaning: 0-no open trade, 1-open trade
t=t0;
% we loop over time (t) [days]
while(t<tN)
    [r,~,~]=find(t==cp(:,1)); % check the row in cp vector
    if(~isempty(r))
       if(~status)
           topen=t; % time when we open the trade
           popen=cp(r,2); % assuming market price of the stock
           status=1;
       else
           ptmp=cp(r,2); % running close price
           rtmp=ptmp/popen-1; % running return of the open trade
           if(rtmp>0.03) % check 3% profit condition
               % if met, then
               tclose=busdate(t,1); % close time of the trade
                                    %  assumed on the next business day
               t=busdate(tclose,1); % next day in the loop
               if(tclose<=tN)
                   [r,~,~]=find(tclose==cp(:,1));
                   pclose=cp(r,2); % close price
                   ret=pclose/popen-1; % realized profit/loss
                   % save the trade details into log-book
                   trades=[trades; topen tclose popen pclose ret*100];
                   status=0; % change status of trading to not-open
                   % mark the opening of the trade as blue dot marker
                   hold on; plot(topen,popen,'b.','markerfacecolor','b');
                   % mark the end time of the trade
                   hold on; plot(tclose,pclose,'r.','markerfacecolor','r');
               end
           elseif(rtmp<=-0.05) % check an additional condition
               topen=t; % overwrite the time
               popen=cp(r,2); % and the price
               status=1; % sustain the status of the trade as 'open'
           else
               t=t+1;
           end
       end
    else
        t=t+1;
    end
end

In this piece of code, in the variable matrix of trades (a log-book of all exercised trades) we store the history of all successful trades meeting our earlier assumed criteria. The only uncertainty that we allow to slip into our perfect solution is the one related to an instance when the the close price on the next business day occurs to be lower, generating the realized profit from the trade less than 3%. By plotting all good trades with the ending day of $tN$ set as for Nov 18, 2013, we get a messy picture:

ideal-fig02

which translates into more intuitive one once we examine the distribution of profits from all trades:

figure(3);
hist(trades(:,5),50);
xlabel('Profit/loss (%)');
ylabel('Number of trades');

ideal-fig03

In this point the most valuable information is contained in the log-book which content we can analyze trade by trade:

>> format shortg
>> trades
 
trades =
 
   7.3218e+05   7.3218e+05       100.34        109.4       9.0293
   7.3218e+05   7.3220e+05       104.87          112       6.7989
   7.3221e+05   7.3221e+05       113.97       119.36       4.7293
   7.3221e+05   7.3222e+05       117.84       131.08       11.236
   7.3222e+05   7.3222e+05        129.6       138.37        6.767
   7.3223e+05   7.3224e+05       137.08       144.11       5.1284
   7.3224e+05   7.3224e+05       140.49       172.43       22.735
   7.3224e+05   7.3225e+05        187.4       190.64       1.7289
   ...
   7.3533e+05   7.3535e+05       783.05       813.45       3.8823
   7.3535e+05   7.3536e+05        809.1       861.55       6.4825
   7.3536e+05   7.3537e+05       857.23       915.89        6.843
   7.3546e+05   7.3549e+05       856.91       888.67       3.7063
   7.3549e+05   7.3553e+05       896.19       1003.3       11.952

where the columns correspond to the open and close time of the trade (a continuous Matlab’s time measure for the financial time-series; see datestr command for getting yyyy-mm-dd date format), open and close price of GOOG stock, and realized profit/loss of the trade, respectively.

Questions? Discuss on Forum.

Just dive directly into Backtesting section on QaR Forum and keep up, never give up.

Create a Portfolio of Stocks based on Google Finance Data fed by Quandl

There is an old saying Be careful what you wish for. Many quants and traders wished for a long time for a better quality of publicly available data. It wasn’t an easy task to accomplish but with a minimum of effort, it has become possible to get all what you truly dreamt of. When I started my experience with daily backtesting of my algo strategies I initially relied on Yahoo! data. Yeah, easily accessible but of low quality. Since the inception of Google Finance the same dream had been reborn. More precise data, covering more aspects of trading, platforms, markets, instruments. But how to get them all?


Quandl.com provides us with an enormous access to an enormous database. Just visit the webpage to find out what I mean by barley putting my thoughts all together in one sentence. In fact, the portal’s accessibility via different programming languages (e.g. Matlab, C++, Python, etc.) makes that database more alive than ever. The space of parameters the quants get an access to becomes astronomical in number!

Alleluja! Let’s ride this horse. Enough of watching the online charts changing a direction of our investing moods with a one-mintute time resolution. We were born to be smarter than stochastic processes taking control over our daily strategies. In this short post, making use of the Matlab R2013a platform, I want to show you how quickly you can construct any portfolio of US stocks taken from the Dow Jones Index universe (as an example). We will use the Google Finance stock data fed by Quandl.com.

Google Finance Stock Data for Model Feeding

Similar to Yahoo! Finance’s resources, Quandl.com has an own system of coding the names of all US stocks. Therefore, we need to know it before doing any further step. The .csv file containing all of the codes for Quandl’s stock data can be obtained via API for Stock Data webpage. The file contains the header of the form:

Ticker      Stock Name      Price Code      Ratios Code      In Market?

linking Google Finance ticker with Quandl’s code (Price Code). Doing some extra job with data filtering, I created the pre-processed set of all tickers excluding those non-active in the markets and removing all invalid data. You can download it here QuandlStockCodeListUS.xlsx as an Excel workbook.

The second list we need to have is a list of potential stock we want to deal with in our portfolio. As for today, the current list of Dow Jones Index’s components you can download here: dowjones.lst. Let’s start with it:

1
2
3
4
5
6
7
8
9
10
11
12
13
% Create a Portfolio from Dow Jones Stocks based on Google Finance
%  Data fed by Quandl.com
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
 
close all; clear all; clc;
 
% Read the list of Dow Jones components
fileID = fopen('dowjones.lst');
tmp = textscan(fileID, '%s');
fclose(fileID);
dowjc=tmp{1};  % a list as a cell array

The guys from Mathworks suggest to make a new and good practice of using textscan command (lines 10-12) instead of dataread for reading data from the text files. The latter will be removed from the next Matlab release.

Next, we need to import the current list of tickers for US stocks and their corresponding codes in the Quandl’s database. We will make the use of my pre-processed data set of QuandlStockCodeListUS.xlsx as mentioned earlier in the following way:

15
16
17
% Read in the list of tickers and internal code from Quandl.com
[ndata, text, alldata] = xlsread('QuandlStockCodeListUS.xlsx');
quandlc=text(:,1); % again, as a list in a cell array

For the simplicity of our example, we will be considering all 30 stocks as a complete sample set of stocks in our portfolio. A portfolio construction is not solely a selection of stocks. It usually links a time period of the past performance of each individual stock. In the following move, we will collect daily stock returns over last 365 calendar year (corresponding to about 252 trading days).

This framework is pretty handful. Why? If you build or run live a model, most probably your wish is to update your database when US markets are closed. You fetch Quandl.com for data. Now, if your model employs risk management or portfolio optimisation as an integrated building-block, it’s very likely, you are interested in risk(-return) estimation, e.g. by calculating Value-at-Risk (VaR) or similar risk measures. Of course, the choice of the time-frame is up to you based on your strategy.

Having that said,

19
20
21
% fetch stock data for last 365 days
date2=datestr(today,'yyyy-mm-dd')     % from
date1=datestr(today-365,'yyyy-mm-dd') % to

what employs an improved Matlab’s command of datestr with formatting as a parameter. It’s really useful when it comes to data fetching from Quandl.com as the format of the date they like to see in your code is yyyy-mm-dd. The above two commands return in Matlab the following values:

date2 =
  2013-11-03
date1 =
  2012-11-03

as for the time of writing this article (string type).

Get an Access to Quandl’s Resources


Excellent! Let’s catch some fishes in the pond! Before doing it we need to make sure that Quandl’s software is installed and linked to our Matlab console. In order to download Quandl Matlab package just visit this website and follow the installation procedure at the top of the page. Add proper path in Matlab environment too. The best practice is to register yourself on Quandl. That allows you to rise the daily number of data call from 50 to 500. If you think you will need more data requests per day, from the same webpage you can send your request.

When registered, in your Profile section on Quandi, you will gain an access to your individual Authorization Token, for example:
qtokenYou will need to replace ‘yourauthenticationtoken’ in the Matlab code (see line 35) with your token in order to get the data.

Fetching Stock Data

From this point the road stops to wind. For all 30 stocks belonging to the current composition of Dow Jones Index, we get their Open, High, Low, Close prices (as retrieved from Google Finance) in the following way:

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
% create an empty array for storing stock data
stockd={};
% scan the Dow Jones tickers and fetch the data from Quandl.com
for i=1:length(dowjc)
    for j=1:length(quandlc)
        if(strcmp(dowjc{i},quandlc{j}))
            fprintf('%4.0f %s\n',i,quandlc{j});
            % fetch the data of the stock from Quandl
            % using recommanded Quandl's command and
            % saving them directly into Matlab's FTS object (fts)
            fts=0;
            [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                          'authcode','yourauthenticationtoken',...
                          'start_date',date1,'end_date',date2);
            stockd{i}=fts; % entire FTS object in an array's cell
        end
    end
end
% use 'save' command to save all actual variables from memory on the disc,
%  if you need it for multiple code running
% save data

A code in line 35 to 37 is an example how you call Quandi for data and fetch them into FINTS (financial time series object from the Financial Toolbox in Matlab). So easy. If you wish to change the time frequency, apply transformation or get the data not in FINTS but .csv format, all is available for you as a parameter at the level of calling Quandl.get() command (explore all possibilities here).

Having that, you can easily plot the Google Finance data, for instance, of IBM stock straight away:

% diplay High-Low Chart for IBM
highlow(stockd{11})
title('IBM');
ylabel('Stock Price ($)');

to get:
quandl-fig01

or prepare a return-risk plot for your further portfolio optimisation

ret=[];
for i=1:length(stockd)
    % extract the Close Price
    tmp=fts2mat(stockd{i}.Close,0);
    % calculate a vector of daily returns
    r=tmp(2:end)./tmp(1:end-1)-1;
    ret=[ret r];
end
% calculate the annualized mean returns for 30 stocks
mret=mean(ret)*sqrt(250);
% and corresponding standard deviations
msig=std(ret)*sqrt(250);
 
% plot return-risk diagram
p=100*[msig; mret]'
plot(p(:,1),p(:,2),'b.');
xlim([0 30]);
xlabel('Annualized Risk of Stock (%)')
ylabel('Annualized Mean Return (%)');

providing us with
quandl-fig02

Forum: Further Discussions

You can discuss the current post with other quants within a new Quant At Risk Forum (Data Handling and Pre-Processing section).

Anxiety Detection Model for Stock Traders based on Principal Component Analysis

Everybody would agree on one thing: the nervousness among traders may lead to massive sells of stocks, the avalanche of prices, and huge losses. We have witnessed this sort of behaviour many times. An anxiety is the feeling of uncertainty, a human inborn instinct that triggers a self-defending mechanism against high risks. The apex of anxiety is fear and panic. When the fear spreads among the markets, all goes south. It is a dream goal for all traders to capture the nervousness and fear just by looking at or studying the market behaviour, trading patterns, ask-bid spread, or flow of orders. The major problem is that we know very well how much a human behaviour is an important factor in the game but it is not observed directly. It has been puzzling me for a few years since I accomplished reading Your Money and Your Brain: How the New Science of Neuroeconomics Can Help Make You Rich by Jason Zweig sitting at the beach of one of the Gili Islands, Indonesia, in December of 2009. A perfect spot far away from the trading charts.

So, is there a way to disentangle the emotional part involved in trading from all other factors (e.g. the application of technical analysis, bad news consequences, IPOs, etc.) which are somehow easier to deduce? In this post I will try to make a quantitative attempt towards solving this problem. Although the solution will not have the final and closed form, my goal is to deliver an inspiration for quants and traders interested in the subject by putting a simple idea into practice: the application of Principal Component Analysis.

1. Principal Component Analysis (PCA)

Called by many as one of the most valuable results from applied linear algebra, the Principal Component Analysis, delivers a simple, non-parametric method of extracting relevant information from often confusing data sets. The real-world data usually hold some relationships among their variables and, as a good approximation, in the first instance we may suspect them to be of the linear (or close to linear) form. And the linearity is one of stringent but powerful assumptions standing behind PCA.

Imagine we observe the daily change of prices of $m$ stocks (being a part of your portfolio or a specific market index) over last $n$ days. We collect the data in $\boldsymbol{X}$, the matrix $m\times n$. Each of $n$-long vectors lie in an $m$-dimensional vector space spanned by an orthonormal basis, therefore they are a linear combination of this set of unit length basic vectors: $ \boldsymbol{BX} = \boldsymbol{X}$ where a basis $\boldsymbol{B}$ is the identity matrix $\boldsymbol{I}$. Within PCA approach we ask a simple question: is there another basis which is a linear combination of the original basis that represents our data set? In other words, we look for a transformation matrix $\boldsymbol{P}$ acting on $\boldsymbol{X}$ in order to deliver its re-representation:
$$
\boldsymbol{PX} = \boldsymbol{Y} \ .
$$ The rows of $\boldsymbol{P}$ become a set of new basis vectors for expressing the columns of $\boldsymbol{X}$. This change of basis makes the row vectors of $\boldsymbol{P}$ in this transformation the principal components of $\boldsymbol{X}$. But how to find a good $\boldsymbol{P}$?

Consider for a moment what we can do with a set of $m$ observables spanned over $n$ days? It is not a mystery that many stocks over different periods of time co-vary, i.e. their price movements are closely correlated and follow the same direction. The statistical method to measure the mutual relationship among $m$ vectors (correlation) is achieved by the calculation of a covariance matrix. For our data set of $\boldsymbol{X}$:
$$
\boldsymbol{X}_{m\times n} =
\left[
\begin{array}{cccc}
\boldsymbol{x_1} \\
\boldsymbol{x_2} \\
… \\
\boldsymbol{x_m}
\end{array}
\right]
=
\left[
\begin{array}{cccc}
x_{1,1} & x_{1,2} & … & x_{1,n} \\
x_{2,1} & x_{2,2} & … & x_{2,n} \\
… & … & … & … \\
x_{m,1} & x_{m,2} & … & x_{m,n}
\end{array}
\right]
$$
the covariance matrix takes the following form:
$$
cov(\boldsymbol{X}) \equiv \frac{1}{n-1} \boldsymbol{X}\boldsymbol{X}^{T}
$$ where we multiply $\boldsymbol{X}$ by its transposed version and $(n-1)^{-1}$ helps to secure the variance to be unbiased. The diagonal elements of $cov(\boldsymbol{X})$ are the variances corresponding to each row of $\boldsymbol{X}$ whereas the off-diagonal terms of $cov(\boldsymbol{X})$ represent the covariances between different rows (prices of the stocks). Please note that above multiplication assures us that $cov(\boldsymbol{X})$ is a square symmetric matrix $m\times m$.

All right, but what does it have in common with our PCA method? PCA looks for a way to optimise the matrix of $cov(\boldsymbol{X})$ by a reduction of redundancy. Sounds a bit enigmatic? I bet! Well, all we need to understand is that PCA wants to ‘force’ all off-diagonal elements of the covariance matrix to be zero (in the best possible way). The guys in the Department of Statistics will tell you the same as: removing redundancy diagonalises $cov(\boldsymbol{X})$. But how, how?!

Let’s come back to our previous notation of $\boldsymbol{PX}=\boldsymbol{Y}$. $\boldsymbol{P}$ transforms $\boldsymbol{X}$ into $\boldsymbol{Y}$. We also marked that:
$$
\boldsymbol{P} = [\boldsymbol{p_1},\boldsymbol{p_2},…,\boldsymbol{p_m}]
$$ was a new basis we were looking for. PCA assumes that all basis vectors $\boldsymbol{p_k}$ are orthonormal, i.e. $\boldsymbol{p_i}\boldsymbol{p_j}=\delta_{ij}$, and that the directions with the largest variances are the most principal. So, PCA first selects a normalised direction in $m$-dimensional space along which the variance in $\boldsymbol{X}$ is maximised. That is first principal component $\boldsymbol{p_1}$. In the next step, PCA looks for another direction along which the variance is maximised. However, because of orthonormality condition, it looks only in all directions perpendicular to all previously found directions. In consequence, we obtain an orthonormal matrix of $\boldsymbol{P}$. Good stuff, but still sounds complicated?

The goal of PCA is to find such $\boldsymbol{P}$ where $\boldsymbol{Y}=\boldsymbol{PX}$ such that $cov(\boldsymbol{Y})=(n-1)^{-1}\boldsymbol{XX}^T$ is diagonalised.

We can evolve the notation of the covariance matrix as follows:
$$
(n-1)cov(\boldsymbol{Y}) = \boldsymbol{YY}^T = \boldsymbol{(PX)(PX)}^T = \boldsymbol{PXX}^T\boldsymbol{P}^T = \boldsymbol{P}(\boldsymbol{XX}^T)\boldsymbol{P}^T = \boldsymbol{PAP}^T
$$ where we made a quick substitution of $\boldsymbol{A}=\boldsymbol{XX}^T$. It is easy to prove that $\boldsymbol{A}$ is symmetric. It takes a longer while to find a proof for the following two theorems: (1) a matrix is symmetric if and only if it is orthogonally diagonalisable; (2) a symmetric matrix is diagonalised by a matrix of its orthonormal eigenvectors. Just check your favourite algebra textbook. The second theorem provides us with a right to denote:
$$
\boldsymbol{A} = \boldsymbol{EDE}^T
$$ where $\boldsymbol{D}$ us a diagonal matrix and $\boldsymbol{E}$ is a matrix of eigenvectors of $\boldsymbol{A}$. That brings us at the end of the rainbow.

We select matrix $\boldsymbol{P}$ to be a such where each row $\boldsymbol{p_1}$ is an eigenvector of $\boldsymbol{XX}^T$, therefore
$$
\boldsymbol{P} = \boldsymbol{E}^T .
$$

Given that, we see that $\boldsymbol{E}=\boldsymbol{P}^T$, thus we find $\boldsymbol{A}=\boldsymbol{EDE}^T = \boldsymbol{P}^T\boldsymbol{DP}$ what leads us to a magnificent relationship between $\boldsymbol{P}$ and the covariance matrix:
$$
(n-1)cov(\boldsymbol{Y}) = \boldsymbol{PAP}^T = \boldsymbol{P}(\boldsymbol{P}^T\boldsymbol{DP})\boldsymbol{P}^T
= (\boldsymbol{PP}^T)\boldsymbol{D}(\boldsymbol{PP}^T) =
(\boldsymbol{PP}^{-1})\boldsymbol{D}(\boldsymbol{PP}^{-1})
$$ or
$$
cov(\boldsymbol{Y}) = \frac{1}{n-1}\boldsymbol{D},
$$ i.e. the choice of $\boldsymbol{P}$ diagonalises $cov(\boldsymbol{Y})$ where silently we also used the matrix algebra theorem saying that the inverse of an orthogonal matrix is its transpose ($\boldsymbol{P^{-1}}=\boldsymbol{P}^T$). Fascinating, right?! Let’s see now how one can use all that complicated machinery in the quest of looking for human emotions among the endless rivers of market numbers bombarding our sensors every day.

2. Covariances of NASDAQ, Eigenvalues of Anxiety


We will try to build a simple quantitative model for detection of the nervousness in the trading markets using PCA.

By its simplicity I will understand the following model assumption: no matter what the data conceal, the 1st Principal Component (1-PC) of PCA solution links the complicated relationships among a subset of stocks triggered by a latent factor attributed by us to a common behaviour of traders (human and pre-programmed algos). It is a pretty reasonable assumption, much stronger than, for instance, the influence of Saturn’s gravity on the annual silver price fluctuations. Since PCA does not tell us what its 1-PC means in reality, this is our job to seek for meaningful explanations. Therefore, a human factor fits the frame as a trial value very well.

Let’s consider the NASDAQ-100 index. It is composed of 100 technology stocks. The most current list you can find here: nasdaq100.lst downloadable as a text file. As usual, we will perform all calculations using Matlab environment. Let’s start with data collection and pre-processing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
% Anxiety Detection Model for Stock Traders
%  making use of the Principal Component Analsis (PCA)
%  and utilising publicly available Yahoo! stock data
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
 
 
% Reading a list of NASDAQ-100 components
nasdaq100=(dataread('file',['nasdaq100.lst'], '%s', 'delimiter', '\n'))';
 
% Time period we are interested in
d1=datenum('Jan 2 1998');
d2=datenum('Oct 11 2013');
 
% Check and download the stock data for a requested time period
stocks={};
for i=1:length(nasdaq100)
    try
        % Fetch the Yahoo! adjusted daily close prices between selected
        % days [d1;d2]
        tmp = fetch(yahoo,nasdaq100{i},'Adj Close',d1,d2,'d');
        stocks{i}=tmp;
        disp(i);
    catch err
        % no full history available for requested time period
    end
end

where, first, we try to check whether for a given list of NASDAQ-100’s components the full data history (adjusted close prices) are available via Yahoo! server (please refer to my previous post of Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting for more information on the connectivity).

The cell array stocks becomes populated with two-dimensional matrixes: the time-series corresponding to stock prices (time,price). Since the Yahoo! database does not contain a full history for all stocks of our interest, we may expect their different time spans. For the purpose of demonstration of the PCA method, we apply additional screening of downloaded data, i.e. we require the data to be spanned between as defined by $d1$ and $d2$ variables and, additionally, having the same (maximal available) number of data points (observations, trials). We achieve that by:

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
% Additional screening
d=[]; 
j=1; 
data={};
for i=1:length(nasdaq100)
    d=[d; i min(stocks{i}(:,1)) max(stocks{i}(:,1)) size(stocks{i},1)];
end
for i=1:length(nasdaq100)
    if(d(i,2)==d1) && (d(i,3)==d2) && (d(i,4)==max(d(:,4)))
        data{j}=sortrows(stocks{i},1);
        fprintf('%3i %1s\n',i,nasdaq100{i})
        j=j+1;
    end
end
m=length(data);

The temporary matrix of $d$ holds the index of stock as read in from nasdaq100.lst file, first and last day number of data available, and total number of data points in the time-series, respectively:

>> d
d =
      1      729757      735518        3970
      2      729757      735518        3964
      3      729757      735518        3964
      4      729757      735518        3969
     ..          ..          ..          ..
     99      729757      735518        3970
    100      729757      735518        3970

Our screening method saves $m=21$ selected stock data into data cell array corresponding to the following companies from our list:

  1 AAPL
  7 ALTR
  9 AMAT
 10 AMGN
 20 CERN
 21 CHKP
 25 COST
 26 CSCO
 30 DELL
 39 FAST
 51 INTC
 64 MSFT
 65 MU
 67 MYL
 74 PCAR
 82 SIAL
 84 SNDK
 88 SYMC
 96 WFM
 99 XRAY
100 YHOO

Okay, some people say that seeing is believing. All right. Let’s see how it works. Recall the fact that we demanded our stock data to be spanned between ‘Jan 2 1998′ and ‘Oct 11 2013′. We found 21 stocks meeting those criteria. Now, let’s assume we pick up a random date, say, Jul 2 2007 and we extract for all 21 stocks their price history over last 90 calendar days. We save their prices (skipping the time columns) into $Z$ matrix as follows:

t=datenum('Jul 2 2007');
Z=[];
for i=1:m
    [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-90));
    Z=[Z data{i}(r,2)]
end

and we plot them all together:

plot(Z)
xlim([1 length(Z)]);
ylabel('Stock price (US$)');
xlabel('T-90d');

anxiety-fig01
It’s easy to deduct that the top one line corresponds to Apple, Inc. (AAPL) adjusted close prices.

The unspoken earlier data processing methodology is that we need to transform our time-series into the comparable form. We can do it by subtracting the average value and dividing each of them by their standard deviations. Why? For a simple reason of an equivalent way of their mutual comparison. We call that step a normalisation or standardisation of the time-series under investigation:

[N,M]=size(Z);
X=(Z-repmat(mean(Z),[N 1]))./repmat(std(Z),[N 1]);

This represents the matrix $\boldsymbol{X}$ that I discussed in a theoretical part of this post. Note, that the dimensions are reversed in Matlab. Therefore, the normalised time-series,

% Display normalized stock prices
plot(X)
xlim([1 length(Z)]);
ylabel('(Stock price-Mean)/StdDev');
xlabel('T-90d');

look like:
anxiety-fig02
For a given matrix of $\boldsymbol{X}$, its covariance matrix,

% Calculate the covariance matrix, cov(X)
CovX=cov(X);
imagesc(CovX);

as for data spanned 90 calendar day back from Jul 2 2007, looks like:
anxiety-fig03
where the colour coding goes from the maximal values (most reddish) down to the minimal values (most blueish). The diagonal of the covariance matrix simply tells us that for normalised time-series, their covariances are equal to the standard deviations (variances) of 1 as expected.

Going one step forward, based on the given covariance matrix, we look for the matrix of $\boldsymbol{P}$ whose columns are the corresponding eigenvectors:

% Find P
[P,~]=eigs(CovX,5);
imagesc(P);
set(gca,'xticklabel',{1,2,3,4,5},'xtick',[1 2 3 4 5]);
xlabel('Principal Component')
ylabel('Stock');
set(gca,'yticklabel',{'AAPL', 'ALTR', 'AMAT', 'AMGN', 'CERN', ...
 'CHKP', 'COST', 'CSCO', 'DELL', 'FAST', 'INTC', 'MSFT', 'MU', ...
 'MYL', 'PCAR', 'SIAL', 'SNDK', 'SYMC', 'WFM', 'XRAY', 'YHOO'}, ...
 'ytick',[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21]);

which results in $\boldsymbol{P}$ displayed as:
anxiety-fig04
where we computed PCA for five principal components in order to illustrate the process. Since the colour coding is the same as in the previous figure, a visual inspection of of the 1-PC indicates on negative numbers for at least 16 out of 21 eigenvalues. That simply means that over last 90 days the global dynamics for those stocks were directed south, in favour of traders holding short-position in those stocks.

It is important to note in this very moment that 1-PC does not represent the ‘price momentum’ itself. It would be too easy. It represents the latent variable responsible for a common behaviour in the stock dynamics whatever it is. Based on our model assumption (see above) we suspect it may indicate a human factor latent in the trading.

3. Game of Nerves

The last figure communicates an additional message. There is a remarkable coherence of eigenvalues for 1-PC and pretty random patterns for the remaining four principal components. One may check that in the case of our data sample, this feature is maintained over many years. That allows us to limit our interest to 1-PC only.

It’s getting exciting, isn’t it? Let’s come back to our main code. Having now a pretty good grasp of the algebra of PCA at work, we may limit our investigation of 1-PC to any time period of our interest, below spanned between as defined by $t1$ and $t2$ variables:

47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
% Select time period of your interest
t1=datenum('July 1 2006');
t2=datenum('July 1 2010');
 
results=[];
for t=t1:t2
    tmp=[];
    A=[]; V=[];
    for i=1:m
        [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-60));
        A=[A data{i}(r,2)];
    end
    [N,M]=size(A);
    X=(A-repmat(mean(A),[N 1]))./repmat(std(A),[N 1]);
    CovX=cov(X);
    [V,D]=eigs(CovX,1);
    % Find all negative eigenvalues of the 1st Principal Component
    [r,c,v]=find(V(:,1)<0);
    % Extract them into a new vector
    neg1PC=V(r,1);
    % Calculate a percentage of negative eigenvalues relative
    % to all values available
    ratio=length(neg1PC)/m;
    % Build a new time-series of 'ratio' change over required
    % time period (spanned between t1 and t2)
    results=[results; t ratio];    
end

We build our anxiety detection model based on the change of number of eigenvalues of the 1st Principal Component (relative to the total their numbers; here equal 21). As a result, we generate a new time-series tracing over $[t1;t2]$ time period this variable. We plot the results all in one plot contrasted with the NASDAQ-100 Index in the following way:

75
76
77
78
79
80
81
82
83
84
85
% Fetch NASDAQ-100 Index from Yahoo! data-server
nasdaq = fetch(yahoo,'^ndx','Adj Close',t1,t2,'d');
% Plot it
subplot(2,1,1)
plot(nasdaq(:,1),nasdaq(:,2),'color',[0.6 0.6 0.6]);
ylabel('NASDAQ-100 Index');
% Add a plot corresponding to a new time-series we've generated
subplot(2,1,2)
plot(results(:,1),results(:,2),'color',[0.6 0.6 0.6])
% add overplot 30d moving average based on the same data
hold on; plot(results(:,1),moving(results(:,2),30),'b')

leading us to:
anxiety-fig05
I use 30-day moving average (a solid blue line) in order to smooth the results (moving.m). Please note, that line in #56 I also replaced the earlier value of 90 days with 60 days. Somehow, it is more reasonable to examine with the PCA the market dynamics over past two months than for longer periods (but it’s a matter of taste and needs).

Eventually, we construct the core model’s element, namely, we detect nervousness among traders when the percentage of negative eigenvalues of the 1st Principal Component increases over (at least) five consecutive days:

87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
% Model Core
x1=results(:,1);
y1=moving(results(:,2),30);
tmp=[];
% Find moments of time where the percetage of negative 1-PC
% eigenvalues increases over time (minimal requirement of
% five consecutive days
for i=5:length(x1)
    if(y1(i)>y1(i-1))&&(y1(i-1)>y1(i-2))&&(y1(i-2)>y1(i-3))&& ...
      (y1(i-3)>y1(i-4))&&(y1(i-4)>y1(i-5))
        tmp=[tmp; x1(i)];
    end
end
% When found
z=[];
for i=1:length(tmp)
    for j=1:length(nasdaq)
        if(tmp(i)==nasdaq(j,1))
            z=[z; nasdaq(j,1) nasdaq(j,2)];
        end
    end
end
subplot(2,1,1); 
hold on; plot(z(:,1),z(:,2),'r.','markersize',7);

The results of the model we over-plot with red markers on top of the NASDAQ-100 Index:
anxiety-fig06
Our simple model takes us into a completely new territory of unexplored space of latent variables. Firstly, it does not predict the future. It still (unfortunately) remains unknown. However, what it delivers is a fresh look at the past dynamics in the market. Secondly, it is easily to read out from the plot that results cluster into three subgroups.

The first subgroup corresponds to actions in the stock trading having further negative consequences (see the events of 2007-2009 and the avalanche of prices). Here the dynamics over any 60 calendar days had been continued. The second subgroup are those periods of time when anxiety led to negative dynamics among stock traders but due to other factors (e.g. financial, global, political, etc.) the stocks surged dragging the Index up. The third subgroup (less frequent) corresponds to instances of relative flat changes of Index revealing a typical pattern of psychological hesitation about the trading direction.

No matter how we might interpret the results, the human factor in trading is evident. Hopefully, the PCA approach captures it. If not, all we are left with is our best friend: a trader’s intuition.

Acknowledgments

An article dedicated to Dr. Dariusz Grech of Physics and Astronomy Department of University of Wroclaw, Poland, for his superbly important! and mind-blowing lectures on linear algebra in the 1998/99 academic year.

Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting


Within the evolution of Mathworks’ MATLAB programming environment, finally, in the most recent version labelled 2013a we received a longly awaited line-command facilitation for pulling stock data directly from the Yahoo! servers. What does that mean for quants and algo traders? Honestly, a lot. Now, simply writing a few commands we can have nearly all what we want. However, please keep in mind that Yahoo! data are free therefore not always in one hundred percent their precision remains at the level of the same quality as, e.g. downloaded from Bloomberg resources. Anyway, just for pure backtesting of your models, this step introduces a big leap in dealing with daily stock data. As usual, we have a possibility of getting open, high, low, close, adjusted close prices of stocks supplemented with traded volume and the dates plus values of dividends.

In this post I present a short example how one can retrieve the data of SPY (tracking the performance of S&P500 index) using Yahoo! data in a new Matlab 2013a and I show a simple code how one can test the time period of buying-holding-and-selling SPY (or any other stock paying dividends) to make a profit every time.

The beauty of Yahoo! new feature in Matlab 2013a has been fully described in the official article of Request data from Yahoo! data servers where you can find all details required to build the code into your Matlab programs.

Model for Dividends

It is a well known opinion (based on many years of market observations) that one may expect the drop of stock price within a short timeframe (e.g. a few days) after the day when the stock’s dividends have been announced. And probably every quant, sooner or later, is tempted to verify that hypothesis. It’s your homework. However, today, let’s look at a bit differently defined problem based on the omni-working reversed rule: what goes down, must go up. Let’s consider an exchange traded fund of SPDR S&P 500 ETF Trust labelled in NYSE as SPY.

First, let’s pull out the Yahoo! data of adjusted Close prices of SPY from Jan 1, 2009 up to Aug 27, 2013

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
% Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
close all; clear all; clc;
 
date_from=datenum('Jan 1 2009');
date_to=datenum('Aug 27 2013');
 
stock='SPY';
 
adjClose = fetch(yahoo,stock,'adj close',date_from,date_to);
div = fetch(yahoo,stock,date_from,date_to,'v')
returns=(adjClose(2:end,2)./adjClose(1:end-1,2)-1);
 
% plot adjusted Close price of  and mark days when dividends
% have been announced
plot(adjClose(:,1),adjClose(:,2),'color',[0.6 0.6 0.6])
hold on;
plot(div(:,1),min(adjClose(:,2))+10,'ob');
ylabel('SPY (US$)');
xlabel('Jan 1 2009 to Aug 27 2013');

and visualize them:

spy-1

Having the data ready for backtesting, let’s look for the most profitable period of time of buying-holding-and-selling SPY assuming that we buy SPY one day after the dividends have been announced (at the market price), and we hold for $dt$ days (here, tested to be between 1 and 40 trading days).

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
% find the most profitable period of holding SPY (long position)
neg=[];
for dt=1:40
 
buy=[]; sell=[];
for i=1:size(div,1)
    % find the dates when the dividends have been announced
    [r,c,v]=find(adjClose(:,1)==div(i,1));
    % mark the corresponding SPY price with blue circle marker
    hold on; plot(adjClose(r,1),adjClose(r,2),'ob');
    % assume you buy long SPY next day at the market price (close price)
    buy=[buy; adjClose(r-1,1) adjClose(r-1,2)];
    % assume you sell SPY in 'dt' days after you bought SPY at the market
    % price (close price)
    sell=[sell; adjClose(r-1-dt,1) adjClose(r-1-dt,2)];
end
 
% calculate profit-and-loss of each trade (excluding transaction costs)
PnL=sell(:,2)./buy(:,2)-1;
% summarize the results
neg=[neg; dt sum(PnL<0) sum(PnL<0)/length(PnL)];
 
end

If we now sort the results according to the percentage of negative returns (column 3 of neg matrix), we will be able to get:

>> sortrows(neg,3)
 
ans =
   18.0000    2.0000    0.1111
   17.0000    3.0000    0.1667
   19.0000    3.0000    0.1667
   24.0000    3.0000    0.1667
    9.0000    4.0000    0.2222
   14.0000    4.0000    0.2222
   20.0000    4.0000    0.2222
   21.0000    4.0000    0.2222
   23.0000    4.0000    0.2222
   25.0000    4.0000    0.2222
   28.0000    4.0000    0.2222
   29.0000    4.0000    0.2222
   13.0000    5.0000    0.2778
   15.0000    5.0000    0.2778
   16.0000    5.0000    0.2778
   22.0000    5.0000    0.2778
   27.0000    5.0000    0.2778
   30.0000    5.0000    0.2778
   31.0000    5.0000    0.2778
   33.0000    5.0000    0.2778
   34.0000    5.0000    0.2778
   35.0000    5.0000    0.2778
   36.0000    5.0000    0.2778
    6.0000    6.0000    0.3333
    8.0000    6.0000    0.3333
   10.0000    6.0000    0.3333
   11.0000    6.0000    0.3333
   12.0000    6.0000    0.3333
   26.0000    6.0000    0.3333
   32.0000    6.0000    0.3333
   37.0000    6.0000    0.3333
   38.0000    6.0000    0.3333
   39.0000    6.0000    0.3333
   40.0000    6.0000    0.3333
    5.0000    7.0000    0.3889
    7.0000    7.0000    0.3889
    1.0000    9.0000    0.5000
    2.0000    9.0000    0.5000
    3.0000    9.0000    0.5000
    4.0000    9.0000    0.5000

what simply indicates at the most optimal period of holding the long position in SPY equal 18 days. We can mark all trades (18 day holding period) in the chart:

spy-2

where the trade open and close prices (according to our model described above) have been marked in the plot by black and red circle markers, respectively. Only 2 out of 18 trades (PnL matrix) occurred to be negative with the loss of 2.63% and 4.26%. The complete distribution of profit and losses from all trades can be obtained in the following way:

47
48
49
50
figure(2);
hist(PnL*100,length(PnL))
ylabel('Number of trades')
xlabel('Return (%)')

returning

spy-3

Let’s make some money!

The above Matlab code delivers a simple application of the newest build-in connectivity with Yahoo! server and the ability to download the stock data of our interest. We have tested the optimal holding period for SPY since the beginning of 2009 till now (global uptrend). The same code can be easily used and/or modified for verification of any period and any stock for which the dividends had been released in the past. Fairly simple approach, though not too frequent in trading, provides us with some extra idea how we can beat the market assuming that the future is going to be/remain more or less the same as the past. So, let’s make some money!

Modern Time Analysis of Black Swans


I decided to take the data analysis on Black Swan and Extreme Loss Modeling to the next level and examine the time distribution of extreme losses across the entire S&P 500 universe of traded stocks. Previously, we were interested, first, in finding the maximum loss among all trading days for a given stock in a specified time interval (stock life-time on the trading floor), secondly, in plotting the distribution of those extreme losses (found to be well fitted with the Gumbel distribution). The investigation when all those losses occurred in time can provide us with an extra vantage point, and if we are lucky researchers we can discover some new information on the population of Black Swans.

An excellent approach to data analysis is through time analysis. It is one of many, many data analytic techniques available but it is closest to my heart and I would like to provide you with a sophisticated taste of easy-to-digest mathematics applied to the financial time-series analysis. So what is a data analysis about in our case? Can we code it or be tempted by ready-to-use dedicated data analytic software? Both ways are accessible but let’s do it in more educational way taking computer programming class in Matlab.

By the time analysis we can understand the behaviour of a given quantity in time, e.g. the price of an asset traded in the exchange market. It is pretty straightforward that our eye looks for patterns and tries to classify them somehow. If we observe a repeating structure, or sequence, or shape $-$ that may help us to code its recognition within our trading platforms. The quest for hunting the most fundamental pattern of hidden volatility in the data remains obvious: a sine wave. Fairly painless in understanding, covered by maths teachers in high-schools, ‘up and down’ approach, good moods versus bad moods, bear markets followed by bull markets. A sinusoidal behaviour is everywhere around us. Therefore it is so important to know how to find it!

In quantitative finance and risk modeling, a periodicity or cyclicality constitutes an attractive and simplest model of volatility. To be able to figure out what spectrum of characteristic frequencies our given data set conceals we may need to use a proper tool. In time-series analysis this tool is known as a periodogram. However, before starting using it properly, it is essential to understand the theoretical background.

1. The Method of Period Detection

1.1. General Formulations and Model Orthogonality

In general, we learn from experiments by fitting data, $x$, with a model, $x_{\|}$. The data contain $n$ measurements, and the model, $n_{\|}$ free parameters. The consistency of the data with the model is measured by a function, $\Theta$, called a statistic. A given model, $x_{\|}$, using a given statistic (e.g., $\chi^2$), yields its particular value, $\Theta_1$. Various methods used in the analysis of time series differ both in their choice of the model and the statistic; hence are difficult to compare directly. To enable such a comparison and for determining the significance of results, $\Theta$ is converted into the false alarm probability,$P_1$. This is done considering a hypothetic situation, $H_1$, in which $x$ is pure white noise. Then each pair $(x_{\|},\Theta)$ corresponds to certain cumulative probability distribution of $\Theta$, namely $P(n_{\|},n;\Theta)$, with $P_1$ being the tail probability that under the hypothesis $H_1$ the experiment yields $\Theta>\Theta_1$, i.e., $P_1(\Theta>\Theta_1) = 1-P(n_{\|},n;\Theta_1)$.

Up to here, we have just outlined the classical Neyman-Pearson procedure of statistics. The specific method for analysis of time series used here differs from those commonly encountered in astronomy only in the choices of $x_{\|}$ and $\Theta$. Then, our accounting for variance, correlation and multiple frequencies in calculating $P$ is dictated by the laws of statistics. The probabilities derived by us from the data are the false alarm probabilities. However, we also call them below just probabilities or significance levels.

We note then that Fourier harmonics are not orthogonal in terms of the scalar product with weights at unevenly distributed observations. Certain statistical procedures employing classical probability distributions hold for orthogonal models only and fail in other cases. To avoid that, a popular variant of the power spectrum, Lomb (1976) and Scargle (1982, hereafter LS) Lomb-Scargle periodogram, $P_{\rm LS}(\nu)$, relies on a special choice of phase such that the sine and cosine functions become orthogonal:
$$
\begin{equation}
P_{\rm LS}(\nu) = A_{\rm LS} \left|\hat{x}_{\rm LS}(\nu) \right|^2 .
\end{equation}
$$ Square of the Fourier amplitude, $\hat{x}(\nu)$, takes form:
$$
\begin{eqnarray}
\left|\hat{x}_{\rm LS}(\nu)\right|^2 & = \left[ \sum_{k=1}^{n} (x_k-\bar{x})
\cos (2\pi\nu (t_k-\tau))
\right]^2 +
\nonumber \\
& \left[ \sum_{k=1}^{n} (x_k-\bar{x}) \sin (2\pi\nu
(t_k-\tau))
\right]^2 .
\end{eqnarray}
$$ The phase $\tau$ is defined as:
$$
\begin{equation}
\tan(4\pi\nu\tau) = \frac{\sum_{k=1}^{n} \sin(4\pi\nu x_k)}
{\sum_{k=1}^{N_{\rm obs}} \cos(4\pi\nu x_k)}.
\end{equation}
$$ where, as usual, we consider a time-series $\{x_i\}\ (i=1,…,n)$; $\bar{x}$ denotes the subtracted mean value, and the discrete Fourier transform takes our signal from time to frequency domain. Orignally, normalization of LS periodogram was proposed as $A_{\rm LS}=1/2\sigma^2$ in order to account for normalization to the level of white noise but different variants are available as well.

In practice, some simplification to the full version of LS periodogram are applied where we are interested in understanding the power density spectrum of $|\hat{x}(\nu)|^2$ distribution. The following Matlab function allow us to obtain a modified LS solution for a time-series $[t,y]$:

% Function calculates the power density spectrum (periodogram) for
%  a time-series [t,y] based on discrete Fourier transform. The
%  series is expected to be evenly distributed, but gaps are allowed.
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
%
% Input:
%           t       : time vector   [1xn]
%           y       : time-series   [1xn]
%           dt      : average time-series sampling time
% Output:
%           freq    : frequency vector        [1xm]
%           psd     : power spectral density  [1xm]
 
function [freq,pds]=qarPDS(t,y,dt);
    n=length(t);                % number of data points
    X=fft(y,n);
    Pyy=X.*conj(X);
    freq=[1:n/2]*(1/n/dt);      % in physical units
    pds=(Pyy(2:(floor(n/2)+1)));
    pds=pds';                   % pds
end

We will use it later on to the time analysis of our S&P 500 data of extreme losses across the whole universe of stocks traded. The function computes the periodogram for any time-series so don’t be worried too much about initial data units.

Now, back to the first base. We may extend the approach used within LS method by employing Szego orthogonal trigonometric polynomials as model functions. A series of $n_{\|}=2N+1$ polynomials corresponds to the orthogonal combinations of the $N$ lowest Fourier harmonics (Schwarzenberg-Czerny 1996). Orthogonal series are optimal from the statistical point of view because, by virtue of the Fisher lemma (Fisz 1963; Schwarzenberg-Czerny 1998), they guarantee the minimum variance of the fit residuals for a given model complexity (given by $n_{\|}$). Szego polynomials are also convenient in computations since the least-square solution may be obtained using recurrence and orthogonal projections, resulting in high computational efficiency, with the number of steps $\propto N$ instead of $N^3$ for $N$ harmonics.

1.2. Variance, the AoV statistics, and model complexity


The LS method employs the sine as a model, and the quadratic norm, $$\Theta_{\chi^2}=\|x-x_{\|}\|^2 , $$ as the statistic. The corresponding probability distribution is $\chi^2$ with 2 degrees of freedom. Prior to use of the $\chi^2$ distribution, $\Theta_{\chi^2}$ has to be divided by the signal variance, $V$. However,$V$ is usually not known and has to be estimated from the data themselves. Then, neither $\Theta_{\chi^2}$ and variance estimates are independent nor their ratio follows the $\chi^2$ distribution, which effect has to be accounted for. A simple way to do it is to apply the Fisher Analysis of Variance (AoV) statistic, $$\Theta\equiv (n-n_{\|}) \|x_{\|}\|^2/ (n_{\|}\|x – x_{\|}\|^2) .$$ Hence we call our method, involving Szego polynomials model and the AoV statistics, the multi-harmonic analysis of variance or mhAoV periodogram (Schwarzenberg-Czerny 1996). The probability distribution is then the Fisher-Snedecor distribution, $F$, rather then $\chi^2$, and $P_1= 1-F(n_{\|},n_{\perp};\Theta)$ where $n_{\perp}=n-n_{\|}$. For everything else fixed, replacing $\chi^2$ with $F$ for $n=100$ yields an increase of $P_1(\chi^2)=0.001$ to $P_1(F)=0.01$. Thus, accounting for the unknown variance yields the mhAoV detection less significant, but more trustworthy. In this work, $n$ usually is larger, for which $P_1(F)/ P_1(\chi^2)$ reduces to several.

Apart from the choice of the statistic, our method for $N=1$ differs from the LS one in the average flux being subtracted in the latter (thus yielding $n_\|=2$) whereas a constant term is fitted in the former (which can be often of significant advantage, see Foster 1995). If the periodic modulation in the data differs significantly from a sinusoid (e.g., due to dips, eclipses, etc.), then our $N>1$ models account for that more complex shape and perform considerably better then the LS one.

1.3. Multiple trials

Probability can be assigned to a period found in data according to one of two statistical hypotheses. Namely, (i) one knows in advance the trial frequency, $\nu_0$ (from other data), and would like to check whether it is also present in a given data set or (ii) one searches a whole range, $\Delta\nu$, of $N_{\rm eff}$ frequencies and finds the frequency, $\nu$, corresponding to the most significant modulation. The two cases correspond to the probabilities $P_1$ and $P_{N_{\rm eff}}$ to win in a lottery after 1 and $N_{\rm eff}$ trials, respectively, i.e., they represent the false alarm probabilities in single and multiple experiments, respectively. They are related by
$$ P_{N_{\rm eff}}= 1-(1-P_1)^{N_{\rm eff}} .$$ Note that the hypothesis (ii) and the probability $P_{N_{\rm eff}}$ must be always employed in order to claim any new frequency in the object under study. The hypothesis (i) is rarely used. However, since $P_1\lt P_{N_{\rm eff}}$, it is the more sensitive one. For this reason, we advocate its use in the situations where the modulation frequency is already known, and we aim at checking for its manifestation in the same object but in a new band, new data set, etc. We stress that we do not use the hypothesis (i) to claim any new frequency.

An obstacle hampering use of the (ii) hypothesis is that no analytical method is known to calculate $N_{\rm eff}$. The number $N_{\rm eff}$ corresponds to independent trials, whereas values of periodograms at many frequencies are correlated because of the finite width of the peaks, $\delta\nu$, and because of aliasing. As no analytical method is known to determine $N_{\rm eff}$, Monte Carlo simulations have been used (e.g., Paltani 2004). Here, we use a simple conservative estimate, $N_{\rm eff}= \min(\Delta\nu/\delta\nu, N_{\rm calc},n)$, where $N_{\rm calc}$ is the number of the values at which the periodogram is calculated. The estimate is conservative in the sense that it corresponds to the upper limit on $P_{N_{\rm eff}}$, and thus the minimum significance of detection. This effect applies to all methods of period search (Horne & Baliunas 1986). In general, it may reduce significance of a new frequency detection for large $N_{\rm eff}$ as $P_{N_{\rm eff}}\gg P_1$. In practice, it underscores the role of any prior knowledge, in a way similar to the Bayesian statistics: with any prior knowledge of the given frequency we are able to use the hypothesis (i) to claim the detection with large significance (small $P_1$).

1.4. Correlation length

The $P_1$, and other common probability distributions used to set the detection criteria, are derived under the assumption of the noise being statistically
independent. Often this is not the case, as seen, e.g., in light curves of cataclysmic variables (CVs). The correlated noise, often termed red noise, obeys different probability distribution than the standard $P_1$, and hence may have a profound effect. For example, noise with a Gaussian autocorrelation function (ACF) correlated over a time interval, $\delta t$, yields a power spectrum with the Gaussian shape centered at $\nu=0$ and the width $\delta\nu=1/\delta t$. It may be demonstrated that the net effect of the correlation on $P_1$ in analysis of low frequency processes is to decimate the number of independent observations by a factor $n_{\rm corr}$, the average number of observations in the correlation interval $\delta t$ (Schwarzenberg-Czerny 1991). Effectively, one should use $n_{\perp}/n_{\rm corr}$ and $\Theta/n_{\rm corr}$ instead of $n_{\perp}$ and $\Theta$ in calculating $P_1$. This result holds generally, for both least squares and maximum likelihood analyses of time series.

For independent observations, $m=2$ consecutive residuals have the same sign on average (e.g., Fisz 1963). Thus, counting the average length, $m$, of series of residuals of the same sign provides an estimate of the number of consecutive observations being correlated, $n_{\rm corr}$. Note that $m=n/l$ where $l$ is the number of such series (both positive and negative). For correlated observations, the average length of series with the same sign is $m=2n_{\rm corr}$, which allows us to calculate $n_{\rm corr}$.

Let $\Theta$ denote the Fisher-Snedecor statistics from the mhAoV periodogram (i.e. from Fourier series fit) computed for $n_{\|}=2N+1$ parameters, $n$ observations and $n_{\perp}=n-n_{\|}$ degrees of freedom. To account for $n_{\rm corr}$, we calculate $P_1$ as follows,
\begin{equation}
P_1=1- F\left(n_{\|},\frac{n_{\perp}}{n_{\rm
corr}};\frac{\Theta}{n_{\rm corr}}\right)=
I_{z}\left(\frac{n_{\perp}}{2n_{\rm
corr}},\frac{n_{\|}}{2}\right)\label{P1},
\end{equation}
where $ z= n_{\perp}/(n_{\perp}+n_{\|}\Theta)$ and $I_z(a,b)$ is the incomplete (regularized) beta function (Abramowitz & Stegun 1971), see Schwarzenberg-Czerny 1998 and references therein. In the popular Mathematica (Wolfram 1996) that function is called BetaRegularized. In Matlab, the following function does the calculations for us:

1
2
3
4
5
6
7
8
9
10
11
12
% Function computes the mhAoV periodogram peak significance
% Usage: [P1,Pneff]=pneff(n,nh,ncorr,neff,theta)
 
function [P1,Pneff]=pneff(n,nh,ncorr,neff,theta);
    npar=2.0*nh+1;
    nper=n-npar;
    z=nper/(nper+npar*theta);
    a=nper/(2.0*ncorr);
    b=npar/2.0;
    P1=betainc(z,a,b);  
    Pneff=1.0-(1.0-P1)^neff;
end

In the following section we will apply both approaches of modified Lomb-Scargle (LS) and multi-harmonic AoV periodograms for financial data and we will discuss the potential consequences coming from time analysis of largest daily losses for stocks traded publicly within S&P500 universe. So buckle up as we are ready for take off!

2. The Annual Migration Routes of Black Swans

A theory can both beautiful and exhausting. So let’s do some work to capture the beauty. Our goal is to re-analyze the data of extreme losses extracted by us previously within Black Swan and Extreme Loss Modeling article. First, we extract the value of the maximum loss for each stock and store them in a matrix in Matlab data as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
% Modern Time Analysis of Black Swans among 
%  Traded Stocks in S&P 500
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
 
clear all; close all; clc;
tic;
 
%% DATA READING AND VISUALIZATION
 
% read a list of stock names
StockNames=dataread('file',['sp500u.lst'],'%s','delimiter', '\n');
K=length(StockNames); % the number of stocks in the universe
% path to data files
path=['./SP500_1984_2011'];
 
data=[];
fprintf('data reading and preprocessing..\n');
for si=1:K
    % --stock name
    stock=StockNames{si};
    fprintf('%4.0f  %7s\n',si,stock);
    % --load data
    n=[path,'/',stock,'.dat'];
    % check for NULL and change to NaN (using 'sed' command
    % in Unix/Linux/MacOS environment
    cmd=['sed -i ''s/NULL/NaN/g''',' ',n]; [status,result]=system(cmd);
    % construct FTS object for daily data
    FTS=ascii2fts(n,1,2);
    % fill any missing values denoted by NaNs
    FTS=fillts(FTS);
    % extract the close price of the stock
    cp=fts2mat(FTS.CLOSE,0);
    dd=fts2mat(FTS.CLOSE,1);
    % extract Matlab matrix containing value of maximum
    % loss per stock and corresponding day
    rtmp=cp(2:end)./cp(1:end-1)-1; % daily returns
    dtmp=dd(2:end,1); % time vector
    tmp{si}=[dtmp rtmp]; 
    [tmp1,tmp2]=min(tmp{si}(:,2)); % maximum loss 
    data=[data; dtmp(tmp2) tmp1];  % [time of maximum loss, loss value]
end
 
data=sortrows(data,1); % sort data according to time of loss occurrence

where, again, the required data files can be downloaded here as sp500u.zip (23.8 MB) and sp500u.lst, respectively.

The visualization of collected data provides us with a new dimension on time distribution of the maximum losses (Black Swans events) across the S&P 500 universe as traded between 3-Jan-1984 and 8-Mar-2011:

46
47
48
49
plot(data(:,1),data(:,2)*100,'.-','color',[0.7 0.7 0.7])
xlim([min(data(:,1)) max(data(:,1)) ]);
xlabel('Time (days)');
ylabel('R_{\rm min} (%)');

07042013_Fig1

All individual stock maximum losses have been depicted with dot markers. As we found within Gumbel distribution analysis, the expected value was -22.6% with a heavy tail extending up to nearly complete losses of $\sim$98%.

Changing the way how we look at our data, we allow to connect the dots and think about data as a new time-series $x_i\ (i=1,…,n=954)$. From this standpoint we can continue our analysis in various direction. Let’s have a look at one case in particular: annual average maximum losses in a function of time. Why? Such approach has been suggested as interesting by McNeil, Frey, and Embrechts in their book Quantitative Risk Management, section 7.1.4., making use of the block maxima method in order to find return levels for stress losses. We turn this idea in practice by rebinning our time-series $\{x_i\}$ with a new time step of 252 (trading) days utilizing the code published within my past post on Rebinning of Financial Time-Series as follows:

51
52
[rx,ry]=rebin(data(:,1),data(:,2),[],252);
hold on; plot(rx,ry*100,'or');

and allowing for inappropriate data profanity with a gentle data interpolation between the points:

54
55
56
57
xi=rx(1):0.01:rx(end);
rdatai=interp1(rx,ry,xi,'pchip');
rdatai=[xi; rdatai]';
hold on; plot(rdatai(:,1),rdatai(:,2)*100,'r-');

resulting in:

07042013_Fig2

Next, based on non-interpolated data we compute the Fourier power spectrum (a modified LS periodogram) as follows:

59
60
61
62
63
64
65
% Periodogram
[freq,pds]=qarPDS(rx,ry,252);
 
figure(2);
plot(freq,pds);
xlabel('Frequency [1/d]');
ylabel('Power Spectrum');

which returns:

07042013_Fig3

It is obvious that the periodogram is calculated based on a fixed frequency grid with a frequency step of $\Delta\nu = 1/T = 0.000104$ [1/d]. The peak of highest power corresponds to the sine modulation detected in the time-series which period is equal $1/0.001462$ or 684 days. The maximal allowed frequency is the Nyquist frequency of $1/(2\Delta t)$ or 0.00198 [1/d]. Honestly, the plot is terrible. To improve its quality it is allowed in spectral analysis of time-series to apply over-samling in frequency, i.e to adopt the frequency grid of computations with a step of $\Delta\nu = 1/(kT)$ where $k$ denotes the over-sampling (an integer) factor. Why do we need the over-sampling? One of the reasons is: to find the value of periodicity as accurately as possible.

Let’s see how mhAoV periodogram copes with this task in practice. The source codes for mhAoV can be downloaded directly from Alex’s webpage (available in Fortran 95 and Python), though I still make use of our old version executable directly in Unix/Linux environment: aov. Let’s first store rebinned data (with 252 d step) in an external file of rdata.dat:

67
68
69
70
71
rdata=[rx ry];
fn=['./rdata.dat'];
fid=fopen(fn,'wt');
fprintf(fid,'%f %f\n',rdata');
fclose(fid);

and, next, let’s compute aov periodogram:

./aov -f=0.000104,0.002 -nh=1 -fos=5 rdata.dat 
 
mhAov periodogram, by Alex Schwarzenberg-Czerny
ver. 27.9.2006 updated by Pawel Lachowicz
 
datname=rdata.dat trfname=rdata.trf maxname=rdata.max nobs=38
method=ORTAOV nh=1 nbf=20 fos=5.00
frstart=0.0001040 frend=0.0019809 frstep=0.0000107 nfr=176
  frequency    period     theta    quality
  0.0014747 678.1132747    5.53729   0.743
  0.0013152 760.3542866    1.39906   0.146
  0.0016301 613.4435376    1.37416   0.138
  0.0003351 2984.5922602   1.30742   0.116
  0.0001733 5771.4262041   1.22450   0.088
  0.0011538 866.7094426    1.12090   0.050

i.e. employing the model which contains a single sinusoid (nh=1) and adopting over-sampling in frequency with $k=5$ (fos). It occurs that the highest value of mhAoV $\Theta=5.537$ statistics corresponds to the period of 678 days.

Fitting the annual data with the model defined as:
$$
f(t) = c + A\sin(2\pi t/P – g\pi)
$$ we find for $P_1=678$ d the estimation for amplitude $A_1=0.12$ and $g_1=0.79$ and the best fit we over-plot as follows:

07042013_Fig4

This model is not perfect but delivers us a good taste of concealed periodic pattern following a sinusoidal model in about 60% of time between 1984 to 2011. This is an interesting result though the computation of:

>> [P1,Pneff]=pneff(38,1,1.3,7.1,5.53729)
 
P1 =
    0.0138
Pneff =
    0.0941

indicates at only 9% of significance of this periodicity. This can be understood as a poor fit of the model to the complicated and variable shape of annual changes in maximal losses for different traded stocks.

A sort of improvement of the model we could achieve by inclusion of variation of the amplitude in function of time, i.e. $A_1(t)$. This can be practically extracted from the wavelet analysis via computation of the continuous wavelet transform. If this is subject of your interest check a draft of this approach in the paper of Czerny et alii (2010) I co-authored.

3. Conclusions

Was Joseph Fourier a birdwatcher? We don’t know. But his approach to the time-series analysis allowed us to check whether any periodic patterns in annual migration (occurrences) of Black Swan events do exist? With a very low probability we found a cyclical trend repeating every 678 days. Will that allow us to forecast the future and next density of massive losses as incurred by individual stocks? Well, now, equipped with power tools of modern approach to time analysis, we can always come back and verify our hypotheses.

References

The theory on the methods of period detection based on publication of Lachowicz et alii (2006). For deeper references to source readings mentioned in the text, check the reference section inside the aforementioned publication.

GARCH(p,q) Model and Exit Strategy for Intraday Algorithmic Traders

Forecasting future has always been a part of human untamed skill to posses. In an enjoyable Hollywood production of Next Nicolas Cage playing a character of Frank Cadillac has an ability to see the future just up to a few minutes ahead. This allows him to take almost immediate actions to avoid the risks. Now, just imagine for a moment that you are an (algorithmic) intraday trader. What would you offer for a glimpse of knowing what is going to happen within following couple of minutes? What sort of risk would you undertake? Is it really possible to deduce the next move on your trading chessboard? Most probably the best answer to this question is: it is partially possible. Why then partially? Well, even with a help of mathematics and statistics, obviously God didn’t want us to know future by putting his fingerprint into our equations and calling it a random variable. Smart, isn’t it? Therefore, our goal is to work harder in order to guess what is going to happen next!?


In this post I will shortly describe one of the most popular methods of forecasting future volatility in financial time-series employing a GARCH model. Next, I will make use of 5-min intraday stock data of close prices to show how to infer possible stock value in next 5 minutes using current levels of volatility in intraday trading. Ultimately, I will discuss an exit strategy from a trade based on forecasted worst case scenario (stock price is forecasted to exceed the assumed stop-loss level). But first, let’s warm up with some cute equations we cannot live without.

Inferring Volatility

Capturing and digesting volatility is somehow like an art that does not attempt to represent external, recognizable reality but seeks to achieve its effect using shapes, forms, colors, and textures. The basic idea we want to describe here is a volatility $\sigma_t$ of a random variable (rv) of, e.g. an asset price, on day $t$ as estimated at the end of the previous day $t-1$. How to do it in the most easiest way? It’s simple. First let’s assume that a logarithmic rate of change of the asset price between two time steps is:
$$
r_{t-1} = \ln\frac{P_{t-1}}{P_{t-2}}
$$ what corresponds to the return expressed in percents as $R_{t-1}=100[\exp(r_{t-1})-1]$ and we will be using this transformation throughout the rest of the text. This notation leaves us with a window of opportunity to denote $r_t$ as an innovation to the rate of return under condition that we are able, somehow, to deduce, infer, and forecast a future asset price of $P_t$.

Using classical definition of a sample variance, we are allowed to write it down as:
$$
\sigma_t^2 = \frac{1}{m-1} \sum_{i=1}^{m} (r_{t-i}-\langle r \rangle)^2
$$ what is our forecast of the variance rate in next time step $t$ based on past $m$ data points, and $\langle r \rangle=m^{-1}\sum_{i=1}^{m} r_{t-i}$ is a sample mean. Now, if we examine return series which is sampled every one day, or an hour, or a minute, it is worth to notice that $\langle r \rangle$ is very small as compared with the standard deviation of changes. This observation pushes us a bit further in rewriting the estimation of $\sigma_t$ as:
$$
\sigma_t^2 = \frac{1}{m} \sum_{i=1}^{m} r_{t-i}^2
$$ where $m-1$ has been replaced with $m$ by adding an extra degree of freedom (equivalent to a maximum likelihood estimate). What is superbly important about this formula is the fact that it gives an equal weighting of unity to every value of $r_{t-i}$ as we can always imagine that quantity multiplied by one. But in practice, we may have a small wish to associate some weights $\alpha_i$ as follows:
$$
\sigma_t^2 = \sum_{i=1}^{m} \alpha_i r_{t-i}^2
$$ where $\sum_{i=1}^{m} \alpha_i = 1$ what replaces a factor of $m^{-1}$ in the previous formula. If you think for a second about the idea of $\alpha$’s it is pretty straightforward to understand that every observation of $r_{t-i}$ has some significant contribution to the overall value of $\sigma_t^2$. In particular, if we select $\alpha_i<\alpha_j$ for $i>j$, every past observation from the most current time of $t-1$ contributes less and less.

In 1982 R. Engle proposed a tiny extension of discussed formula, finalized in the form of AutoRegressive Conditional Heteroscedasticity ARCH($m$) model:
$$
\sigma_t^2 = \omega + \sum_{i=1}^{m} \alpha_i r_{t-i}^2
$$ where $\omega$ is the weighted long-run variance taking its position with a weight of $\gamma$, such $\omega=\gamma V$ and now $\gamma+\sum_{i=1}^{m} \alpha_i = 1$. What the ARCH model allows is the estimation of future volatility, $\sigma_t$, taking into account only past $m$ weighted rates of return $\alpha_i r_{t-i}$ and additional parameter of $\omega$. In practice, we aim at finding weights of $\alpha_i$ and $\gamma$ using maximum likelihood method for given return series of $\{r_i\}$. This approach, in general, requires approximately $m>3$ in order to describe $\sigma_t^2$ efficiently. So, the question emerges: can we do much better? And the answer is: of course.

Four years later, in 1986, a new player entered the ring. His name was Mr T (Bollerslev) and he literally crushed Engle in the second round with an innovation of the Generalized AutoRegressive Conditional Heteroscedasticity GARCH($p,q$) model:
$$
\sigma_t^2 = \omega + \sum_{i=1}^{p} \alpha_i r_{t-i}^2 + \sum_{j=1}^{q} \beta_j \sigma_{t-j}^2
$$ which derives its $\sigma_t^2$ based on $p$ past observations of $r^2$ and $q$ most recent estimates of the variance rate. The inferred return is then equal $r_t=\sigma_t\epsilon_t$ where $\epsilon_t\sim N(0,1)$ what leave us with a rather pale and wry face as we know what in practice that truly means! A some sort of simplification meeting a wide applause in financial hassle delivers the solution of GARCH(1,1) model:
$$
\sigma_t^2 = \omega + \alpha r_{t-1}^2 + \beta \sigma_{t-1}^2
$$ which derives its value based solely on the most recent update of $r$ and $\sigma$. If we think for a shorter while, GARCH(1,1) should provide us with a good taste of forecasted volatility when the series of past couple of returns were similar, however its weakness emerges in the moments of sudden jumps (shocks) in price changes what causes overestimated volatility predictions. Well, no model is perfect.

Similarly as in case of ARCH model, for GARCH(1,1) we may use the maximum likelihood method to find the best estimates of $\alpha$ and $\beta$ parameters leading us to a long-run volatility of $[\omega/(1-\alpha-\beta)]^{1/2}$. It is usually achieved in the iterative process by looking for the maximum value of the sum among all sums computed as follows:
$$
\sum_{i=3}^{N} \left[ -\ln(\sigma_i^2) – \frac{r_i^2}{\sigma_i^2} \right]
$$ where $N$ denotes the length of the return series $\{r_j\}$ ($j=2,…,N$) available for us. There are special dedicated algorithms for doing that and, as we will see later on, we will make use of one of them in Matlab.

For the remaining discussion on verification procedure of GARCH model as a tool to explain volatility in the return time-series, pros and cons, and other comparisons of GARCH to other ARCH-derivatives I refer you to the immortal and infamous quant’s bible of John Hull and more in-depth textbook by a financial time-series role model Ruey Tsay.

Predicting the Unpredictable


The concept of predicting the next move in asset price based on GARCH model appears to be thrilling and exciting. The only worry we may have, and as it has been already recognized by us, is the fact that the forecasted return value is $r_t=\sigma_t\epsilon_t$ with $\epsilon_t$ to be a rv drawn from a normal distribution of $N(0,1)$. That implies $r_t$ to be a rv such $r_t\sim N(0,\sigma_t)$. This model we are allowed to extend further to an attractive form of:
$$
r_t = \mu + \sigma_t\epsilon_t \ \ \ \sim N(\mu,\sigma_t)
$$ where by $\mu$ we will understand a simple mean over past $k$ data points:
$$
\mu = k^{-1} \sum_{i=1}^{k} r_{t-i} \ .
$$ Since we rather expect $\mu$ to be very small, its inclusion provides us with an offset in $r_t$ value modeling. Okay, so what does it mean for us? A huge uncertainty in guessing where we gonna end up in a next time step. The greater value of $\sigma_t$ inferred from GARCH model, the wider spread in possible values that $r_t$ will take. God’s fingerprint. End of hopes. Well, not exactly.

Let’s look at the bright side of our $N(\mu,\sigma_t)$ distribution. Remember, never give in too quickly! Look for opportunities! Always! So, the distribution has two tails. We know very well what is concealed in its left tail. It’s the devil’s playground of worst losses! Can a bad thing be a good thing? Yes! But how? It’s simple. We can always compute, for example, 5% quantile of $Q$, what would correspond to finding a value of a rv $r_t$ for which there is 5% and less of chances that the actual value of $r_t$ will be smaller or equal $Q$ (a sort of equivalent of finding VaR). Having this opportunity, we may wish to design a test statistic as follows:
$$
d = Q + r’
$$ where $r’$ would represent a cumulative return of an open trading position. If you are an intraday (or algorithmic high-frequency) trader and you have a fixed stop-loss limit of $s$ set for every trade you enter, a simple verification of a logical outcome of the condition of:
$$
d < s \ ? $$ provides you immediately with an attractive model for your decision to close your position in next time step based on GARCH forecasting derived at time $t-1$. All right, let’s cut the cheese and let’s see now how does that work in practice!

5-min Trading with GARCH Exit Strategy

In order to illustrate the whole theory of GARCH approach and dancing at the edge of uncertainty of the future, we analyze the intraday 5-min stock data of Toyota Motor Corporation traded at Toyota Stock Exchange with a ticker of TYO:7203. The data file of TYO7203m5.dat contains a historical record of trading in the following form:

TYO7203
DATES      TIMES OPEN HIGH LOW  CLOSE VOLUME
07/8/2010  11:00 3135 3140 3130 3130  50900
07/8/2010  11:05 3130 3135 3130 3130  16100
07/8/2010  11:10 3130 3145 3130 3140  163700
...
01/13/2011 16:50 3535 3540 3535 3535  148000
01/13/2011 16:55 3540 3545 3535 3535  1010900

i.e. it is spanned between 8-Jul-2010 11:00 and 13-Jan-2011 16:55. Using Matlab language,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
% Forecasting Volatility, Returns and VaR for Intraday 
%   and High-Frequency Traders
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
 
%% ---DATA READING AND PREPROCESSING
 
fname=['TYO7203m5.dat'];
% construct FTS object for 5-min data
fts=ascii2fts(fname,'T',1,2,[1 2]);
% sampling
dt=5; % minutes
 
% extract close prices
cp=fts2mat(fts.CLOSE,1);
 
% plot 5-min close prices for entire data set
figure(1);
plot(cp(:,1)-cp(1,1),cp(:,2),'Color',[0.6 0.6 0.6]);
xlabel('Days since 08-Jul-2010 11:00');
ylabel('TYO:7203 Stock Price (JPY)');

let’s plot 5-min close prices for a whole data set available:

Figure1

Now, a moment of concentration. Let’s imagine you are trading this stock based on 5-min close price data points. It is late morning of 3-Dec-2010, you just finished your second cup of cappuccino and a ham-and-cheese sandwich when at 10:55 your trading model sends a signal to open a new long position in TYO:7203 at 11:00.

25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
% ---DATES SPECIFICATION
 
% data begin on/at
date1='08-Jul-2010';
time1='11:00';
% last available data point (used for GARCH model feeding)
dateG='03-Dec-2010';
timeG='12:35';
% a corresponding data index for that time point (idxGARCH)
[idxGARCH,~,~]=find(cp(:,1)==datenum([dateG,' ' ,timeG]));
% enter (trade) long position on 03-Dec-2010 at 11:00am
dateT='03-Dec-2010';
timeT='11:00';
% a corresposding data index for that time point(idx)
[idx,~,~]=find(cp(:,1)==datenum([dateT,' ',timeT]));

You buy $X$ shares at $P_{\rm{open}}=3310$ JPY per share at 11:00 and you wait observing the asset price every 5 min. The price goes first up, than starts to fall down. You look at your watch, it is already 12:35. Following a trading data processing, which your Matlab does for you every 5 minutes,

41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
 % ---DATA PREPROCESSING (2)
 
% extract FTS object spanned between the beginning of data
%  and last available data point
ftsGARCH=fetch(fts,date1,time1,dateG,timeG,1,'d');
cpG=fts2mat(ftsGARCH.CLOSE,1); % extract close prices (vector)
retG=log(cpG(2:end,2)./cpG(1:end-1,2)); % extract log return (vector)
 
figure(1); 
hold on; 
plot(cp(1:idxGARCH,1)-cp(1,1),cp(1:idxGARCH,2),'b') % a blue line in Figure 2
 
figure(2);
% plot close prices
x=1:5:size(cpG,1)*5;
x=x-x(idx+1); % 
plot(x,cpG(:,2),'-ok');
xlim([-20 110]);
ylim([3285 3330]);
ylabel('TYO:7203 Stock Price (JPY)'); 
xlabel('Minutes since 03-Dec-2010 11:00');
% add markers to the plot
hold on; %
plot(-dt,cpG(idx,2),'o','MarkerEdgeColor','g');
hold on; 
plot(0,cpG(idx+1,2),'o','MarkerFaceColor','k','MarkerEdgeColor','k');
% mark -0.5% stoploss line
xs=0:5:110;
stoploss=(1-0.005)*3315; % (JPY)
ys=stoploss*ones(1,length(xs));
plot(xs,ys,':r');

you re-plot charts, here: as for data extracted at 3-Dec-2010 12:35,

TYO:7203

where by blue line it has been marked the current available price history of the stock (please ignore a grey line as it refers to the future prices which we of course don’t know on 3-Dec-2010 12:35 but we use it here to understand better the data feeding process in time, used next in GARCH modeling), and

Fig3

what presents the price history with time axis adjusted to start at 0 when a trade has been entered (black filled marker) as suggested by the model 5 min earlier (green marker). Assuming that in our trading we have a fixed stop-loss level of -0.5% per every single trade, the red dotted line in the figure marks the corresponding stock price of suggested exit action. As for 12:35, the actual trading price of TYO:7203 is still above the stop-loss. Great! Well, are you sure?

As we trade the stock since 11:00 every 5 min we recalculate GARCH model based on the total data available being updated by a new innovation in price series. Therefore, as for 12:35, our GARCH modeling in Matlab,

73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
% ---ANALYSIS
 
% GARCH(p,q) parameter estimation
model = garch(3,3) % define model
[fit,VarCov,LogL,Par] = estimate(model,retG)
% extract model parameters
parC=Par.X(1) % omega
parG=Par.X(2) % beta (GARCH)
parA=Par.X(3) % alpha (ARCH)
% estimate unconditional volatility
gamma=1-parA-parG
VL=parC/gamma;
volL=sqrt(VL)
% redefine model with estimatated parameters
model=garch('Constant',parC,'GARCH',parG,'ARCH',parA)
 
% infer 5-min varinace based on GARCH model
sig2=infer(model,retG); % vector

runs,

model = 
    GARCH(3,3) Conditional Variance Model:
    --------------------------------------  
    Distribution: Name = 'Gaussian'
               P: 3
               Q: 3
        Constant: NaN
           GARCH: {NaN NaN NaN} at Lags [1 2 3]
            ARCH: {NaN NaN NaN} at Lags [1 2 3]
____________________________________________________________
   Diagnostic Information
 
Number of variables: 7
 
Functions 
Objective:                            @(X)OBJ.nLogLikeGaussian(X,V,E,Lags,...)
Gradient:                             finite-differencing
Hessian:                              finite-differencing (or Quasi-Newton)
 
Constraints
Nonlinear constraints:                do not exist
 
Number of linear inequality constraints:    1
Number of linear equality constraints:      0
Number of lower bound constraints:          7
Number of upper bound constraints:          7
 
Algorithm selected
   sequential quadratic programming
____________________________________________________________
   End diagnostic information
                                                          Norm of First-order
 Iter F-count            f(x) Feasibility  Steplength        step  optimality
    0       8   -2.545544e+04   0.000e+00                           1.392e+09
    1      60   -2.545544e+04   0.000e+00   8.812e-09   8.813e-07   1.000e+02
 
Local minimum found that satisfies the constraints.
 
Optimization completed because the objective function is non-decreasing in 
feasible directions, to within the selected value of the function tolerance,
and constraints are satisfied to within the selected value of the constraint tolerance.

and returns the best estimates of GARCH model’s parameters:

   GARCH(1,1) Conditional Variance Model:
    ----------------------------------------
    Conditional Probability Distribution: Gaussian
 
                                  Standard          t     
     Parameter       Value          Error       Statistic 
    -----------   -----------   ------------   -----------
     Constant    2.39895e-07   6.87508e-08        3.48934
     GARCH{1}            0.9     0.0224136        40.1542
      ARCH{1}           0.05     0.0037863        13.2055
 
fit = 
    GARCH(1,1) Conditional Variance Model:
    --------------------------------------  
    Distribution: Name = 'Gaussian'
               P: 1
               Q: 1
        Constant: 2.39895e-07
           GARCH: {0.9} at Lags [1]
            ARCH: {0.05} at Lags [1]

defining the GARCH(1,1) model on 3-Dec-2010 12:35 of TYO:7302 as follows:
$$
\sigma_t^2 = 2.39895\times 10^{-7} + 0.05r_{t-1}^2 + 0.9\sigma_{t-1}^2 \ \ .
$$ where the long-run volatility $V=0.0022$ ($0.2\%$) at $\gamma=0.05$. Please note, that due to some computational mind tricks, we allowed to define the raw template of GARCH model as GARCH(3,3) which has been adjusted by Matlab’s solver down to resulting GARCH(1,1).

Based on the model and data available till $t-1=$ 12:35, we get a forecast of $\sigma_t$ value at $t=$ 12:40 to be

92
93
94
% forcast varianace 1-step (5-min) ahead, sigma^2_t
sig2_t=forecast(model,1,'Y0',retG,'V0',sig2); % scalar
sig_t=sqrt(sig2_t) % scalar

$$
\sigma_t=0.002 \ (0.2\%) \ .
$$
Plotting return series for our trade, at 12:35 we have:

96
97
98
99
100
101
102
103
% update a plot of 5-min returns
figure(3);
x=1:length(retG); x=x-idx; x=x*5;
plot(x,100*(exp(retG)-1),'ko-');  
xlim([-20 110]);
ylim([-0.8 0.8]);
xlabel('Minutes since 03-Dec-2010 11:00');
ylabel('R_t [%]');

Fig4

Estimating the mean value of return series by taking into account last 12 data points (60 min),

105
106
% estimate mean return over passing hour (dt*12=60 min)
mu=mean(retG(end-12:end))

we get $\mu=-2.324\times 10^{-4}$ (log values), what allows us to define the model for return at $t=$ 12:40 to be:
$$
R_t = \mu + \sigma_t\epsilon_t = -0.02324 + 0.2\epsilon \ , \ \ \ \epsilon\sim N(0,1)\ .
$$ The beauty of God’s fingerprint denoted by $\epsilon$ we can understand better but running a simple Monte-Carlo simulation and drawing 1000 rvs of $r_t=\mu+\sigma_t\epsilon$ which illustrate 1000 possible values of stock return as predicted by GARCH model!

108
109
% simulate 1000 rvs ~ N(0,sig_t)
r_t=sig_t*randn(1000,1);

One of them could be the one at 12:40 but we don’t know which one?! However, as discussed in Section 2 of this article, we find 5% quantile of $Q$ from the simulation and we mark this value by red filled circle in the plot.

111
112
113
114
115
116
117
tmp=[(length(retG)-idx)*5+5*ones(length(r_t),1) 100*(exp(r_t)-1)];
hold on; plot(tmp(:,1),tmp(:,2),'x');
 
Q=norminv(0.05,mu,sig_t); 
Q=100*(exp(Q)-1);
 
hold on; plot((length(retG)-idx)*5+5,Q,'or','MarkerFaceColor','r');

Including 1000 possibilities (blue markers) of realisation of $r_t$, the updated Figure 4 now looks like:

Fig5

where we find from simulation $Q=-0.3447\%$. The current P&L for an open long position in TYO:7203,

119
120
121
% current P/L for the trade
P_open=3315; % JPY (3-Dec-2010 11:00)
cumRT=100*(cpG(end,2)/P_open-1) %  = r'

returns $r’=-0.3017\%$. Employing a proposed test statistic of $d$,

123
124
% test statistics (percent)
d=cumRT+Q

we get $d=r’+Q = -0.6464\%$ which, for the first time since the opening the position at 11:00, exceeds our stop-loss of $s=-0.5\%$. Therefore, based on GARCH(1,1) forecast and logical condition of $d\lt s$ to be true, our risk management engine sends out at 12:35 an order to close the position at 12:40. The forecasted closing price,

126
127
% forecasted stock price under test (in JPY)
f=P_open*(1+d/100)

is estimated to be equal 3293.6 JPY, i.e. -0.6464% loss.

At 12:40 the order is executed at 3305 JPY per share, i.e. with realized loss of -0.3017%.

The very last question you asks yourself is: using above GARCH Exit Strategy was luck in my favor or not? In other words, has the algorithm taken the right decision or not? We can find out about that only letting a time to flow. As you come back from the super-quick lunch break, you sit down, look at the recent chart of TYO:7203 at 12:55,

Fig6

and you feel relief, as the price went further down making the algo’s decision a right decision.

For reference, the red filled marker has been used to denote the price at which we closed our long position with a loss of -0.3017% at 12:40, and by open red circle the forecasted price under $d$-statistics as derived at 12:35 has been marked in addition for the completeness of the grand picture and idea standing behind this post.

Black Swan and Extreme Loss Modeling

When I read the book of Nassim Nicholas Taleb Black Swan my mind was captured by the beauty of extremely rare events and, concurrently, devastated by the message the book sent: the non-computability of the probability of the consequential rare events using scientific methods (as owed to the very nature of small probabilities). I rushed to the local library to find out what had been written on the subject. Surprisingly I discovered the book of Embrechts, Kluppelberg & Mikosh on Modelling Extremal Events for Insurance and Finance which appeared to me very inaccessible, loaded with heavy mathematical theorems and proofs, and a negligible number of practical examples. I left it on the shelf to dust for a longer while until last week when a fresh drive to decompose the problem came back to me again.

In this article I will try to take you for a short but efficient journey through the part of the classical extreme value theory, namely, fluctuations of maxima, and fulfil the story with easy-to-follow procedures on how one may run simulations of the occurrences of extreme rare losses in financial return series. Having this experience, I will discuss briefly the inclusion of resulting modeling to the future stock returns.

1. The Theory of Fluctuations of Maxima

Let’s imagine we have a rich historical data (time-series) of returns for a specific financial asset or portfolio of assets. A good and easy example is the daily rate of returns, $R_i$, for a stock traded e.g. at NASDAQ Stock Market,
$$
R_t = \frac{P_t}{P_{t-1}} – 1 \ ,
$$ where $P_t$ and $P_{t-1}$ denote a stock price on a day $t$ and $t-1$, respectively. The longer time coverage the more valuable information can be extracted. Given the time-series of daily stock returns, $\{R_i\}\ (i=1,…,N)$, we can create a histogram, i.e. the distribution of returns. By the rare event or, more precisely here, the rare loss we will refer to the returns placed in the far left tail of the distribution. As an assumption we also agree to $R_1,R_2,…$ to be the sequence of iid non-degenerate rvs (random variables) with a common df $F$ (distribution function of $F$). We define the fluctuations of the sample maxima as:
$$
M_1 = R_1, \ \ \ M_n = \max(R_1,…,R_n) \ \mbox{for}\ \ n\ge 2 \ .
$$ That simply says that for any time-series $\{R_i\}$, there is one maximum corresponding to the rv (random variable) with the most extreme value. Since the main line of this post is the investigation of maximum losses in return time-series, we are eligible to think about negative value (losses) in terms of maxima (therefore conduct the theoretical understanding) thanks to the identity:
$$
\min(R_1,…,R_n) = -\max(-R_1,…,-R_n) \ .
$$ The distribution function of maximum $M_n$ is given as:
$$
P(M_n\le x) = P(R_1\le x, …, R_n\le x) = P(R_1\le x)\cdots P(R_n\le x) = F^n(x)
$$ for $x\in\Re$ and $n\in\mbox{N}$.

What the extreme value theory first ‘investigates’ are the limit laws for the maxima $M_n$. The important question here emerges: is there somewhere out there any distribution which satisfies for all $n\ge 2$ the identity in law
$$
\max(R_1,…,R_n) = c_nR + d_n
$$ for appropriate constants $c_n>0$ and $d_n\in\Re$, or simply speaking, which classes of distributions $F$ are closed for maxima? The theory defines next the max-stable distribution within which a random variable $R$ is called max-stable if it satisfies a aforegoing relation for idd $R_1,…,R_n$. If we assume that $\{R_i\}$ is the sequence of idd max-stable rvs then:
$$
R = c_n^{-1}(M_n-d_n)
$$ and one can say that every max-stable distribution is a limit distribution for maxima of idd rvs. That brings us to the fundamental Fisher-Trippett theorem saying that if there exist constants $c_n>0$ and $d_n\in\Re$ such that:
$$
c_n^{-1}(M_n-d_n) \rightarrow H, \ \ n\rightarrow\infty\ ,
$$ then $H$ must be of the type of one of the three so-called standard extreme value distributions, namely: Fréchet, Weibull, and Gumbel. In this post we will be only considering the Gumbel distribution $G$ of the corresponding probability density function (pdf) $g$ given as:
$$
G(z;\ a,b) = e^{-e^{-z}} \ \ \mbox{for}\ \ z=\frac{x-b}{a}, \ x\in\Re
$$ and
$$
g(z;\ a,b) = b^{-1} e^{-z}e^{-e^{-z}} \ .
$$ where $a$ and $b$ are the location parameter and scale parameter, respectively. Having defined the extreme value distribution and being now equipped with a better understanding of theory, we are ready for a test drive over daily roads of profits and losses in the trading markets. This is the moment which separates men from boys.

2. Gumbel Extreme Value Distribution for S&P500 Universe


As usual, we start with entrée. Our goal is to find the empirical distribution of maxima (i.e. maximal daily losses) for all stocks belonging to the S&P500 universe between 3-Jan-1984 and 8-Mar-2011. There were $K=954$ stocks traded within this period and their data can be downloaded here as a sp500u.zip file (23.8 MB). The full list of stocks’ names is provided in sp500u.lst file. Therefore, performing the data processing in Matlab, first we need to compute a vector storing daily returns for each stock, and next find the corresponding minimal value $M_n$ where $n$ stands for the length of each return vector:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
% Black Swan and Extreme Loss Modeling
%  using Gumbel distribution and S&P500 universe
%
% (c) 2013 QuantAtRisk, by Pawel Lachowicz
 
 
clear all; close all; clc;
tic;
 
%% DATA READING AND PREPROCESSING
 
% read a list of stock names
StockNames=dataread('file',['sp500u.lst'],'%s','delimiter', '\n');
K=length(StockNames); % the number of stocks in the universe
% path to data files
path=['./SP500u'];
 
fprintf('data reading and preprocessing..\n');
for si=1:K
    % --stock name
    stock=StockNames{si};
    fprintf('%4.0f  %7s\n',si,stock);
    % --load data
    n=[path,'/',stock,'.dat'];
    % check for NULL and change to NaN (using 'sed' command
    % in Unix/Linux/MacOS environment)
    cmd=['sed -i ''s/NULL/NaN/g''',' ',n]; [status,result]=system(cmd);
    % construct FTS object for daily data
    FTS=ascii2fts(n,1,2);
    % fill any missing values denoted by NaNs
    FTS=fillts(FTS);
    % extract the close price of the stock
    cp=fts2mat(FTS.CLOSE,0);
    % calculate a vector with daily stock returns and store it in
    % the cell array
    R{si}=cp(2:end)./cp(1:end-1)-1;
end
 
%% ANALYSIS
 
% find the minimum daily return value for each stock
Rmin=[];
for si=1:K
    Mn=min(R{si},[],1);
    Rmin=[Rmin; Mn];
end

Having that ready, we fit the data with the Gumbel function which (as we believe) would describe the distribution of maximal losses in the S&P500 universe best:

48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
% fit the empirical distribution with Gumbel distribution and 
% estimate the location, a, and scale, b, parameter
[par,parci]=evfit(Rmin);
a=par(1); 
b=par(2);
 
% plot the distribution
x=-1:0.01:1;
hist(Rmin,-1:0.01:0);
h=findobj(gca,'Type','patch');
set(h,'FaceColor',[0.7 0.7 0.7],'EdgeColor',[0.6 0.6 0.6]);
h=findobj(gca,'Type','box');
set(h,'Color','k');
 
% add a plot of Gumbel pdf 
pdf1=evpdf(x,a,b);
y=0.01*length(Rmin)*pdf1;
line(x,y,'color','r'); box on;
xlabel('R_{min}');
ylabel(['f(R_{min}|a,b)']);
text(-1,140,['a = ',num2str(paramEstsMinima(1),3)]);
text(-1,130,['b = ',num2str(paramEstsMinima(2),3)]); 
xlim([-1 0]);

The maximum likelihood estimates of the parameters $a$ and $b$ and corresponding 95% confidence intervals we can find as follows:

>> [par,parci]=evfit(Rmin)
 
par =
   -0.2265    0.1135
 
parci =
   -0.2340    0.1076
   -0.2190    0.1197

That brings us to a visual representation of our analysis:

gumbel

This is a very important result communicating that the expected value of extreme daily loss is equal about -22.6%. However, the left tail of the fitted Gumbel distribution extends far up to nearly -98% although the probability of the occurrence of such a massive daily loss is rather low.

On the other hand, the expected value of -22.6% is surprisingly close to the trading down-movements in the markets on Oct 19, 1987 known as Black Monday when Dow Jones Industrial Average (DJIA) dropped by 508 points to 1738.74, i.e. by 22.61%!

3. Blending Extreme Loss Model with Daily Returns of a Stock

Probably you wonder how can we include the results coming from the Gumbel modeling for the prediction of rare losses in the future daily returns of a particular stock. This can be pretty straightforwardly done combining the best fitted model (pdf) for extreme losses with stock’s pdf. To do it properly we need to employ the concept of the mixture distributions. Michael B. Miller in his book Mathematics and Statistics for Financial Risk Management provides with a clear idea of this procedure. In our case, the mixture density function $f(x)$ could be denoted as:
$$
f(x) = w_1 g(x) + (1-w_1) n(x)
$$ where $g(x)$ is the Gumbel pdf, $n(x)$ represents fitted stock pdf, and $w_1$ marks the weight (influence) of $g(x)$ into resulting overall pdf.

In order to illustrate this process, let’s select one stock from our S&P500 universe, say Apple Inc. (NASDAQ: AAPL), and fit its daily returns with a normal distribution:

72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
% AAPL daily returns (3-Jan-1984 to 11-Mar-2011)
rs=R{18};
figure(2);
hist(rs,50);
h=findobj(gca,'Type','patch');
set(h,'FaceColor',[0.7 0.7 0.7],'EdgeColor',[0.6 0.6 0.6]);
h=findobj(gca,'Type','box');
set(h,'Color','k');
% fit the normal distribution and plot the fit
[muhat,sigmahat]=normfit(rs)
x =-1:0.01:1;
pdf2=normpdf(x,muhat,sigmahat);
y=0.01*length(rs)*pdf2;
hold on; line(x,y,'color','r');
xlim([-0.2 0.2]); ylim([0 2100]);
xlabel('R');

aapl

where the red line represents the fit with a mean of $\mu=0.0012$ and a standard deviation $\sigma=0.0308$.

We can obtain the mixture distribution $f(x)$ executing a few more lines of code:

89
90
91
92
93
94
95
96
% Mixture Distribution Plot
figure(3);
w1= % enter your favorite value, e.g. 0.001
w2=1-w1;
pdfmix=w1*(pdf1*0.01)+w2*(pdf2*0.01);  % note: sum(pdfmix)=1 as expected
x=-1:0.01:1;
plot(x,pdfmix);
xlim([-0.6 0.6]);

It is important to note that our modeling is based on $w_1$ parameter. It can be intuitively understood as follows. Let’s say that we choose $w_1=0.01$. That would mean that Gumbel pdf contributes to the overall pdf in 1%. In the following section we will see that if a random variable is drawn from the distribution given by $f(x)$, $w_1=0.01$ simply means (not exactly but with a sufficient approximation) that there is 99% of chances of drawing this variable from $n(x)$ and only 1% from $g(x)$. The dependence of $f(x)$ on $w_1$ illustrates the next figure:

w1

It is well visible that a selection of $w_1>0.01$ would be a significant contributor to the left tail making it fat. This is not the case what is observed in the empirical distribution of daily returns for AAPL (and in general for majority of stocks), therefore we rather expect $w_1$ to be much much smaller than 1%.

4. Drawing Random Variables from Mixture Distribution

A short break between entrée and main course we fill with a sip of red wine. Having discrete form of $f(x)$ we would like to be able to draw a random variable from this distribution. Again, this is easy too. Following a general recipe, for instance given in the Chapter 12.2.2 of Philippe Jorion’s book Value at Risk: The New Benchmark for Managing Financial Risk, we wish to use the concept of inverse transform method. In first step we use the output (a random variable) coming from a pseudo-random generator drawing its rvs based on the uniform distribution $U(x)$. This rv is always between 0 and 1, and in the last step is projected on the cumulative distribution of our interest $F(x)$, what in our case would correspond to the cumulative distribution for $f(x)$ pdf. Finally, we read out the corresponding value on the x-axis: a rv drawn from $f(x)$ pdf. Philippe illustrates that procedure more intuitively:

Drawing a Random Variable Process

This methods works smoothly when we know the analytical form of $F(x)$. However, if this not in the menu, we need to use a couple of technical skills. First, we calculate $F(x)$ based on $f(x)$. Next, we set a very fine grid for $x$ domain, and we perform interpolation between given data points of $F(x)$.

98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
% find cumulative pdf, F(x)
figure(4);
s=0;
x=-1;
F=[];
for i=1:length(pdfmix);
    s=s+pdfmix(i);
    F=[F; x s];
    x=x+0.01;
end
plot(F(:,1),F(:,2),'k')
xlim([-1 1]); ylim([-0.1 1.1]);
% perform interpolation of cumulative pdf using very fine grid
xi=(-1:0.0000001:1);
yi=interp1(F(:,1),F(:,2),xi,'linear'); % use linear interpolation method
hold on; plot(xi,yi);

The second sort of difficulty is in finding a good match between the rv drawn from the uniform distribution and approximated value for our $F(x)$. That is why a very fine grid is required supplemented with some matching techniques. The following code that I wrote deals with this problem pretty efficiently:

115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
% draw a random variable from f(x) pdf: xi(row)
tF2=round((round(100000*yi')/100000)*100000);
RV=[];
for k=1:(252*40)
    notok=false;
    while(~notok)
        tU=round((round(100000*rand)/100000)*100000);
        [r,c,v]=find(tF2==tU);
        if(~isempty(r))
            notok=true;
        end
    end
    if(length(r)>1)
        rv=round(2+(length(r)-2)*rand);
        row=r(rv); 
    else
        row=r(1);
    end
    % therefore, xi(row) is a number represting a rv
    % drawn from f(x) pdf; we store 252*40 of those
    % new rvs in the following matrix:
    RV=[RV; xi(row) yi(row)];
end
% mark all corresponding rvs on the cumulative pdf
hold on; plot(RV(:,1),RV(:,2),'rx');

Finally, as the main course we get and verify the distribution of a large number of new rvs drawn from $f(x)$ pdf. It is crucial to check whether our generating algorithm provides us with a uniform coverage across the entire $F(x)$ plot,

Fx

where, in order to get more reliable (statistically) results, we generate 10080 rvs which correspond to the simulated 1-day stock returns for 252 trading days times 40 years.

5. Black Swan Detection

A -22% collapse in the markets on Oct 19, 1978 served as a day when the name of Black Swan event took its birth or at least had been reinforced in the financial community. Are black swans extremely rare? It depends. If you live for example in Perth, Western Australia, you can see a lot of them wandering around. So what defines the extremely rare loss in the sense of financial event? Let’s assume by the definition that by Black Swan event we will understand of a daily loss of 20% or more. If so, using the procedure described in this post, we are tempted to pass from the main course to dessert.

Our modeling concentrates around finding the most proper contribution of $w_1g(x)$ to resulting $f(x)$ pdf. As an outcome of a few runs of Monte Carlo simulations with different values of $w_1$ we find that for $w_1=[0.0010,0.0005,0.0001]$ we detect in the simulations respectively 9, 5, and 2 events (rvs) displaying a one-day loss of 20% or more.

Therefore, the simulated daily returns for AAPL, assuming $w_1=0.0001$, generate in the distribution two Black Swan events, i.e. one event per 5040 trading days, or one per 20 years:

Black Swans in future AAPL returns

That result agrees quite well with what has been observed so far, i.e. including Black Monday in 1978 and Flash Crash in intra-day trading on May 6, 2010 for some of the stocks.

Acknowledgements

I am grateful to Peter Urbani from New Zealand for directing my attention towards Gumbel distribution for modeling very rare events.

Simulation of Portfolio Value using Geometric Brownian Motion Model


Having in mind the upcoming series of articles on building a backtesting engine for algo traded portfolios, today I decided to drop a short post on a simulation of the portfolio realised profit and loss (P&L). In the future I will use some results obtained below for a discussion of key statistics used in the evaluation of P&L at any time when it is required by the portfolio manager.

Assuming that we trade a portfolio of any assets, its P&L can be simulated in a number of ways. One of the quickest method is the application of geometric brownian motion (GBM) model with a drift in time of $\mu_t$ and the process standard deviation of $\sigma_t$ over its total time interval. The model takes its form as follows:
$$
dS_t = \mu_t S_t dt + \sigma_t S_t dz
$$ where $dz\sim N(0,dt)$ and the process has variance equal to $dt$ (the process is brownian). Let $t$ is the present time and the portfolio has an initial value of $S_t$ dollars. The target time is $T$ therefore portfolio time horizon of evaluation is $\tau=T-t$ at $N$ time steps. Since the GBM model assumes no correlations between the values of portfolio on two consecutive days (in general, over time), by integrating $dS/S$ over finite interval we get a discrete change of portfolio value:
$$
\Delta S_t = S_{t-1} (\mu_t\Delta t + \sigma_t\epsilon \sqrt{\Delta t}) \ .
$$ For simplicity, one can assume that both parameters of the model, $\mu_t$ and $\sigma_t$ are constant over time, and the random variable $\epsilon\sim N(0,1)$. In order to simulate the path of portfolio value, we go through $N$ iterations following the formula:
$$
S_{t+1} = S_t + S_t(\mu_t\Delta t + \sigma_t \epsilon_t \sqrt{\Delta t})
$$ where $\Delta t$ denotes a local volatility defined as $\sigma_t/\sqrt{N}$ and $t=1,…,N$.

Example

Let’s assume that initial portfolio value is $S_1=\$10,000$ and it is being traded over 252 days. We allow the underlying process to have a drift of $\mu_t=0.05$ and the overall volatility of $\sigma_t=5%$ constant over time. Since the simulation in every of 252 steps depends on $\epsilon$ drawn from the normal distribution $N(0,1)$, we can obtain any number of possible realisations of the simulated portfolio value path.

Coding quickly the above model in Matlab,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
mu=0.05;      % drift
sigma=0.05;   % std dev over total inverval
S1=10000;     % initial capital ($)
N=252;        % number of time steps (trading days)
K=1;          % nubber of simulations
 
dt=sigma/sqrt(N);   % local volatility
 
St=[S1];
for k=1:K
    St=[S1];
    for i=2:N
        eps=randn; 
        S=St(i-1)+St(i-1)*(mu*dt+sigma*eps*sqrt(dt));
        St=[St; S];
    end
    hold on; plot(St);
end
xlim([1 N]);
xlabel('Trading Days')
ylabel('Simulated Portfolio Value ($)');

lead us to one of the possible process realisations,

Simulated Portfolio Value

quite not too bad, with the annual rate of return of about 13%.

The visual inspection of the path suggests no major drawbacks and low volatility, therefore pretty good trading decisions made by portfolio manager or trading algorithm. Of course, the simulation does not tell us anything about the model, the strategy involved, etc. but the result we obtained is sufficient for further discussion on portfolio key metrics and VaR evolution.

VaR and Expected Shortfall vs. Black Swan


It is one of the most fundamental approaches in measuring the risk, but truly worth revising its calculation. Value-at-Risk (VaR). A magical number quoted at the end of the day by the banks’ financial risk managers, portfolios’ risk managers, and risk managers simply concerned about the expected risk threshold not to be, hopefully, exceeded on the next day. What is so magical about it? According to definition, given a specific time horizon and taking into account last $N$ days of our trading activity, we can always calculate a number which would provide us with a simple answer to the question: if something really unexpected happened on the next day, what would be the loss margin we could for sure suffer? Well, welcome to the world of VaR!

More formally, given a daily (or, weekly, monthly, etc.) distribution of returns (of, for example, a single stock, a portfolio of assets, etc.), we are interested in finding the value of $VaR_\alpha$ of, say, $\alpha=0.05$ (five percent) which would say to as that there is 5% of chances that the value of $VaR_{0.05}$ would be exceeded in trading on the next day. This value is located in the left tail of the distribution and, by convention, it is given as positive number. Therefore, we define $VaR_\alpha$ as:
$$
P[X\le VaR_\alpha] = 1 – \alpha
$$ where $X$ is a random variable and can, as in our assumed case, represent a 1-day return on a traded asset or a portfolio of assets.

Beautiful! We have the risk threshold provided as a number. But in practice, how should we calculate $VaR_\alpha$? Here come into spotlight some of the shortcomings of VaR in general. VaR depends on the time-frame of the daily returns we examine: therefore $VaR_\alpha$ as a number will be different if you include last 252 trading days in estimation comparing to last 504 trading days (2 years). Intuitively, the more data (more days, years, etc.) we have the better estimation of $VaR_\alpha$, right? Yes and no. Keep in mind that the dynamic of the trading market changes continuously. In early 80s there was no high-frequency trading (HFT). It dominated trading traffic after the year of 2000 and now HFT plays a key role (an influential factor) in making impact on prices. So, it would be a bit unfair to compare $VaR_\alpha$ estimated in 80s with what is going on right now. $VaR_\alpha$ assumes that the distribution of returns is normal. Not most beautiful assumption but the most straightforward to compute $VaR_\alpha$. Should we use $VaR_\alpha$? Yes, but with caution.

Okay then, having $VaR_\alpha$ calculated we know how far the loss could reach at the $1-\alpha$ confidence level. The next super important question in risk management every risk manager should ask or at least be interested in reporting is: if $VaR_\alpha$ event occurs on the next day, what is the expected loss we might expect to suffer (say, in dollars)? $VaR_\alpha$ is the threshold. Now we are interested in the expected value of loss given the exceedance. It is the definition of expected shortfall and is based on the concept of conditional probability as follows:
$$
E[X\ |\ X>VaR_\alpha] = ES \ .
$$ In general, if our daily distribution of returns can be described by a function $f(x)$ which would represent a power density function (pdf), then:
$$
ES = \frac{1}{\alpha} \int_{-\infty}^{VaR_\alpha} xf(x)dx \ .
$$

Given any data set of returns, $R_t\ (t=1,…,N)$, calculated from our price history,
$$
R_t = \frac{P_{t+1}}{P_t} -1
$$ both numbers, $VaR_\alpha$ and $ES$ can be, in fact, calculated in two ways. The first method is based on the empirical distribution, i.e. using data as given:
$$
VaR_\alpha = h_i^{VaR} \ \ \mbox{for}\ \ \sum_{i=1}^{M-1} H_i(h_{i+1}-h_i) \le \alpha
$$ where $H$ represents the normalized histogram of $R_t$ (i.e., its integral is equal 1) and $M$ is the number of histograms bins. Similarly for $ES$, we get:
$$
ES = \sum_{i=1}^{h_i^{VaR}} h_iH_i(h_{i+1}-h_i) \ .
$$ The second method would be based on integrations given the best fit to the histogram of $R_t$ using $f(x)$ being a normal distribution. As we will see in the practical example below, both approaches returns different values, and an extra caution should be undertaken while reporting the final risk measures.

Case Study: IBM Daily Returns in 1987

The year of 1987 came into history as the most splendid time in the history of stock trading. Why? Simply because of so-called Black Monday on Oct 19, where markets denoted over 20% losses in a single trading day. Let’s analyse the case of daily returns and their distribution for IBM stock, as analysed by a risk manager after the close of the market on Thu, Dec 31, 1978. The risk manager is interested in the estimation of 1-day 95% VaR threshold, i.e. $VaR_0.05$, and the expected shortfall if on the next trading day (Jan 4, 1988) the exceedance event would occur. Therefore, let’s assume that our portfolio is composed solely of the IBM stock and we have the allocation of the capital of $C$ dollars in it as for Dec 31, 1987.

Using a record of historical trading, IBM.dat, of the form:

IBM
DATES 	      OPEN 	HIGH 	  LOW 	    CLOSE 	VOLUME
03-Jan-1984   0.000000 	0.000000  0.000000  30.437500 	0.000000
04-Jan-1984   0.000000 	0.000000  0.000000  30.968750 	0.000000
05-Jan-1984   0.000000 	0.000000  0.000000  31.062500 	0.000000
06-Jan-1984   0.000000 	0.000000  0.000000  30.875000 	0.000000
...
04-Mar-2011   0.000000 	0.000000  0.000000  161.830000 	0.000000
07-Mar-2011   0.000000 	0.000000  0.000000  159.930000 	0.000000

containing the close price of IBM stock, in Matlab, we extract the data for the year of 1987, then we construct a vector of daily returns $R_t$ on the stock,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
% Reading in the data
fn=['IBM.dat'];
ftsD=ascii2fts(fn,1,2);
range=[datenum('02-Jan-1987'),datenum('31-Dec-1987')];
drr=[datestr(range(1)),'::',datestr(range(2))];
fts=ftsD(drr);
data=fts2mat(fts,1);
 
% R_t vector
rdata=data(2:end,5)./data(1:end-1,5)-1;
 
% Plotting
figure(1);
subplot(2,1,1);
plot(data(:,1)-datenum('2-Jan-1987'),data(:,5));
xlim([0 365]); ylabel('IBM Close Price ($)');
subplot(2,1,2);
plot(data(2:end,1)-datenum('2-Jan-1987'),rdata);
xlim([0 365]); ylabel('R_t')
xlabel('t (days)')

and present the data graphically:

varesR

Having our data, let’s find $VaR_{0.05}$ and corresponding $ES$,

21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
% Construct the normalized histogram
[H,h]=hist(rdata,100);
hist(rdata,100);
sum=0;
for i=1:length(H)-1
    sum=sum+(H(i)*(h(i+1)-h(i)));
end
figure(2)
H=H/sum;
bar(h,H,'FaceColor',[0.7 0.7 0.7],'EdgeColor',[0.7 0.7 0.7])
 
% Calculate VaR 95 based on empirical distribution
sum=0;
i=1;
while (i<length(H))
    sum=sum+(H(i)*(h(i+1)-h(i)));
    if(sum>=0.05) 
        break;
    end
    i=i+1;
end
VaR=h(i)
hold on; plot(h(i),0,'ro'); % mark VaR in the plot
 
% Fit the normal distribution to R_t data
[muhat,sigmahat] = normfit([rdata]');
Y = normpdf(h,muhat(1),sigmahat(1));
hold on; plot(h,Y,'b-'); % and plot the fit using blue line
 
% Find for the fitted N(muhat,sigmahat) VaR 95%
sum=0;
i=1;
while (i<length(h))
    sum=sum+(Y(i)*(h(i+1)-h(i)));
    if(sum>=0.05) 
        break;
    end
    i=i+1;
end
VaR_fit=h(i)
hold on; plot(h(i),0,'ro','markerfacecolor','r'); % mark VaR_fit
ivar=i;
 
% Calculate the Expected Shortfall 
%  based on fitted normal distribution
sum=0;
i=1;
while (i<=ivar)
    sum=sum+(h(i)*Y(i)*(h(i+1)-h(i)));
    i=i+1;
end
ES_fit=sum/0.05
hold on; plot(ES_fit,0,'ko','markerfacecolor','k');
 
% Add numbers to the plot
text(-0.23,25,['1-Day VaR and Expected Shortfall']);
text(-0.23,23.5,['as for 31/12/1987']);
text(-0.23,20,['VaR = ',num2str(VaR*-1,3),' (',num2str(VaR*-100,2),'%)']);
text(-0.23,18,['ES  = ',num2str(ES_fit*-1,3),' (',num2str(ES_fit*-100,2),'%)']);
text(-0.23,13,['\mu = ',num2str(muhat,2)]);
text(-0.23,11,['\sigma = ',num2str(sigmahat,2)]);
xlabel('R_t');

marking all numbers in resulting common plot:

VarES

In the plot, filled red and black markers correspond to $VaR_{0.05}$ and $ES$, respectively, and are calculated using the fitted normal distribution. Both values have been displayed in the plot as well. In addition to that, the plot contains a red open circle marker denoting the value of $VaR_{0.05}$ obtained from the empirical histogram discrete integration. Please note how the choice of the method influences the value of VaR.

The derived numbers tell the risk manager that in 95% of cases one should feel confident that the value of $VaR_{0.05}$ will not be exceed in trading of IBM on Jan 4, 1988 (next trading day). $C\times VaR_{0.05} = 0.028C$ would provide us with the loss amount (in dollars) if such undesired event had occurred. And if it was the case indeed, the expected loss (in dollars) we potentially would suffer, would be $C\times ES = 0.058C$.

Now it easy to understand why the Black Swan event of Oct 19, 1987 falls into the category of super least likely moves in the markets. It places itself in the far left tail of both empirical and fitted distributions denoting an impressive loss of 23.52% for IBM on that day. If we assume that the daily price changes are normally distributed then probability of a drop of at least 22 standard deviations is
$$
P(Z\le -22) = \Phi(-22) = 1.4 \times 10^{-107} \ .
$$ This is astronomically rare event!

They say that a lightening does not strike two times, but if a 1-day drop in the market of the same magnitude had occurred again on Jan 4, 1988, our long position in IBM would cause us to suffer not the loss of $0.058C$ dollars but approximately $0.23C$. Ahhhuc!

Coskewness and Cokurtosis Computation for Portfolio Managers


Suppose you are responsible for the management of a current company’s portfolio $P$ consisting of $N$ securities (e.g. equities) traded actively at NASDAQ. As a portfolio manager you probably are interested in portfolio performance, new asset allocation, and risk control. Therefore, your basic set of statistics should include portfolio expected return, $\mu_P$, and portfolio standard deviation, $\sigma_P$, as two key numbers you want to calculate every time you evaluate your portfolio performance.

Coskewness, $s_P$, and cokurtosis, $k_P$, are two additional multivariate higher moments (co-moments) important in asset allocation process and portfolio management. Similarly to the fundamental meaning of a sample’s skewness and kurtosis, the co-skewness and co-kurtosis provides a portfolio manager with an ability to test the same portfolio under different composition in order to facilitate changes required to be introduced (e.g. an elimination of some allocations in badly performing securities causing more pronounced negative coskewness).

In this post I will provide with the basic matrix algebra covering the computational solution for four portfolio statistics with a Matlab code example as a practical implementation of mathematics in action.

Expected Return on Portfolio

Let $P$ is our portfolio containing $N$ securities $K_i$ such that $P=\{K_1,K_2,…,K_N\}$ and $i=1,…,N$. Each security has a trading history for last T days (or months, or years) and we are looking for its $T-1$ rates of return $r_t$ for $t=1,…,T-1$. Therefore, let us denote by $K_i$ a column vector of dimensions $(T-1,1)$ such that:
$$
K_i = [r_1,r_2,…,r_{T-1}]^T\ ,
$$
and portfolio matrix, $P$, containing all securities as:
$$
P =
[K_1\ |\ K_2\ |\ …\ |\ K_N ]
=
\left[
\begin{array}{cccc}
r_{1,1} & r_{1,2} & … & r_{1,N} \\
r_{2,1} & r_{2,2} & … & r_{2,N} \\
… & … & … & … \\
r_{T-1,1} & r_{T-1,2} & … & r_{T-1,N}
\end{array}
\right] \ .
$$ Our portfolio can be described in term of their weights:
$$
w_i = \frac{y_iS_i(0)}{V(0)}
$$ where numerator describes the product of the number of shares and stock purchase price in ratio to the amount initially invested in the portfolio. Rewriting it into one-row matrix we have:
$$
w = [\ w_1\ \ w_2\ \ … \ \ w_N\ ]
$$ where from obvious reason the sum of individual weights is one or $1=uw^T$. For each security we can calculate its expected return $\mu_i=E(K_i)$ and form a vector:
$$
m = [\ \mu_1\ \ \mu_2\ \ … \ \ \mu_N\ ] \ .
$$
Finally, the expected return on portfolio $P$ is given as:
$$
\mu_P = mw^T \
$$ what follows:
$$
\mu_P = E(K_i) = E\left(\sum_{i=1}^{N} w_iK_i \right) = \sum_{i=1}^{N} w_i\mu_i = mw^T .
$$

Portfolio Variance


By evaluation of variance, we are able to show that the expected portfolio variance can be expressed as:
$$
\sigma_P = w M_2 w^T
$$
where $M_2$ is the covariance matrix $(N,N)$ with the individual covariances denoted by $c_{ij}=Cov(K_i,K_j)$. At this point, each element of $M_2$ could be calculated as $c_{ij}=\rho_{ij}\sigma_i\sigma_j$, i.e. with a knowledge of coefficient of correlation, $\rho_{ij}$, between the securities $i$ and $j$, and corresponding standard deviations calculated from the history provided by $P$ matrix.

Just as an educational remark, let me remind you that elements $c_{ij}$ can be obtained in the following way:
$$
c_{ij} = \frac{1}{[(T-1)-1)]} \sum_{i,j=1}^{N} \sum_{t=1}^{T-1} (K_{t,i}-\mu_i)(K_{t,j}-\mu_j) \ ,
$$ therefore
$$
M_2 =
\left[
\begin{array}{cccc}
c_{11} & c_{12} & … & c_{1N} \\
c_{21} & c_{22} & … & c_{2N} \\
… & … & … & … \\
c_{N1} & c_{N2} & … & c_{NN}
\end{array}
\right] .
$$

Portfolio Coskewness

It can calculated as a product of:
$$
s_P = w M_3 (w^T\otimes w^T)
$$
where the symbol of $\otimes$ denotes the Kronecker product between $w$ and $w$, and
$$
M_3 = E[(r-\mu)(r-\mu)^T\otimes(r-\mu)^T] = \{a_{ijk}\} \ , \\
a_{ijk} = E[(r_i-\mu_i)(r_j-\mu_j)(r_k-\mu_k)] \ \ \mbox{for}\ \ i,j,k=1,…,N.
$$
The co-skewness matrix of $M_3$ of dimensions $(N,N^2)$ can be represented by $N$ $A_{ijk}$ matrixes $(N,N)$ such that:
$$
M_3 = [\ A_{1jk}\ A_{1jk}\ …\ A_{Njk}\ ]
$$
where $j,k=1,…,N$ as well as for an index $i$. The individual elements of the matrix can be obtained as:
$$
a_{ijk} = \frac{1}{T-1} \sum_{i,j,k=1}^{N} \sum_{t=1}^{T-1} (K_{t,i}-\mu_i)(K_{t,j}-\mu_j)(K_{t,k}-\mu_k) \ .
$$

Portfolio Cokurtosis

Cokurtosis can be obtained by calculation of the following products:
$$
k_P = w M_4 (w^T\otimes w^T\otimes w^T)
$$ where
$$
M_4 = E[(r-\mu)(r-\mu)^T\otimes(r-\mu)^T\otimes(r-\mu)^T] = \{b_{ijkl}\} \ , \\
b_{ijkl} = E[(r_i-\mu_i)(r_j-\mu_j)(r_k-\mu_k)(r_l-\mu_l)] \ \ \mbox{for}\ \ i,j,k,l=1,…,N.
$$
The co-kurtosis matrix $M_4$ of dimensions $(N,N^3)$ can be represented by $N$ $B_{ijkl}$ matrixes $(N,N)$ such that:
$$
M_4 = [\ B_{11kl}\ B_{12kl}\ …\ B_{1Nkl}\ |\ B_{21kl}\ B_{22kl}\ …\ B_{2Nkl}\ |\ …\ |\ B_{N1kl}\ B_{N2kl}\ …\ B_{NNkl}\ ]
$$
where $k,l=1,…,N$. The individual elements of the matrix can be obtained as:
$$
b_{ijkl} = \frac{1}{T-1} \sum_{i,j,k,l=1}^{N} \sum_{t=1}^{T-1} (K_{t,i}-\mu_i)(K_{t,j}-\mu_j)(K_{t,k}-\mu_k)(K_{t,l}-\mu_l) \ .
$$

Example

Below I present a simple Matlab code which computes discussed above portfolio key four portfolio statistics based on $N=3$ securities traded within given $P$. The choice of portfolio content is not solely limited to equities, and $P$ can contain any asset class under management, or even their mixture. Since the expected portfolio return and risk, $\mu_P$ and $\sigma_P$, respectively, are assumed to be expressed as percent, their transformation to a physical unit of money can be straightforwardly obtained.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
% Program demonstrates the calculattion of portfolio four key statistics:
% the expected portfolio return (muP), portfolio standard deviation (sigP),
% portfolio co-skewness (sP), and portfolio co-kurtosis (kP).
%
% (c) 2013, QuantAtRisk.com, by Pawel Lachowicz
%
% Assumptions: * portfolio comprises 252 trading days, i.e. T=252-1 values
%                of daily returns
%              * daily returns are simulated from the normal distribution
%                with zero mean and unit standard deviation (in practice
%                one can use real market data)
%              * N=3, number of securities in portfolio P
%
% Remark: the code can be easily modified for the need of any N securities
%         considered in your portfolio P; it is not difficult to note that
%         the calcution of all four statistics can be performed by risk
%         managers who manage the porfolio P of portfolios P_i such that
%         P=[P_1 P_2 ... P_N]
 
clear all; clc; close all;
 
T=252;
 
% Simulated returns for 3 securities X,Y,Z expressed in [%]
X=random('normal',0,1,T-1,1);
Y=random('normal',0,1,T-1,1);
Z=random('normal',0,1,T-1,1);
 
% Portfolio matrix
P=[X Y Z];
 
% Portfolio weights for all securities (sum=1)
w=[0.2 0.5 0.3];
 
% m-vector containing the expected returns of securities
m=mean([P(:,1) P(:,2) P(:,3)]);
 
% Computation of M2, the covariance matrix
S=[];
for i=1:size(P,2)
    for j=1:size(P,2)
        u=0;
        for t=1:T-1
            u=u+((P(t,i)-mean(P(:,i)))*(P(t,j)-mean(P(:,j))));
        end
        S(i,j)=u/((T-1)-1);
    end
end
M2=S; %
 
% Computation of M3, the co-skewness matrix
M3=[];
for i=1:size(P,2)
    S=[];
    for j=1:size(P,2)
        for k=1:size(P,2)
            u=0;
            for t=1:T-1 
                u=u+((P(t,i)-mean(P(:,i)))*(P(t,j)-mean(P(:,j))) ...
                    *(P(t,k)-mean(P(:,k))));
            end
            S(j,k)=u/(T-1); 
        end
    end
    M3=[M3 S];
end
 
% Computation of M4, the co-kurtosis matrix
M4=[];
for i=1:size(P,2)
    for j=1:size(P,2)
        S=[];
        for k=1:size(P,2)
            for l=1:size(P,2)
                u=0;
                for t=1:T-1 
                    u=u+((P(t,i)-mean(P(:,i)))*(P(t,j)-mean(P(:,j)))* ...
                        (P(t,k)-mean(P(:,k)))*(P(t,l)-mean(P(:,l))));
                end
                S(k,l)=u/(T-1);
            end
        end
        M4=[M4 S];
    end
end
 
% Results
muP=m*w'                           % expected return on a portfolio [%]
stdP=sqrt(w*M2*w')                 % portfolio standard deviation [%]
sK=w*M3*kron(w',w')                % portfolio coskewness
sP=w*M4*kron(kron(w',w'),w')       % portfolio cokurtosis

Trend Identification for FX Traders (Part 2)


In my previous post I provided an introduction to the trading model invention and design. We made use of FX data of AUDUSD pair sampled hourly and splitting data into weekly time-series.

Today, we will work a bit harder over formulation of the very first rules for the model. This step will require an engagement of our creativity in understanding what data are like as well as how we can make a basic set of rules which would help us to perform an attractive time-series classification. Our objective is to invent a method which will be helpful in classification of last week FX pair’s time-series to be either in the downtrend or in the uptrend.

The most naive way of classification of directional information contained in any time-series is its slope: a straight line fit to the data. Let’s use it as our starting point. Instead of fitting all data points for a given week, we find median values for the first and the last 12 data points both in Time $(x1,x2)$ and Pair Ratio $(y1,y2)$ as specified in lines 92 to 94:

65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
% FX time-series analysis
% (c) Quant at Risk, 2012
%
% Part 2: Classification of weekly time-series
 
close all; scrsz = get(0,'ScreenSize');
h=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none');
fprintf('\nuptrend/downtrend identification.. ');
% for viewing use the loop
hold off;
set(0,'CurrentFigure',h);
 
% pre-define variables
trend=zeros(nw,1);
slope=zeros(nw,1);
midp={};  % middle points
endp={};  % end points (median based on last 12 points)
 
for i=1:nw  %--- a loop over total number of weeks available
 
    % reading time-series for a current week 
    w=week{i}; 
    x=w(:,1); y=w(:,2);
 
    % plot the time-series
    hold on; plot(x,y,'k');
 
    % linear trend estimation
    x1=median(x(1:12)); x2=median(x(end-11:end));
    y1=median(y(1:12)); y2=median(y(end-11:end));
 
    % define end-point of the time-series and mark it on the plot
    endp{i}=[x2 y2];
    hold on; plot(endp{i}(1),endp{i}(2),'b*');
 
    % find slope
    m=(y2-y1)/(x2-x1);
    slope(i)=m;
    xl=x1:dt:x2;       
    yl=m*xl-m*x2+y2;   % a line of representing the slope
    hold on; plot(xl,yl,'b:');
 
    % find middle point of the line and mark it on the plot
    mx=mean(xl);
    my=mean(yl);
    midp{i}=[mx my];
    hold on; plot(midp{i}(1),midp{i}(2),'bo');

As an example of the code execution, for the first two weeks we plot slopes, mid-points and end-points:

We assume that our classification procedure will be based solely on the information provided for end-points and slopes. The exception we make for the classification of the first two weeks. For the first week the distinction between uptrend and downtrend is made based on the position of the first and last point:

113
114
115
116
117
118
119
120
121
122
123
124
    % Time-Series Classification
 
    if(i==1)
        ybeg=y(1); yend=y(end);
        if(ybeg<yend)
            trend(i)=+1; % uptrend
            hold on; plot(x,y,'k');
        else
            trend(i)=-1; % downtrend
            hold on; plot(x,y,'r');
        end
    end

where we mark the result of classification with a sign $+1$ or $-1$ in a vector element of $trend$, and plot it with black and red color denoting uptrend and downtrend, respectively.

For the second week, our rules of classification are enriched by additional information about the end-point of a current and a previous week:

126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
    if(i==2)
        % week(current-1)
        tmp=week{i-1};
        x1=tmp(:,1); y1=tmp(:,2);
        y1b=y1(1); y1e=y1(end);
        % week(current)
        y0b=y(1); y0e=y(end);
        if(y0e>y1e)
            trend(i)=+1; % uptrend
            hold on; plot(x,y,'k');
        else
            trend(i)=-1; % downtrend
            hold on; plot(x,y,'r');
        end
    end

For weeks number 3 and higher we do our creative research over the data to define a specific set of rules. We allow to take into account the information from two weeks prior to the current one and combine them all together. The following code represents an attractive solution, subject to improvement:

142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
    if(i>2)
        % week(current-2)
        mid2=midp{i-2}(2);
        end2=endp{i-2}(2);
        slp2=slope(i-2);
        % week(current-1)
        mid1=midp{i-1}(2);
        end1=endp{i-1}(2);
        slp1=slope(i-1);
        % week(current)
        mid0=midp{i}(2);
        end0=endp{i}(2);
        slp0=slope(i);
        if((mid0>mid1))                     % up-trend
            if((mid0>mid2)&&(end0>end1))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % strong up-trend
            elseif((mid0>mid2)&&(end0>end2))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % weak up-trend
            elseif((mid0<mid2)&&(end0<end2)&&(slp0<0))
                trend(i)=-1;
                hold on; plot(x,y,'r');    % turns into possible down-trend
            elseif((mid0<mid2)&&(end0<end2)&&(slp0>0))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % turns into possible up-trend
            else
                trend(i)=+1;                
                hold on; plot(x,y,'k');    % turns into possible up-trend
            end
        elseif(mid0<mid1)                  % down-trend
            if((mid0<mid2)&&(end0<end1)&&(end0<end2))
                trend(i)=-1;
                hold on; plot(x,y,'r');    % weak down-trend
            elseif((mid0<mid2)&&(end0<end2)&&(end0>end1))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % possible up-trend
            elseif((mid0<mid2)&&(end0>end2))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % turns into possible up-trend
            elseif((mid0>mid2)&&(end0<end1)&&(end0<end2))
                trend(i)=-1;
                hold on; plot(x,y,'r');
            elseif((mid0>mid2)&&(end0>end2))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % turns into possible up-trend
            elseif((mid0>mid2)&&(end0>end1))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % turns into possible up-trend
            else
                trend(i)=-1;                 
                hold on; plot(x,y,'r');
            end
        end
    end
end

Since one picture is worth millions of lines of code, below we present three examples of our model in action. The last plot corresponds to the latest Global Financial Crisis and shows how weeks in uptrends of 2009 followed these in downtrend a year before.

It is straightforward to note that the performance of our rules works very intuitively and stays in general agreement with the market sentiments.

Accessing a SQL Server database in Matlab and the use of FTS objects


Feeding our models with data for backtesting purposes requires pulling data from the server. If we do our modeling part in Matlab under Linux and the MS SQL Server database is our target, the quickest solution for the establishment of a good connectivity is making use of Java Database Connectivity (JDBC) driver. Firstly, visit the following web page to download and current driver (.tar.gz) and next follow the installation steps provided by Mathworks.

In the following example we will try to connect to our database called ‘XDataBase’ and retrieve stock prices from Feb 2, 1998 of Apple Inc. (NASDAQ ticker: AAPL).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ticker=['AAPL'];
start=['02-Feb-1998'];
 
% --Connect to your local SQL Server and fetch the data for a ticker
%
% define connection
conn=database('XDataBase','matlab','m4tl4b','com.microsoft.sqlserver.jdbc.SQLServerDriver','jdbc:sqlserver://163.118.2.7:1533;database=XDataBase');
 
% specify return format
setdbprefs('DataReturnFormat','structure');
 
% define your SQL query, as for example:
QueryString = ['SELECT sec.tic, dp.datadate, dp.prchd As PriceH, dp.prcld As PriceL, dp.prccd As PriceC, dp.prccd/dp.ajexdi As PriceCA, dp.cshtrd*dp.ajexdi As Vol FROM  Xpressfeed.dbo.security AS sec INNER JOIN Xpressfeed.dbo.sec_dprc AS dp ON sec.gvkey = dp.gvkey AND sec.iid = dp.iid AND dp.datadate >=''',start,''' WHERE (sec.tic IN (''',ticker,'''))'];
 
% execute connection, fetch data, and assign them to a Matlab variable
cur = exec(conn, QueryString);
myData = fetch(cur);
secData = myData.data;

In the above example, our structure variable of secData is composed of secData.PriceH, secData.PriceL, secData.PriceC, secData.PriceCA, and secData.Vol arrays containing information about AAPL’s High, Low, Close, Adjusted Close prices and Volume traded, respectively.

Making use of Matlab’s Financial Toolbox, one can store all these data into a useful text-like file format. We do it by construction of Financial Time-Series (FTS) object:

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
% verify the existence of data for a ticker
skip=false;
try
    n=length(secData.PriceC);  % a random test
catch
    skip=true;
end
if(~skip)
    dates=secData.datadate;
    dates=datenum(dates);
    data=[zeros(n,1) secData.PriceH secData.PriceL secData.PriceC secData.PriceCA secData.Vol];
    colheads={'DATES';'OPEN';'HIGH';'LOW';'CLOSE';'CLOSEA';'VOLUME'};
    desc=ticker;
    filename=[ticker,'.dat'];
    % save AAPL's extracted Price and Volume data into AAPL.dat file
    fts2ascii(filename,dates,data,colheads,desc);
 
    % construct FTS object by reading the data from the file
    fts=ascii2fts(filename,1,2);
    data=fts2mat(fts,1);
    % and plot the Adjusted Close Price
    scrsz = get(0,'ScreenSize'); 
    hh=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none');
    plot(data(:,1),data(:,5)); 
    xlim([min(data(:,1)) max(data(:,1))]); 
    title(ticker);
    xlabel('Time [days]'); ylabel('Adjusted Close Price [US$]');
end

As the result of execution of code lines 28 to 35, we were able to retrieve AAPL stock data and save them into AAPL.dat file. First ten lines of the file are as follows:

$ head -10 AAPL.txt
AAPL
DATES 	OPEN 	HIGH 	LOW 	CLOSE 	CLOSEA 	VOLUME
02-Feb-1998   0.000000   18.500000   17.375000   17.688000   4.422000   22740000
03-Feb-1998   0.000000   18.625000   17.688000   18.313000   4.578250   14380000
04-Feb-1998   0.000000   18.500000   18.000000   18.250000   4.562500   6078400
05-Feb-1998   0.000000   18.500000   18.000000   18.313000   4.578250   8508000
06-Feb-1998   0.000000   18.688000   18.250000   18.500000   4.625000   7228000
09-Feb-1998   0.000000   19.500000   18.375000   19.188000   4.797000   17668000
10-Feb-1998   0.000000   19.563000   19.063000   19.438000   4.859500   15072000
11-Feb-1998   0.000000   19.500000   18.875000   19.000000   4.750000   7560000

In line 38 we show how to construct FTS object

>> whos fts
  Name         Size             Bytes  Class    Attributes
 
  fts       3366x6             190016  fints

whereas line 39 provides us with a transformation of the object’s data into matrix format that we may use to plot AAPL’s Adjusted Close Prices.

It is of great importance to remind that Financial Toolbox allows us to express a date (e.g. 11-Feb-1998) as a continuous number:

>> datenum('02-Feb-1998')
ans =
      729788

or reversely

>> datestr(ans)
ans =
02-Feb-1998

Trend Identification for FX Traders


When you think about an invention of a new model for algorithmic trading, there are only three key elements you need to start your work with: creativity, data, and programming tool. Assuming that the last two are already in your possession, all what remains is seeking and finding a great new idea! With no offense, that’s the hardest part of the game.

To be successful in discovering new trading solutions you have to be completely open-minded, relaxed and full of spatial orientation with the information pertaining to your topic. Personally, after many years of programming and playing with the digital signal processing techniques, I have discovered that the most essential aspect of well grounded research is data itself. The more, literally, I starred at time-series changing their properties, the more I was able to capture subtle differences, often overlooked by myself before, and with the aid of intuition and scientific experience some new ideas simply popped up.

Here I would like to share with you a part of this process.

In Extracting Time-Series from Tick-Data article I outlined one of many possible ways of the FX time-series extraction from the very fine data sets. As a final product we have obtained two files, namely:

audusd.bid.1h
audusd.ask.1h

corresponding to Bid and Ask prices for Forex AUDUSD pair’s trading history between Jan 2000 and May 2010. Each file contained two columns of numbers: Time (Modified Julian Day) and Price. The time resolution has been selected to be 1 hour.

FOREX trading lasts from Monday to Friday, continuously for 24 hours. Therefore the data contain regular gaps corresponding to weekends. As the data coverage is more abundant comparing to, for example, much shorter trading windows of equities or ETFs around the world, that provides us with a better understanding of trading directions within every week time frame. Keeping that in mind, we might be interested in looking at directional information conveyed by the data as a seed of a potential new FX model.

As for now, let’s solely focus on initial pre-processing of Bid and Ask time-series and splitting each week into a common cell array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
% FX time-series analysis
% (c) Quant at Risk, 2012
%
% Part 1: Separation of the weeks
 
close all; clear all; clc;
 
% --analyzed FX pair
pair=['audusd'];
 
% --data
n=['./',pair,'/',pair];     % a common path to files
na=[n,'.ask.1h']; 
nb=[n,'.bid.1h'];
d1=load(na); d2=load(na);   % loading data
d=(d1+d2)/2;                % blending
clear d1 d2

For a sake of simplicity, in line 16, we decided to use a simple average of Bid and Ask 1-hour prices for our further research. Next, we create a weekly template, $x$, for our data classification, and we find the total number of weeks available for analysis:

19
20
21
22
23
24
25
26
27
28
29
30
31
% time constraints from the data
t0=min(d(:,1));
tN=max(d(:,1));
t1=t0-1;
 
% weekly template for data classification
x=t1:7:tN+7;
 
% total number of weeks
nw=length(x)-1;
 
fprintf(upper(pair));
fprintf(' time-series: %3.0f weeks (%5.2f yrs)\n',nw,nw/52);

what in our case returns a positive information:

AUDUSD time-series: 539 weeks (10.37 yrs)

The core of programming exercise is to split all 539 weeks and save them into a cell array of $week$. As we will see in the code section below, for some reasons we may want to assure ourselves that each week will contain the same number of points, therefore any missing data from our FX data provider will be interpolated. To do that efficiently, we use the following function which makes use of Piecewise Cubic Hermite Interpolating Polynomial interpolation for filling gapped data point in the series:

function [x2,y2]=gapinterpol(x,y,dt);
    % specify axis
    x_min=x(1);
    x_max=x(length(x));
    x2=(x_min:dt:x_max);
    % inperpolate gaps
    y2=pchip(x,y,x2);
end

The separation of weeks we realize in our program by:

33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
week={};  % an empty cell array
avdt=[]; 
 
for i=1:nw
    % split FX signal according to week
    [r,c,v]=find(d(:,1)>x(i) & d(:,1)<x(i+1));
    x1=d(r,1); y1=d(r,2);
    % interpolate gaps, use 1-hour bins
    dt=1/24;
    [x2,y2]=gapinterpol(x1,y1,dt);
    % check the average sampling time, should equal to dt
    s=0;
    for j=1:length(x2)-1
        s=s+(x2(j+1)-x2(j));
    end
    tmp=s/(length(x2)-1);
    avdt=[avdt; tmp];
    % store the week signal in a cell array
    tmp=[x2; y2]; tmp=tmp';
    week{i}=tmp;
end
fprintf('average sampling after interpolation = %10.7f [d]\n',max(avdt));

where as a check-up we get:

average sampling after interpolation =  0.0416667 [d]

what corresponds to the expected value of $1/24$ day with a sufficient approximation.

A quick visual verification of our signal processing,

54
55
56
57
58
59
60
61
62
63
scrsz = get(0,'ScreenSize');
h=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none');
hold off;
for i=1:nw
    w=week{i};
    x=w(:,1); y=w(:,2);
    % plot weekly signal
    hold on; plot(x,y,'k');
end
xlim([0 100]);

uncovers our desired result:

AUD/USD

Bank format of numbers in Matlab

The following Matlab function allows for displaying any double-type number using a nice bank format that includes commas separators and two digit precision (without rounding). The number $num$ is converted into a new variable of a string type.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
function [str]=bankformat(num)
    num2=floor(num*1000)/1000;
    r=int32(100*(num-floor(num)));
    str = num2str(num2);
    k=find(str == '.');
    if(isempty(k))
        str=[str,'.00'];
    end
    FIN = min(length(str),find(str == '.')-1);
    for i = FIN-2:-3:2
        str(i+1:end+1) = str(i:end);
        str(i) = ',';
    end
    x=mod(r,10);
    if(x==0)
        str=[str,'0'];
    end
    k=find(str == '.');
    d=length(str)-k;
    if(d>2)
        str=str(1:end-(d-2));
    end
end

Example:

>> n=70149506.5;
>> s=bankformat(n);
>> fprintf('\nInvested capital: £\%1s\n',s)
 
Invested capital: £70,149,506.50
Contact Form Powered By : XYZScripts.com