Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

Rebinning Tick-Data for FX Algo Traders

If you work or intend to work with FX data in order to build and backtest your own FX models, the Historical Tick-Data of Pepperstone.com is probably the best place to kick off your algorithmic experience. As for now, they offer tick-data sets of 15 most frequently traded currency pairs since May 2009. Some of the unzip’ed files (one month data) reach over 400 MB in size, i.e. storing 8.5+ millions of lines with a tick resolution for both bid and ask “prices”. A good thing is you can download them all free of charge and their quality is regarded as very high. A bad thing is there is 3 month delay in data accessibility.

Dealing with a rebinning process of tick-data up, that’s a different story and the subject of this post. We will see how efficiently you can turn Pepperstone’s Tick-Data set(s) into 5-min time-series as an example. We will make use of scripting in bash (Linux/OS X) supplemented with data processing in Python.

Data Structure

You can download Pepperstone’s historical tick-data from here, month by month, pair by pair. Their inner structure follows the same pattern, namely:

$ head AUDUSD-2014-09.csv 
AUD/USD,20140901 00:00:01.323,0.93289,0.93297
AUD/USD,20140901 00:00:02.138,0.9329,0.93297
AUD/USD,20140901 00:00:02.156,0.9329,0.93298
AUD/USD,20140901 00:00:02.264,0.9329,0.93297
AUD/USD,20140901 00:00:02.265,0.9329,0.93293
AUD/USD,20140901 00:00:02.265,0.93289,0.93293
AUD/USD,20140901 00:00:02.268,0.93289,0.93295
AUD/USD,20140901 00:00:02.277,0.93289,0.93296
AUD/USD,20140901 00:00:02.278,0.9329,0.93296
AUD/USD,20140901 00:00:02.297,0.93288,0.93296

The columns, from left to right, represent respectively: a pair name, the date and tick-time, the bid price, and the ask price.

Pre-Processing

Here, for each .csv file, we aim to split the date into year, month, and day separately, and remove commas and colons to get raw data ready to be read in as a matrix (array) using any other programming language (e.g. Matlab or Python). The matrix is mathematically intuitive data structure therefore making direct reference to any specific column of it makes any backtesting engine running with its full thrust.

Let’s play with AUDUSD-2014-09.csv data file. Working in the same directory where the file is located we begin with writing a bash script (pp.scr) that contains:

1
2
3
4
5
6
7
8
9
10
11
# pp.scr
# Rebinning Pepperstone.com Tick-Data for FX Algo Traders 
# (c) 2014 QuantAtRisk, by Pawel Lachowicz
 
clear
echo "..making a sorted list of .csv files"
for i in $1-*.csv; do echo ${i##$1-} $i ${i##.csv};
done | sort -n | awk '{print $2}' > $1.lst
 
python pp.py
head AUDUSD.pp

that you run in Terminal:

$ chmod +x pp.scr
$ ./pp.scr AUDUSD

where the first command makes sure the script becomes executable (you need to perform this task only once). Lines #7-8 of our script, in fact, look for all .csv data files in the local directory starting with AUDUSD- prefix and create their list in AUDUSD.lst file. Since we work with AUDUSD-2014-09.csv file only, the AUDUSD.lst file will contain:

$ cat AUDUSD.lst 
AUDUSD-2014-09.csv

as expected. Next, we utilise the power and flexibility of Python in the following way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# pp.py
import csv
 
fnlst="AUDUSD.lst"
fnout="AUDUSD.pp"
 
for lstline in open(fnlst,'r').readlines():
    fncur=lstline[:-1]
    #print(fncur)
 
    with open(fnout,'w') as f:
        writer=csv.writer(f,delimiter=" ")
 
        i=1 # counts a number of lines with tick-data
        for line in open(fncur,'r').readlines():
            if(i<=5200): # replace with (i>0) to process an entire file
                #print(line)
                year=line[8:12]
                month=line[12:14]
                day=line[14:16]
                hh=line[17:19]
                mm=line[20:22]
                ss=line[23:29]
                bidask=line[30:]
                writer.writerow([year,month,day,hh,mm,ss,bidask])
                i+=1

It is a pretty efficient way to open really a big file and process its information line by line. Just for further purpose of display, in the code we told computer to process only first 5,200 of lines. The output of lines #10-11 of pp.scr is the following:

2014 09 01 00 00 01.323 "0.93289,0.93297
"
2014 09 01 00 00 02.138 "0.9329,0.93297
"
2014 09 01 00 00 02.156 "0.9329,0.93298
"
2014 09 01 00 00 02.264 "0.9329,0.93297
"
2014 09 01 00 00 02.265 "0.9329,0.93293
"

since we allowed Python to save bid and ask information as one string (due to a variable number of decimal digits). In order to clean this mess we continue:

13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# pp.scr (continued)
echo "..removing token: comma"
sed 's/,/ /g' AUDUSD.pp > $1.tmp
rm AUDUSD.pp
 
echo "..removing token: double quotes"
sed 's/"/ /g' $1.tmp > $1.tmp2
rm $1.tmp
 
echo "..removing empty lines"
sed -i '/^[[:space:]]*$/d' $1.tmp2
mv $1.tmp2 AUDUSD.pp
 
echo "head..."
head AUDUSD.pp
echo "tail..."
tail AUDUSD.pp

what brings us to pre-processed data:

..removing token: comma
..removing token: double quotes
..removing empty lines
head...
2014 09 01 00 00 01.323  0.93289 0.93297
2014 09 01 00 00 02.138  0.9329 0.93297
2014 09 01 00 00 02.156  0.9329 0.93298
2014 09 01 00 00 02.264  0.9329 0.93297
2014 09 01 00 00 02.265  0.9329 0.93293
2014 09 01 00 00 02.265  0.93289 0.93293
2014 09 01 00 00 02.268  0.93289 0.93295
2014 09 01 00 00 02.277  0.93289 0.93296
2014 09 01 00 00 02.278  0.9329 0.93296
2014 09 01 00 00 02.297  0.93288 0.93296
tail...
2014 09 02 00 54 39.324  0.93317 0.93321
2014 09 02 00 54 39.533  0.93319 0.93321
2014 09 02 00 54 39.543  0.93318 0.93321
2014 09 02 00 54 39.559  0.93321 0.93321
2014 09 02 00 54 39.784  0.9332 0.93321
2014 09 02 00 54 39.798  0.93319 0.93321
2014 09 02 00 54 39.885  0.93319 0.93325
2014 09 02 00 54 39.886  0.93319 0.93321
2014 09 02 00 54 40.802  0.9332 0.93321
2014 09 02 00 54 48.829  0.93319 0.93321

Personally, I love that part as you can learn how to do simple but necessary text file operations by typing single lines of Unix/Linux commands. Good luck for those who try to repeat the same in Microsoft Windows not spending more than 30 sec for doing it.

Rebinning: 5-min Data

The rebinning has many schools. It’s the art for some people. We just want to have the job done. I opt for simplicity and understanding of the data we deal with. Imagine we have two adjacent 5 min bins with a tick history of trading:

sam
We want to derive the closest possible (or most fair) price estimation every 5 min, denoted in the above painting by a red marker. The old-school approach is to take the average over a number (larger than 5) of tick data points from the left and from the right. That creates the under- or overestimation of the mid-price.

If we trade live, every 5 min we receive an information on the last tick point before the minute hits 5 and we wait for the next tick point after 5 (blue markers). Taking the average of their prices (mid-price) makes most of sense. The precision we look at here is sometimes $10^{-5}$. It is not much of significance if our position is small, but if it is not, the mid-price may start playing a crucial role.

The cons of the old-school approach: a possible high volatility among all tick-data within last 5 minutes that we neglect.

The following Python code (pp2.py) performs 5-min rebinning for our pre-processed AUDUSD-2014-09 file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# pp2.py
import csv
import numpy as np
 
def convert(data):
     tempDATA = []
     for i in data:
         tempDATA.append([float(j) for j in i.split()])
     return np.array(tempDATA).T
 
fname="AUDUSD.pp"
 
with open(fname) as f:
    data = f.read().splitlines()
 
#print(data)
 
i=1
for d in data:
    list=[s for s in d.split(' ')]
    #print(list)
    # remover empty element in the list
    dd=[x for x in list if x]
    #print(dd)
    tmp=convert(dd)
    #print(tmp)
    if(i==1):
        a=tmp
        i+=1
    else:
        a = np.vstack([a, tmp])
        i+=1
 
N=i-1
#print("N = %d" % N)
 
# print the first line
tmp=np.array([a[1][0],a[1][1],a[1][2],a[1][3],a[1][4],0.0,(a[1][6]+a[1][7])/2])
print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" %
             (tmp[0],tmp[1],tmp[2],tmp[3],tmp[4],tmp[5],tmp[6]))
m=tmp
 
# check the boundary conditions (5 min bins)
for i in xrange(2,N-1):
    if( (a[i-1][4]%5!=0.0) and (a[i][4]%5==0.0)):
 
        # BLUE MARKER No. 1
        # (print for i-1)
        #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" %
        #      (a[i-1][0],a[i-1][1],a[i-1][2],a[i-1][3],a[i-1][4],a[i-1][5],a[i-1][6],a[i-1][7]))
        b1=a[i-1][6]
        b2=a[i][6]
        a1=a[i-1][7]
        a2=a[i][7]
        # mid-price, and new date for 5 min bin
        bm=(b1+b2)/2
        am=(a1+a2)/2
        Ym=a[i][0]
        Mm=a[i][1]
        Dm=a[i][2]
        Hm=a[i][3]
        MMm=a[i][4]
        Sm=0.0        # set seconds to zero
 
        # RED MARKER
        print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" %
              (Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2))
        tmp=np.array([Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2])
        m=np.vstack([m, tmp])
 
        # BLUE MARKER No. 2
        # (print for i)
        #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" %
        #      (a[i][0],a[i][1],a[i][2],a[i][3],a[i][4],a[i][5],a[i][6],a[i][7]))

what you run in pp.scr file as:

31
32
33
# pp.scr (continued)
 
python pp2.py > AUDUSD.dat

in order to get 5-min rebinned FX time-series as follows:

$ head AUDUSD.dat
2014  9  1  0  0  0.000   0.932935
2014  9  1  0  5  0.000   0.933023
2014  9  1  0 10  0.000   0.932917
2014  9  1  0 15  0.000   0.932928
2014  9  1  0 20  0.000   0.932937
2014  9  1  0 25  0.000   0.933037
2014  9  1  0 30  0.000   0.933075
2014  9  1  0 35  0.000   0.933070
2014  9  1  0 40  0.000   0.933092
2014  9  1  0 45  0.000   0.933063

That concludes our efforts. Happy rebinning!

Gap-on-Open Profitable Trading Strategy


After a longer while, QuantAtRisk is back to business. As an algo trader I have been always tempted to test a gap-on-open trading strategy. There were various reasons standing behind it but the most popular one was always omni-discussed: good/bad news on the stock. And what? The stock price skyrocketed/dropped down on the following days. When we approach such price patterns, we talk about triggers or triggered events. The core of the algorithm’s activity is the trigger identification and taking proper actions: to go long or short. That’s it. In both cases we want to make money.

In this post we will design the initial conditions for our gap-on-open trading strategy acting as the triggers and we will backtest a realistic scenario of betting our money on those stocks that opened higher on the next trading day. Our goal is to find the most optimal holding period for such trades closed with a profit.

Portfolio

Our strategy can be backtested using any $N$-asset portfolio. Here, for simplicity, let us use a random subset of 10 stocks (portfolio.lst) being a part of a current Dow Jones Index:

AXP   CSCO   DIS   IBM   JNJ   KO   NKE   PG   UTX   XOM

In Matlab, we fetch the stock prices from Google Finance data provider accessible via Quandl.com’s Matlab API (see this post for its setup in Matlab). We commence writing our main backtesting code as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
% Gap on Open Trading Strategy
%  Fetching stock prices via Quandl and Strategy Backtesting
%
% (c) 2014 by Pawel Lachowicz, QuantAtRisk.com
 
 
clear all; close all; clc;
 
fname=['portfolio.lst'];
 
% Model's parameter #1 (years)
parm1=1;  
ndays=parm1*365;
lday=datenum('2014-08-05');
% fetching stock data
[Top,Thp,Tlp,Tcp,N,ntdays]=FetchQuandl(fname,ndays,lday);

where we use a pre-designed function of FetchQuandl to import 4 separate price-series of each stock’s open (Top), high (Thp), low (Tlp), and close (Tcp) daily prices:

function [Top,Thp,Tlp,Tcp,N,ntdays]=FetchQuandl(fname,ndays,lday)
    % Read the list of Dow Jones components
    fileID = fopen(fname);
    tmp = textscan(fileID,'%s');
    fclose(fileID);
    components=tmp{1};  % a list as a cell array
 
    % Read in the list of tickers and internal codes from Quandl.com
    [~,text,~] = xlsread('QuandlStockCodeListUS.xlsx');
    quandlc=text(:,1);    % again, as a list in a cell array
    quandlcode=text(:,3); % corresponding Quandl's Price Code
 
    % fetch stock data for last ‘ndays’
    date2=datestr(lday,'yyyy-mm-dd');       % from
    date1=datestr(lday-ndays,'yyyy-mm-dd'); % to
 
    Rop={}; Tcp={};
    % scan all tickers and fetch the data from Quandl.com
    for i=1:length(components)
        for j=1:length(quandlc)
            if(strcmp(components{i},quandlc{j}))
                fprintf('%4.0f %s\n',i,quandlc{j});
                fts=0;
                [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                              'authcode','PutHereYourQuandlCode',...
                              'start_date',date1,'end_date',date2);
                cp=fts2mat(fts.Close,1); Tcp{i}=cp;     % close price-series
                op=fts2mat(fts.Open,1);  Top{i}=op;     % open price-series
                hp=fts2mat(fts.High,1);  Thp{i}=hp;     % high price
                lp=fts2mat(fts.Low,1);   Tlp{i}=lp;     % low price
                %Rcp{i}=cp(2:end,2)./cp(1:end-1,2)-1;   % return-series cp
            end
        end
    end
    N=length(components);
    ntdays=length(Tcp{1});
end

Please note that in line #12 we specified number of years, i.e. how far our backtest should be extended backward in time (or number of calendar days; see line #13) from the day specified in line #14 (last day).

Trading Model

First, let us design the trading strategy. We scan concurrently four price-series for each stock separately. We define the strategy’s trigger as follows:
triggers-2 i.e. if a stock open price on day $t$ was higher than the close price on the day $t-1$ and the lowest prices on day $t$ was higher than the highest price on day $t-1$. Having that, we make a BUY LONG decision! We buy that stock on the next day at its market price (close price). This approach should remove the slippage bias effectively (see more on slippage in stock trading here).

Now, we run the backtest on each stock and each open trade. We select the second parameter (parm2) to be a number of days, i.e. how long we hold the stock. In the following piece of code, let us allow to sell the stock after/between 1 to 21 calendar days ($\pm$ weekend or public holidays time period):

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
% pre-defined matrix for backtest final results
results=[];
 
for parm2=0:20
 
    cR=[];
    for i=1:N
        % just for a purpose of plotting of price-series
        if(i==1)
            % open (blue color)
            plot(Top{i}(:,1),Top{i}(:,2),'')
            hold on
            % close (red color)
            plot(Tcp{i}(:,1),Tcp{i}(:,2),'r')
            hold on
            % high (green color)
            plot(Thp{i}(:,1),Thp{i}(:,2),'g')
            %
            xlabel('Days');
            ylabel('AXP Stock Prices [US$]');
        end
 
        Tbuy=[];
        for t=2:ntdays
            % define indicators 
            ind1=Tcp{i}(t-1,2);  % cp on (t-1)day
            ind2=Thp{i}(t-1,2);  % hp on (t-1)day
            ind3=Top{i}(t,2);    % op on (t)day
            ind4=Tlp{i}(t,2);    % lp on (t)day
            % detect trigger
            if(ind1<ind3)&&(ind2<ind4)
                % plotting only for AXP
                if(i==1)
                    hold on;
                    plot(Top{i}(t,1),Top{i}(t,2),'o');
                end
                % date of a trigger
                tday=Top{i}(t,1);
                nextbusdate=busdate(tday,1); % find next trading date
                Tbuy=[Tbuy; nextbusdate];
            end
        end
        Tsell=busdate(Tbuy+parm2,1);

Here, in lines #57 and #60 we constructed time array storing physical information on those days. Now, we will use them to check the price on trade’s open and close and derive profit and loss for each stock:

62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
        R=[];
        for k=1:length(Tbuy)
            j=find(Tbuy(k)==Tcp{i}(:,1));
            pbuy=Tcp{i}(j,2);
            j=find(Tsell(k)==Tcp{i}(:,1));
            psell=Tcp{i}(j,2);
            ret=(psell/pbuy-1); % return per trade
            R=[R; ret];
        end
 
        compR=prod(R+1)-1;  % compound return per stock
        cR=[cR; compR];
 
    end
 
    results=[results cR];
 
end

In the inner loop (lines #24 to #75, i.e. tracking a number of stocks in portfolio; index $i$, here 1 to 10) we capture all trades per stock (lines #63-70) and calculate a multi-period compound return (line #72) as if we were trading that stock solely using our model.

For instance, for stock $i=1$ (AXP) from our portfolio, our code displays 1-year price-series:
post05082014-fig01 where days meeting our trigger criteria have been denoted by open-circle markers. If you now re-run the backtest making a gentle substitution in line #24 now to be:

24
    for i=1:1

we can find that by running through some extra lines of code as defined:

81
82
83
84
figure(2)
stem((0:20),100*results)
xlabel('Holding Period [days]');
ylabel('AXP: Compound Return [%]');

we obtain an appealing result:
post05082014-fig02
The chart reveals that for AXP, over past 251 days (since Aug/4 2014 backwards), we had 16 triggers therefore 16 trades and, surprisingly, regardless of holding period, the compound return from all closed trades was highly positive (profitable).

This is not the case if we consider, for example, $i=4$, IBM stock:
post05082014-fig03 This result points that for different holding periods (and different stocks of course) certain extra trading indicators should be applied to limit the losses (e.g. profit targets).

If we traded a whole portfolio using our gap-on-open model, we would end up with very encouraging result:
post05082014-fig04 where for each holding period we displayed the averaged over 10 stocks compound return. Taking into account the global up-trend in the US stock markets between August of 2013 and 2014, this strategy is worth its consideration with any further modifications (e.g. considering short or both long and short triggers, FX time-series, etc.).

Someone wise once said: Sometimes you win. Sometimes you learn. In algo trading we all learn to win.

In next post…
Marginal Value-at-Risk for Portfolio Managers

WANT TO LEARN MORE ON PORTFOLIOS in MATLAB!?
Click here!

 

Hacking Google Finance in Real-Time for Algorithmic Traders

Forecasting risk in algorithmic stock trading is of paramount importance for everyone. You should always look for the ways how to detect sudden price changes and take immediate actions to protect your investments.

Imagine you opened a new long position last Wednesday for NASDAQ:NVDA buying 1500 shares at the market price of USD16.36. On the next day price goes down to USD15.75 at the end of the session. You are down 3.87% or almost a grand in one day. If you can handle that, it’s okay but if the drop were more steep? Another terrorist attack, unforeseen political event, North Korea nuclear strike? Then what? You need to react!

If you have information, you have options in your hands. In this post we will see how one can use real-time data of stock prices displayed on Google Finance website, fetch and record them on your computer. Having them, you can build your own warning system for sudden price swings (risk management) or run the code in the background for a whole trading session (for any stock, index, etc. accessible through Google Finance) and capture asset prices with an intraday sampling (e.g. every 10min, 30min, 1h, and so on). From this point only your imagination can stop you from using all collected data.

Hacking with Python

If you ever dreamt of becoming a hacker, this is your chance to shine! I have got my inspiration after reading the book of Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers by TJ O’Connor. A powerful combination of the beauty and the beast.

The core of our code will be contained in a small function which does the job. For a specified Google-style ticker (query), it fetches the data directly from the server returning the most current price of an asset:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Hacking Google Finance in Real-Time for Algorithmic Traders
# 
# (c) 2014 QuantAtRisk.com, by Pawel Lachowicz
 
import urllib, time, os, re, csv
 
def fetchGF(googleticker):
    url="http://www.google.com/finance?&q="
    txt=urllib.urlopen(url+googleticker).read()
    k=re.search('id="ref_(.*?)">(.*?)<',txt)
    if k:
        tmp=k.group(2)
        q=tmp.replace(',','')
    else:
        q="Nothing found for: "+googleticker
    return q

Just make sure that a Google ticker is correctly specified (as will see below). Next, let’s display on the screen our local time and let’s force a change of the system time to the one corresponding to New York City, NY. The latter assumption we make as we would like to track the intraday prices of stock(s) traded at NYSE or NASDAQ. However, if you are tracking FTSE 100 index, the Universal Time (UTC) of London is advisable as an input parameter.

18
19
20
21
22
23
24
25
26
27
# display time corresponding to your location
print(time.ctime())
print
 
# Set local time zone to NYC
os.environ['TZ']='America/New_York'
time.tzset()
t=time.localtime() # string
print(time.ctime())
print

Having that, let us define a side-function combine which we will use to glue all fetched data together into Python’s list variable:

29
30
31
32
33
34
def combine(ticker):
    quote=fetchGF(ticker) # use the core-engine function
    t=time.localtime()    # grasp the moment of time
    output=[t.tm_year,t.tm_mon,t.tm_mday,t.tm_hour,  # build a list
            t.tm_min,t.tm_sec,ticker,quote]
    return output

As an input, we define Google ticker of our interest:

36
ticker="NASDAQ:AAPL"

for which we open a new text file where all queries will be saved in real-time:

39
40
41
42
# define file name of the output record
fname="aapl.dat"
# remove a file, if exist
os.path.exists(fname) and os.remove(fname)

Eventually, we construct the final loop over trading time. Here, we fetch the last data at 16:00:59 New York time. The key parameter in the game is freq variable where we specify the intraday sampling (in seconds). From my tests, using a private Internet provider, I have found that the most optimal sampling was 600 sec (10 min). Somehow, for shorter time intervals, Google Finance detected too frequent queries sent from my IP address. This test succeed from a different IP location, therefore, feel free to play with your local Internet network to find out what is the lowest available sampling time for your geolocation.

43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
freq=600 # fetch data every 600 sec (10 min)
 
with open(fname,'a') as f:
    writer=csv.writer(f,dialect="excel") #,delimiter=" ")
    while(t.tm_hour<=16):
        if(t.tm_hour==16):
            while(t.tm_min<01):
                data=combine(ticker)
                print(data)
                writer.writerow(data) # save data in the file
                time.sleep(freq)
            else:
                break
        else:
            for ticker in tickers:
            data=combine(ticker)
            print(data)
            writer.writerow(data) # save data in the file
            time.sleep(freq)
 
f.close()

To see how the above code works in practice, I conducted a test on Jan/9 2014, starting at 03:31:19 Sydney/Australia time, corresponding to 11:31:19 New York time. Setting the sampling frequency to 600 sec, I was able to fetch the data in the following form:

Thu Jan  9 03:31:19 2014
 
Wed Jan  8 11:31:19 2014
 
[2014, 1, 8, 11, 31, 19, '543.71']
[2014, 1, 8, 11, 41, 22, '543.66']
[2014, 1, 8, 11, 51, 22, '544.22']
[2014, 1, 8, 12, 1, 23, '544.80']
[2014, 1, 8, 12, 11, 24, '544.32']
[2014, 1, 8, 12, 21, 25, '544.86']
[2014, 1, 8, 12, 31, 27, '544.47']
[2014, 1, 8, 12, 41, 28, '543.76']
[2014, 1, 8, 12, 51, 29, '543.86']
[2014, 1, 8, 13, 1, 30, '544.00']
[2014, 1, 8, 13, 11, 31, 'Nothing found for: NASDAQ:AAPL']
[2014, 1, 8, 13, 21, 33, '543.32']
[2014, 1, 8, 13, 31, 34, '543.84']
[2014, 1, 8, 13, 41, 36, '544.26']
[2014, 1, 8, 13, 51, 37, '544.10']
[2014, 1, 8, 14, 1, 39, '544.30']
[2014, 1, 8, 14, 11, 40, '543.88']
[2014, 1, 8, 14, 21, 42, '544.29']
[2014, 1, 8, 14, 31, 45, '544.15']
...

As you can notice, they were displayed on the screen (line #59 in the code) in the form of Python’s list. It is important to note that the time we make an effort to capture and associate it with fetched asset price (query) is the computer’s system time, therefore please don’t expect regular time intervals as one may get from a verified market data providers. We are hacking in real-time! However, if you think about the data themselves, this time precision is not of great importance. As long as we fetch the data every freq seconds, that sufficiently allows us to build a risk management system or even to measure a rolling volatility of an asset. Your trading model will benefit anyway.

Have also a note that if our Internet connection fails or there are some disturbances of a different kind, we will miss the data in a sent query as visible in the example above.

Looks exciting? Give me High Five! and say Hell Yeah!

Code Modification: Portfolio of Assets

The presented Python code can be very easily modified if you wish to try fetching data for a couple of assets concurrently every freq seconds. Simply extend and amend all the lines starting at row #36, for example in the following form:

36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
tickers=["NASDAQ:AAPL","NASDAQ:GOOG","NASDAQ:BIDU","NYSE:IBM", \
         "NASDAQ:INTC","NASDAQ:MSFT","NYSEARCA:SPY"]
 
# define the name of an output file
fname="portfolio.dat"
# remove a file, if exist
os.path.exists(fname) and os.remove(fname)
 
freq=600 # fetch data every 600 sec (10 min)
 
with open(fname,'a') as f:
    writer=csv.writer(f,dialect="excel") #,delimiter=" ")
    while(t.tm_hour<=16):
        if(t.tm_hour==16):
            while(t.tm_min<01):
                #for ticker in tickers:
                    data=combine(ticker)
                    print(data)
                    writer.writerow(data)
                time.sleep(freq)
            else:
                break
        else:
            for ticker in tickers:
                data=combine(ticker)
                print(data)
                writer.writerow(data)
            time.sleep(freq)
 
f.close()

That’s it! For the sake of real-time verification, here is a screenshot how does it work:

Thu Jan  9 07:01:43 2014
 
Wed Jan  8 15:01:43 2014
 
[2014, 1, 8, 15, 1, 44, 'NASDAQ:AAPL', '543.55']
[2014, 1, 8, 15, 1, 44, 'NASDAQ:GOOG', '1140.30']
[2014, 1, 8, 15, 1, 45, 'NASDAQ:BIDU', '182.65']
[2014, 1, 8, 15, 1, 45, 'NYSE:IBM', '187.97']
[2014, 1, 8, 15, 1, 46, 'NASDAQ:INTC', '25.40']
[2014, 1, 8, 15, 1, 47, 'NASDAQ:MSFT', '35.67']
[2014, 1, 8, 15, 1, 47, 'NYSEARCA:SPY', '183.43']
[2014, 1, 8, 15, 11, 48, 'NASDAQ:AAPL', '543.76']
[2014, 1, 8, 15, 11, 49, 'NASDAQ:GOOG', '1140.06']
[2014, 1, 8, 15, 11, 49, 'NASDAQ:BIDU', '182.63']
[2014, 1, 8, 15, 11, 50, 'NYSE:IBM', '187.95']
[2014, 1, 8, 15, 11, 51, 'NASDAQ:INTC', '25.34']
[2014, 1, 8, 15, 11, 52, 'NASDAQ:MSFT', '35.67']
[2014, 1, 8, 15, 11, 53, 'NYSEARCA:SPY', '183.34']
...

where we can see that we were able to grab the prices of 6 stocks and 1 ETF (Exchange Trading Fund tracking S&P500 Index) every 10 min.

Reflection

You may wonder whether hacking is legal or not? The best answer I find in the words of Gordon Gekko: Someone reminded me I once said “Greed is good”,

Hacking Google Finance in Python

PART II
    Hacking Google Finance in Real-Time for Algo Traders. (2) Pre-Market Trading.

Pre-Processing of Asset Price Series for Portfolio Optimization

Portfolio Optimization is a significant component of Matlab’s Financial Toolbox. It provides us with ready-to-use solution in finding optimal weights of assets that we consider for trading deriving them based on the historical asset performance. From a practical point of view, we can include it in our algorithmic trading strategy and backtest its applicability under different initial conditions. This is a subject of my next up-coming post. However, before we can enjoy the view from the peak, we need to climb the mountain first.

In Matlab, the portfolio is created as a dedicated object of the same name. It doesn’t read the raw stock data. We need to feed that beast. Two major ingredients satisfy the input: a vector of the expected asset returns and a covariance matrix. Matlab helps us to estimate these moments but first we need to deliver asset data in a digestable form.

In this post we will see how one can quickly download the stock data from the Internet based on our own stock selection and pre-process them for solving portfolio optimization problem in Matlab.

Initial Setup for Portfolio Object

Let’s say that at any point of time you have your own list of stocks you wish to buy. For simplicity let’s also assume that the list contains stocks traded on NYSE or NASDAQ. Since you have been a great fun of this game, now you are almost ready to buy what you jotted down on your ShoppingList.lst. Here, an example of 10 tech stocks:

AAPL   AOL   BIDU   GOOG   HPQ   IBM   INTC   MSFT   NVDA   TXN

They will constitute your portfolio of stocks. The problem of portfolio optimization requires a look back in time in the space of returns obtained in trading by each stock. Based on them the Return Proxy and Risk Proxy can be found.

The return matrix $R$ of dimensions $(N-1)\times M$ where $N$ stands for number of historical prices (e.g. derived daily, or monthly, etc.) and $M$ for the number of stocks in our portfolio, is required by Matlab as an input. We will see how does it work in next post. For now let’s solely focus on creation of this matrix.

In the article Create a Portfolio of Stocks based on Google Finance Data fed by Quandl I discussed Quandl.com as an attractive data provider for US stocks. Here, we will follow this solution making use of Quandl resources to pull out the stock price series for our shopping list. Ultimately, we aim at building a function, here: QuandlForPortfolio, that does the job for us:

% Pre-Processing of Asset Price Series for Portfolio Optimization in Matlab
%  (c) 2013, QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
 
% Input Parameters
n=1*365;
tickers='ShoppingList.lst';
qcodes='QuandlStockCodeListUS.xlsx';
 
[X,Y,R,AssetList] = QuandlForPortfolio(n,tickers,qcodes);

We call this function with three input parameters. The first one, $n$, denotes a number of calendar days from today (counting backwards) for which we wish to retrieve the stock data. Usually, 365 days will correspond to about 250$-$252 trading days. The second parameter is a path/file name to our list of stock (desired to be taken into account in the portfolio optimisation process) while the last input defines the path/file name to the file storing stocks’ tickers and associated Quandl Price Codes (see here for more details).

Feeding the Beast

The QuandlForPortfolio Matlab function is an extended version of the previously discussed solution. It contains an important correcting procedure for the data fetched from the Quandl servers. First, let’s have a closer look on the function itself:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
% Function assists in fetching Google Finance data from the Quandl.com
%  server for a given list of tickers of stocks traded on NYSE or
%  NASDAQ. Data are retrieved for last 'n' days with daily sampling.
%
% INPUT
%   n       : number of calendar days from 'today' (e.g. 365 would
%             correspond to about 252 business days)
%   tickers : a path/file name of a text file listing tickers
%   qcodes  : a path/file name of Excel workbook (.xlsx) containing a list
%              of tickers and Quandl Price Codes in the format of
%              [Ticker,Stock Name,Price Code,Ratios Code,In Market?]
% OUTPUT
%   X0        : [Nx1] column vector with days
%   Y0        : [NxM] matrix with Close Prices for M stocks
%   R0        : [(N-1)xM] matrix of Retruns
%   AssetList : a list of tickers (cell array)
%
% (c) 2013, QuantAtRisk.com, by Pawel Lachowicz
 
function [X0,Y0,R0,AssetList0] = QuandlForPortfolio(n,tickers,qcodes)
    fileID = fopen(tickers);
    tmp = textscan(fileID, '%s');
    fclose(fileID);
    AssetList=tmp{1};  % a list as a cell array
 
    % Read in the list of tickers and internal Quandl codes
    %
    [~,text,~] = xlsread(qcodes);
    quandlc=text(:,1); % again, as a list in a cell array
    quandlcode=text(:,3); % corresponding Quandl's Price Code
 
    date1=datestr(today-n,'yyyy-mm-dd'); % from
    date2=datestr(today,'yyyy-mm-dd');   % to
 
    % Fetch the data from Quandl.com
    %
    QData={};
    for i=1:length(AssetList)
        for j=1:length(quandlc)
            if(strcmp(AssetList{i},quandlc{j}))
                fprintf('%4.0f %s\n',i,quandlc{j});
                fts=0;
                [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                   'authcode','x',...
                   'start_date',date1,'end_date',date2,'collapse','daily');
                QData{i}=fts;
            end
        end
    end
 
    % Post-Processing of Fetched Data
    %
    % create a list of days across all tickers
    TMP=[];
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        tmp=tmp(:,1);
        TMP=[TMP; tmp];
    end
    ut=unique(TMP);
    % use that list to find these days that are not present
    %  among all data sets
    TMP=[];
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        tmp=tmp(:,1);
        TMP=[TMP; setdiff(ut,tmp)];
    end
    ut=unique(TMP);
    % finally, extract Close Prices from FTS object and store them
    %  in Y0 matrix, plus corresponding days in X0
    X0=[];
    Y0=[]; 
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        cp=[];
        for j=1:size(tmp,1)
            [r,~,~]=find(ut==tmp(j,1));
            if(isempty(r))
                cp=[cp; tmp(j,5)]; % column 5 corresponds to Close Price
                if(i<2)
                    % create a time column vector listing days
                    % common among all data sets
                    X0=[X0; tmp(j,1)];
                end
            end
        end
        Y0=[Y0 cp];
    end
    % transform Close Prices into Returns, R(i)=cp(i)/cp(i-1)-1
    R0=tick2ret(Y0);
    AssetList0=AssetList';
end

The main bottleneck comes from the fact that Matlab’s portfolio object demands an equal number of historical returns ($N-1$) in the matrix of $R$ for all $M$ assets. We design the function in the way that it sets the common timeframe for all stocks listed on our shopping list. Of course, we ensure that all stocks were traded in the markets for about $n$ last days (rough estimation).

Now, the timeframe of $n$ last days should be understood as a first approximation. We fetch the data from Quandl (numeric date, Open, High, Low, Close, Volume) and save them in the cell array QData (lines #37-49) for each stock separately as FTS objects (Financial Time-Series objects; see Financial Toolbox). However, it may occur that not every stock we fetched displays the same amount of data. That is why we need to investigate for what days and for what stocks we miss the data. We achieve that by scanning each FTS object and creating a unique list of all days for which we have data (lines #54-60).

Next, we loop again over the same data sets but now we compare that list with a list of all dates for each stock individually (lines #63-69), capturing (line #67) those dates that are missing. Their complete list is stored as a vector in line #69. Eventually, given that, we are able to compile the full data set (e.g. Close Prices; here line #80) for all stocks in our portfolio ensuring that we will include only those dates for which we have prices across all $M$ assets (lines #70-91).

Beast Unleashed

We test our data pre-processing simply by running the block of code listed above engaging QuandlForPortfolio function and we check the results in the Matlab’s command window as follows:

>> whos X Y R AssetList
  Name             Size            Bytes  Class     Attributes
 
  AssetList        1x10             1192  cell                
  R              250x10            20000  double              
  X              251x1              2008  double              
  Y              251x10            20080  double

what confirms the correctness of dimensions as expected.

At this stage, the aforementioned function can be used two-fold. First, we are interested in the portfolio optimisation and we look back at last $n$ calendar days since the most current one (today). The second usage is handy too. We consider our stocks on the shopping list and fetch for their last, say, $n=7\times365$ days with data. If all stocks were traded over past 7 years we should be able to collect a reach data set. If not, the function will adjust the beginning and end date to meet the initial time constrains as required for $R$ matrix construction. For the former case, we can use 7-year data sample for direct backtesting of algo models utilizing Portfolio Optimization.

Stay tuned as we will rock this land in the next post!

Any Questions?

Share them across QuantCove.com – the official Forum of QuantAtRisk.

Download Intraday and Tick-Data from QaR

Since this week you have an opportunity to get an access to random sets of historical intraday trading and tick-data sets across different markets. Simply scroll down Download menu option at the top of the screen. In time, more data sets will be added.

Happy backtesting!

Sincerely Yours,
QaR Team

Anxiety Detection Model for Stock Traders based on Principal Component Analysis

Everybody would agree on one thing: the nervousness among traders may lead to massive sells of stocks, the avalanche of prices, and huge losses. We have witnessed this sort of behaviour many times. An anxiety is the feeling of uncertainty, a human inborn instinct that triggers a self-defending mechanism against high risks. The apex of anxiety is fear and panic. When the fear spreads among the markets, all goes south. It is a dream goal for all traders to capture the nervousness and fear just by looking at or studying the market behaviour, trading patterns, ask-bid spread, or flow of orders. The major problem is that we know very well how much a human behaviour is an important factor in the game but it is not observed directly. It has been puzzling me for a few years since I accomplished reading Your Money and Your Brain: How the New Science of Neuroeconomics Can Help Make You Rich by Jason Zweig sitting at the beach of one of the Gili Islands, Indonesia, in December of 2009. A perfect spot far away from the trading charts.

So, is there a way to disentangle the emotional part involved in trading from all other factors (e.g. the application of technical analysis, bad news consequences, IPOs, etc.) which are somehow easier to deduce? In this post I will try to make a quantitative attempt towards solving this problem. Although the solution will not have the final and closed form, my goal is to deliver an inspiration for quants and traders interested in the subject by putting a simple idea into practice: the application of Principal Component Analysis.

1. Principal Component Analysis (PCA)

Called by many as one of the most valuable results from applied linear algebra, the Principal Component Analysis, delivers a simple, non-parametric method of extracting relevant information from often confusing data sets. The real-world data usually hold some relationships among their variables and, as a good approximation, in the first instance we may suspect them to be of the linear (or close to linear) form. And the linearity is one of stringent but powerful assumptions standing behind PCA.

Imagine we observe the daily change of prices of $m$ stocks (being a part of your portfolio or a specific market index) over last $n$ days. We collect the data in $\boldsymbol{X}$, the matrix $m\times n$. Each of $n$-long vectors lie in an $m$-dimensional vector space spanned by an orthonormal basis, therefore they are a linear combination of this set of unit length basic vectors: $ \boldsymbol{BX} = \boldsymbol{X}$ where a basis $\boldsymbol{B}$ is the identity matrix $\boldsymbol{I}$. Within PCA approach we ask a simple question: is there another basis which is a linear combination of the original basis that represents our data set? In other words, we look for a transformation matrix $\boldsymbol{P}$ acting on $\boldsymbol{X}$ in order to deliver its re-representation:
$$
\boldsymbol{PX} = \boldsymbol{Y} \ .
$$ The rows of $\boldsymbol{P}$ become a set of new basis vectors for expressing the columns of $\boldsymbol{X}$. This change of basis makes the row vectors of $\boldsymbol{P}$ in this transformation the principal components of $\boldsymbol{X}$. But how to find a good $\boldsymbol{P}$?

Consider for a moment what we can do with a set of $m$ observables spanned over $n$ days? It is not a mystery that many stocks over different periods of time co-vary, i.e. their price movements are closely correlated and follow the same direction. The statistical method to measure the mutual relationship among $m$ vectors (correlation) is achieved by the calculation of a covariance matrix. For our data set of $\boldsymbol{X}$:
$$
\boldsymbol{X}_{m\times n} =
\left[
\begin{array}{cccc}
\boldsymbol{x_1} \\
\boldsymbol{x_2} \\
… \\
\boldsymbol{x_m}
\end{array}
\right]
=
\left[
\begin{array}{cccc}
x_{1,1} & x_{1,2} & … & x_{1,n} \\
x_{2,1} & x_{2,2} & … & x_{2,n} \\
… & … & … & … \\
x_{m,1} & x_{m,2} & … & x_{m,n}
\end{array}
\right]
$$
the covariance matrix takes the following form:
$$
cov(\boldsymbol{X}) \equiv \frac{1}{n-1} \boldsymbol{X}\boldsymbol{X}^{T}
$$ where we multiply $\boldsymbol{X}$ by its transposed version and $(n-1)^{-1}$ helps to secure the variance to be unbiased. The diagonal elements of $cov(\boldsymbol{X})$ are the variances corresponding to each row of $\boldsymbol{X}$ whereas the off-diagonal terms of $cov(\boldsymbol{X})$ represent the covariances between different rows (prices of the stocks). Please note that above multiplication assures us that $cov(\boldsymbol{X})$ is a square symmetric matrix $m\times m$.

All right, but what does it have in common with our PCA method? PCA looks for a way to optimise the matrix of $cov(\boldsymbol{X})$ by a reduction of redundancy. Sounds a bit enigmatic? I bet! Well, all we need to understand is that PCA wants to ‘force’ all off-diagonal elements of the covariance matrix to be zero (in the best possible way). The guys in the Department of Statistics will tell you the same as: removing redundancy diagonalises $cov(\boldsymbol{X})$. But how, how?!

Let’s come back to our previous notation of $\boldsymbol{PX}=\boldsymbol{Y}$. $\boldsymbol{P}$ transforms $\boldsymbol{X}$ into $\boldsymbol{Y}$. We also marked that:
$$
\boldsymbol{P} = [\boldsymbol{p_1},\boldsymbol{p_2},…,\boldsymbol{p_m}]
$$ was a new basis we were looking for. PCA assumes that all basis vectors $\boldsymbol{p_k}$ are orthonormal, i.e. $\boldsymbol{p_i}\boldsymbol{p_j}=\delta_{ij}$, and that the directions with the largest variances are the most principal. So, PCA first selects a normalised direction in $m$-dimensional space along which the variance in $\boldsymbol{X}$ is maximised. That is first principal component $\boldsymbol{p_1}$. In the next step, PCA looks for another direction along which the variance is maximised. However, because of orthonormality condition, it looks only in all directions perpendicular to all previously found directions. In consequence, we obtain an orthonormal matrix of $\boldsymbol{P}$. Good stuff, but still sounds complicated?

The goal of PCA is to find such $\boldsymbol{P}$ where $\boldsymbol{Y}=\boldsymbol{PX}$ such that $cov(\boldsymbol{Y})=(n-1)^{-1}\boldsymbol{XX}^T$ is diagonalised.

We can evolve the notation of the covariance matrix as follows:
$$
(n-1)cov(\boldsymbol{Y}) = \boldsymbol{YY}^T = \boldsymbol{(PX)(PX)}^T = \boldsymbol{PXX}^T\boldsymbol{P}^T = \boldsymbol{P}(\boldsymbol{XX}^T)\boldsymbol{P}^T = \boldsymbol{PAP}^T
$$ where we made a quick substitution of $\boldsymbol{A}=\boldsymbol{XX}^T$. It is easy to prove that $\boldsymbol{A}$ is symmetric. It takes a longer while to find a proof for the following two theorems: (1) a matrix is symmetric if and only if it is orthogonally diagonalisable; (2) a symmetric matrix is diagonalised by a matrix of its orthonormal eigenvectors. Just check your favourite algebra textbook. The second theorem provides us with a right to denote:
$$
\boldsymbol{A} = \boldsymbol{EDE}^T
$$ where $\boldsymbol{D}$ us a diagonal matrix and $\boldsymbol{E}$ is a matrix of eigenvectors of $\boldsymbol{A}$. That brings us at the end of the rainbow.

We select matrix $\boldsymbol{P}$ to be a such where each row $\boldsymbol{p_1}$ is an eigenvector of $\boldsymbol{XX}^T$, therefore
$$
\boldsymbol{P} = \boldsymbol{E}^T .
$$

Given that, we see that $\boldsymbol{E}=\boldsymbol{P}^T$, thus we find $\boldsymbol{A}=\boldsymbol{EDE}^T = \boldsymbol{P}^T\boldsymbol{DP}$ what leads us to a magnificent relationship between $\boldsymbol{P}$ and the covariance matrix:
$$
(n-1)cov(\boldsymbol{Y}) = \boldsymbol{PAP}^T = \boldsymbol{P}(\boldsymbol{P}^T\boldsymbol{DP})\boldsymbol{P}^T
= (\boldsymbol{PP}^T)\boldsymbol{D}(\boldsymbol{PP}^T) =
(\boldsymbol{PP}^{-1})\boldsymbol{D}(\boldsymbol{PP}^{-1})
$$ or
$$
cov(\boldsymbol{Y}) = \frac{1}{n-1}\boldsymbol{D},
$$ i.e. the choice of $\boldsymbol{P}$ diagonalises $cov(\boldsymbol{Y})$ where silently we also used the matrix algebra theorem saying that the inverse of an orthogonal matrix is its transpose ($\boldsymbol{P^{-1}}=\boldsymbol{P}^T$). Fascinating, right?! Let’s see now how one can use all that complicated machinery in the quest of looking for human emotions among the endless rivers of market numbers bombarding our sensors every day.

2. Covariances of NASDAQ, Eigenvalues of Anxiety


We will try to build a simple quantitative model for detection of the nervousness in the trading markets using PCA.

By its simplicity I will understand the following model assumption: no matter what the data conceal, the 1st Principal Component (1-PC) of PCA solution links the complicated relationships among a subset of stocks triggered by a latent factor attributed by us to a common behaviour of traders (human and pre-programmed algos). It is a pretty reasonable assumption, much stronger than, for instance, the influence of Saturn’s gravity on the annual silver price fluctuations. Since PCA does not tell us what its 1-PC means in reality, this is our job to seek for meaningful explanations. Therefore, a human factor fits the frame as a trial value very well.

Let’s consider the NASDAQ-100 index. It is composed of 100 technology stocks. The most current list you can find here: nasdaq100.lst downloadable as a text file. As usual, we will perform all calculations using Matlab environment. Let’s start with data collection and pre-processing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
% Anxiety Detection Model for Stock Traders
%  making use of the Principal Component Analsis (PCA)
%  and utilising publicly available Yahoo! stock data
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
 
 
% Reading a list of NASDAQ-100 components
nasdaq100=(dataread('file',['nasdaq100.lst'], '%s', 'delimiter', '\n'))';
 
% Time period we are interested in
d1=datenum('Jan 2 1998');
d2=datenum('Oct 11 2013');
 
% Check and download the stock data for a requested time period
stocks={};
for i=1:length(nasdaq100)
    try
        % Fetch the Yahoo! adjusted daily close prices between selected
        % days [d1;d2]
        tmp = fetch(yahoo,nasdaq100{i},'Adj Close',d1,d2,'d');
        stocks{i}=tmp;
        disp(i);
    catch err
        % no full history available for requested time period
    end
end

where, first, we try to check whether for a given list of NASDAQ-100’s components the full data history (adjusted close prices) are available via Yahoo! server (please refer to my previous post of Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting for more information on the connectivity).

The cell array stocks becomes populated with two-dimensional matrixes: the time-series corresponding to stock prices (time,price). Since the Yahoo! database does not contain a full history for all stocks of our interest, we may expect their different time spans. For the purpose of demonstration of the PCA method, we apply additional screening of downloaded data, i.e. we require the data to be spanned between as defined by $d1$ and $d2$ variables and, additionally, having the same (maximal available) number of data points (observations, trials). We achieve that by:

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
% Additional screening
d=[]; 
j=1; 
data={};
for i=1:length(nasdaq100)
    d=[d; i min(stocks{i}(:,1)) max(stocks{i}(:,1)) size(stocks{i},1)];
end
for i=1:length(nasdaq100)
    if(d(i,2)==d1) && (d(i,3)==d2) && (d(i,4)==max(d(:,4)))
        data{j}=sortrows(stocks{i},1);
        fprintf('%3i %1s\n',i,nasdaq100{i})
        j=j+1;
    end
end
m=length(data);

The temporary matrix of $d$ holds the index of stock as read in from nasdaq100.lst file, first and last day number of data available, and total number of data points in the time-series, respectively:

>> d
d =
      1      729757      735518        3970
      2      729757      735518        3964
      3      729757      735518        3964
      4      729757      735518        3969
     ..          ..          ..          ..
     99      729757      735518        3970
    100      729757      735518        3970

Our screening method saves $m=21$ selected stock data into data cell array corresponding to the following companies from our list:

  1 AAPL
  7 ALTR
  9 AMAT
 10 AMGN
 20 CERN
 21 CHKP
 25 COST
 26 CSCO
 30 DELL
 39 FAST
 51 INTC
 64 MSFT
 65 MU
 67 MYL
 74 PCAR
 82 SIAL
 84 SNDK
 88 SYMC
 96 WFM
 99 XRAY
100 YHOO

Okay, some people say that seeing is believing. All right. Let’s see how it works. Recall the fact that we demanded our stock data to be spanned between ‘Jan 2 1998′ and ‘Oct 11 2013′. We found 21 stocks meeting those criteria. Now, let’s assume we pick up a random date, say, Jul 2 2007 and we extract for all 21 stocks their price history over last 90 calendar days. We save their prices (skipping the time columns) into $Z$ matrix as follows:

t=datenum('Jul 2 2007');
Z=[];
for i=1:m
    [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-90));
    Z=[Z data{i}(r,2)]
end

and we plot them all together:

plot(Z)
xlim([1 length(Z)]);
ylabel('Stock price (US$)');
xlabel('T-90d');

anxiety-fig01
It’s easy to deduct that the top one line corresponds to Apple, Inc. (AAPL) adjusted close prices.

The unspoken earlier data processing methodology is that we need to transform our time-series into the comparable form. We can do it by subtracting the average value and dividing each of them by their standard deviations. Why? For a simple reason of an equivalent way of their mutual comparison. We call that step a normalisation or standardisation of the time-series under investigation:

[N,M]=size(Z);
X=(Z-repmat(mean(Z),[N 1]))./repmat(std(Z),[N 1]);

This represents the matrix $\boldsymbol{X}$ that I discussed in a theoretical part of this post. Note, that the dimensions are reversed in Matlab. Therefore, the normalised time-series,

% Display normalized stock prices
plot(X)
xlim([1 length(Z)]);
ylabel('(Stock price-Mean)/StdDev');
xlabel('T-90d');

look like:
anxiety-fig02
For a given matrix of $\boldsymbol{X}$, its covariance matrix,

% Calculate the covariance matrix, cov(X)
CovX=cov(X);
imagesc(CovX);

as for data spanned 90 calendar day back from Jul 2 2007, looks like:
anxiety-fig03
where the colour coding goes from the maximal values (most reddish) down to the minimal values (most blueish). The diagonal of the covariance matrix simply tells us that for normalised time-series, their covariances are equal to the standard deviations (variances) of 1 as expected.

Going one step forward, based on the given covariance matrix, we look for the matrix of $\boldsymbol{P}$ whose columns are the corresponding eigenvectors:

% Find P
[P,~]=eigs(CovX,5);
imagesc(P);
set(gca,'xticklabel',{1,2,3,4,5},'xtick',[1 2 3 4 5]);
xlabel('Principal Component')
ylabel('Stock');
set(gca,'yticklabel',{'AAPL', 'ALTR', 'AMAT', 'AMGN', 'CERN', ...
 'CHKP', 'COST', 'CSCO', 'DELL', 'FAST', 'INTC', 'MSFT', 'MU', ...
 'MYL', 'PCAR', 'SIAL', 'SNDK', 'SYMC', 'WFM', 'XRAY', 'YHOO'}, ...
 'ytick',[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21]);

which results in $\boldsymbol{P}$ displayed as:
anxiety-fig04
where we computed PCA for five principal components in order to illustrate the process. Since the colour coding is the same as in the previous figure, a visual inspection of of the 1-PC indicates on negative numbers for at least 16 out of 21 eigenvalues. That simply means that over last 90 days the global dynamics for those stocks were directed south, in favour of traders holding short-position in those stocks.

It is important to note in this very moment that 1-PC does not represent the ‘price momentum’ itself. It would be too easy. It represents the latent variable responsible for a common behaviour in the stock dynamics whatever it is. Based on our model assumption (see above) we suspect it may indicate a human factor latent in the trading.

3. Game of Nerves

The last figure communicates an additional message. There is a remarkable coherence of eigenvalues for 1-PC and pretty random patterns for the remaining four principal components. One may check that in the case of our data sample, this feature is maintained over many years. That allows us to limit our interest to 1-PC only.

It’s getting exciting, isn’t it? Let’s come back to our main code. Having now a pretty good grasp of the algebra of PCA at work, we may limit our investigation of 1-PC to any time period of our interest, below spanned between as defined by $t1$ and $t2$ variables:

47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
% Select time period of your interest
t1=datenum('July 1 2006');
t2=datenum('July 1 2010');
 
results=[];
for t=t1:t2
    tmp=[];
    A=[]; V=[];
    for i=1:m
        [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-60));
        A=[A data{i}(r,2)];
    end
    [N,M]=size(A);
    X=(A-repmat(mean(A),[N 1]))./repmat(std(A),[N 1]);
    CovX=cov(X);
    [V,D]=eigs(CovX,1);
    % Find all negative eigenvalues of the 1st Principal Component
    [r,c,v]=find(V(:,1)<0);
    % Extract them into a new vector
    neg1PC=V(r,1);
    % Calculate a percentage of negative eigenvalues relative
    % to all values available
    ratio=length(neg1PC)/m;
    % Build a new time-series of 'ratio' change over required
    % time period (spanned between t1 and t2)
    results=[results; t ratio];    
end

We build our anxiety detection model based on the change of number of eigenvalues of the 1st Principal Component (relative to the total their numbers; here equal 21). As a result, we generate a new time-series tracing over $[t1;t2]$ time period this variable. We plot the results all in one plot contrasted with the NASDAQ-100 Index in the following way:

75
76
77
78
79
80
81
82
83
84
85
% Fetch NASDAQ-100 Index from Yahoo! data-server
nasdaq = fetch(yahoo,'^ndx','Adj Close',t1,t2,'d');
% Plot it
subplot(2,1,1)
plot(nasdaq(:,1),nasdaq(:,2),'color',[0.6 0.6 0.6]);
ylabel('NASDAQ-100 Index');
% Add a plot corresponding to a new time-series we've generated
subplot(2,1,2)
plot(results(:,1),results(:,2),'color',[0.6 0.6 0.6])
% add overplot 30d moving average based on the same data
hold on; plot(results(:,1),moving(results(:,2),30),'b')

leading us to:
anxiety-fig05
I use 30-day moving average (a solid blue line) in order to smooth the results (moving.m). Please note, that line in #56 I also replaced the earlier value of 90 days with 60 days. Somehow, it is more reasonable to examine with the PCA the market dynamics over past two months than for longer periods (but it’s a matter of taste and needs).

Eventually, we construct the core model’s element, namely, we detect nervousness among traders when the percentage of negative eigenvalues of the 1st Principal Component increases over (at least) five consecutive days:

87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
% Model Core
x1=results(:,1);
y1=moving(results(:,2),30);
tmp=[];
% Find moments of time where the percetage of negative 1-PC
% eigenvalues increases over time (minimal requirement of
% five consecutive days
for i=5:length(x1)
    if(y1(i)>y1(i-1))&&(y1(i-1)>y1(i-2))&&(y1(i-2)>y1(i-3))&& ...
      (y1(i-3)>y1(i-4))&&(y1(i-4)>y1(i-5))
        tmp=[tmp; x1(i)];
    end
end
% When found
z=[];
for i=1:length(tmp)
    for j=1:length(nasdaq)
        if(tmp(i)==nasdaq(j,1))
            z=[z; nasdaq(j,1) nasdaq(j,2)];
        end
    end
end
subplot(2,1,1); 
hold on; plot(z(:,1),z(:,2),'r.','markersize',7);

The results of the model we over-plot with red markers on top of the NASDAQ-100 Index:
anxiety-fig06
Our simple model takes us into a completely new territory of unexplored space of latent variables. Firstly, it does not predict the future. It still (unfortunately) remains unknown. However, what it delivers is a fresh look at the past dynamics in the market. Secondly, it is easily to read out from the plot that results cluster into three subgroups.

The first subgroup corresponds to actions in the stock trading having further negative consequences (see the events of 2007-2009 and the avalanche of prices). Here the dynamics over any 60 calendar days had been continued. The second subgroup are those periods of time when anxiety led to negative dynamics among stock traders but due to other factors (e.g. financial, global, political, etc.) the stocks surged dragging the Index up. The third subgroup (less frequent) corresponds to instances of relative flat changes of Index revealing a typical pattern of psychological hesitation about the trading direction.

No matter how we might interpret the results, the human factor in trading is evident. Hopefully, the PCA approach captures it. If not, all we are left with is our best friend: a trader’s intuition.

Acknowledgments

An article dedicated to Dr. Dariusz Grech of Physics and Astronomy Department of University of Wroclaw, Poland, for his superbly important! and mind-blowing lectures on linear algebra in the 1998/99 academic year.

Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting


Within the evolution of Mathworks’ MATLAB programming environment, finally, in the most recent version labelled 2013a we received a longly awaited line-command facilitation for pulling stock data directly from the Yahoo! servers. What does that mean for quants and algo traders? Honestly, a lot. Now, simply writing a few commands we can have nearly all what we want. However, please keep in mind that Yahoo! data are free therefore not always in one hundred percent their precision remains at the level of the same quality as, e.g. downloaded from Bloomberg resources. Anyway, just for pure backtesting of your models, this step introduces a big leap in dealing with daily stock data. As usual, we have a possibility of getting open, high, low, close, adjusted close prices of stocks supplemented with traded volume and the dates plus values of dividends.

In this post I present a short example how one can retrieve the data of SPY (tracking the performance of S&P500 index) using Yahoo! data in a new Matlab 2013a and I show a simple code how one can test the time period of buying-holding-and-selling SPY (or any other stock paying dividends) to make a profit every time.

The beauty of Yahoo! new feature in Matlab 2013a has been fully described in the official article of Request data from Yahoo! data servers where you can find all details required to build the code into your Matlab programs.

Model for Dividends

It is a well known opinion (based on many years of market observations) that one may expect the drop of stock price within a short timeframe (e.g. a few days) after the day when the stock’s dividends have been announced. And probably every quant, sooner or later, is tempted to verify that hypothesis. It’s your homework. However, today, let’s look at a bit differently defined problem based on the omni-working reversed rule: what goes down, must go up. Let’s consider an exchange traded fund of SPDR S&P 500 ETF Trust labelled in NYSE as SPY.

First, let’s pull out the Yahoo! data of adjusted Close prices of SPY from Jan 1, 2009 up to Aug 27, 2013

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
% Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
close all; clear all; clc;
 
date_from=datenum('Jan 1 2009');
date_to=datenum('Aug 27 2013');
 
stock='SPY';
 
adjClose = fetch(yahoo,stock,'adj close',date_from,date_to);
div = fetch(yahoo,stock,date_from,date_to,'v')
returns=(adjClose(2:end,2)./adjClose(1:end-1,2)-1);
 
% plot adjusted Close price of  and mark days when dividends
% have been announced
plot(adjClose(:,1),adjClose(:,2),'color',[0.6 0.6 0.6])
hold on;
plot(div(:,1),min(adjClose(:,2))+10,'ob');
ylabel('SPY (US$)');
xlabel('Jan 1 2009 to Aug 27 2013');

and visualize them:

spy-1

Having the data ready for backtesting, let’s look for the most profitable period of time of buying-holding-and-selling SPY assuming that we buy SPY one day after the dividends have been announced (at the market price), and we hold for $dt$ days (here, tested to be between 1 and 40 trading days).

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
% find the most profitable period of holding SPY (long position)
neg=[];
for dt=1:40
 
buy=[]; sell=[];
for i=1:size(div,1)
    % find the dates when the dividends have been announced
    [r,c,v]=find(adjClose(:,1)==div(i,1));
    % mark the corresponding SPY price with blue circle marker
    hold on; plot(adjClose(r,1),adjClose(r,2),'ob');
    % assume you buy long SPY next day at the market price (close price)
    buy=[buy; adjClose(r-1,1) adjClose(r-1,2)];
    % assume you sell SPY in 'dt' days after you bought SPY at the market
    % price (close price)
    sell=[sell; adjClose(r-1-dt,1) adjClose(r-1-dt,2)];
end
 
% calculate profit-and-loss of each trade (excluding transaction costs)
PnL=sell(:,2)./buy(:,2)-1;
% summarize the results
neg=[neg; dt sum(PnL<0) sum(PnL<0)/length(PnL)];
 
end

If we now sort the results according to the percentage of negative returns (column 3 of neg matrix), we will be able to get:

>> sortrows(neg,3)
 
ans =
   18.0000    2.0000    0.1111
   17.0000    3.0000    0.1667
   19.0000    3.0000    0.1667
   24.0000    3.0000    0.1667
    9.0000    4.0000    0.2222
   14.0000    4.0000    0.2222
   20.0000    4.0000    0.2222
   21.0000    4.0000    0.2222
   23.0000    4.0000    0.2222
   25.0000    4.0000    0.2222
   28.0000    4.0000    0.2222
   29.0000    4.0000    0.2222
   13.0000    5.0000    0.2778
   15.0000    5.0000    0.2778
   16.0000    5.0000    0.2778
   22.0000    5.0000    0.2778
   27.0000    5.0000    0.2778
   30.0000    5.0000    0.2778
   31.0000    5.0000    0.2778
   33.0000    5.0000    0.2778
   34.0000    5.0000    0.2778
   35.0000    5.0000    0.2778
   36.0000    5.0000    0.2778
    6.0000    6.0000    0.3333
    8.0000    6.0000    0.3333
   10.0000    6.0000    0.3333
   11.0000    6.0000    0.3333
   12.0000    6.0000    0.3333
   26.0000    6.0000    0.3333
   32.0000    6.0000    0.3333
   37.0000    6.0000    0.3333
   38.0000    6.0000    0.3333
   39.0000    6.0000    0.3333
   40.0000    6.0000    0.3333
    5.0000    7.0000    0.3889
    7.0000    7.0000    0.3889
    1.0000    9.0000    0.5000
    2.0000    9.0000    0.5000
    3.0000    9.0000    0.5000
    4.0000    9.0000    0.5000

what simply indicates at the most optimal period of holding the long position in SPY equal 18 days. We can mark all trades (18 day holding period) in the chart:

spy-2

where the trade open and close prices (according to our model described above) have been marked in the plot by black and red circle markers, respectively. Only 2 out of 18 trades (PnL matrix) occurred to be negative with the loss of 2.63% and 4.26%. The complete distribution of profit and losses from all trades can be obtained in the following way:

47
48
49
50
figure(2);
hist(PnL*100,length(PnL))
ylabel('Number of trades')
xlabel('Return (%)')

returning

spy-3

Let’s make some money!

The above Matlab code delivers a simple application of the newest build-in connectivity with Yahoo! server and the ability to download the stock data of our interest. We have tested the optimal holding period for SPY since the beginning of 2009 till now (global uptrend). The same code can be easily used and/or modified for verification of any period and any stock for which the dividends had been released in the past. Fairly simple approach, though not too frequent in trading, provides us with some extra idea how we can beat the market assuming that the future is going to be/remain more or less the same as the past. So, let’s make some money!

Trend Identification for FX Traders (Part 2)


In my previous post I provided an introduction to the trading model invention and design. We made use of FX data of AUDUSD pair sampled hourly and splitting data into weekly time-series.

Today, we will work a bit harder over formulation of the very first rules for the model. This step will require an engagement of our creativity in understanding what data are like as well as how we can make a basic set of rules which would help us to perform an attractive time-series classification. Our objective is to invent a method which will be helpful in classification of last week FX pair’s time-series to be either in the downtrend or in the uptrend.

The most naive way of classification of directional information contained in any time-series is its slope: a straight line fit to the data. Let’s use it as our starting point. Instead of fitting all data points for a given week, we find median values for the first and the last 12 data points both in Time $(x1,x2)$ and Pair Ratio $(y1,y2)$ as specified in lines 92 to 94:

65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
% FX time-series analysis
% (c) Quant at Risk, 2012
%
% Part 2: Classification of weekly time-series
 
close all; scrsz = get(0,'ScreenSize');
h=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none');
fprintf('\nuptrend/downtrend identification.. ');
% for viewing use the loop
hold off;
set(0,'CurrentFigure',h);
 
% pre-define variables
trend=zeros(nw,1);
slope=zeros(nw,1);
midp={};  % middle points
endp={};  % end points (median based on last 12 points)
 
for i=1:nw  %--- a loop over total number of weeks available
 
    % reading time-series for a current week 
    w=week{i}; 
    x=w(:,1); y=w(:,2);
 
    % plot the time-series
    hold on; plot(x,y,'k');
 
    % linear trend estimation
    x1=median(x(1:12)); x2=median(x(end-11:end));
    y1=median(y(1:12)); y2=median(y(end-11:end));
 
    % define end-point of the time-series and mark it on the plot
    endp{i}=[x2 y2];
    hold on; plot(endp{i}(1),endp{i}(2),'b*');
 
    % find slope
    m=(y2-y1)/(x2-x1);
    slope(i)=m;
    xl=x1:dt:x2;       
    yl=m*xl-m*x2+y2;   % a line of representing the slope
    hold on; plot(xl,yl,'b:');
 
    % find middle point of the line and mark it on the plot
    mx=mean(xl);
    my=mean(yl);
    midp{i}=[mx my];
    hold on; plot(midp{i}(1),midp{i}(2),'bo');

As an example of the code execution, for the first two weeks we plot slopes, mid-points and end-points:

We assume that our classification procedure will be based solely on the information provided for end-points and slopes. The exception we make for the classification of the first two weeks. For the first week the distinction between uptrend and downtrend is made based on the position of the first and last point:

113
114
115
116
117
118
119
120
121
122
123
124
    % Time-Series Classification
 
    if(i==1)
        ybeg=y(1); yend=y(end);
        if(ybeg<yend)
            trend(i)=+1; % uptrend
            hold on; plot(x,y,'k');
        else
            trend(i)=-1; % downtrend
            hold on; plot(x,y,'r');
        end
    end

where we mark the result of classification with a sign $+1$ or $-1$ in a vector element of $trend$, and plot it with black and red color denoting uptrend and downtrend, respectively.

For the second week, our rules of classification are enriched by additional information about the end-point of a current and a previous week:

126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
    if(i==2)
        % week(current-1)
        tmp=week{i-1};
        x1=tmp(:,1); y1=tmp(:,2);
        y1b=y1(1); y1e=y1(end);
        % week(current)
        y0b=y(1); y0e=y(end);
        if(y0e>y1e)
            trend(i)=+1; % uptrend
            hold on; plot(x,y,'k');
        else
            trend(i)=-1; % downtrend
            hold on; plot(x,y,'r');
        end
    end

For weeks number 3 and higher we do our creative research over the data to define a specific set of rules. We allow to take into account the information from two weeks prior to the current one and combine them all together. The following code represents an attractive solution, subject to improvement:

142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
    if(i>2)
        % week(current-2)
        mid2=midp{i-2}(2);
        end2=endp{i-2}(2);
        slp2=slope(i-2);
        % week(current-1)
        mid1=midp{i-1}(2);
        end1=endp{i-1}(2);
        slp1=slope(i-1);
        % week(current)
        mid0=midp{i}(2);
        end0=endp{i}(2);
        slp0=slope(i);
        if((mid0>mid1))                     % up-trend
            if((mid0>mid2)&&(end0>end1))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % strong up-trend
            elseif((mid0>mid2)&&(end0>end2))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % weak up-trend
            elseif((mid0<mid2)&&(end0<end2)&&(slp0<0))
                trend(i)=-1;
                hold on; plot(x,y,'r');    % turns into possible down-trend
            elseif((mid0<mid2)&&(end0<end2)&&(slp0>0))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % turns into possible up-trend
            else
                trend(i)=+1;                
                hold on; plot(x,y,'k');    % turns into possible up-trend
            end
        elseif(mid0<mid1)                  % down-trend
            if((mid0<mid2)&&(end0<end1)&&(end0<end2))
                trend(i)=-1;
                hold on; plot(x,y,'r');    % weak down-trend
            elseif((mid0<mid2)&&(end0<end2)&&(end0>end1))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % possible up-trend
            elseif((mid0<mid2)&&(end0>end2))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % turns into possible up-trend
            elseif((mid0>mid2)&&(end0<end1)&&(end0<end2))
                trend(i)=-1;
                hold on; plot(x,y,'r');
            elseif((mid0>mid2)&&(end0>end2))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % turns into possible up-trend
            elseif((mid0>mid2)&&(end0>end1))
                trend(i)=+1;
                hold on; plot(x,y,'k');    % turns into possible up-trend
            else
                trend(i)=-1;                 
                hold on; plot(x,y,'r');
            end
        end
    end
end

Since one picture is worth millions of lines of code, below we present three examples of our model in action. The last plot corresponds to the latest Global Financial Crisis and shows how weeks in uptrends of 2009 followed these in downtrend a year before.

It is straightforward to note that the performance of our rules works very intuitively and stays in general agreement with the market sentiments.

Trend Identification for FX Traders


When you think about an invention of a new model for algorithmic trading, there are only three key elements you need to start your work with: creativity, data, and programming tool. Assuming that the last two are already in your possession, all what remains is seeking and finding a great new idea! With no offense, that’s the hardest part of the game.

To be successful in discovering new trading solutions you have to be completely open-minded, relaxed and full of spatial orientation with the information pertaining to your topic. Personally, after many years of programming and playing with the digital signal processing techniques, I have discovered that the most essential aspect of well grounded research is data itself. The more, literally, I starred at time-series changing their properties, the more I was able to capture subtle differences, often overlooked by myself before, and with the aid of intuition and scientific experience some new ideas simply popped up.

Here I would like to share with you a part of this process.

In Extracting Time-Series from Tick-Data article I outlined one of many possible ways of the FX time-series extraction from the very fine data sets. As a final product we have obtained two files, namely:

audusd.bid.1h
audusd.ask.1h

corresponding to Bid and Ask prices for Forex AUDUSD pair’s trading history between Jan 2000 and May 2010. Each file contained two columns of numbers: Time (Modified Julian Day) and Price. The time resolution has been selected to be 1 hour.

FOREX trading lasts from Monday to Friday, continuously for 24 hours. Therefore the data contain regular gaps corresponding to weekends. As the data coverage is more abundant comparing to, for example, much shorter trading windows of equities or ETFs around the world, that provides us with a better understanding of trading directions within every week time frame. Keeping that in mind, we might be interested in looking at directional information conveyed by the data as a seed of a potential new FX model.

As for now, let’s solely focus on initial pre-processing of Bid and Ask time-series and splitting each week into a common cell array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
% FX time-series analysis
% (c) Quant at Risk, 2012
%
% Part 1: Separation of the weeks
 
close all; clear all; clc;
 
% --analyzed FX pair
pair=['audusd'];
 
% --data
n=['./',pair,'/',pair];     % a common path to files
na=[n,'.ask.1h']; 
nb=[n,'.bid.1h'];
d1=load(na); d2=load(na);   % loading data
d=(d1+d2)/2;                % blending
clear d1 d2

For a sake of simplicity, in line 16, we decided to use a simple average of Bid and Ask 1-hour prices for our further research. Next, we create a weekly template, $x$, for our data classification, and we find the total number of weeks available for analysis:

19
20
21
22
23
24
25
26
27
28
29
30
31
% time constraints from the data
t0=min(d(:,1));
tN=max(d(:,1));
t1=t0-1;
 
% weekly template for data classification
x=t1:7:tN+7;
 
% total number of weeks
nw=length(x)-1;
 
fprintf(upper(pair));
fprintf(' time-series: %3.0f weeks (%5.2f yrs)\n',nw,nw/52);

what in our case returns a positive information:

AUDUSD time-series: 539 weeks (10.37 yrs)

The core of programming exercise is to split all 539 weeks and save them into a cell array of $week$. As we will see in the code section below, for some reasons we may want to assure ourselves that each week will contain the same number of points, therefore any missing data from our FX data provider will be interpolated. To do that efficiently, we use the following function which makes use of Piecewise Cubic Hermite Interpolating Polynomial interpolation for filling gapped data point in the series:

function [x2,y2]=gapinterpol(x,y,dt);
    % specify axis
    x_min=x(1);
    x_max=x(length(x));
    x2=(x_min:dt:x_max);
    % inperpolate gaps
    y2=pchip(x,y,x2);
end

The separation of weeks we realize in our program by:

33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
week={};  % an empty cell array
avdt=[]; 
 
for i=1:nw
    % split FX signal according to week
    [r,c,v]=find(d(:,1)>x(i) & d(:,1)<x(i+1));
    x1=d(r,1); y1=d(r,2);
    % interpolate gaps, use 1-hour bins
    dt=1/24;
    [x2,y2]=gapinterpol(x1,y1,dt);
    % check the average sampling time, should equal to dt
    s=0;
    for j=1:length(x2)-1
        s=s+(x2(j+1)-x2(j));
    end
    tmp=s/(length(x2)-1);
    avdt=[avdt; tmp];
    % store the week signal in a cell array
    tmp=[x2; y2]; tmp=tmp';
    week{i}=tmp;
end
fprintf('average sampling after interpolation = %10.7f [d]\n',max(avdt));

where as a check-up we get:

average sampling after interpolation =  0.0416667 [d]

what corresponds to the expected value of $1/24$ day with a sufficient approximation.

A quick visual verification of our signal processing,

54
55
56
57
58
59
60
61
62
63
scrsz = get(0,'ScreenSize');
h=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none');
hold off;
for i=1:nw
    w=week{i};
    x=w(:,1); y=w(:,2);
    % plot weekly signal
    hold on; plot(x,y,'k');
end
xlim([0 100]);

uncovers our desired result:

AUD/USD

Contact Form Powered By : XYZScripts.com