Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

## Rebinning Tick-Data for FX Algo Traders

If you work or intend to work with FX data in order to build and backtest your own FX models, the Historical Tick-Data of Pepperstone.com is probably the best place to kick off your algorithmic experience. As for now, they offer tick-data sets of 15 most frequently traded currency pairs since May 2009. Some of the unzip’ed files (one month data) reach over 400 MB in size, i.e. storing 8.5+ millions of lines with a tick resolution for both bid and ask “prices”. A good thing is you can download them all free of charge and their quality is regarded as very high. A bad thing is there is 3 month delay in data accessibility.

Dealing with a rebinning process of tick-data up, that’s a different story and the subject of this post. We will see how efficiently you can turn Pepperstone’s Tick-Data set(s) into 5-min time-series as an example. We will make use of scripting in bash (Linux/OS X) supplemented with data processing in Python.

Data Structure

You can download Pepperstone’s historical tick-data from here, month by month, pair by pair. Their inner structure follows the same pattern, namely:

$head AUDUSD-2014-09.csv AUD/USD,20140901 00:00:01.323,0.93289,0.93297 AUD/USD,20140901 00:00:02.138,0.9329,0.93297 AUD/USD,20140901 00:00:02.156,0.9329,0.93298 AUD/USD,20140901 00:00:02.264,0.9329,0.93297 AUD/USD,20140901 00:00:02.265,0.9329,0.93293 AUD/USD,20140901 00:00:02.265,0.93289,0.93293 AUD/USD,20140901 00:00:02.268,0.93289,0.93295 AUD/USD,20140901 00:00:02.277,0.93289,0.93296 AUD/USD,20140901 00:00:02.278,0.9329,0.93296 AUD/USD,20140901 00:00:02.297,0.93288,0.93296 The columns, from left to right, represent respectively: a pair name, the date and tick-time, the bid price, and the ask price. Pre-Processing Here, for each .csv file, we aim to split the date into year, month, and day separately, and remove commas and colons to get raw data ready to be read in as a matrix (array) using any other programming language (e.g. Matlab or Python). The matrix is mathematically intuitive data structure therefore making direct reference to any specific column of it makes any backtesting engine running with its full thrust. Let’s play with AUDUSD-2014-09.csv data file. Working in the same directory where the file is located we begin with writing a bash script (pp.scr) that contains: 1 2 3 4 5 6 7 8 9 10 11 # pp.scr # Rebinning Pepperstone.com Tick-Data for FX Algo Traders # (c) 2014 QuantAtRisk, by Pawel Lachowicz clear echo "..making a sorted list of .csv files" for i in$1-*.csv; do echo ${i##$1-} $i${i##.csv}; done | sort -n | awk '{print $2}' >$1.lst   python pp.py head AUDUSD.pp

that you run in Terminal:

$chmod +x pp.scr$ ./pp.scr AUDUSD

where the first command makes sure the script becomes executable (you need to perform this task only once). Lines #7-8 of our script, in fact, look for all .csv data files in the local directory starting with AUDUSD- prefix and create their list in AUDUSD.lst file. Since we work with AUDUSD-2014-09.csv file only, the AUDUSD.lst file will contain:

$cat AUDUSD.lst AUDUSD-2014-09.csv as expected. Next, we utilise the power and flexibility of Python in the following way: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 # pp.py import csv fnlst="AUDUSD.lst" fnout="AUDUSD.pp" for lstline in open(fnlst,'r').readlines(): fncur=lstline[:-1] #print(fncur) with open(fnout,'w') as f: writer=csv.writer(f,delimiter=" ") i=1 # counts a number of lines with tick-data for line in open(fncur,'r').readlines(): if(i<=5200): # replace with (i>0) to process an entire file #print(line) year=line[8:12] month=line[12:14] day=line[14:16] hh=line[17:19] mm=line[20:22] ss=line[23:29] bidask=line[30:] writer.writerow([year,month,day,hh,mm,ss,bidask]) i+=1 It is a pretty efficient way to open really a big file and process its information line by line. Just for further purpose of display, in the code we told computer to process only first 5,200 of lines. The output of lines #10-11 of pp.scr is the following: 2014 09 01 00 00 01.323 "0.93289,0.93297 " 2014 09 01 00 00 02.138 "0.9329,0.93297 " 2014 09 01 00 00 02.156 "0.9329,0.93298 " 2014 09 01 00 00 02.264 "0.9329,0.93297 " 2014 09 01 00 00 02.265 "0.9329,0.93293 " since we allowed Python to save bid and ask information as one string (due to a variable number of decimal digits). In order to clean this mess we continue: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 # pp.scr (continued) echo "..removing token: comma" sed 's/,/ /g' AUDUSD.pp >$1.tmp rm AUDUSD.pp   echo "..removing token: double quotes" sed 's/"/ /g' $1.tmp >$1.tmp2 rm $1.tmp echo "..removing empty lines" sed -i '/^[[:space:]]*$/d' $1.tmp2 mv$1.tmp2 AUDUSD.pp   echo "head..." head AUDUSD.pp echo "tail..." tail AUDUSD.pp

what brings us to pre-processed data:

..removing token: comma ..removing token: double quotes ..removing empty lines head... 2014 09 01 00 00 01.323 0.93289 0.93297 2014 09 01 00 00 02.138 0.9329 0.93297 2014 09 01 00 00 02.156 0.9329 0.93298 2014 09 01 00 00 02.264 0.9329 0.93297 2014 09 01 00 00 02.265 0.9329 0.93293 2014 09 01 00 00 02.265 0.93289 0.93293 2014 09 01 00 00 02.268 0.93289 0.93295 2014 09 01 00 00 02.277 0.93289 0.93296 2014 09 01 00 00 02.278 0.9329 0.93296 2014 09 01 00 00 02.297 0.93288 0.93296 tail... 2014 09 02 00 54 39.324 0.93317 0.93321 2014 09 02 00 54 39.533 0.93319 0.93321 2014 09 02 00 54 39.543 0.93318 0.93321 2014 09 02 00 54 39.559 0.93321 0.93321 2014 09 02 00 54 39.784 0.9332 0.93321 2014 09 02 00 54 39.798 0.93319 0.93321 2014 09 02 00 54 39.885 0.93319 0.93325 2014 09 02 00 54 39.886 0.93319 0.93321 2014 09 02 00 54 40.802 0.9332 0.93321 2014 09 02 00 54 48.829 0.93319 0.93321

Personally, I love that part as you can learn how to do simple but necessary text file operations by typing single lines of Unix/Linux commands. Good luck for those who try to repeat the same in Microsoft Windows not spending more than 30 sec for doing it.

Rebinning: 5-min Data

The rebinning has many schools. It’s the art for some people. We just want to have the job done. I opt for simplicity and understanding of the data we deal with. Imagine we have two adjacent 5 min bins with a tick history of trading:

We want to derive the closest possible (or most fair) price estimation every 5 min, denoted in the above painting by a red marker. The old-school approach is to take the average over a number (larger than 5) of tick data points from the left and from the right. That creates the under- or overestimation of the mid-price.

If we trade live, every 5 min we receive an information on the last tick point before the minute hits 5 and we wait for the next tick point after 5 (blue markers). Taking the average of their prices (mid-price) makes most of sense. The precision we look at here is sometimes $10^{-5}$. It is not much of significance if our position is small, but if it is not, the mid-price may start playing a crucial role.

The cons of the old-school approach: a possible high volatility among all tick-data within last 5 minutes that we neglect.

The following Python code (pp2.py) performs 5-min rebinning for our pre-processed AUDUSD-2014-09 file:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 # pp2.py import csv import numpy as np   def convert(data): tempDATA = [] for i in data: tempDATA.append([float(j) for j in i.split()]) return np.array(tempDATA).T   fname="AUDUSD.pp"   with open(fname) as f: data = f.read().splitlines()   #print(data)   i=1 for d in data: list=[s for s in d.split(' ')] #print(list) # remover empty element in the list dd=[x for x in list if x] #print(dd) tmp=convert(dd) #print(tmp) if(i==1): a=tmp i+=1 else: a = np.vstack([a, tmp]) i+=1   N=i-1 #print("N = %d" % N)   # print the first line tmp=np.array([a[1][0],a[1][1],a[1][2],a[1][3],a[1][4],0.0,(a[1][6]+a[1][7])/2]) print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" % (tmp[0],tmp[1],tmp[2],tmp[3],tmp[4],tmp[5],tmp[6])) m=tmp   # check the boundary conditions (5 min bins) for i in xrange(2,N-1): if( (a[i-1][4]%5!=0.0) and (a[i][4]%5==0.0)):   # BLUE MARKER No. 1 # (print for i-1) #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" % # (a[i-1][0],a[i-1][1],a[i-1][2],a[i-1][3],a[i-1][4],a[i-1][5],a[i-1][6],a[i-1][7])) b1=a[i-1][6] b2=a[i][6] a1=a[i-1][7] a2=a[i][7] # mid-price, and new date for 5 min bin bm=(b1+b2)/2 am=(a1+a2)/2 Ym=a[i][0] Mm=a[i][1] Dm=a[i][2] Hm=a[i][3] MMm=a[i][4] Sm=0.0 # set seconds to zero   # RED MARKER print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" % (Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2)) tmp=np.array([Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2]) m=np.vstack([m, tmp])   # BLUE MARKER No. 2 # (print for i) #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" % # (a[i][0],a[i][1],a[i][2],a[i][3],a[i][4],a[i][5],a[i][6],a[i][7]))

what you run in pp.scr file as:

31 32 33 # pp.scr (continued)   python pp2.py > AUDUSD.dat

in order to get 5-min rebinned FX time-series as follows:

$head AUDUSD.dat 2014 9 1 0 0 0.000 0.932935 2014 9 1 0 5 0.000 0.933023 2014 9 1 0 10 0.000 0.932917 2014 9 1 0 15 0.000 0.932928 2014 9 1 0 20 0.000 0.932937 2014 9 1 0 25 0.000 0.933037 2014 9 1 0 30 0.000 0.933075 2014 9 1 0 35 0.000 0.933070 2014 9 1 0 40 0.000 0.933092 2014 9 1 0 45 0.000 0.933063 That concludes our efforts. Happy rebinning! ## Gap-on-Open Profitable Trading Strategy After a longer while, QuantAtRisk is back to business. As an algo trader I have been always tempted to test a gap-on-open trading strategy. There were various reasons standing behind it but the most popular one was always omni-discussed: good/bad news on the stock. And what? The stock price skyrocketed/dropped down on the following days. When we approach such price patterns, we talk about triggers or triggered events. The core of the algorithm’s activity is the trigger identification and taking proper actions: to go long or short. That’s it. In both cases we want to make money. In this post we will design the initial conditions for our gap-on-open trading strategy acting as the triggers and we will backtest a realistic scenario of betting our money on those stocks that opened higher on the next trading day. Our goal is to find the most optimal holding period for such trades closed with a profit. Portfolio Our strategy can be backtested using any$N$-asset portfolio. Here, for simplicity, let us use a random subset of 10 stocks (portfolio.lst) being a part of a current Dow Jones Index: AXP CSCO DIS IBM JNJ KO NKE PG UTX XOM In Matlab, we fetch the stock prices from Google Finance data provider accessible via Quandl.com’s Matlab API (see this post for its setup in Matlab). We commence writing our main backtesting code as follows: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 % Gap on Open Trading Strategy % Fetching stock prices via Quandl and Strategy Backtesting % % (c) 2014 by Pawel Lachowicz, QuantAtRisk.com clear all; close all; clc; fname=['portfolio.lst']; % Model's parameter #1 (years) parm1=1; ndays=parm1*365; lday=datenum('2014-08-05'); % fetching stock data [Top,Thp,Tlp,Tcp,N,ntdays]=FetchQuandl(fname,ndays,lday); where we use a pre-designed function of FetchQuandl to import 4 separate price-series of each stock’s open (Top), high (Thp), low (Tlp), and close (Tcp) daily prices: function [Top,Thp,Tlp,Tcp,N,ntdays]=FetchQuandl(fname,ndays,lday) % Read the list of Dow Jones components fileID = fopen(fname); tmp = textscan(fileID,'%s'); fclose(fileID); components=tmp{1}; % a list as a cell array % Read in the list of tickers and internal codes from Quandl.com [~,text,~] = xlsread('QuandlStockCodeListUS.xlsx'); quandlc=text(:,1); % again, as a list in a cell array quandlcode=text(:,3); % corresponding Quandl's Price Code % fetch stock data for last ‘ndays’ date2=datestr(lday,'yyyy-mm-dd'); % from date1=datestr(lday-ndays,'yyyy-mm-dd'); % to Rop={}; Tcp={}; % scan all tickers and fetch the data from Quandl.com for i=1:length(components) for j=1:length(quandlc) if(strcmp(components{i},quandlc{j})) fprintf('%4.0f %s\n',i,quandlc{j}); fts=0; [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ... 'authcode','PutHereYourQuandlCode',... 'start_date',date1,'end_date',date2); cp=fts2mat(fts.Close,1); Tcp{i}=cp; % close price-series op=fts2mat(fts.Open,1); Top{i}=op; % open price-series hp=fts2mat(fts.High,1); Thp{i}=hp; % high price lp=fts2mat(fts.Low,1); Tlp{i}=lp; % low price %Rcp{i}=cp(2:end,2)./cp(1:end-1,2)-1; % return-series cp end end end N=length(components); ntdays=length(Tcp{1}); end Please note that in line #12 we specified number of years, i.e. how far our backtest should be extended backward in time (or number of calendar days; see line #13) from the day specified in line #14 (last day). Trading Model First, let us design the trading strategy. We scan concurrently four price-series for each stock separately. We define the strategy’s trigger as follows: i.e. if a stock open price on day$t$was higher than the close price on the day$t-1$and the lowest prices on day$t$was higher than the highest price on day$t-1$. Having that, we make a BUY LONG decision! We buy that stock on the next day at its market price (close price). This approach should remove the slippage bias effectively (see more on slippage in stock trading here). Now, we run the backtest on each stock and each open trade. We select the second parameter (parm2) to be a number of days, i.e. how long we hold the stock. In the following piece of code, let us allow to sell the stock after/between 1 to 21 calendar days ($\pm$weekend or public holidays time period): 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 % pre-defined matrix for backtest final results results=[]; for parm2=0:20 cR=[]; for i=1:N % just for a purpose of plotting of price-series if(i==1) % open (blue color) plot(Top{i}(:,1),Top{i}(:,2),'') hold on % close (red color) plot(Tcp{i}(:,1),Tcp{i}(:,2),'r') hold on % high (green color) plot(Thp{i}(:,1),Thp{i}(:,2),'g') % xlabel('Days'); ylabel('AXP Stock Prices [US$]'); end   Tbuy=[]; for t=2:ntdays % define indicators ind1=Tcp{i}(t-1,2); % cp on (t-1)day ind2=Thp{i}(t-1,2); % hp on (t-1)day ind3=Top{i}(t,2); % op on (t)day ind4=Tlp{i}(t,2); % lp on (t)day % detect trigger if(ind1<ind3)&&(ind2<ind4) % plotting only for AXP if(i==1) hold on; plot(Top{i}(t,1),Top{i}(t,2),'o'); end % date of a trigger tday=Top{i}(t,1); nextbusdate=busdate(tday,1); % find next trading date Tbuy=[Tbuy; nextbusdate]; end end Tsell=busdate(Tbuy+parm2,1);

Here, in lines #57 and #60 we constructed time array storing physical information on those days. Now, we will use them to check the price on trade’s open and close and derive profit and loss for each stock:

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 R=[]; for k=1:length(Tbuy) j=find(Tbuy(k)==Tcp{i}(:,1)); pbuy=Tcp{i}(j,2); j=find(Tsell(k)==Tcp{i}(:,1)); psell=Tcp{i}(j,2); ret=(psell/pbuy-1); % return per trade R=[R; ret]; end   compR=prod(R+1)-1; % compound return per stock cR=[cR; compR];   end   results=[results cR];   end

In the inner loop (lines #24 to #75, i.e. tracking a number of stocks in portfolio; index $i$, here 1 to 10) we capture all trades per stock (lines #63-70) and calculate a multi-period compound return (line #72) as if we were trading that stock solely using our model.

For instance, for stock $i=1$ (AXP) from our portfolio, our code displays 1-year price-series:
where days meeting our trigger criteria have been denoted by open-circle markers. If you now re-run the backtest making a gentle substitution in line #24 now to be:

24 for i=1:1

we can find that by running through some extra lines of code as defined:

81 82 83 84 figure(2) stem((0:20),100*results) xlabel('Holding Period [days]'); ylabel('AXP: Compound Return [%]');

we obtain an appealing result:

The chart reveals that for AXP, over past 251 days (since Aug/4 2014 backwards), we had 16 triggers therefore 16 trades and, surprisingly, regardless of holding period, the compound return from all closed trades was highly positive (profitable).

This is not the case if we consider, for example, $i=4$, IBM stock:
This result points that for different holding periods (and different stocks of course) certain extra trading indicators should be applied to limit the losses (e.g. profit targets).

If we traded a whole portfolio using our gap-on-open model, we would end up with very encouraging result:
where for each holding period we displayed the averaged over 10 stocks compound return. Taking into account the global up-trend in the US stock markets between August of 2013 and 2014, this strategy is worth its consideration with any further modifications (e.g. considering short or both long and short triggers, FX time-series, etc.).

Someone wise once said: Sometimes you win. Sometimes you learn. In algo trading we all learn to win.

In next post…
Marginal Value-at-Risk for Portfolio Managers

Forecasting risk in algorithmic stock trading is of paramount importance for everyone. You should always look for the ways how to detect sudden price changes and take immediate actions to protect your investments.

Imagine you opened a new long position last Wednesday for NASDAQ:NVDA buying 1500 shares at the market price of USD16.36. On the next day price goes down to USD15.75 at the end of the session. You are down 3.87% or almost a grand in one day. If you can handle that, it’s okay but if the drop were more steep? Another terrorist attack, unforeseen political event, North Korea nuclear strike? Then what? You need to react!

If you have information, you have options in your hands. In this post we will see how one can use real-time data of stock prices displayed on Google Finance website, fetch and record them on your computer. Having them, you can build your own warning system for sudden price swings (risk management) or run the code in the background for a whole trading session (for any stock, index, etc. accessible through Google Finance) and capture asset prices with an intraday sampling (e.g. every 10min, 30min, 1h, and so on). From this point only your imagination can stop you from using all collected data.

Hacking with Python

If you ever dreamt of becoming a hacker, this is your chance to shine! I have got my inspiration after reading the book of Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers by TJ O’Connor. A powerful combination of the beauty and the beast.

The core of our code will be contained in a small function which does the job. For a specified Google-style ticker (query), it fetches the data directly from the server returning the most current price of an asset:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Hacking Google Finance in Real-Time for Algorithmic Traders # # (c) 2014 QuantAtRisk.com, by Pawel Lachowicz   import urllib, time, os, re, csv   def fetchGF(googleticker): url="http://www.google.com/finance?&q=" txt=urllib.urlopen(url+googleticker).read() k=re.search('id="ref_(.*?)">(.*?)<',txt) if k: tmp=k.group(2) q=tmp.replace(',','') else: q="Nothing found for: "+googleticker return q

Just make sure that a Google ticker is correctly specified (as will see below). Next, let’s display on the screen our local time and let’s force a change of the system time to the one corresponding to New York City, NY. The latter assumption we make as we would like to track the intraday prices of stock(s) traded at NYSE or NASDAQ. However, if you are tracking FTSE 100 index, the Universal Time (UTC) of London is advisable as an input parameter.

18 19 20 21 22 23 24 25 26 27 # display time corresponding to your location print(time.ctime()) print   # Set local time zone to NYC os.environ['TZ']='America/New_York' time.tzset() t=time.localtime() # string print(time.ctime()) print

Having that, let us define a side-function combine which we will use to glue all fetched data together into Python’s list variable:

29 30 31 32 33 34 def combine(ticker): quote=fetchGF(ticker) # use the core-engine function t=time.localtime() # grasp the moment of time output=[t.tm_year,t.tm_mon,t.tm_mday,t.tm_hour, # build a list t.tm_min,t.tm_sec,ticker,quote] return output

As an input, we define Google ticker of our interest:

36 ticker="NASDAQ:AAPL"

for which we open a new text file where all queries will be saved in real-time:

39 40 41 42 # define file name of the output record fname="aapl.dat" # remove a file, if exist os.path.exists(fname) and os.remove(fname)

Eventually, we construct the final loop over trading time. Here, we fetch the last data at 16:00:59 New York time. The key parameter in the game is freq variable where we specify the intraday sampling (in seconds). From my tests, using a private Internet provider, I have found that the most optimal sampling was 600 sec (10 min). Somehow, for shorter time intervals, Google Finance detected too frequent queries sent from my IP address. This test succeed from a different IP location, therefore, feel free to play with your local Internet network to find out what is the lowest available sampling time for your geolocation.

43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 freq=600 # fetch data every 600 sec (10 min)   with open(fname,'a') as f: writer=csv.writer(f,dialect="excel") #,delimiter=" ") while(t.tm_hour<=16): if(t.tm_hour==16): while(t.tm_min<01): data=combine(ticker) print(data) writer.writerow(data) # save data in the file time.sleep(freq) else: break else: for ticker in tickers: data=combine(ticker) print(data) writer.writerow(data) # save data in the file time.sleep(freq)   f.close()

To see how the above code works in practice, I conducted a test on Jan/9 2014, starting at 03:31:19 Sydney/Australia time, corresponding to 11:31:19 New York time. Setting the sampling frequency to 600 sec, I was able to fetch the data in the following form:

Thu Jan 9 03:31:19 2014   Wed Jan 8 11:31:19 2014   [2014, 1, 8, 11, 31, 19, '543.71'] [2014, 1, 8, 11, 41, 22, '543.66'] [2014, 1, 8, 11, 51, 22, '544.22'] [2014, 1, 8, 12, 1, 23, '544.80'] [2014, 1, 8, 12, 11, 24, '544.32'] [2014, 1, 8, 12, 21, 25, '544.86'] [2014, 1, 8, 12, 31, 27, '544.47'] [2014, 1, 8, 12, 41, 28, '543.76'] [2014, 1, 8, 12, 51, 29, '543.86'] [2014, 1, 8, 13, 1, 30, '544.00'] [2014, 1, 8, 13, 11, 31, 'Nothing found for: NASDAQ:AAPL'] [2014, 1, 8, 13, 21, 33, '543.32'] [2014, 1, 8, 13, 31, 34, '543.84'] [2014, 1, 8, 13, 41, 36, '544.26'] [2014, 1, 8, 13, 51, 37, '544.10'] [2014, 1, 8, 14, 1, 39, '544.30'] [2014, 1, 8, 14, 11, 40, '543.88'] [2014, 1, 8, 14, 21, 42, '544.29'] [2014, 1, 8, 14, 31, 45, '544.15'] ...

As you can notice, they were displayed on the screen (line #59 in the code) in the form of Python’s list. It is important to note that the time we make an effort to capture and associate it with fetched asset price (query) is the computer’s system time, therefore please don’t expect regular time intervals as one may get from a verified market data providers. We are hacking in real-time! However, if you think about the data themselves, this time precision is not of great importance. As long as we fetch the data every freq seconds, that sufficiently allows us to build a risk management system or even to measure a rolling volatility of an asset. Your trading model will benefit anyway.

Have also a note that if our Internet connection fails or there are some disturbances of a different kind, we will miss the data in a sent query as visible in the example above.

Looks exciting? Give me High Five! and say Hell Yeah!

Code Modification: Portfolio of Assets

The presented Python code can be very easily modified if you wish to try fetching data for a couple of assets concurrently every freq seconds. Simply extend and amend all the lines starting at row #36, for example in the following form:

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 tickers=["NASDAQ:AAPL","NASDAQ:GOOG","NASDAQ:BIDU","NYSE:IBM", \ "NASDAQ:INTC","NASDAQ:MSFT","NYSEARCA:SPY"]   # define the name of an output file fname="portfolio.dat" # remove a file, if exist os.path.exists(fname) and os.remove(fname)   freq=600 # fetch data every 600 sec (10 min)   with open(fname,'a') as f: writer=csv.writer(f,dialect="excel") #,delimiter=" ") while(t.tm_hour<=16): if(t.tm_hour==16): while(t.tm_min<01): #for ticker in tickers: data=combine(ticker) print(data) writer.writerow(data) time.sleep(freq) else: break else: for ticker in tickers: data=combine(ticker) print(data) writer.writerow(data) time.sleep(freq)   f.close()

That’s it! For the sake of real-time verification, here is a screenshot how does it work:

Thu Jan 9 07:01:43 2014   Wed Jan 8 15:01:43 2014   [2014, 1, 8, 15, 1, 44, 'NASDAQ:AAPL', '543.55'] [2014, 1, 8, 15, 1, 44, 'NASDAQ:GOOG', '1140.30'] [2014, 1, 8, 15, 1, 45, 'NASDAQ:BIDU', '182.65'] [2014, 1, 8, 15, 1, 45, 'NYSE:IBM', '187.97'] [2014, 1, 8, 15, 1, 46, 'NASDAQ:INTC', '25.40'] [2014, 1, 8, 15, 1, 47, 'NASDAQ:MSFT', '35.67'] [2014, 1, 8, 15, 1, 47, 'NYSEARCA:SPY', '183.43'] [2014, 1, 8, 15, 11, 48, 'NASDAQ:AAPL', '543.76'] [2014, 1, 8, 15, 11, 49, 'NASDAQ:GOOG', '1140.06'] [2014, 1, 8, 15, 11, 49, 'NASDAQ:BIDU', '182.63'] [2014, 1, 8, 15, 11, 50, 'NYSE:IBM', '187.95'] [2014, 1, 8, 15, 11, 51, 'NASDAQ:INTC', '25.34'] [2014, 1, 8, 15, 11, 52, 'NASDAQ:MSFT', '35.67'] [2014, 1, 8, 15, 11, 53, 'NYSEARCA:SPY', '183.34'] ...

where we can see that we were able to grab the prices of 6 stocks and 1 ETF (Exchange Trading Fund tracking S&P500 Index) every 10 min.

Reflection

You may wonder whether hacking is legal or not? The best answer I find in the words of Gordon Gekko: Someone reminded me I once said “Greed is good”,

## Pre-Processing of Asset Price Series for Portfolio Optimization

Portfolio Optimization is a significant component of Matlab’s Financial Toolbox. It provides us with ready-to-use solution in finding optimal weights of assets that we consider for trading deriving them based on the historical asset performance. From a practical point of view, we can include it in our algorithmic trading strategy and backtest its applicability under different initial conditions. This is a subject of my next up-coming post. However, before we can enjoy the view from the peak, we need to climb the mountain first.

In Matlab, the portfolio is created as a dedicated object of the same name. It doesn’t read the raw stock data. We need to feed that beast. Two major ingredients satisfy the input: a vector of the expected asset returns and a covariance matrix. Matlab helps us to estimate these moments but first we need to deliver asset data in a digestable form.

In this post we will see how one can quickly download the stock data from the Internet based on our own stock selection and pre-process them for solving portfolio optimization problem in Matlab.

Initial Setup for Portfolio Object

Let’s say that at any point of time you have your own list of stocks you wish to buy. For simplicity let’s also assume that the list contains stocks traded on NYSE or NASDAQ. Since you have been a great fun of this game, now you are almost ready to buy what you jotted down on your ShoppingList.lst. Here, an example of 10 tech stocks:

AAPL AOL BIDU GOOG HPQ IBM INTC MSFT NVDA TXN

They will constitute your portfolio of stocks. The problem of portfolio optimization requires a look back in time in the space of returns obtained in trading by each stock. Based on them the Return Proxy and Risk Proxy can be found.

The return matrix $R$ of dimensions $(N-1)\times M$ where $N$ stands for number of historical prices (e.g. derived daily, or monthly, etc.) and $M$ for the number of stocks in our portfolio, is required by Matlab as an input. We will see how does it work in next post. For now let’s solely focus on creation of this matrix.

In the article Create a Portfolio of Stocks based on Google Finance Data fed by Quandl I discussed Quandl.com as an attractive data provider for US stocks. Here, we will follow this solution making use of Quandl resources to pull out the stock price series for our shopping list. Ultimately, we aim at building a function, here: QuandlForPortfolio, that does the job for us:

% Pre-Processing of Asset Price Series for Portfolio Optimization in Matlab % (c) 2013, QuantAtRisk.com, by Pawel Lachowicz   clear all; close all; clc;   % Input Parameters n=1*365; tickers='ShoppingList.lst'; qcodes='QuandlStockCodeListUS.xlsx';   [X,Y,R,AssetList] = QuandlForPortfolio(n,tickers,qcodes);

We call this function with three input parameters. The first one, $n$, denotes a number of calendar days from today (counting backwards) for which we wish to retrieve the stock data. Usually, 365 days will correspond to about 250$-$252 trading days. The second parameter is a path/file name to our list of stock (desired to be taken into account in the portfolio optimisation process) while the last input defines the path/file name to the file storing stocks’ tickers and associated Quandl Price Codes (see here for more details).

Feeding the Beast

The QuandlForPortfolio Matlab function is an extended version of the previously discussed solution. It contains an important correcting procedure for the data fetched from the Quandl servers. First, let’s have a closer look on the function itself:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 % Function assists in fetching Google Finance data from the Quandl.com % server for a given list of tickers of stocks traded on NYSE or % NASDAQ. Data are retrieved for last 'n' days with daily sampling. % % INPUT % n : number of calendar days from 'today' (e.g. 365 would % correspond to about 252 business days) % tickers : a path/file name of a text file listing tickers % qcodes : a path/file name of Excel workbook (.xlsx) containing a list % of tickers and Quandl Price Codes in the format of % [Ticker,Stock Name,Price Code,Ratios Code,In Market?] % OUTPUT % X0 : [Nx1] column vector with days % Y0 : [NxM] matrix with Close Prices for M stocks % R0 : [(N-1)xM] matrix of Retruns % AssetList : a list of tickers (cell array) % % (c) 2013, QuantAtRisk.com, by Pawel Lachowicz   function [X0,Y0,R0,AssetList0] = QuandlForPortfolio(n,tickers,qcodes) fileID = fopen(tickers); tmp = textscan(fileID, '%s'); fclose(fileID); AssetList=tmp{1}; % a list as a cell array   % Read in the list of tickers and internal Quandl codes % [~,text,~] = xlsread(qcodes); quandlc=text(:,1); % again, as a list in a cell array quandlcode=text(:,3); % corresponding Quandl's Price Code   date1=datestr(today-n,'yyyy-mm-dd'); % from date2=datestr(today,'yyyy-mm-dd'); % to   % Fetch the data from Quandl.com % QData={}; for i=1:length(AssetList) for j=1:length(quandlc) if(strcmp(AssetList{i},quandlc{j})) fprintf('%4.0f %s\n',i,quandlc{j}); fts=0; [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ... 'authcode','x',... 'start_date',date1,'end_date',date2,'collapse','daily'); QData{i}=fts; end end end   % Post-Processing of Fetched Data % % create a list of days across all tickers TMP=[]; for i=1:length(QData) tmp=fts2mat(QData{i},1); tmp=tmp(:,1); TMP=[TMP; tmp]; end ut=unique(TMP); % use that list to find these days that are not present % among all data sets TMP=[]; for i=1:length(QData) tmp=fts2mat(QData{i},1); tmp=tmp(:,1); TMP=[TMP; setdiff(ut,tmp)]; end ut=unique(TMP); % finally, extract Close Prices from FTS object and store them % in Y0 matrix, plus corresponding days in X0 X0=[]; Y0=[]; for i=1:length(QData) tmp=fts2mat(QData{i},1); cp=[]; for j=1:size(tmp,1) [r,~,~]=find(ut==tmp(j,1)); if(isempty(r)) cp=[cp; tmp(j,5)]; % column 5 corresponds to Close Price if(i<2) % create a time column vector listing days % common among all data sets X0=[X0; tmp(j,1)]; end end end Y0=[Y0 cp]; end % transform Close Prices into Returns, R(i)=cp(i)/cp(i-1)-1 R0=tick2ret(Y0); AssetList0=AssetList'; end

The main bottleneck comes from the fact that Matlab’s portfolio object demands an equal number of historical returns ($N-1$) in the matrix of $R$ for all $M$ assets. We design the function in the way that it sets the common timeframe for all stocks listed on our shopping list. Of course, we ensure that all stocks were traded in the markets for about $n$ last days (rough estimation).

Now, the timeframe of $n$ last days should be understood as a first approximation. We fetch the data from Quandl (numeric date, Open, High, Low, Close, Volume) and save them in the cell array QData (lines #37-49) for each stock separately as FTS objects (Financial Time-Series objects; see Financial Toolbox). However, it may occur that not every stock we fetched displays the same amount of data. That is why we need to investigate for what days and for what stocks we miss the data. We achieve that by scanning each FTS object and creating a unique list of all days for which we have data (lines #54-60).

Next, we loop again over the same data sets but now we compare that list with a list of all dates for each stock individually (lines #63-69), capturing (line #67) those dates that are missing. Their complete list is stored as a vector in line #69. Eventually, given that, we are able to compile the full data set (e.g. Close Prices; here line #80) for all stocks in our portfolio ensuring that we will include only those dates for which we have prices across all $M$ assets (lines #70-91).

Beast Unleashed

We test our data pre-processing simply by running the block of code listed above engaging QuandlForPortfolio function and we check the results in the Matlab’s command window as follows:

>> whos X Y R AssetList Name Size Bytes Class Attributes   AssetList 1x10 1192 cell R 250x10 20000 double X 251x1 2008 double Y 251x10 20080 double

what confirms the correctness of dimensions as expected.

At this stage, the aforementioned function can be used two-fold. First, we are interested in the portfolio optimisation and we look back at last $n$ calendar days since the most current one (today). The second usage is handy too. We consider our stocks on the shopping list and fetch for their last, say, $n=7\times365$ days with data. If all stocks were traded over past 7 years we should be able to collect a reach data set. If not, the function will adjust the beginning and end date to meet the initial time constrains as required for $R$ matrix construction. For the former case, we can use 7-year data sample for direct backtesting of algo models utilizing Portfolio Optimization.

Stay tuned as we will rock this land in the next post!

Any Questions?

Share them across QuantCove.com – the official Forum of QuantAtRisk.

Happy backtesting!

Sincerely Yours,
QaR Team

## Anxiety Detection Model for Stock Traders based on Principal Component Analysis

So, is there a way to disentangle the emotional part involved in trading from all other factors (e.g. the application of technical analysis, bad news consequences, IPOs, etc.) which are somehow easier to deduce? In this post I will try to make a quantitative attempt towards solving this problem. Although the solution will not have the final and closed form, my goal is to deliver an inspiration for quants and traders interested in the subject by putting a simple idea into practice: the application of Principal Component Analysis.

1. Principal Component Analysis (PCA)

Called by many as one of the most valuable results from applied linear algebra, the Principal Component Analysis, delivers a simple, non-parametric method of extracting relevant information from often confusing data sets. The real-world data usually hold some relationships among their variables and, as a good approximation, in the first instance we may suspect them to be of the linear (or close to linear) form. And the linearity is one of stringent but powerful assumptions standing behind PCA.

Imagine we observe the daily change of prices of $m$ stocks (being a part of your portfolio or a specific market index) over last $n$ days. We collect the data in $\boldsymbol{X}$, the matrix $m\times n$. Each of $n$-long vectors lie in an $m$-dimensional vector space spanned by an orthonormal basis, therefore they are a linear combination of this set of unit length basic vectors: $\boldsymbol{BX} = \boldsymbol{X}$ where a basis $\boldsymbol{B}$ is the identity matrix $\boldsymbol{I}$. Within PCA approach we ask a simple question: is there another basis which is a linear combination of the original basis that represents our data set? In other words, we look for a transformation matrix $\boldsymbol{P}$ acting on $\boldsymbol{X}$ in order to deliver its re-representation:
$$\boldsymbol{PX} = \boldsymbol{Y} \ .$$ The rows of $\boldsymbol{P}$ become a set of new basis vectors for expressing the columns of $\boldsymbol{X}$. This change of basis makes the row vectors of $\boldsymbol{P}$ in this transformation the principal components of $\boldsymbol{X}$. But how to find a good $\boldsymbol{P}$?

Consider for a moment what we can do with a set of $m$ observables spanned over $n$ days? It is not a mystery that many stocks over different periods of time co-vary, i.e. their price movements are closely correlated and follow the same direction. The statistical method to measure the mutual relationship among $m$ vectors (correlation) is achieved by the calculation of a covariance matrix. For our data set of $\boldsymbol{X}$:
$$\boldsymbol{X}_{m\times n} = \left[ \begin{array}{cccc} \boldsymbol{x_1} \\ \boldsymbol{x_2} \\ … \\ \boldsymbol{x_m} \end{array} \right] = \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & … & x_{1,n} \\ x_{2,1} & x_{2,2} & … & x_{2,n} \\ … & … & … & … \\ x_{m,1} & x_{m,2} & … & x_{m,n} \end{array} \right]$$
the covariance matrix takes the following form:
$$cov(\boldsymbol{X}) \equiv \frac{1}{n-1} \boldsymbol{X}\boldsymbol{X}^{T}$$ where we multiply $\boldsymbol{X}$ by its transposed version and $(n-1)^{-1}$ helps to secure the variance to be unbiased. The diagonal elements of $cov(\boldsymbol{X})$ are the variances corresponding to each row of $\boldsymbol{X}$ whereas the off-diagonal terms of $cov(\boldsymbol{X})$ represent the covariances between different rows (prices of the stocks). Please note that above multiplication assures us that $cov(\boldsymbol{X})$ is a square symmetric matrix $m\times m$.

All right, but what does it have in common with our PCA method? PCA looks for a way to optimise the matrix of $cov(\boldsymbol{X})$ by a reduction of redundancy. Sounds a bit enigmatic? I bet! Well, all we need to understand is that PCA wants to ‘force’ all off-diagonal elements of the covariance matrix to be zero (in the best possible way). The guys in the Department of Statistics will tell you the same as: removing redundancy diagonalises $cov(\boldsymbol{X})$. But how, how?!

Let’s come back to our previous notation of $\boldsymbol{PX}=\boldsymbol{Y}$. $\boldsymbol{P}$ transforms $\boldsymbol{X}$ into $\boldsymbol{Y}$. We also marked that:
$$\boldsymbol{P} = [\boldsymbol{p_1},\boldsymbol{p_2},…,\boldsymbol{p_m}]$$ was a new basis we were looking for. PCA assumes that all basis vectors $\boldsymbol{p_k}$ are orthonormal, i.e. $\boldsymbol{p_i}\boldsymbol{p_j}=\delta_{ij}$, and that the directions with the largest variances are the most principal. So, PCA first selects a normalised direction in $m$-dimensional space along which the variance in $\boldsymbol{X}$ is maximised. That is first principal component $\boldsymbol{p_1}$. In the next step, PCA looks for another direction along which the variance is maximised. However, because of orthonormality condition, it looks only in all directions perpendicular to all previously found directions. In consequence, we obtain an orthonormal matrix of $\boldsymbol{P}$. Good stuff, but still sounds complicated?

The goal of PCA is to find such $\boldsymbol{P}$ where $\boldsymbol{Y}=\boldsymbol{PX}$ such that $cov(\boldsymbol{Y})=(n-1)^{-1}\boldsymbol{XX}^T$ is diagonalised.

We can evolve the notation of the covariance matrix as follows:
$$(n-1)cov(\boldsymbol{Y}) = \boldsymbol{YY}^T = \boldsymbol{(PX)(PX)}^T = \boldsymbol{PXX}^T\boldsymbol{P}^T = \boldsymbol{P}(\boldsymbol{XX}^T)\boldsymbol{P}^T = \boldsymbol{PAP}^T$$ where we made a quick substitution of $\boldsymbol{A}=\boldsymbol{XX}^T$. It is easy to prove that $\boldsymbol{A}$ is symmetric. It takes a longer while to find a proof for the following two theorems: (1) a matrix is symmetric if and only if it is orthogonally diagonalisable; (2) a symmetric matrix is diagonalised by a matrix of its orthonormal eigenvectors. Just check your favourite algebra textbook. The second theorem provides us with a right to denote:
$$\boldsymbol{A} = \boldsymbol{EDE}^T$$ where $\boldsymbol{D}$ us a diagonal matrix and $\boldsymbol{E}$ is a matrix of eigenvectors of $\boldsymbol{A}$. That brings us at the end of the rainbow.

We select matrix $\boldsymbol{P}$ to be a such where each row $\boldsymbol{p_1}$ is an eigenvector of $\boldsymbol{XX}^T$, therefore
$$\boldsymbol{P} = \boldsymbol{E}^T .$$

Given that, we see that $\boldsymbol{E}=\boldsymbol{P}^T$, thus we find $\boldsymbol{A}=\boldsymbol{EDE}^T = \boldsymbol{P}^T\boldsymbol{DP}$ what leads us to a magnificent relationship between $\boldsymbol{P}$ and the covariance matrix:
$$(n-1)cov(\boldsymbol{Y}) = \boldsymbol{PAP}^T = \boldsymbol{P}(\boldsymbol{P}^T\boldsymbol{DP})\boldsymbol{P}^T = (\boldsymbol{PP}^T)\boldsymbol{D}(\boldsymbol{PP}^T) = (\boldsymbol{PP}^{-1})\boldsymbol{D}(\boldsymbol{PP}^{-1})$$ or
$$cov(\boldsymbol{Y}) = \frac{1}{n-1}\boldsymbol{D},$$ i.e. the choice of $\boldsymbol{P}$ diagonalises $cov(\boldsymbol{Y})$ where silently we also used the matrix algebra theorem saying that the inverse of an orthogonal matrix is its transpose ($\boldsymbol{P^{-1}}=\boldsymbol{P}^T$). Fascinating, right?! Let’s see now how one can use all that complicated machinery in the quest of looking for human emotions among the endless rivers of market numbers bombarding our sensors every day.

2. Covariances of NASDAQ, Eigenvalues of Anxiety

We will try to build a simple quantitative model for detection of the nervousness in the trading markets using PCA.

By its simplicity I will understand the following model assumption: no matter what the data conceal, the 1st Principal Component (1-PC) of PCA solution links the complicated relationships among a subset of stocks triggered by a latent factor attributed by us to a common behaviour of traders (human and pre-programmed algos). It is a pretty reasonable assumption, much stronger than, for instance, the influence of Saturn’s gravity on the annual silver price fluctuations. Since PCA does not tell us what its 1-PC means in reality, this is our job to seek for meaningful explanations. Therefore, a human factor fits the frame as a trial value very well.

Let’s consider the NASDAQ-100 index. It is composed of 100 technology stocks. The most current list you can find here: nasdaq100.lst downloadable as a text file. As usual, we will perform all calculations using Matlab environment. Let’s start with data collection and pre-processing:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 % Anxiety Detection Model for Stock Traders % making use of the Principal Component Analsis (PCA) % and utilising publicly available Yahoo! stock data % % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz   clear all; close all; clc;     % Reading a list of NASDAQ-100 components nasdaq100=(dataread('file',['nasdaq100.lst'], '%s', 'delimiter', '\n'))';   % Time period we are interested in d1=datenum('Jan 2 1998'); d2=datenum('Oct 11 2013');   % Check and download the stock data for a requested time period stocks={}; for i=1:length(nasdaq100) try % Fetch the Yahoo! adjusted daily close prices between selected % days [d1;d2] tmp = fetch(yahoo,nasdaq100{i},'Adj Close',d1,d2,'d'); stocks{i}=tmp; disp(i); catch err % no full history available for requested time period end end

where, first, we try to check whether for a given list of NASDAQ-100’s components the full data history (adjusted close prices) are available via Yahoo! server (please refer to my previous post of Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting for more information on the connectivity).

The cell array stocks becomes populated with two-dimensional matrixes: the time-series corresponding to stock prices (time,price). Since the Yahoo! database does not contain a full history for all stocks of our interest, we may expect their different time spans. For the purpose of demonstration of the PCA method, we apply additional screening of downloaded data, i.e. we require the data to be spanned between as defined by $d1$ and $d2$ variables and, additionally, having the same (maximal available) number of data points (observations, trials). We achieve that by:

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 % Additional screening d=[]; j=1; data={}; for i=1:length(nasdaq100) d=[d; i min(stocks{i}(:,1)) max(stocks{i}(:,1)) size(stocks{i},1)]; end for i=1:length(nasdaq100) if(d(i,2)==d1) && (d(i,3)==d2) && (d(i,4)==max(d(:,4))) data{j}=sortrows(stocks{i},1); fprintf('%3i %1s\n',i,nasdaq100{i}) j=j+1; end end m=length(data);

The temporary matrix of $d$ holds the index of stock as read in from nasdaq100.lst file, first and last day number of data available, and total number of data points in the time-series, respectively:

>> d d = 1 729757 735518 3970 2 729757 735518 3964 3 729757 735518 3964 4 729757 735518 3969 .. .. .. .. 99 729757 735518 3970 100 729757 735518 3970

Our screening method saves $m=21$ selected stock data into data cell array corresponding to the following companies from our list:

 1 AAPL 7 ALTR 9 AMAT 10 AMGN 20 CERN 21 CHKP 25 COST 26 CSCO 30 DELL 39 FAST 51 INTC 64 MSFT 65 MU 67 MYL 74 PCAR 82 SIAL 84 SNDK 88 SYMC 96 WFM 99 XRAY 100 YHOO

Okay, some people say that seeing is believing. All right. Let’s see how it works. Recall the fact that we demanded our stock data to be spanned between ‘Jan 2 1998′ and ‘Oct 11 2013′. We found 21 stocks meeting those criteria. Now, let’s assume we pick up a random date, say, Jul 2 2007 and we extract for all 21 stocks their price history over last 90 calendar days. We save their prices (skipping the time columns) into $Z$ matrix as follows:

t=datenum('Jul 2 2007'); Z=[]; for i=1:m [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-90)); Z=[Z data{i}(r,2)] end

and we plot them all together:

plot(Z) xlim([1 length(Z)]); ylabel('Stock price (US$)'); xlabel('T-90d'); It’s easy to deduct that the top one line corresponds to Apple, Inc. (AAPL) adjusted close prices. The unspoken earlier data processing methodology is that we need to transform our time-series into the comparable form. We can do it by subtracting the average value and dividing each of them by their standard deviations. Why? For a simple reason of an equivalent way of their mutual comparison. We call that step a normalisation or standardisation of the time-series under investigation: [N,M]=size(Z); X=(Z-repmat(mean(Z),[N 1]))./repmat(std(Z),[N 1]); This represents the matrix$\boldsymbol{X}$that I discussed in a theoretical part of this post. Note, that the dimensions are reversed in Matlab. Therefore, the normalised time-series, % Display normalized stock prices plot(X) xlim([1 length(Z)]); ylabel('(Stock price-Mean)/StdDev'); xlabel('T-90d'); look like: For a given matrix of$\boldsymbol{X}$, its covariance matrix, % Calculate the covariance matrix, cov(X) CovX=cov(X); imagesc(CovX); as for data spanned 90 calendar day back from Jul 2 2007, looks like: where the colour coding goes from the maximal values (most reddish) down to the minimal values (most blueish). The diagonal of the covariance matrix simply tells us that for normalised time-series, their covariances are equal to the standard deviations (variances) of 1 as expected. Going one step forward, based on the given covariance matrix, we look for the matrix of$\boldsymbol{P}$whose columns are the corresponding eigenvectors: % Find P [P,~]=eigs(CovX,5); imagesc(P); set(gca,'xticklabel',{1,2,3,4,5},'xtick',[1 2 3 4 5]); xlabel('Principal Component') ylabel('Stock'); set(gca,'yticklabel',{'AAPL', 'ALTR', 'AMAT', 'AMGN', 'CERN', ... 'CHKP', 'COST', 'CSCO', 'DELL', 'FAST', 'INTC', 'MSFT', 'MU', ... 'MYL', 'PCAR', 'SIAL', 'SNDK', 'SYMC', 'WFM', 'XRAY', 'YHOO'}, ... 'ytick',[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21]); which results in$\boldsymbol{P}$displayed as: where we computed PCA for five principal components in order to illustrate the process. Since the colour coding is the same as in the previous figure, a visual inspection of of the 1-PC indicates on negative numbers for at least 16 out of 21 eigenvalues. That simply means that over last 90 days the global dynamics for those stocks were directed south, in favour of traders holding short-position in those stocks. It is important to note in this very moment that 1-PC does not represent the ‘price momentum’ itself. It would be too easy. It represents the latent variable responsible for a common behaviour in the stock dynamics whatever it is. Based on our model assumption (see above) we suspect it may indicate a human factor latent in the trading. 3. Game of Nerves The last figure communicates an additional message. There is a remarkable coherence of eigenvalues for 1-PC and pretty random patterns for the remaining four principal components. One may check that in the case of our data sample, this feature is maintained over many years. That allows us to limit our interest to 1-PC only. It’s getting exciting, isn’t it? Let’s come back to our main code. Having now a pretty good grasp of the algebra of PCA at work, we may limit our investigation of 1-PC to any time period of our interest, below spanned between as defined by$t1$and$t2$variables: 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 % Select time period of your interest t1=datenum('July 1 2006'); t2=datenum('July 1 2010'); results=[]; for t=t1:t2 tmp=[]; A=[]; V=[]; for i=1:m [r,c,v]=find((data{i}(:,1)<=t) & (data{i}(:,1)>t-60)); A=[A data{i}(r,2)]; end [N,M]=size(A); X=(A-repmat(mean(A),[N 1]))./repmat(std(A),[N 1]); CovX=cov(X); [V,D]=eigs(CovX,1); % Find all negative eigenvalues of the 1st Principal Component [r,c,v]=find(V(:,1)<0); % Extract them into a new vector neg1PC=V(r,1); % Calculate a percentage of negative eigenvalues relative % to all values available ratio=length(neg1PC)/m; % Build a new time-series of 'ratio' change over required % time period (spanned between t1 and t2) results=[results; t ratio]; end We build our anxiety detection model based on the change of number of eigenvalues of the 1st Principal Component (relative to the total their numbers; here equal 21). As a result, we generate a new time-series tracing over$[t1;t2]$time period this variable. We plot the results all in one plot contrasted with the NASDAQ-100 Index in the following way: 75 76 77 78 79 80 81 82 83 84 85 % Fetch NASDAQ-100 Index from Yahoo! data-server nasdaq = fetch(yahoo,'^ndx','Adj Close',t1,t2,'d'); % Plot it subplot(2,1,1) plot(nasdaq(:,1),nasdaq(:,2),'color',[0.6 0.6 0.6]); ylabel('NASDAQ-100 Index'); % Add a plot corresponding to a new time-series we've generated subplot(2,1,2) plot(results(:,1),results(:,2),'color',[0.6 0.6 0.6]) % add overplot 30d moving average based on the same data hold on; plot(results(:,1),moving(results(:,2),30),'b') leading us to: I use 30-day moving average (a solid blue line) in order to smooth the results (moving.m). Please note, that line in #56 I also replaced the earlier value of 90 days with 60 days. Somehow, it is more reasonable to examine with the PCA the market dynamics over past two months than for longer periods (but it’s a matter of taste and needs). Eventually, we construct the core model’s element, namely, we detect nervousness among traders when the percentage of negative eigenvalues of the 1st Principal Component increases over (at least) five consecutive days: 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 % Model Core x1=results(:,1); y1=moving(results(:,2),30); tmp=[]; % Find moments of time where the percetage of negative 1-PC % eigenvalues increases over time (minimal requirement of % five consecutive days for i=5:length(x1) if(y1(i)>y1(i-1))&&(y1(i-1)>y1(i-2))&&(y1(i-2)>y1(i-3))&& ... (y1(i-3)>y1(i-4))&&(y1(i-4)>y1(i-5)) tmp=[tmp; x1(i)]; end end % When found z=[]; for i=1:length(tmp) for j=1:length(nasdaq) if(tmp(i)==nasdaq(j,1)) z=[z; nasdaq(j,1) nasdaq(j,2)]; end end end subplot(2,1,1); hold on; plot(z(:,1),z(:,2),'r.','markersize',7); The results of the model we over-plot with red markers on top of the NASDAQ-100 Index: Our simple model takes us into a completely new territory of unexplored space of latent variables. Firstly, it does not predict the future. It still (unfortunately) remains unknown. However, what it delivers is a fresh look at the past dynamics in the market. Secondly, it is easily to read out from the plot that results cluster into three subgroups. The first subgroup corresponds to actions in the stock trading having further negative consequences (see the events of 2007-2009 and the avalanche of prices). Here the dynamics over any 60 calendar days had been continued. The second subgroup are those periods of time when anxiety led to negative dynamics among stock traders but due to other factors (e.g. financial, global, political, etc.) the stocks surged dragging the Index up. The third subgroup (less frequent) corresponds to instances of relative flat changes of Index revealing a typical pattern of psychological hesitation about the trading direction. No matter how we might interpret the results, the human factor in trading is evident. Hopefully, the PCA approach captures it. If not, all we are left with is our best friend: a trader’s intuition. Acknowledgments An article dedicated to Dr. Dariusz Grech of Physics and Astronomy Department of University of Wroclaw, Poland, for his superbly important! and mind-blowing lectures on linear algebra in the 1998/99 academic year. ## Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting Within the evolution of Mathworks’ MATLAB programming environment, finally, in the most recent version labelled 2013a we received a longly awaited line-command facilitation for pulling stock data directly from the Yahoo! servers. What does that mean for quants and algo traders? Honestly, a lot. Now, simply writing a few commands we can have nearly all what we want. However, please keep in mind that Yahoo! data are free therefore not always in one hundred percent their precision remains at the level of the same quality as, e.g. downloaded from Bloomberg resources. Anyway, just for pure backtesting of your models, this step introduces a big leap in dealing with daily stock data. As usual, we have a possibility of getting open, high, low, close, adjusted close prices of stocks supplemented with traded volume and the dates plus values of dividends. In this post I present a short example how one can retrieve the data of SPY (tracking the performance of S&P500 index) using Yahoo! data in a new Matlab 2013a and I show a simple code how one can test the time period of buying-holding-and-selling SPY (or any other stock paying dividends) to make a profit every time. The beauty of Yahoo! new feature in Matlab 2013a has been fully described in the official article of Request data from Yahoo! data servers where you can find all details required to build the code into your Matlab programs. Model for Dividends It is a well known opinion (based on many years of market observations) that one may expect the drop of stock price within a short timeframe (e.g. a few days) after the day when the stock’s dividends have been announced. And probably every quant, sooner or later, is tempted to verify that hypothesis. It’s your homework. However, today, let’s look at a bit differently defined problem based on the omni-working reversed rule: what goes down, must go up. Let’s consider an exchange traded fund of SPDR S&P 500 ETF Trust labelled in NYSE as SPY. First, let’s pull out the Yahoo! data of adjusted Close prices of SPY from Jan 1, 2009 up to Aug 27, 2013 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 % Yahoo! Stock Data in Matlab and a Model for Dividend Backtesting % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz close all; clear all; clc; date_from=datenum('Jan 1 2009'); date_to=datenum('Aug 27 2013'); stock='SPY'; adjClose = fetch(yahoo,stock,'adj close',date_from,date_to); div = fetch(yahoo,stock,date_from,date_to,'v') returns=(adjClose(2:end,2)./adjClose(1:end-1,2)-1); % plot adjusted Close price of and mark days when dividends % have been announced plot(adjClose(:,1),adjClose(:,2),'color',[0.6 0.6 0.6]) hold on; plot(div(:,1),min(adjClose(:,2))+10,'ob'); ylabel('SPY (US$)'); xlabel('Jan 1 2009 to Aug 27 2013');

and visualize them:

Having the data ready for backtesting, let’s look for the most profitable period of time of buying-holding-and-selling SPY assuming that we buy SPY one day after the dividends have been announced (at the market price), and we hold for $dt$ days (here, tested to be between 1 and 40 trading days).

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 % find the most profitable period of holding SPY (long position) neg=[]; for dt=1:40   buy=[]; sell=[]; for i=1:size(div,1) % find the dates when the dividends have been announced [r,c,v]=find(adjClose(:,1)==div(i,1)); % mark the corresponding SPY price with blue circle marker hold on; plot(adjClose(r,1),adjClose(r,2),'ob'); % assume you buy long SPY next day at the market price (close price) buy=[buy; adjClose(r-1,1) adjClose(r-1,2)]; % assume you sell SPY in 'dt' days after you bought SPY at the market % price (close price) sell=[sell; adjClose(r-1-dt,1) adjClose(r-1-dt,2)]; end   % calculate profit-and-loss of each trade (excluding transaction costs) PnL=sell(:,2)./buy(:,2)-1; % summarize the results neg=[neg; dt sum(PnL<0) sum(PnL<0)/length(PnL)];   end

If we now sort the results according to the percentage of negative returns (column 3 of neg matrix), we will be able to get:

>> sortrows(neg,3)   ans = 18.0000 2.0000 0.1111 17.0000 3.0000 0.1667 19.0000 3.0000 0.1667 24.0000 3.0000 0.1667 9.0000 4.0000 0.2222 14.0000 4.0000 0.2222 20.0000 4.0000 0.2222 21.0000 4.0000 0.2222 23.0000 4.0000 0.2222 25.0000 4.0000 0.2222 28.0000 4.0000 0.2222 29.0000 4.0000 0.2222 13.0000 5.0000 0.2778 15.0000 5.0000 0.2778 16.0000 5.0000 0.2778 22.0000 5.0000 0.2778 27.0000 5.0000 0.2778 30.0000 5.0000 0.2778 31.0000 5.0000 0.2778 33.0000 5.0000 0.2778 34.0000 5.0000 0.2778 35.0000 5.0000 0.2778 36.0000 5.0000 0.2778 6.0000 6.0000 0.3333 8.0000 6.0000 0.3333 10.0000 6.0000 0.3333 11.0000 6.0000 0.3333 12.0000 6.0000 0.3333 26.0000 6.0000 0.3333 32.0000 6.0000 0.3333 37.0000 6.0000 0.3333 38.0000 6.0000 0.3333 39.0000 6.0000 0.3333 40.0000 6.0000 0.3333 5.0000 7.0000 0.3889 7.0000 7.0000 0.3889 1.0000 9.0000 0.5000 2.0000 9.0000 0.5000 3.0000 9.0000 0.5000 4.0000 9.0000 0.5000

what simply indicates at the most optimal period of holding the long position in SPY equal 18 days. We can mark all trades (18 day holding period) in the chart:

where the trade open and close prices (according to our model described above) have been marked in the plot by black and red circle markers, respectively. Only 2 out of 18 trades (PnL matrix) occurred to be negative with the loss of 2.63% and 4.26%. The complete distribution of profit and losses from all trades can be obtained in the following way:

47 48 49 50 figure(2); hist(PnL*100,length(PnL)) ylabel('Number of trades') xlabel('Return (%)')

returning

Let’s make some money!

The above Matlab code delivers a simple application of the newest build-in connectivity with Yahoo! server and the ability to download the stock data of our interest. We have tested the optimal holding period for SPY since the beginning of 2009 till now (global uptrend). The same code can be easily used and/or modified for verification of any period and any stock for which the dividends had been released in the past. Fairly simple approach, though not too frequent in trading, provides us with some extra idea how we can beat the market assuming that the future is going to be/remain more or less the same as the past. So, let’s make some money!

## Trend Identification for FX Traders (Part 2)

In my previous post I provided an introduction to the trading model invention and design. We made use of FX data of AUDUSD pair sampled hourly and splitting data into weekly time-series.

Today, we will work a bit harder over formulation of the very first rules for the model. This step will require an engagement of our creativity in understanding what data are like as well as how we can make a basic set of rules which would help us to perform an attractive time-series classification. Our objective is to invent a method which will be helpful in classification of last week FX pair’s time-series to be either in the downtrend or in the uptrend.

The most naive way of classification of directional information contained in any time-series is its slope: a straight line fit to the data. Let’s use it as our starting point. Instead of fitting all data points for a given week, we find median values for the first and the last 12 data points both in Time $(x1,x2)$ and Pair Ratio $(y1,y2)$ as specified in lines 92 to 94:

65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 % FX time-series analysis % (c) Quant at Risk, 2012 % % Part 2: Classification of weekly time-series   close all; scrsz = get(0,'ScreenSize'); h=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none'); fprintf('\nuptrend/downtrend identification.. '); % for viewing use the loop hold off; set(0,'CurrentFigure',h);   % pre-define variables trend=zeros(nw,1); slope=zeros(nw,1); midp={}; % middle points endp={}; % end points (median based on last 12 points)   for i=1:nw %--- a loop over total number of weeks available   % reading time-series for a current week w=week{i}; x=w(:,1); y=w(:,2);   % plot the time-series hold on; plot(x,y,'k');   % linear trend estimation x1=median(x(1:12)); x2=median(x(end-11:end)); y1=median(y(1:12)); y2=median(y(end-11:end));   % define end-point of the time-series and mark it on the plot endp{i}=[x2 y2]; hold on; plot(endp{i}(1),endp{i}(2),'b*');   % find slope m=(y2-y1)/(x2-x1); slope(i)=m; xl=x1:dt:x2; yl=m*xl-m*x2+y2; % a line of representing the slope hold on; plot(xl,yl,'b:');   % find middle point of the line and mark it on the plot mx=mean(xl); my=mean(yl); midp{i}=[mx my]; hold on; plot(midp{i}(1),midp{i}(2),'bo');

As an example of the code execution, for the first two weeks we plot slopes, mid-points and end-points:

We assume that our classification procedure will be based solely on the information provided for end-points and slopes. The exception we make for the classification of the first two weeks. For the first week the distinction between uptrend and downtrend is made based on the position of the first and last point:

113 114 115 116 117 118 119 120 121 122 123 124 % Time-Series Classification   if(i==1) ybeg=y(1); yend=y(end); if(ybeg<yend) trend(i)=+1; % uptrend hold on; plot(x,y,'k'); else trend(i)=-1; % downtrend hold on; plot(x,y,'r'); end end

where we mark the result of classification with a sign $+1$ or $-1$ in a vector element of $trend$, and plot it with black and red color denoting uptrend and downtrend, respectively.

For the second week, our rules of classification are enriched by additional information about the end-point of a current and a previous week:

126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 if(i==2) % week(current-1) tmp=week{i-1}; x1=tmp(:,1); y1=tmp(:,2); y1b=y1(1); y1e=y1(end); % week(current) y0b=y(1); y0e=y(end); if(y0e>y1e) trend(i)=+1; % uptrend hold on; plot(x,y,'k'); else trend(i)=-1; % downtrend hold on; plot(x,y,'r'); end end

For weeks number 3 and higher we do our creative research over the data to define a specific set of rules. We allow to take into account the information from two weeks prior to the current one and combine them all together. The following code represents an attractive solution, subject to improvement:

142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 if(i>2) % week(current-2) mid2=midp{i-2}(2); end2=endp{i-2}(2); slp2=slope(i-2); % week(current-1) mid1=midp{i-1}(2); end1=endp{i-1}(2); slp1=slope(i-1); % week(current) mid0=midp{i}(2); end0=endp{i}(2); slp0=slope(i); if((mid0>mid1)) % up-trend if((mid0>mid2)&&(end0>end1)) trend(i)=+1; hold on; plot(x,y,'k'); % strong up-trend elseif((mid0>mid2)&&(end0>end2)) trend(i)=+1; hold on; plot(x,y,'k'); % weak up-trend elseif((mid0<mid2)&&(end0<end2)&&(slp0<0)) trend(i)=-1; hold on; plot(x,y,'r'); % turns into possible down-trend elseif((mid0<mid2)&&(end0<end2)&&(slp0>0)) trend(i)=+1; hold on; plot(x,y,'k'); % turns into possible up-trend else trend(i)=+1; hold on; plot(x,y,'k'); % turns into possible up-trend end elseif(mid0<mid1) % down-trend if((mid0<mid2)&&(end0<end1)&&(end0<end2)) trend(i)=-1; hold on; plot(x,y,'r'); % weak down-trend elseif((mid0<mid2)&&(end0<end2)&&(end0>end1)) trend(i)=+1; hold on; plot(x,y,'k'); % possible up-trend elseif((mid0<mid2)&&(end0>end2)) trend(i)=+1; hold on; plot(x,y,'k'); % turns into possible up-trend elseif((mid0>mid2)&&(end0<end1)&&(end0<end2)) trend(i)=-1; hold on; plot(x,y,'r'); elseif((mid0>mid2)&&(end0>end2)) trend(i)=+1; hold on; plot(x,y,'k'); % turns into possible up-trend elseif((mid0>mid2)&&(end0>end1)) trend(i)=+1; hold on; plot(x,y,'k'); % turns into possible up-trend else trend(i)=-1; hold on; plot(x,y,'r'); end end end end

Since one picture is worth millions of lines of code, below we present three examples of our model in action. The last plot corresponds to the latest Global Financial Crisis and shows how weeks in uptrends of 2009 followed these in downtrend a year before.

It is straightforward to note that the performance of our rules works very intuitively and stays in general agreement with the market sentiments.

## Trend Identification for FX Traders

When you think about an invention of a new model for algorithmic trading, there are only three key elements you need to start your work with: creativity, data, and programming tool. Assuming that the last two are already in your possession, all what remains is seeking and finding a great new idea! With no offense, that’s the hardest part of the game.

To be successful in discovering new trading solutions you have to be completely open-minded, relaxed and full of spatial orientation with the information pertaining to your topic. Personally, after many years of programming and playing with the digital signal processing techniques, I have discovered that the most essential aspect of well grounded research is data itself. The more, literally, I starred at time-series changing their properties, the more I was able to capture subtle differences, often overlooked by myself before, and with the aid of intuition and scientific experience some new ideas simply popped up.

Here I would like to share with you a part of this process.

In Extracting Time-Series from Tick-Data article I outlined one of many possible ways of the FX time-series extraction from the very fine data sets. As a final product we have obtained two files, namely:

audusd.bid.1h audusd.ask.1h

corresponding to Bid and Ask prices for Forex AUDUSD pair’s trading history between Jan 2000 and May 2010. Each file contained two columns of numbers: Time (Modified Julian Day) and Price. The time resolution has been selected to be 1 hour.

FOREX trading lasts from Monday to Friday, continuously for 24 hours. Therefore the data contain regular gaps corresponding to weekends. As the data coverage is more abundant comparing to, for example, much shorter trading windows of equities or ETFs around the world, that provides us with a better understanding of trading directions within every week time frame. Keeping that in mind, we might be interested in looking at directional information conveyed by the data as a seed of a potential new FX model.

As for now, let’s solely focus on initial pre-processing of Bid and Ask time-series and splitting each week into a common cell array.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 % FX time-series analysis % (c) Quant at Risk, 2012 % % Part 1: Separation of the weeks   close all; clear all; clc;   % --analyzed FX pair pair=['audusd'];   % --data n=['./',pair,'/',pair]; % a common path to files na=[n,'.ask.1h']; nb=[n,'.bid.1h']; d1=load(na); d2=load(na); % loading data d=(d1+d2)/2; % blending clear d1 d2

For a sake of simplicity, in line 16, we decided to use a simple average of Bid and Ask 1-hour prices for our further research. Next, we create a weekly template, $x$, for our data classification, and we find the total number of weeks available for analysis:

19 20 21 22 23 24 25 26 27 28 29 30 31 % time constraints from the data t0=min(d(:,1)); tN=max(d(:,1)); t1=t0-1;   % weekly template for data classification x=t1:7:tN+7;   % total number of weeks nw=length(x)-1;   fprintf(upper(pair)); fprintf(' time-series: %3.0f weeks (%5.2f yrs)\n',nw,nw/52);

what in our case returns a positive information:

AUDUSD time-series: 539 weeks (10.37 yrs)

The core of programming exercise is to split all 539 weeks and save them into a cell array of $week$. As we will see in the code section below, for some reasons we may want to assure ourselves that each week will contain the same number of points, therefore any missing data from our FX data provider will be interpolated. To do that efficiently, we use the following function which makes use of Piecewise Cubic Hermite Interpolating Polynomial interpolation for filling gapped data point in the series:

function [x2,y2]=gapinterpol(x,y,dt); % specify axis x_min=x(1); x_max=x(length(x)); x2=(x_min:dt:x_max); % inperpolate gaps y2=pchip(x,y,x2); end

The separation of weeks we realize in our program by:

33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 week={}; % an empty cell array avdt=[];   for i=1:nw % split FX signal according to week [r,c,v]=find(d(:,1)>x(i) & d(:,1)<x(i+1)); x1=d(r,1); y1=d(r,2); % interpolate gaps, use 1-hour bins dt=1/24; [x2,y2]=gapinterpol(x1,y1,dt); % check the average sampling time, should equal to dt s=0; for j=1:length(x2)-1 s=s+(x2(j+1)-x2(j)); end tmp=s/(length(x2)-1); avdt=[avdt; tmp]; % store the week signal in a cell array tmp=[x2; y2]; tmp=tmp'; week{i}=tmp; end fprintf('average sampling after interpolation = %10.7f [d]\n',max(avdt));

where as a check-up we get:

average sampling after interpolation = 0.0416667 [d]

what corresponds to the expected value of $1/24$ day with a sufficient approximation.

A quick visual verification of our signal processing,

54 55 56 57 58 59 60 61 62 63 scrsz = get(0,'ScreenSize'); h=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none'); hold off; for i=1:nw w=week{i}; x=w(:,1); y=w(:,2); % plot weekly signal hold on; plot(x,y,'k'); end xlim([0 100]);

uncovers our desired result: