Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

## Rebinning Tick-Data for FX Algo Traders

If you work or intend to work with FX data in order to build and backtest your own FX models, the Historical Tick-Data of Pepperstone.com is probably the best place to kick off your algorithmic experience. As for now, they offer tick-data sets of 15 most frequently traded currency pairs since May 2009. Some of the unzip’ed files (one month data) reach over 400 MB in size, i.e. storing 8.5+ millions of lines with a tick resolution for both bid and ask “prices”. A good thing is you can download them all free of charge and their quality is regarded as very high. A bad thing is there is 3 month delay in data accessibility.

Dealing with a rebinning process of tick-data up, that’s a different story and the subject of this post. We will see how efficiently you can turn Pepperstone’s Tick-Data set(s) into 5-min time-series as an example. We will make use of scripting in bash (Linux/OS X) supplemented with data processing in Python.

Data Structure

You can download Pepperstone’s historical tick-data from here, month by month, pair by pair. Their inner structure follows the same pattern, namely:

 $head AUDUSD-2014-09.csv AUD/USD,20140901 00:00:01.323,0.93289,0.93297 AUD/USD,20140901 00:00:02.138,0.9329,0.93297 AUD/USD,20140901 00:00:02.156,0.9329,0.93298 AUD/USD,20140901 00:00:02.264,0.9329,0.93297 AUD/USD,20140901 00:00:02.265,0.9329,0.93293 AUD/USD,20140901 00:00:02.265,0.93289,0.93293 AUD/USD,20140901 00:00:02.268,0.93289,0.93295 AUD/USD,20140901 00:00:02.277,0.93289,0.93296 AUD/USD,20140901 00:00:02.278,0.9329,0.93296 AUD/USD,20140901 00:00:02.297,0.93288,0.93296 The columns, from left to right, represent respectively: a pair name, the date and tick-time, the bid price, and the ask price. Pre-Processing Here, for each .csv file, we aim to split the date into year, month, and day separately, and remove commas and colons to get raw data ready to be read in as a matrix (array) using any other programming language (e.g. Matlab or Python). The matrix is mathematically intuitive data structure therefore making direct reference to any specific column of it makes any backtesting engine running with its full thrust. Let’s play with AUDUSD-2014-09.csv data file. Working in the same directory where the file is located we begin with writing a bash script (pp.scr) that contains:  1 2 3 4 5 6 7 8 9 10 11 # pp.scr # Rebinning Pepperstone.com Tick-Data for FX Algo Traders # (c) 2014 QuantAtRisk, by Pawel Lachowicz clear echo "..making a sorted list of .csv files" for i in$1-*.csv; do echo ${i##$1-} $i${i##.csv}; done | sort -n | awk '{print $2}' >$1.lst   python pp.py head AUDUSD.pp

that you run in Terminal:

 $chmod +x pp.scr$ ./pp.scr AUDUSD

where the first command makes sure the script becomes executable (you need to perform this task only once). Lines #7-8 of our script, in fact, look for all .csv data files in the local directory starting with AUDUSD- prefix and create their list in AUDUSD.lst file. Since we work with AUDUSD-2014-09.csv file only, the AUDUSD.lst file will contain:

 $cat AUDUSD.lst AUDUSD-2014-09.csv as expected. Next, we utilise the power and flexibility of Python in the following way:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 # pp.py import csv fnlst="AUDUSD.lst" fnout="AUDUSD.pp" for lstline in open(fnlst,'r').readlines(): fncur=lstline[:-1] #print(fncur) with open(fnout,'w') as f: writer=csv.writer(f,delimiter=" ") i=1 # counts a number of lines with tick-data for line in open(fncur,'r').readlines(): if(i<=5200): # replace with (i>0) to process an entire file #print(line) year=line[8:12] month=line[12:14] day=line[14:16] hh=line[17:19] mm=line[20:22] ss=line[23:29] bidask=line[30:] writer.writerow([year,month,day,hh,mm,ss,bidask]) i+=1 It is a pretty efficient way to open really a big file and process its information line by line. Just for further purpose of display, in the code we told computer to process only first 5,200 of lines. The output of lines #10-11 of pp.scr is the following:  2014 09 01 00 00 01.323 "0.93289,0.93297 " 2014 09 01 00 00 02.138 "0.9329,0.93297 " 2014 09 01 00 00 02.156 "0.9329,0.93298 " 2014 09 01 00 00 02.264 "0.9329,0.93297 " 2014 09 01 00 00 02.265 "0.9329,0.93293 " since we allowed Python to save bid and ask information as one string (due to a variable number of decimal digits). In order to clean this mess we continue:  13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 # pp.scr (continued) echo "..removing token: comma" sed 's/,/ /g' AUDUSD.pp >$1.tmp rm AUDUSD.pp   echo "..removing token: double quotes" sed 's/"/ /g' $1.tmp >$1.tmp2 rm $1.tmp echo "..removing empty lines" sed -i '/^[[:space:]]*$/d' $1.tmp2 mv$1.tmp2 AUDUSD.pp   echo "head..." head AUDUSD.pp echo "tail..." tail AUDUSD.pp

what brings us to pre-processed data:

 ..removing token: comma ..removing token: double quotes ..removing empty lines head... 2014 09 01 00 00 01.323 0.93289 0.93297 2014 09 01 00 00 02.138 0.9329 0.93297 2014 09 01 00 00 02.156 0.9329 0.93298 2014 09 01 00 00 02.264 0.9329 0.93297 2014 09 01 00 00 02.265 0.9329 0.93293 2014 09 01 00 00 02.265 0.93289 0.93293 2014 09 01 00 00 02.268 0.93289 0.93295 2014 09 01 00 00 02.277 0.93289 0.93296 2014 09 01 00 00 02.278 0.9329 0.93296 2014 09 01 00 00 02.297 0.93288 0.93296 tail... 2014 09 02 00 54 39.324 0.93317 0.93321 2014 09 02 00 54 39.533 0.93319 0.93321 2014 09 02 00 54 39.543 0.93318 0.93321 2014 09 02 00 54 39.559 0.93321 0.93321 2014 09 02 00 54 39.784 0.9332 0.93321 2014 09 02 00 54 39.798 0.93319 0.93321 2014 09 02 00 54 39.885 0.93319 0.93325 2014 09 02 00 54 39.886 0.93319 0.93321 2014 09 02 00 54 40.802 0.9332 0.93321 2014 09 02 00 54 48.829 0.93319 0.93321

Personally, I love that part as you can learn how to do simple but necessary text file operations by typing single lines of Unix/Linux commands. Good luck for those who try to repeat the same in Microsoft Windows not spending more than 30 sec for doing it.

Rebinning: 5-min Data

The rebinning has many schools. It’s the art for some people. We just want to have the job done. I opt for simplicity and understanding of the data we deal with. Imagine we have two adjacent 5 min bins with a tick history of trading:

We want to derive the closest possible (or most fair) price estimation every 5 min, denoted in the above painting by a red marker. The old-school approach is to take the average over a number (larger than 5) of tick data points from the left and from the right. That creates the under- or overestimation of the mid-price.

If we trade live, every 5 min we receive an information on the last tick point before the minute hits 5 and we wait for the next tick point after 5 (blue markers). Taking the average of their prices (mid-price) makes most of sense. The precision we look at here is sometimes $10^{-5}$. It is not much of significance if our position is small, but if it is not, the mid-price may start playing a crucial role.

The cons of the old-school approach: a possible high volatility among all tick-data within last 5 minutes that we neglect.

The following Python code (pp2.py) performs 5-min rebinning for our pre-processed AUDUSD-2014-09 file:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 # pp2.py import csv import numpy as np   def convert(data): tempDATA = [] for i in data: tempDATA.append([float(j) for j in i.split()]) return np.array(tempDATA).T   fname="AUDUSD.pp"   with open(fname) as f: data = f.read().splitlines()   #print(data)   i=1 for d in data: list=[s for s in d.split(' ')] #print(list) # remover empty element in the list dd=[x for x in list if x] #print(dd) tmp=convert(dd) #print(tmp) if(i==1): a=tmp i+=1 else: a = np.vstack([a, tmp]) i+=1   N=i-1 #print("N = %d" % N)   # print the first line tmp=np.array([a[1][0],a[1][1],a[1][2],a[1][3],a[1][4],0.0,(a[1][6]+a[1][7])/2]) print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" % (tmp[0],tmp[1],tmp[2],tmp[3],tmp[4],tmp[5],tmp[6])) m=tmp   # check the boundary conditions (5 min bins) for i in xrange(2,N-1): if( (a[i-1][4]%5!=0.0) and (a[i][4]%5==0.0)):   # BLUE MARKER No. 1 # (print for i-1) #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" % # (a[i-1][0],a[i-1][1],a[i-1][2],a[i-1][3],a[i-1][4],a[i-1][5],a[i-1][6],a[i-1][7])) b1=a[i-1][6] b2=a[i][6] a1=a[i-1][7] a2=a[i][7] # mid-price, and new date for 5 min bin bm=(b1+b2)/2 am=(a1+a2)/2 Ym=a[i][0] Mm=a[i][1] Dm=a[i][2] Hm=a[i][3] MMm=a[i][4] Sm=0.0 # set seconds to zero   # RED MARKER print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" % (Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2)) tmp=np.array([Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2]) m=np.vstack([m, tmp])   # BLUE MARKER No. 2 # (print for i) #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" % # (a[i][0],a[i][1],a[i][2],a[i][3],a[i][4],a[i][5],a[i][6],a[i][7]))

what you run in pp.scr file as:

 31 32 33 # pp.scr (continued)   python pp2.py > AUDUSD.dat

in order to get 5-min rebinned FX time-series as follows:

 $head AUDUSD.dat 2014 9 1 0 0 0.000 0.932935 2014 9 1 0 5 0.000 0.933023 2014 9 1 0 10 0.000 0.932917 2014 9 1 0 15 0.000 0.932928 2014 9 1 0 20 0.000 0.932937 2014 9 1 0 25 0.000 0.933037 2014 9 1 0 30 0.000 0.933075 2014 9 1 0 35 0.000 0.933070 2014 9 1 0 40 0.000 0.933092 2014 9 1 0 45 0.000 0.933063 That concludes our efforts. Happy rebinning! ## Ideal Stock Trading Model for the Purpose of Backtesting Only There is only one goal in algorithmic trading: to win the game. To emerge victorious. To feel the sunlight again after the months of walking through the valley of shadows being stuck in the depths of algo drawdowns. An endless quest for the most brilliant minds, to find the way to win over and over, and over again. To discover a perfect solution as your new private weapon in this battleground. Is it possible? If the answer were so simple, we wouldn’t be so engaged in seeking for a golden mean. However, algo trading is a journey, and sometimes in the process of programming and testing of our trading systems, we need to have an ideal trading model ready-to-use straight away! What I mean by the ideal model is a sort of template, a benchmark that will allow us to list a number of successful trades, their starting and closing times, open and close price of the trades being executed, and the realized return coming down to our pocket from each trade. Such a trading model template also allows us to look at the trading data from a different perspective and re-consider and/or apply an additional risk management framework. In fact, the benefits are limitless for the backtesting purposes. In this post we will explore one of the easiest ways in programming a perfect model by re-directing the time-axis backwards. Using an example of the data of a Google, Inc. (GOOG) stock listed at NASDAQ, we will analyse the stock trading history and find all possible trades returning at least 3% over past decade. The results of this strategy I will use within the upcoming (this week!) series of new posts on Portfolio Optimization in Matlab for Algo Traders. Model and Data Let’s imagine we are interested in finding a large number of trades with the expected return from each trade to be at least 3%. We consider GOOG stock (daily close prices) with data spanning 365$\times$10 days back in time since the present day (last 10 years). We will make use of Google Finance data powered by Quandl as described in one of my previous posts, namely, Create a Portfolio of Stocks based on Google Finance Data fed by Quandl. Shall we begin?  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 % Ideal Stock Trading Model for the Purpose of Backtesting Only % % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz clear all; close all; clc; stocklist={'GOOG'}; % Read in the list of tickers and internal code from Quandl.com [ndata, text, alldata] = xlsread('QuandlStockCodeListUS.xlsx'); quandlc=text(:,1); % again, as a list in a cell array quandlcode=text(:,3) % corresponding Quandl's Price Code % fetch stock data for last 10 years date2=datestr(today,'yyyy-mm-dd') % from date1=datestr(today-365*10,'yyyy-mm-dd') % to stockd={}; for i=1:length(stocklist) for j=1:length(quandlc) if(strcmp(stocklist{i},quandlc{j})) fprintf('%4.0f %s\n',i,quandlc{j}); % fetch the data of the stock from Quandl % using recommanded Quandl's command and % saving them directly into FTS object (fts) fts=0; [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ... 'authcode','ENTER_YOUR_CODE',... 'start_date',date1,'end_date',date2); stockd{i}=fts; % entire FTS object in an array's cell end end end The extracted data of GOOG from Google Finance via Quandl we can visualize immediately as follows:  36 37 38 39 40 41 42 % plot the close prices of GOOG cp=fts2mat(stockd{1}.Close,1); plot(cp(:,1),cp(:,2),'color',[0.6 0.6 0.6]) xlim([min(cp(:,1)) max(cp(:,1))]); ylim([min(cp(:,2)) max(cp(:,2))]); xlabel('Nov 2003 - Nov 2013 (days)'); ylabel('GOOG Close Price ($)');

In this piece of code, in the variable matrix of trades (a log-book of all exercised trades) we store the history of all successful trades meeting our earlier assumed criteria. The only uncertainty that we allow to slip into our perfect solution is the one related to an instance when the the close price on the next business day occurs to be lower, generating the realized profit from the trade less than 3%. By plotting all good trades with the ending day of $tN$ set as for Nov 18, 2013, we get a messy picture:

which translates into more intuitive one once we examine the distribution of profits from all trades:

 figure(3); hist(trades(:,5),50); xlabel('Profit/loss (%)'); ylabel('Number of trades');

In this point the most valuable information is contained in the log-book which content we can analyze trade by trade:

 >> format shortg >> trades   trades =   7.3218e+05 7.3218e+05 100.34 109.4 9.0293 7.3218e+05 7.3220e+05 104.87 112 6.7989 7.3221e+05 7.3221e+05 113.97 119.36 4.7293 7.3221e+05 7.3222e+05 117.84 131.08 11.236 7.3222e+05 7.3222e+05 129.6 138.37 6.767 7.3223e+05 7.3224e+05 137.08 144.11 5.1284 7.3224e+05 7.3224e+05 140.49 172.43 22.735 7.3224e+05 7.3225e+05 187.4 190.64 1.7289 ... 7.3533e+05 7.3535e+05 783.05 813.45 3.8823 7.3535e+05 7.3536e+05 809.1 861.55 6.4825 7.3536e+05 7.3537e+05 857.23 915.89 6.843 7.3546e+05 7.3549e+05 856.91 888.67 3.7063 7.3549e+05 7.3553e+05 896.19 1003.3 11.952

where the columns correspond to the open and close time of the trade (a continuous Matlab’s time measure for the financial time-series; see datestr command for getting yyyy-mm-dd date format), open and close price of GOOG stock, and realized profit/loss of the trade, respectively.

Questions? Discuss on Forum.

Just dive directly into Backtesting section on QaR Forum and keep up, never give up.

## Trend Identification for FX Traders

When you think about an invention of a new model for algorithmic trading, there are only three key elements you need to start your work with: creativity, data, and programming tool. Assuming that the last two are already in your possession, all what remains is seeking and finding a great new idea! With no offense, that’s the hardest part of the game.

To be successful in discovering new trading solutions you have to be completely open-minded, relaxed and full of spatial orientation with the information pertaining to your topic. Personally, after many years of programming and playing with the digital signal processing techniques, I have discovered that the most essential aspect of well grounded research is data itself. The more, literally, I starred at time-series changing their properties, the more I was able to capture subtle differences, often overlooked by myself before, and with the aid of intuition and scientific experience some new ideas simply popped up.

Here I would like to share with you a part of this process.

In Extracting Time-Series from Tick-Data article I outlined one of many possible ways of the FX time-series extraction from the very fine data sets. As a final product we have obtained two files, namely:

corresponding to Bid and Ask prices for Forex AUDUSD pair’s trading history between Jan 2000 and May 2010. Each file contained two columns of numbers: Time (Modified Julian Day) and Price. The time resolution has been selected to be 1 hour.

FOREX trading lasts from Monday to Friday, continuously for 24 hours. Therefore the data contain regular gaps corresponding to weekends. As the data coverage is more abundant comparing to, for example, much shorter trading windows of equities or ETFs around the world, that provides us with a better understanding of trading directions within every week time frame. Keeping that in mind, we might be interested in looking at directional information conveyed by the data as a seed of a potential new FX model.

As for now, let’s solely focus on initial pre-processing of Bid and Ask time-series and splitting each week into a common cell array.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 % FX time-series analysis % (c) Quant at Risk, 2012 % % Part 1: Separation of the weeks   close all; clear all; clc;   % --analyzed FX pair pair=['audusd'];   % --data n=['./',pair,'/',pair]; % a common path to files na=[n,'.ask.1h']; nb=[n,'.bid.1h']; d1=load(na); d2=load(na); % loading data d=(d1+d2)/2; % blending clear d1 d2

For a sake of simplicity, in line 16, we decided to use a simple average of Bid and Ask 1-hour prices for our further research. Next, we create a weekly template, $x$, for our data classification, and we find the total number of weeks available for analysis:

 19 20 21 22 23 24 25 26 27 28 29 30 31 % time constraints from the data t0=min(d(:,1)); tN=max(d(:,1)); t1=t0-1;   % weekly template for data classification x=t1:7:tN+7;   % total number of weeks nw=length(x)-1;   fprintf(upper(pair)); fprintf(' time-series: %3.0f weeks (%5.2f yrs)\n',nw,nw/52);

what in our case returns a positive information:

 AUDUSD time-series: 539 weeks (10.37 yrs)

The core of programming exercise is to split all 539 weeks and save them into a cell array of $week$. As we will see in the code section below, for some reasons we may want to assure ourselves that each week will contain the same number of points, therefore any missing data from our FX data provider will be interpolated. To do that efficiently, we use the following function which makes use of Piecewise Cubic Hermite Interpolating Polynomial interpolation for filling gapped data point in the series:

 function [x2,y2]=gapinterpol(x,y,dt); % specify axis x_min=x(1); x_max=x(length(x)); x2=(x_min:dt:x_max); % inperpolate gaps y2=pchip(x,y,x2); end

The separation of weeks we realize in our program by:

 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 week={}; % an empty cell array avdt=[];   for i=1:nw % split FX signal according to week [r,c,v]=find(d(:,1)>x(i) & d(:,1)

where as a check-up we get:

 average sampling after interpolation = 0.0416667 [d]

what corresponds to the expected value of $1/24$ day with a sufficient approximation.

A quick visual verification of our signal processing,

 54 55 56 57 58 59 60 61 62 63 scrsz = get(0,'ScreenSize'); h=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none'); hold off; for i=1:nw w=week{i}; x=w(:,1); y=w(:,2); % plot weekly signal hold on; plot(x,y,'k'); end xlim([0 100]);

uncovers our desired result: