Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

Rebinning Tick-Data for FX Algo Traders

If you work or intend to work with FX data in order to build and backtest your own FX models, the Historical Tick-Data of Pepperstone.com is probably the best place to kick off your algorithmic experience. As for now, they offer tick-data sets of 15 most frequently traded currency pairs since May 2009. Some of the unzip’ed files (one month data) reach over 400 MB in size, i.e. storing 8.5+ millions of lines with a tick resolution for both bid and ask “prices”. A good thing is you can download them all free of charge and their quality is regarded as very high. A bad thing is there is 3 month delay in data accessibility.

Dealing with a rebinning process of tick-data up, that’s a different story and the subject of this post. We will see how efficiently you can turn Pepperstone’s Tick-Data set(s) into 5-min time-series as an example. We will make use of scripting in bash (Linux/OS X) supplemented with data processing in Python.

Data Structure

You can download Pepperstone’s historical tick-data from here, month by month, pair by pair. Their inner structure follows the same pattern, namely:

$ head AUDUSD-2014-09.csv 
AUD/USD,20140901 00:00:01.323,0.93289,0.93297
AUD/USD,20140901 00:00:02.138,0.9329,0.93297
AUD/USD,20140901 00:00:02.156,0.9329,0.93298
AUD/USD,20140901 00:00:02.264,0.9329,0.93297
AUD/USD,20140901 00:00:02.265,0.9329,0.93293
AUD/USD,20140901 00:00:02.265,0.93289,0.93293
AUD/USD,20140901 00:00:02.268,0.93289,0.93295
AUD/USD,20140901 00:00:02.277,0.93289,0.93296
AUD/USD,20140901 00:00:02.278,0.9329,0.93296
AUD/USD,20140901 00:00:02.297,0.93288,0.93296

The columns, from left to right, represent respectively: a pair name, the date and tick-time, the bid price, and the ask price.

Pre-Processing

Here, for each .csv file, we aim to split the date into year, month, and day separately, and remove commas and colons to get raw data ready to be read in as a matrix (array) using any other programming language (e.g. Matlab or Python). The matrix is mathematically intuitive data structure therefore making direct reference to any specific column of it makes any backtesting engine running with its full thrust.

Let’s play with AUDUSD-2014-09.csv data file. Working in the same directory where the file is located we begin with writing a bash script (pp.scr) that contains:

1
2
3
4
5
6
7
8
9
10
11
# pp.scr
# Rebinning Pepperstone.com Tick-Data for FX Algo Traders 
# (c) 2014 QuantAtRisk, by Pawel Lachowicz
 
clear
echo "..making a sorted list of .csv files"
for i in $1-*.csv; do echo ${i##$1-} $i ${i##.csv};
done | sort -n | awk '{print $2}' > $1.lst
 
python pp.py
head AUDUSD.pp

that you run in Terminal:

$ chmod +x pp.scr
$ ./pp.scr AUDUSD

where the first command makes sure the script becomes executable (you need to perform this task only once). Lines #7-8 of our script, in fact, look for all .csv data files in the local directory starting with AUDUSD- prefix and create their list in AUDUSD.lst file. Since we work with AUDUSD-2014-09.csv file only, the AUDUSD.lst file will contain:

$ cat AUDUSD.lst 
AUDUSD-2014-09.csv

as expected. Next, we utilise the power and flexibility of Python in the following way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# pp.py
import csv
 
fnlst="AUDUSD.lst"
fnout="AUDUSD.pp"
 
for lstline in open(fnlst,'r').readlines():
    fncur=lstline[:-1]
    #print(fncur)
 
    with open(fnout,'w') as f:
        writer=csv.writer(f,delimiter=" ")
 
        i=1 # counts a number of lines with tick-data
        for line in open(fncur,'r').readlines():
            if(i<=5200): # replace with (i>0) to process an entire file
                #print(line)
                year=line[8:12]
                month=line[12:14]
                day=line[14:16]
                hh=line[17:19]
                mm=line[20:22]
                ss=line[23:29]
                bidask=line[30:]
                writer.writerow([year,month,day,hh,mm,ss,bidask])
                i+=1

It is a pretty efficient way to open really a big file and process its information line by line. Just for further purpose of display, in the code we told computer to process only first 5,200 of lines. The output of lines #10-11 of pp.scr is the following:

2014 09 01 00 00 01.323 "0.93289,0.93297
"
2014 09 01 00 00 02.138 "0.9329,0.93297
"
2014 09 01 00 00 02.156 "0.9329,0.93298
"
2014 09 01 00 00 02.264 "0.9329,0.93297
"
2014 09 01 00 00 02.265 "0.9329,0.93293
"

since we allowed Python to save bid and ask information as one string (due to a variable number of decimal digits). In order to clean this mess we continue:

13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# pp.scr (continued)
echo "..removing token: comma"
sed 's/,/ /g' AUDUSD.pp > $1.tmp
rm AUDUSD.pp
 
echo "..removing token: double quotes"
sed 's/"/ /g' $1.tmp > $1.tmp2
rm $1.tmp
 
echo "..removing empty lines"
sed -i '/^[[:space:]]*$/d' $1.tmp2
mv $1.tmp2 AUDUSD.pp
 
echo "head..."
head AUDUSD.pp
echo "tail..."
tail AUDUSD.pp

what brings us to pre-processed data:

..removing token: comma
..removing token: double quotes
..removing empty lines
head...
2014 09 01 00 00 01.323  0.93289 0.93297
2014 09 01 00 00 02.138  0.9329 0.93297
2014 09 01 00 00 02.156  0.9329 0.93298
2014 09 01 00 00 02.264  0.9329 0.93297
2014 09 01 00 00 02.265  0.9329 0.93293
2014 09 01 00 00 02.265  0.93289 0.93293
2014 09 01 00 00 02.268  0.93289 0.93295
2014 09 01 00 00 02.277  0.93289 0.93296
2014 09 01 00 00 02.278  0.9329 0.93296
2014 09 01 00 00 02.297  0.93288 0.93296
tail...
2014 09 02 00 54 39.324  0.93317 0.93321
2014 09 02 00 54 39.533  0.93319 0.93321
2014 09 02 00 54 39.543  0.93318 0.93321
2014 09 02 00 54 39.559  0.93321 0.93321
2014 09 02 00 54 39.784  0.9332 0.93321
2014 09 02 00 54 39.798  0.93319 0.93321
2014 09 02 00 54 39.885  0.93319 0.93325
2014 09 02 00 54 39.886  0.93319 0.93321
2014 09 02 00 54 40.802  0.9332 0.93321
2014 09 02 00 54 48.829  0.93319 0.93321

Personally, I love that part as you can learn how to do simple but necessary text file operations by typing single lines of Unix/Linux commands. Good luck for those who try to repeat the same in Microsoft Windows not spending more than 30 sec for doing it.

Rebinning: 5-min Data

The rebinning has many schools. It’s the art for some people. We just want to have the job done. I opt for simplicity and understanding of the data we deal with. Imagine we have two adjacent 5 min bins with a tick history of trading:

sam
We want to derive the closest possible (or most fair) price estimation every 5 min, denoted in the above painting by a red marker. The old-school approach is to take the average over a number (larger than 5) of tick data points from the left and from the right. That creates the under- or overestimation of the mid-price.

If we trade live, every 5 min we receive an information on the last tick point before the minute hits 5 and we wait for the next tick point after 5 (blue markers). Taking the average of their prices (mid-price) makes most of sense. The precision we look at here is sometimes $10^{-5}$. It is not much of significance if our position is small, but if it is not, the mid-price may start playing a crucial role.

The cons of the old-school approach: a possible high volatility among all tick-data within last 5 minutes that we neglect.

The following Python code (pp2.py) performs 5-min rebinning for our pre-processed AUDUSD-2014-09 file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# pp2.py
import csv
import numpy as np
 
def convert(data):
     tempDATA = []
     for i in data:
         tempDATA.append([float(j) for j in i.split()])
     return np.array(tempDATA).T
 
fname="AUDUSD.pp"
 
with open(fname) as f:
    data = f.read().splitlines()
 
#print(data)
 
i=1
for d in data:
    list=[s for s in d.split(' ')]
    #print(list)
    # remover empty element in the list
    dd=[x for x in list if x]
    #print(dd)
    tmp=convert(dd)
    #print(tmp)
    if(i==1):
        a=tmp
        i+=1
    else:
        a = np.vstack([a, tmp])
        i+=1
 
N=i-1
#print("N = %d" % N)
 
# print the first line
tmp=np.array([a[1][0],a[1][1],a[1][2],a[1][3],a[1][4],0.0,(a[1][6]+a[1][7])/2])
print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" %
             (tmp[0],tmp[1],tmp[2],tmp[3],tmp[4],tmp[5],tmp[6]))
m=tmp
 
# check the boundary conditions (5 min bins)
for i in xrange(2,N-1):
    if( (a[i-1][4]%5!=0.0) and (a[i][4]%5==0.0)):
 
        # BLUE MARKER No. 1
        # (print for i-1)
        #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" %
        #      (a[i-1][0],a[i-1][1],a[i-1][2],a[i-1][3],a[i-1][4],a[i-1][5],a[i-1][6],a[i-1][7]))
        b1=a[i-1][6]
        b2=a[i][6]
        a1=a[i-1][7]
        a2=a[i][7]
        # mid-price, and new date for 5 min bin
        bm=(b1+b2)/2
        am=(a1+a2)/2
        Ym=a[i][0]
        Mm=a[i][1]
        Dm=a[i][2]
        Hm=a[i][3]
        MMm=a[i][4]
        Sm=0.0        # set seconds to zero
 
        # RED MARKER
        print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" %
              (Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2))
        tmp=np.array([Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2])
        m=np.vstack([m, tmp])
 
        # BLUE MARKER No. 2
        # (print for i)
        #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" %
        #      (a[i][0],a[i][1],a[i][2],a[i][3],a[i][4],a[i][5],a[i][6],a[i][7]))

what you run in pp.scr file as:

31
32
33
# pp.scr (continued)
 
python pp2.py > AUDUSD.dat

in order to get 5-min rebinned FX time-series as follows:

$ head AUDUSD.dat
2014  9  1  0  0  0.000   0.932935
2014  9  1  0  5  0.000   0.933023
2014  9  1  0 10  0.000   0.932917
2014  9  1  0 15  0.000   0.932928
2014  9  1  0 20  0.000   0.932937
2014  9  1  0 25  0.000   0.933037
2014  9  1  0 30  0.000   0.933075
2014  9  1  0 35  0.000   0.933070
2014  9  1  0 40  0.000   0.933092
2014  9  1  0 45  0.000   0.933063

That concludes our efforts. Happy rebinning!

Ideal Stock Trading Model for the Purpose of Backtesting Only


There is only one goal in algorithmic trading: to win the game. To emerge victorious. To feel the sunlight again after the months of walking through the valley of shadows being stuck in the depths of algo drawdowns. An endless quest for the most brilliant minds, to find the way to win over and over, and over again. To discover a perfect solution as your new private weapon in this battleground. Is it possible? If the answer were so simple, we wouldn’t be so engaged in seeking for a golden mean.

However, algo trading is a journey, and sometimes in the process of programming and testing of our trading systems, we need to have an ideal trading model ready-to-use straight away! What I mean by the ideal model is a sort of template, a benchmark that will allow us to list a number of successful trades, their starting and closing times, open and close price of the trades being executed, and the realized return coming down to our pocket from each trade. Such a trading model template also allows us to look at the trading data from a different perspective and re-consider and/or apply an additional risk management framework. In fact, the benefits are limitless for the backtesting purposes.

In this post we will explore one of the easiest ways in programming a perfect model by re-directing the time-axis backwards. Using an example of the data of a Google, Inc. (GOOG) stock listed at NASDAQ, we will analyse the stock trading history and find all possible trades returning at least 3% over past decade.

The results of this strategy I will use within the upcoming (this week!) series of new posts on Portfolio Optimization in Matlab for Algo Traders.

Model and Data

Let’s imagine we are interested in finding a large number of trades with the expected return from each trade to be at least 3%. We consider GOOG stock (daily close prices) with data spanning 365$\times$10 days back in time since the present day (last 10 years). We will make use of Google Finance data powered by Quandl as described in one of my previous posts, namely, Create a Portfolio of Stocks based on Google Finance Data fed by Quandl. Shall we begin?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
% Ideal Stock Trading Model for the Purpose of Backtesting Only
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
 
clear all; close all; clc;
 
stocklist={'GOOG'};
 
% Read in the list of tickers and internal code from Quandl.com
[ndata, text, alldata] = xlsread('QuandlStockCodeListUS.xlsx');
quandlc=text(:,1); % again, as a list in a cell array
quandlcode=text(:,3) % corresponding Quandl's Price Code
 
% fetch stock data for last 10 years
date2=datestr(today,'yyyy-mm-dd')     % from
date1=datestr(today-365*10,'yyyy-mm-dd') % to
 
stockd={};
for i=1:length(stocklist)
    for j=1:length(quandlc)
        if(strcmp(stocklist{i},quandlc{j}))
            fprintf('%4.0f %s\n',i,quandlc{j});
            % fetch the data of the stock from Quandl
            % using recommanded Quandl's command and
            % saving them directly into FTS object (fts)
            fts=0;
            [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                          'authcode','ENTER_YOUR_CODE',...
                          'start_date',date1,'end_date',date2);
            stockd{i}=fts; % entire FTS object in an array's cell
        end
    end
end

The extracted data of GOOG from Google Finance via Quandl we can visualize immediately as follows:

36
37
38
39
40
41
42
% plot the close prices of GOOG
cp=fts2mat(stockd{1}.Close,1);
plot(cp(:,1),cp(:,2),'color',[0.6 0.6 0.6])
xlim([min(cp(:,1)) max(cp(:,1))]);
ylim([min(cp(:,2)) max(cp(:,2))]);
xlabel('Nov 2003 - Nov 2013 (days)');
ylabel('GOOG Close Price ($)');

ideal-fig01

We open the trade on the first day (long position) and as time goes by, on the following day we check whether the stock value increased by 3% or more. If not, we increase the current time position by one day and check again. If the condition is met, we close the opened trade on the following trading (business) day at the stock’s market price. We allow to open a new trade at the stock’s market price one next business day after the day when the previous trade has been closed (exercised). Additionally, we check if the ‘running’ return from the open trade exceeds -5%. If so, we substitute the open time of a new trade with the time when the latter condition has been fulfilled (the same for the open price of that trade). The latter strategy allow us to restart the backtesting process therefore the search for a new profitable trade.

44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
t0=cp(1,1);    % starting day for backtesting
tN=cp(end,1);  % last day
 
trades=[]; % open a log-book for all executed traded
status=0; % status meaning: 0-no open trade, 1-open trade
t=t0;
% we loop over time (t) [days]
while(t<tN)
    [r,~,~]=find(t==cp(:,1)); % check the row in cp vector
    if(~isempty(r))
       if(~status)
           topen=t; % time when we open the trade
           popen=cp(r,2); % assuming market price of the stock
           status=1;
       else
           ptmp=cp(r,2); % running close price
           rtmp=ptmp/popen-1; % running return of the open trade
           if(rtmp>0.03) % check 3% profit condition
               % if met, then
               tclose=busdate(t,1); % close time of the trade
                                    %  assumed on the next business day
               t=busdate(tclose,1); % next day in the loop
               if(tclose<=tN)
                   [r,~,~]=find(tclose==cp(:,1));
                   pclose=cp(r,2); % close price
                   ret=pclose/popen-1; % realized profit/loss
                   % save the trade details into log-book
                   trades=[trades; topen tclose popen pclose ret*100];
                   status=0; % change status of trading to not-open
                   % mark the opening of the trade as blue dot marker
                   hold on; plot(topen,popen,'b.','markerfacecolor','b');
                   % mark the end time of the trade
                   hold on; plot(tclose,pclose,'r.','markerfacecolor','r');
               end
           elseif(rtmp<=-0.05) % check an additional condition
               topen=t; % overwrite the time
               popen=cp(r,2); % and the price
               status=1; % sustain the status of the trade as 'open'
           else
               t=t+1;
           end
       end
    else
        t=t+1;
    end
end

In this piece of code, in the variable matrix of trades (a log-book of all exercised trades) we store the history of all successful trades meeting our earlier assumed criteria. The only uncertainty that we allow to slip into our perfect solution is the one related to an instance when the the close price on the next business day occurs to be lower, generating the realized profit from the trade less than 3%. By plotting all good trades with the ending day of $tN$ set as for Nov 18, 2013, we get a messy picture:

ideal-fig02

which translates into more intuitive one once we examine the distribution of profits from all trades:

figure(3);
hist(trades(:,5),50);
xlabel('Profit/loss (%)');
ylabel('Number of trades');

ideal-fig03

In this point the most valuable information is contained in the log-book which content we can analyze trade by trade:

>> format shortg
>> trades
 
trades =
 
   7.3218e+05   7.3218e+05       100.34        109.4       9.0293
   7.3218e+05   7.3220e+05       104.87          112       6.7989
   7.3221e+05   7.3221e+05       113.97       119.36       4.7293
   7.3221e+05   7.3222e+05       117.84       131.08       11.236
   7.3222e+05   7.3222e+05        129.6       138.37        6.767
   7.3223e+05   7.3224e+05       137.08       144.11       5.1284
   7.3224e+05   7.3224e+05       140.49       172.43       22.735
   7.3224e+05   7.3225e+05        187.4       190.64       1.7289
   ...
   7.3533e+05   7.3535e+05       783.05       813.45       3.8823
   7.3535e+05   7.3536e+05        809.1       861.55       6.4825
   7.3536e+05   7.3537e+05       857.23       915.89        6.843
   7.3546e+05   7.3549e+05       856.91       888.67       3.7063
   7.3549e+05   7.3553e+05       896.19       1003.3       11.952

where the columns correspond to the open and close time of the trade (a continuous Matlab’s time measure for the financial time-series; see datestr command for getting yyyy-mm-dd date format), open and close price of GOOG stock, and realized profit/loss of the trade, respectively.

Questions? Discuss on Forum.

Just dive directly into Backtesting section on QaR Forum and keep up, never give up.

Trend Identification for FX Traders


When you think about an invention of a new model for algorithmic trading, there are only three key elements you need to start your work with: creativity, data, and programming tool. Assuming that the last two are already in your possession, all what remains is seeking and finding a great new idea! With no offense, that’s the hardest part of the game.

To be successful in discovering new trading solutions you have to be completely open-minded, relaxed and full of spatial orientation with the information pertaining to your topic. Personally, after many years of programming and playing with the digital signal processing techniques, I have discovered that the most essential aspect of well grounded research is data itself. The more, literally, I starred at time-series changing their properties, the more I was able to capture subtle differences, often overlooked by myself before, and with the aid of intuition and scientific experience some new ideas simply popped up.

Here I would like to share with you a part of this process.

In Extracting Time-Series from Tick-Data article I outlined one of many possible ways of the FX time-series extraction from the very fine data sets. As a final product we have obtained two files, namely:

audusd.bid.1h
audusd.ask.1h

corresponding to Bid and Ask prices for Forex AUDUSD pair’s trading history between Jan 2000 and May 2010. Each file contained two columns of numbers: Time (Modified Julian Day) and Price. The time resolution has been selected to be 1 hour.

FOREX trading lasts from Monday to Friday, continuously for 24 hours. Therefore the data contain regular gaps corresponding to weekends. As the data coverage is more abundant comparing to, for example, much shorter trading windows of equities or ETFs around the world, that provides us with a better understanding of trading directions within every week time frame. Keeping that in mind, we might be interested in looking at directional information conveyed by the data as a seed of a potential new FX model.

As for now, let’s solely focus on initial pre-processing of Bid and Ask time-series and splitting each week into a common cell array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
% FX time-series analysis
% (c) Quant at Risk, 2012
%
% Part 1: Separation of the weeks
 
close all; clear all; clc;
 
% --analyzed FX pair
pair=['audusd'];
 
% --data
n=['./',pair,'/',pair];     % a common path to files
na=[n,'.ask.1h']; 
nb=[n,'.bid.1h'];
d1=load(na); d2=load(na);   % loading data
d=(d1+d2)/2;                % blending
clear d1 d2

For a sake of simplicity, in line 16, we decided to use a simple average of Bid and Ask 1-hour prices for our further research. Next, we create a weekly template, $x$, for our data classification, and we find the total number of weeks available for analysis:

19
20
21
22
23
24
25
26
27
28
29
30
31
% time constraints from the data
t0=min(d(:,1));
tN=max(d(:,1));
t1=t0-1;
 
% weekly template for data classification
x=t1:7:tN+7;
 
% total number of weeks
nw=length(x)-1;
 
fprintf(upper(pair));
fprintf(' time-series: %3.0f weeks (%5.2f yrs)\n',nw,nw/52);

what in our case returns a positive information:

AUDUSD time-series: 539 weeks (10.37 yrs)

The core of programming exercise is to split all 539 weeks and save them into a cell array of $week$. As we will see in the code section below, for some reasons we may want to assure ourselves that each week will contain the same number of points, therefore any missing data from our FX data provider will be interpolated. To do that efficiently, we use the following function which makes use of Piecewise Cubic Hermite Interpolating Polynomial interpolation for filling gapped data point in the series:

function [x2,y2]=gapinterpol(x,y,dt);
    % specify axis
    x_min=x(1);
    x_max=x(length(x));
    x2=(x_min:dt:x_max);
    % inperpolate gaps
    y2=pchip(x,y,x2);
end

The separation of weeks we realize in our program by:

33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
week={};  % an empty cell array
avdt=[]; 
 
for i=1:nw
    % split FX signal according to week
    [r,c,v]=find(d(:,1)>x(i) & d(:,1)<x(i+1));
    x1=d(r,1); y1=d(r,2);
    % interpolate gaps, use 1-hour bins
    dt=1/24;
    [x2,y2]=gapinterpol(x1,y1,dt);
    % check the average sampling time, should equal to dt
    s=0;
    for j=1:length(x2)-1
        s=s+(x2(j+1)-x2(j));
    end
    tmp=s/(length(x2)-1);
    avdt=[avdt; tmp];
    % store the week signal in a cell array
    tmp=[x2; y2]; tmp=tmp';
    week{i}=tmp;
end
fprintf('average sampling after interpolation = %10.7f [d]\n',max(avdt));

where as a check-up we get:

average sampling after interpolation =  0.0416667 [d]

what corresponds to the expected value of $1/24$ day with a sufficient approximation.

A quick visual verification of our signal processing,

54
55
56
57
58
59
60
61
62
63
scrsz = get(0,'ScreenSize');
h=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none');
hold off;
for i=1:nw
    w=week{i};
    x=w(:,1); y=w(:,2);
    % plot weekly signal
    hold on; plot(x,y,'k');
end
xlim([0 100]);

uncovers our desired result:

AUD/USD

Contact Form Powered By : XYZScripts.com