Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

Financial Time-Series Segmentation Based On Turning Points in Python

A determination of peaks and troughs for any financial time-series seems to be always in high demand, especially in algorithmic trading. A number of numerical methods can be found in the literature. The main problem exists when a smart differentiation between a local trend and “global” sentiment needs to be translated into computer language. In this short post, we fully refer to the publication of Yin, Si, & Gong (2011) on Financial Time-Series Segmentation using Turning Points wherein the authors proposed an appealing way to simplify the “noisy” character of the financial (high-frequency) time-series.

Since this publication presents an easy-to-digest historical introduction to the problem with a novel pseudo-code addressing solution, let me skip this part here and refer you to the paper itself (download .pdf here).

We develop Python implementation of the pseudo-code as follows. We start with some dataset. Let us use the 4-level order-book record of Hang Seng Index as traded over Jan 4, 2016 (download 20160104_orderbook.csv.zip; 8MB). The data cover both morning and afternoon trading sessions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
# Reading Orderbook
data = pd.read_csv('20160104_orderbook.csv')
data['MidPrice0'] = (data.AskPrice0 + data.BidPrice0)/2.  # mid-price
 
# Split Data according to Sessions
delta = np.diff(data.Timestamp)
# find a good separation index
k = np.where(delta > np.max(delta)/2)[0][0] + 1
 
data1 = data[0:k].copy()  # Session 12:15-15:00
data2 = data[k+1:].copy() # Session 16:00-19:15
data2.index = range(len(data2))
 
plt.figure(figsize=(10,5))
plt.plot(data1.Timestamp, data1.MidPrice0, 'r', label="Session 12:15-15:00")
plt.plot(data2.Timestamp, data2.MidPrice0, 'b', label="Session 16:00-19:15")
plt.legend(loc='best')
plt.axis('tight')

revealing:

tp_fig01

Turning Points pseudo-algorithm of Yin, Si, & Gong (2011) can be organised using simple Python functions in a straightforward way, namely:

24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def first_tps(p):
    tp = []
    for i in range(1, len(p)-1):
        if((p[i] < p[i+1]) and (p[i] < p[i-1])) or ((p[i] > p[i+1]) \
           and (p[i] > p[i-1])):
            tp.append(i)
    return tp
 
def contains_point_in_uptrend(i, p):
    if(p[i] < p[i+1]) and (p[i] < p[i+2]) and (p[i+1] < p[i+3]) and \
              (p[i+2] < p[i+3]) and \
              (abs(p[i+1] - p[i+2]) < abs(p[i] - p[i+2]) + abs(p[i+1] - p[i+3])):
        return True
    else:
        return False
 
def contains_point_in_downtrend(i, p):
    if(p[i] > p[i+1]) and (p[i] > p[i+2]) and (p[i+1] > p[i+3]) and \
           (p[i+2] > p[i+3]) and \
           (abs(p[i+2] - p[i+1]) < abs(p[i] - p[i+2]) + abs(p[i+1] - p[i+3])):
        return True
    else:
        return False
 
def points_in_the_same_trend(i, p, thr):
    if(abs(p[i]/p[i+2]-1) < thr) and (abs(p[i+1]/p[i+3]-1) < thr):
        return True
    else:
        return False
 
def turning_points(idx, p, thr):
    i = 0
    tp = []
    while(i < len(idx)-3):
        if contains_point_in_downtrend(idx[i], p) or \
           contains_point_in_uptrend(idx[i], p) \
              or points_in_the_same_trend(idx[i], p, thr):
            tp.extend([idx[i], idx[i+3]])
            i += 3
        else:
            tp.append(idx[i])
            i += 1
    return tp

The algorithms allows us to specify a number $k$ (or a range) of sub-levels for time-series segmentation. The “deeper” we go the more distinctive peaks and throughs remain. Have a look:

66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
thr = 0.05
sep = 75  # separation for plotting
 
P1 = data1.MidPrice0.values
P2 = data2.MidPrice0.values
tp1 = first_tps(P1)
tp2 = first_tps(P2)
 
plt.figure(figsize=(16,10))
 
plt.plot(data1.Timestamp, data1.MidPrice0, 'r', label="Session 12:15-15:00")
plt.plot(data2.Timestamp, data2.MidPrice0, 'b', label="Session 16:00-19:15")
plt.legend(loc='best')
 
for k in range(1, 10):  # k over a given range of sub-levels
    tp1 = turning_points(tp1, P1, thr)
    tp2 = turning_points(tp2, P2, thr)
    plt.plot(data1.Timestamp[tp1], data1.MidPrice0[tp1]-sep*k, 'k')
    plt.plot(data2.Timestamp[tp2], data2.MidPrice0[tp2]-sep*k, 'k')
 
plt.axis('tight')
plt.ylabel('Price')
plt.xlabel('Timestamp')

tp_fig02

It is highly tempting to use the code as a supportive indicator for confirmation of new trends in the time-series (single) or build concurrently running decomposition (segmentation; at the same sub-level) for two or more parallel time-series (e.g. of the FX pairs). Enjoy!

Rebinning Tick-Data for FX Algo Traders

If you work or intend to work with FX data in order to build and backtest your own FX models, the Historical Tick-Data of Pepperstone.com is probably the best place to kick off your algorithmic experience. As for now, they offer tick-data sets of 15 most frequently traded currency pairs since May 2009. Some of the unzip’ed files (one month data) reach over 400 MB in size, i.e. storing 8.5+ millions of lines with a tick resolution for both bid and ask “prices”. A good thing is you can download them all free of charge and their quality is regarded as very high. A bad thing is there is 3 month delay in data accessibility.

Dealing with a rebinning process of tick-data up, that’s a different story and the subject of this post. We will see how efficiently you can turn Pepperstone’s Tick-Data set(s) into 5-min time-series as an example. We will make use of scripting in bash (Linux/OS X) supplemented with data processing in Python.

Data Structure

You can download Pepperstone’s historical tick-data from here, month by month, pair by pair. Their inner structure follows the same pattern, namely:

$ head AUDUSD-2014-09.csv 
AUD/USD,20140901 00:00:01.323,0.93289,0.93297
AUD/USD,20140901 00:00:02.138,0.9329,0.93297
AUD/USD,20140901 00:00:02.156,0.9329,0.93298
AUD/USD,20140901 00:00:02.264,0.9329,0.93297
AUD/USD,20140901 00:00:02.265,0.9329,0.93293
AUD/USD,20140901 00:00:02.265,0.93289,0.93293
AUD/USD,20140901 00:00:02.268,0.93289,0.93295
AUD/USD,20140901 00:00:02.277,0.93289,0.93296
AUD/USD,20140901 00:00:02.278,0.9329,0.93296
AUD/USD,20140901 00:00:02.297,0.93288,0.93296

The columns, from left to right, represent respectively: a pair name, the date and tick-time, the bid price, and the ask price.

Pre-Processing

Here, for each .csv file, we aim to split the date into year, month, and day separately, and remove commas and colons to get raw data ready to be read in as a matrix (array) using any other programming language (e.g. Matlab or Python). The matrix is mathematically intuitive data structure therefore making direct reference to any specific column of it makes any backtesting engine running with its full thrust.

Let’s play with AUDUSD-2014-09.csv data file. Working in the same directory where the file is located we begin with writing a bash script (pp.scr) that contains:

1
2
3
4
5
6
7
8
9
10
11
# pp.scr
# Rebinning Pepperstone.com Tick-Data for FX Algo Traders 
# (c) 2014 QuantAtRisk, by Pawel Lachowicz
 
clear
echo "..making a sorted list of .csv files"
for i in $1-*.csv; do echo ${i##$1-} $i ${i##.csv};
done | sort -n | awk '{print $2}' > $1.lst
 
python pp.py
head AUDUSD.pp

that you run in Terminal:

$ chmod +x pp.scr
$ ./pp.scr AUDUSD

where the first command makes sure the script becomes executable (you need to perform this task only once). Lines #7-8 of our script, in fact, look for all .csv data files in the local directory starting with AUDUSD- prefix and create their list in AUDUSD.lst file. Since we work with AUDUSD-2014-09.csv file only, the AUDUSD.lst file will contain:

$ cat AUDUSD.lst 
AUDUSD-2014-09.csv

as expected. Next, we utilise the power and flexibility of Python in the following way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# pp.py
import csv
 
fnlst="AUDUSD.lst"
fnout="AUDUSD.pp"
 
for lstline in open(fnlst,'r').readlines():
    fncur=lstline[:-1]
    #print(fncur)
 
    with open(fnout,'w') as f:
        writer=csv.writer(f,delimiter=" ")
 
        i=1 # counts a number of lines with tick-data
        for line in open(fncur,'r').readlines():
            if(i<=5200): # replace with (i>0) to process an entire file
                #print(line)
                year=line[8:12]
                month=line[12:14]
                day=line[14:16]
                hh=line[17:19]
                mm=line[20:22]
                ss=line[23:29]
                bidask=line[30:]
                writer.writerow([year,month,day,hh,mm,ss,bidask])
                i+=1

It is a pretty efficient way to open really a big file and process its information line by line. Just for further purpose of display, in the code we told computer to process only first 5,200 of lines. The output of lines #10-11 of pp.scr is the following:

2014 09 01 00 00 01.323 "0.93289,0.93297
"
2014 09 01 00 00 02.138 "0.9329,0.93297
"
2014 09 01 00 00 02.156 "0.9329,0.93298
"
2014 09 01 00 00 02.264 "0.9329,0.93297
"
2014 09 01 00 00 02.265 "0.9329,0.93293
"

since we allowed Python to save bid and ask information as one string (due to a variable number of decimal digits). In order to clean this mess we continue:

13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# pp.scr (continued)
echo "..removing token: comma"
sed 's/,/ /g' AUDUSD.pp > $1.tmp
rm AUDUSD.pp
 
echo "..removing token: double quotes"
sed 's/"/ /g' $1.tmp > $1.tmp2
rm $1.tmp
 
echo "..removing empty lines"
sed -i '/^[[:space:]]*$/d' $1.tmp2
mv $1.tmp2 AUDUSD.pp
 
echo "head..."
head AUDUSD.pp
echo "tail..."
tail AUDUSD.pp

what brings us to pre-processed data:

..removing token: comma
..removing token: double quotes
..removing empty lines
head...
2014 09 01 00 00 01.323  0.93289 0.93297
2014 09 01 00 00 02.138  0.9329 0.93297
2014 09 01 00 00 02.156  0.9329 0.93298
2014 09 01 00 00 02.264  0.9329 0.93297
2014 09 01 00 00 02.265  0.9329 0.93293
2014 09 01 00 00 02.265  0.93289 0.93293
2014 09 01 00 00 02.268  0.93289 0.93295
2014 09 01 00 00 02.277  0.93289 0.93296
2014 09 01 00 00 02.278  0.9329 0.93296
2014 09 01 00 00 02.297  0.93288 0.93296
tail...
2014 09 02 00 54 39.324  0.93317 0.93321
2014 09 02 00 54 39.533  0.93319 0.93321
2014 09 02 00 54 39.543  0.93318 0.93321
2014 09 02 00 54 39.559  0.93321 0.93321
2014 09 02 00 54 39.784  0.9332 0.93321
2014 09 02 00 54 39.798  0.93319 0.93321
2014 09 02 00 54 39.885  0.93319 0.93325
2014 09 02 00 54 39.886  0.93319 0.93321
2014 09 02 00 54 40.802  0.9332 0.93321
2014 09 02 00 54 48.829  0.93319 0.93321

Personally, I love that part as you can learn how to do simple but necessary text file operations by typing single lines of Unix/Linux commands. Good luck for those who try to repeat the same in Microsoft Windows not spending more than 30 sec for doing it.

Rebinning: 5-min Data

The rebinning has many schools. It’s the art for some people. We just want to have the job done. I opt for simplicity and understanding of the data we deal with. Imagine we have two adjacent 5 min bins with a tick history of trading:

sam
We want to derive the closest possible (or most fair) price estimation every 5 min, denoted in the above painting by a red marker. The old-school approach is to take the average over a number (larger than 5) of tick data points from the left and from the right. That creates the under- or overestimation of the mid-price.

If we trade live, every 5 min we receive an information on the last tick point before the minute hits 5 and we wait for the next tick point after 5 (blue markers). Taking the average of their prices (mid-price) makes most of sense. The precision we look at here is sometimes $10^{-5}$. It is not much of significance if our position is small, but if it is not, the mid-price may start playing a crucial role.

The cons of the old-school approach: a possible high volatility among all tick-data within last 5 minutes that we neglect.

The following Python code (pp2.py) performs 5-min rebinning for our pre-processed AUDUSD-2014-09 file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# pp2.py
import csv
import numpy as np
 
def convert(data):
     tempDATA = []
     for i in data:
         tempDATA.append([float(j) for j in i.split()])
     return np.array(tempDATA).T
 
fname="AUDUSD.pp"
 
with open(fname) as f:
    data = f.read().splitlines()
 
#print(data)
 
i=1
for d in data:
    list=[s for s in d.split(' ')]
    #print(list)
    # remover empty element in the list
    dd=[x for x in list if x]
    #print(dd)
    tmp=convert(dd)
    #print(tmp)
    if(i==1):
        a=tmp
        i+=1
    else:
        a = np.vstack([a, tmp])
        i+=1
 
N=i-1
#print("N = %d" % N)
 
# print the first line
tmp=np.array([a[1][0],a[1][1],a[1][2],a[1][3],a[1][4],0.0,(a[1][6]+a[1][7])/2])
print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" %
             (tmp[0],tmp[1],tmp[2],tmp[3],tmp[4],tmp[5],tmp[6]))
m=tmp
 
# check the boundary conditions (5 min bins)
for i in xrange(2,N-1):
    if( (a[i-1][4]%5!=0.0) and (a[i][4]%5==0.0)):
 
        # BLUE MARKER No. 1
        # (print for i-1)
        #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" %
        #      (a[i-1][0],a[i-1][1],a[i-1][2],a[i-1][3],a[i-1][4],a[i-1][5],a[i-1][6],a[i-1][7]))
        b1=a[i-1][6]
        b2=a[i][6]
        a1=a[i-1][7]
        a2=a[i][7]
        # mid-price, and new date for 5 min bin
        bm=(b1+b2)/2
        am=(a1+a2)/2
        Ym=a[i][0]
        Mm=a[i][1]
        Dm=a[i][2]
        Hm=a[i][3]
        MMm=a[i][4]
        Sm=0.0        # set seconds to zero
 
        # RED MARKER
        print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" %
              (Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2))
        tmp=np.array([Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2])
        m=np.vstack([m, tmp])
 
        # BLUE MARKER No. 2
        # (print for i)
        #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" %
        #      (a[i][0],a[i][1],a[i][2],a[i][3],a[i][4],a[i][5],a[i][6],a[i][7]))

what you run in pp.scr file as:

31
32
33
# pp.scr (continued)
 
python pp2.py > AUDUSD.dat

in order to get 5-min rebinned FX time-series as follows:

$ head AUDUSD.dat
2014  9  1  0  0  0.000   0.932935
2014  9  1  0  5  0.000   0.933023
2014  9  1  0 10  0.000   0.932917
2014  9  1  0 15  0.000   0.932928
2014  9  1  0 20  0.000   0.932937
2014  9  1  0 25  0.000   0.933037
2014  9  1  0 30  0.000   0.933075
2014  9  1  0 35  0.000   0.933070
2014  9  1  0 40  0.000   0.933092
2014  9  1  0 45  0.000   0.933063

That concludes our efforts. Happy rebinning!

Hacking Google Finance in Real-Time for Algorithmic Traders

Forecasting risk in algorithmic stock trading is of paramount importance for everyone. You should always look for the ways how to detect sudden price changes and take immediate actions to protect your investments.

Imagine you opened a new long position last Wednesday for NASDAQ:NVDA buying 1500 shares at the market price of USD16.36. On the next day price goes down to USD15.75 at the end of the session. You are down 3.87% or almost a grand in one day. If you can handle that, it’s okay but if the drop were more steep? Another terrorist attack, unforeseen political event, North Korea nuclear strike? Then what? You need to react!

If you have information, you have options in your hands. In this post we will see how one can use real-time data of stock prices displayed on Google Finance website, fetch and record them on your computer. Having them, you can build your own warning system for sudden price swings (risk management) or run the code in the background for a whole trading session (for any stock, index, etc. accessible through Google Finance) and capture asset prices with an intraday sampling (e.g. every 10min, 30min, 1h, and so on). From this point only your imagination can stop you from using all collected data.

Hacking with Python

If you ever dreamt of becoming a hacker, this is your chance to shine! I have got my inspiration after reading the book of Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers by TJ O’Connor. A powerful combination of the beauty and the beast.

The core of our code will be contained in a small function which does the job. For a specified Google-style ticker (query), it fetches the data directly from the server returning the most current price of an asset:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Hacking Google Finance in Real-Time for Algorithmic Traders
# 
# (c) 2014 QuantAtRisk.com, by Pawel Lachowicz
 
import urllib, time, os, re, csv
 
def fetchGF(googleticker):
    url="http://www.google.com/finance?&q="
    txt=urllib.urlopen(url+googleticker).read()
    k=re.search('id="ref_(.*?)">(.*?)<',txt)
    if k:
        tmp=k.group(2)
        q=tmp.replace(',','')
    else:
        q="Nothing found for: "+googleticker
    return q

Just make sure that a Google ticker is correctly specified (as will see below). Next, let’s display on the screen our local time and let’s force a change of the system time to the one corresponding to New York City, NY. The latter assumption we make as we would like to track the intraday prices of stock(s) traded at NYSE or NASDAQ. However, if you are tracking FTSE 100 index, the Universal Time (UTC) of London is advisable as an input parameter.

18
19
20
21
22
23
24
25
26
27
# display time corresponding to your location
print(time.ctime())
print
 
# Set local time zone to NYC
os.environ['TZ']='America/New_York'
time.tzset()
t=time.localtime() # string
print(time.ctime())
print

Having that, let us define a side-function combine which we will use to glue all fetched data together into Python’s list variable:

29
30
31
32
33
34
def combine(ticker):
    quote=fetchGF(ticker) # use the core-engine function
    t=time.localtime()    # grasp the moment of time
    output=[t.tm_year,t.tm_mon,t.tm_mday,t.tm_hour,  # build a list
            t.tm_min,t.tm_sec,ticker,quote]
    return output

As an input, we define Google ticker of our interest:

36
ticker="NASDAQ:AAPL"

for which we open a new text file where all queries will be saved in real-time:

39
40
41
42
# define file name of the output record
fname="aapl.dat"
# remove a file, if exist
os.path.exists(fname) and os.remove(fname)

Eventually, we construct the final loop over trading time. Here, we fetch the last data at 16:00:59 New York time. The key parameter in the game is freq variable where we specify the intraday sampling (in seconds). From my tests, using a private Internet provider, I have found that the most optimal sampling was 600 sec (10 min). Somehow, for shorter time intervals, Google Finance detected too frequent queries sent from my IP address. This test succeed from a different IP location, therefore, feel free to play with your local Internet network to find out what is the lowest available sampling time for your geolocation.

43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
freq=600 # fetch data every 600 sec (10 min)
 
with open(fname,'a') as f:
    writer=csv.writer(f,dialect="excel") #,delimiter=" ")
    while(t.tm_hour<=16):
        if(t.tm_hour==16):
            while(t.tm_min<01):
                data=combine(ticker)
                print(data)
                writer.writerow(data) # save data in the file
                time.sleep(freq)
            else:
                break
        else:
            for ticker in tickers:
            data=combine(ticker)
            print(data)
            writer.writerow(data) # save data in the file
            time.sleep(freq)
 
f.close()

To see how the above code works in practice, I conducted a test on Jan/9 2014, starting at 03:31:19 Sydney/Australia time, corresponding to 11:31:19 New York time. Setting the sampling frequency to 600 sec, I was able to fetch the data in the following form:

Thu Jan  9 03:31:19 2014
 
Wed Jan  8 11:31:19 2014
 
[2014, 1, 8, 11, 31, 19, '543.71']
[2014, 1, 8, 11, 41, 22, '543.66']
[2014, 1, 8, 11, 51, 22, '544.22']
[2014, 1, 8, 12, 1, 23, '544.80']
[2014, 1, 8, 12, 11, 24, '544.32']
[2014, 1, 8, 12, 21, 25, '544.86']
[2014, 1, 8, 12, 31, 27, '544.47']
[2014, 1, 8, 12, 41, 28, '543.76']
[2014, 1, 8, 12, 51, 29, '543.86']
[2014, 1, 8, 13, 1, 30, '544.00']
[2014, 1, 8, 13, 11, 31, 'Nothing found for: NASDAQ:AAPL']
[2014, 1, 8, 13, 21, 33, '543.32']
[2014, 1, 8, 13, 31, 34, '543.84']
[2014, 1, 8, 13, 41, 36, '544.26']
[2014, 1, 8, 13, 51, 37, '544.10']
[2014, 1, 8, 14, 1, 39, '544.30']
[2014, 1, 8, 14, 11, 40, '543.88']
[2014, 1, 8, 14, 21, 42, '544.29']
[2014, 1, 8, 14, 31, 45, '544.15']
...

As you can notice, they were displayed on the screen (line #59 in the code) in the form of Python’s list. It is important to note that the time we make an effort to capture and associate it with fetched asset price (query) is the computer’s system time, therefore please don’t expect regular time intervals as one may get from a verified market data providers. We are hacking in real-time! However, if you think about the data themselves, this time precision is not of great importance. As long as we fetch the data every freq seconds, that sufficiently allows us to build a risk management system or even to measure a rolling volatility of an asset. Your trading model will benefit anyway.

Have also a note that if our Internet connection fails or there are some disturbances of a different kind, we will miss the data in a sent query as visible in the example above.

Looks exciting? Give me High Five! and say Hell Yeah!

Code Modification: Portfolio of Assets

The presented Python code can be very easily modified if you wish to try fetching data for a couple of assets concurrently every freq seconds. Simply extend and amend all the lines starting at row #36, for example in the following form:

36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
tickers=["NASDAQ:AAPL","NASDAQ:GOOG","NASDAQ:BIDU","NYSE:IBM", \
         "NASDAQ:INTC","NASDAQ:MSFT","NYSEARCA:SPY"]
 
# define the name of an output file
fname="portfolio.dat"
# remove a file, if exist
os.path.exists(fname) and os.remove(fname)
 
freq=600 # fetch data every 600 sec (10 min)
 
with open(fname,'a') as f:
    writer=csv.writer(f,dialect="excel") #,delimiter=" ")
    while(t.tm_hour<=16):
        if(t.tm_hour==16):
            while(t.tm_min<01):
                #for ticker in tickers:
                    data=combine(ticker)
                    print(data)
                    writer.writerow(data)
                time.sleep(freq)
            else:
                break
        else:
            for ticker in tickers:
                data=combine(ticker)
                print(data)
                writer.writerow(data)
            time.sleep(freq)
 
f.close()

That’s it! For the sake of real-time verification, here is a screenshot how does it work:

Thu Jan  9 07:01:43 2014
 
Wed Jan  8 15:01:43 2014
 
[2014, 1, 8, 15, 1, 44, 'NASDAQ:AAPL', '543.55']
[2014, 1, 8, 15, 1, 44, 'NASDAQ:GOOG', '1140.30']
[2014, 1, 8, 15, 1, 45, 'NASDAQ:BIDU', '182.65']
[2014, 1, 8, 15, 1, 45, 'NYSE:IBM', '187.97']
[2014, 1, 8, 15, 1, 46, 'NASDAQ:INTC', '25.40']
[2014, 1, 8, 15, 1, 47, 'NASDAQ:MSFT', '35.67']
[2014, 1, 8, 15, 1, 47, 'NYSEARCA:SPY', '183.43']
[2014, 1, 8, 15, 11, 48, 'NASDAQ:AAPL', '543.76']
[2014, 1, 8, 15, 11, 49, 'NASDAQ:GOOG', '1140.06']
[2014, 1, 8, 15, 11, 49, 'NASDAQ:BIDU', '182.63']
[2014, 1, 8, 15, 11, 50, 'NYSE:IBM', '187.95']
[2014, 1, 8, 15, 11, 51, 'NASDAQ:INTC', '25.34']
[2014, 1, 8, 15, 11, 52, 'NASDAQ:MSFT', '35.67']
[2014, 1, 8, 15, 11, 53, 'NYSEARCA:SPY', '183.34']
...

where we can see that we were able to grab the prices of 6 stocks and 1 ETF (Exchange Trading Fund tracking S&P500 Index) every 10 min.

Reflection

You may wonder whether hacking is legal or not? The best answer I find in the words of Gordon Gekko: Someone reminded me I once said “Greed is good”,

Hacking Google Finance in Python

PART II
    Hacking Google Finance in Real-Time for Algo Traders. (2) Pre-Market Trading.

Retrieve the Data of Fund Performance utilizing Google and Python

Do you remember my post on Get the Data of Fund Performance directly into Excel utilizing VBA and Google? If not, have a look as this time we will do the same but in Python. Shortly, given a list of APIR codes (referring to different investment option performance) we want to fetch the publicly available data at InvestSMART.com.au website. Previously we built a code in Excel/VBA. It works but it is slow as the process makes use of Excel’s plug-ins for establishing connection with external websites and downloading the data into Excel’s workbooks. We can improve it significantly using Python language.

The second reason for doing it in Python is purely educational. We will see below how we can read the data from CSV file, send a query from Python directly to Google Search, retrieve a list of responses, get the URL of the page we are interested in, download it as a HTML code, parse it to get the content of the webpage without HTML ornaments, scan the content for required information (in our case: the most actual fund performance figures available online), and finally to save all in a file.

This is our roadmap for today. Are you ready for the journey?

In Search of Sunrise

To get the job done, we will need to utilize some existing modules for Python. The most important one is google module available to download from here. We also will need mechanize module, a stateful programmatic web browsing in Python. If you need to install BeautifulSoap, here is all what you look for. Okay then, let’s roll it:

1
2
3
4
5
6
7
8
9
10
# Retrieve the Data of Fund Performance utilizing Google and Python
#  
# (c) 2014 QuantAtRisk.com, by Pawel Lachowicz
 
import mechanize, csv, os, unicodedata
from google import search as gs
from bs4 import BeautifulSoup
 
# remove file, if exists
os.path.exists("investsmart.csv") and os.remove("investsmart.csv")

where the last two lines refer to the comma-separated-values file where we will save all fetched data. Sometimes it’s a good practice to remove it if it exists in a local directory. As you will see in a moment, we do it at the beginning as whatever will be retrieved by us from the web and filtered out for content we will be appending to the output file of investsmart.csv in a loop.

We feed our algo with a list of APIR codes stored in a text file of codes.csv of the exemplary structure:

AMP1232AU
AMP1524AU
AMP0167AU
AMP1196AU
...

In order to read each line from this file, we commence with:

12
13
14
15
16
17
18
19
with open('test.csv', 'rb') as f:
    reader = csv.reader(f) # reads the lines as a list
    for irow in reader:
        apircode=irow[0]
        print(apircode)
 
        # define query for Google Search
        phrase="investsmart com au "+apircode

where line #14 iterates through file line-by-line reading in the APIR code (line #15) as the first element from the irow list. Having code in the variable of apircode we define a query for search engine (line #19) and open an inner operational loop, here referring to the output file:

21
22
23
24
25
26
27
28
29
30
        with open('investsmart.csv', 'a') as csvfile:
            writer=csv.writer(csvfile, dialect="excel")
 
            i=0
            # Search in Google for InvestSMART APIR code
            for url in gs(phrase, stop=1):
                i+=1
                if(i==1):
                    link=url # store the first URL from Google results
                    print("%s\n" % link)

The whole action starts from line #24 where we execute the first procedure of sending query to Google (command gs in line #26). As you look at this creature, it is, in fact, the masterpiece, beauty, and Python’s elegance all in one. Not only we hit the search engine with our phrase but we filter the incoming data from Google Search according to URLs list and extract URL corresponding to the first one found by Google (line #29). Next, we browse through the first link,

32
33
34
35
            br=mechanize.Browser()  # open a virtual browser
            resp=br.open(link)      
            content=resp.get_data()
            #print content

and download the HTML code of the website (line #34). To have a better feeling what the variable content contains for the first APIR code from our codes.csv input file, just uncomment and execute code in line #35.

For AMP1232AU investment option, the content displays the website HTML code starting from:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
 
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]--> 
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]--> 
<!--[if IE 8]>    <html class="no-js lt-ie9" lang="en"> <![endif]--> 
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]--> 
 
<html> 
<head> 
<title>AMP SigSup - AMP Super Cash - Managed fund Profile - Managed funds - InvestSMART - managed funds, shares and investment news</title> 
 
<meta http-equiv="X-UA-Compatible" content="IE=9" /> 
<meta name="verify-v1" content="xgkff+3TBcugNz7JE2NiJoqkiVs1PHybWgFkaBuhblI=" /> 
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> 
<meta name="title" content="Discount access and research on more than 1,000 top performing managed funds" /> 
<meta name="description" content="Rebate of entry fees on the majority of Australian managed funds. Research on managed funds and shares." /> 
 
...

and containing (somewhere in the middle) the information we would like to extract:

...
 
<!-- ****************************** --> 
<!-- ***   Performance Table    *** --> 
<!-- ****************************** --> 
<br /><br /> 
<table class="Container" width="100%" cellpadding="0" cellspacing="0" border="0"> 
  <tr> 
    <td class="ContainerHeader" align="left"><b>Fund Performance</b> (as at 30th Nov 2013)&nbsp;<a href="javascript:PopupWindow('/education/GlossaryPopup.asp?id=133', '_blank', 700, 450)" class="glossaryLnk" title="More information...">&nbsp;</a></td> 
    <td class="ContainerHeader" align="right">NOTE : returns for periods greater than 1 year are annualised</td> 
  </tr> 
  <tr> 
    <td colspan="2" width="100%" align="left" valign="top" class="ContainerBody"> 
      <table width="100%" cellpadding="0" cellspacing="0" border="0" class="DataTable"> 
        <tr class="DataTableHeader"> 
          <td align="left"  valign="top" nowrap>&nbsp;</td> 
          <td align="right" valign="top" nowrap>1 Month<br />%</td> 
          <td align="right" valign="top" nowrap>3 Month<br />%</td> 
          <td align="right" valign="top" nowrap>6 Month<br />%</td> 
          <td align="right" valign="top" nowrap>1 Year<br />% p.a.</td> 
          <td align="right" valign="top" nowrap>2 Year<br />% p.a.</td> 
          <td align="right" valign="top" nowrap>3 Year<br />% p.a.</td> 
          <td align="right" valign="top" nowrap>5 Year<br />% p.a.</td> 
          <!--<td align="right" valign="top" nowrap>7 Year<br />% p.a.</td>--> 
          <td align="right" valign="top" nowrap>10 Year<br />% p.a.</td> 
        </tr> 
 
 
        <tr class="DataTableRow" onMouseOver="this.className = 'DataTableRowHighlight';" onMouseOut ="this.className = 'DataTableRow';"> 
          <td align="left"  valign="top"><b>Total Return</b></td> 
          <td align="right" valign="top"><b>0.12</b></td> 
          <td align="right" valign="top"><b>0.37</b></td> 
          <td align="right" valign="top"><b>0.8</b></td> 
          <td align="right" valign="top"><b>1.77</b></td> 
          <td align="right" valign="top"><b>2.24</b></td> 
          <td align="right" valign="top"><b>2.62</b></td> 
          <td align="right" valign="top"><b>-</b></td> 
          <!--<td align="right" valign="top"><b>-</b></td>--> 
          <td align="right" valign="top"><b>-</b></td> 
        </tr>
 
...

As we can see the HTML code is pretty nasty and digging up numbers from a subsection “Total Returns” seems to be troublesome. Well, not for Python. Just have a look how easily you can strip HTML from all those Christmas Tree ornaments:

37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
            # transform code into beautiful soup ;-)
            soup = BeautifulSoup(content)
 
            # perform post-processing for stripped data
            outputline=[]
            row=0
            roww=0
            for string in soup.stripped_strings:
                if(string=="Fund Performance"):
                    print(string)
                    j=1
                    row=1
                if(row!=0): 
                    row+=1
                if(row==3):
                    updated=string
                    print(string)
                if(string=="Total Return") and (j==1):
                    print(string)
                    roww=1
                    j=0
                if(roww!=0):
                    roww+=1
                if(roww>=3) and (roww<11):
                    if(roww==3):
                        print(string)
                        s=string.lstrip('(').rstrip(')').lstrip('as at ')
                    else:
                        s=string
                    outputline.append(s)
                    print(string)

In line #38 we make use of BeautifulSoup – a Python library designed for quick turnaround projects like screen-scraping. The transformed data we can scan line-by-line (the loop starting in line #44) where the outer function of .stripped_strings makes all HTML bumpers and stickers vanish leaving us with pure text! From that point, all following lines of Python code (#45-67) are designed to extract specific information, print it on the screen, and append to Python’s list of outputline. For AMP1232AU investment option from our query, Python displays:

AMP1232AU
http://www.investsmart.com.au/managed-funds/profile.asp?function=print&FundID=16654
 
Fund Performance
(as at 30th Nov 2013)
Total Return
0.12
0.12
0.37
0.8
1.77
2.24
2.62
-
-

If Google returns the first link to be something completely different than InvestSMART webpage with requested APIR code,

70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
            # check validity of desired data for APIR code
            if(len(outputline)>0):
                rowout=[]
                outputline.insert(0,apircode) # add APIR code to the list
                outputline.append(updated)    # add the performance numbers
                for item in outputline:
                    rowout.append(item.encode('ascii')) # convert <type:unicode>
                                                        # to <type:string>
                print(rowout) # show final version of the list on the screen
                if(len(rowout)>0):
                    writer.writerow(rowout) # if non-zero, save/append the list
                                            # with data in "investsmart.csv" 
                                            # output file, otherwise ignore
                                            # this query's results
            csvfile.close()
 
f.close()

the outputline string should be of zero length. With extra few lines for post-processing, the output file stores the subject of our small endeavour, i.e.:

pawels-mbp:Py pawel$ cat investsmart.csv 
AMP1232AU,0.12,0.37,0.8,1.77,2.24,2.62,-,-,(as at 30th Nov 2013)
AMP1524AU,-2.47,0.14,-1.58,8.63,14.09,9.42,-,-,(as at 30th Nov 2013)
AMP0167AU,0.39,1.96,3.18,8.45,7.72,5.82,5.09,4.89,(as at 30th Nov 2013)
...

what ends our journey In Search of Sunrise. The total time of scanning 1000 APIR codes is cut down by 60% making use of Python instead of VBA. That brings to my mind an AirAustral commercial that I saw a year ago: Arrive fast, then slow down.

Pre-Processing of Asset Price Series for Portfolio Optimization

Portfolio Optimization is a significant component of Matlab’s Financial Toolbox. It provides us with ready-to-use solution in finding optimal weights of assets that we consider for trading deriving them based on the historical asset performance. From a practical point of view, we can include it in our algorithmic trading strategy and backtest its applicability under different initial conditions. This is a subject of my next up-coming post. However, before we can enjoy the view from the peak, we need to climb the mountain first.

In Matlab, the portfolio is created as a dedicated object of the same name. It doesn’t read the raw stock data. We need to feed that beast. Two major ingredients satisfy the input: a vector of the expected asset returns and a covariance matrix. Matlab helps us to estimate these moments but first we need to deliver asset data in a digestable form.

In this post we will see how one can quickly download the stock data from the Internet based on our own stock selection and pre-process them for solving portfolio optimization problem in Matlab.

Initial Setup for Portfolio Object

Let’s say that at any point of time you have your own list of stocks you wish to buy. For simplicity let’s also assume that the list contains stocks traded on NYSE or NASDAQ. Since you have been a great fun of this game, now you are almost ready to buy what you jotted down on your ShoppingList.lst. Here, an example of 10 tech stocks:

AAPL   AOL   BIDU   GOOG   HPQ   IBM   INTC   MSFT   NVDA   TXN

They will constitute your portfolio of stocks. The problem of portfolio optimization requires a look back in time in the space of returns obtained in trading by each stock. Based on them the Return Proxy and Risk Proxy can be found.

The return matrix $R$ of dimensions $(N-1)\times M$ where $N$ stands for number of historical prices (e.g. derived daily, or monthly, etc.) and $M$ for the number of stocks in our portfolio, is required by Matlab as an input. We will see how does it work in next post. For now let’s solely focus on creation of this matrix.

In the article Create a Portfolio of Stocks based on Google Finance Data fed by Quandl I discussed Quandl.com as an attractive data provider for US stocks. Here, we will follow this solution making use of Quandl resources to pull out the stock price series for our shopping list. Ultimately, we aim at building a function, here: QuandlForPortfolio, that does the job for us:

% Pre-Processing of Asset Price Series for Portfolio Optimization in Matlab
%  (c) 2013, QuantAtRisk.com, by Pawel Lachowicz
 
clear all; close all; clc;
 
% Input Parameters
n=1*365;
tickers='ShoppingList.lst';
qcodes='QuandlStockCodeListUS.xlsx';
 
[X,Y,R,AssetList] = QuandlForPortfolio(n,tickers,qcodes);

We call this function with three input parameters. The first one, $n$, denotes a number of calendar days from today (counting backwards) for which we wish to retrieve the stock data. Usually, 365 days will correspond to about 250$-$252 trading days. The second parameter is a path/file name to our list of stock (desired to be taken into account in the portfolio optimisation process) while the last input defines the path/file name to the file storing stocks’ tickers and associated Quandl Price Codes (see here for more details).

Feeding the Beast

The QuandlForPortfolio Matlab function is an extended version of the previously discussed solution. It contains an important correcting procedure for the data fetched from the Quandl servers. First, let’s have a closer look on the function itself:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
% Function assists in fetching Google Finance data from the Quandl.com
%  server for a given list of tickers of stocks traded on NYSE or
%  NASDAQ. Data are retrieved for last 'n' days with daily sampling.
%
% INPUT
%   n       : number of calendar days from 'today' (e.g. 365 would
%             correspond to about 252 business days)
%   tickers : a path/file name of a text file listing tickers
%   qcodes  : a path/file name of Excel workbook (.xlsx) containing a list
%              of tickers and Quandl Price Codes in the format of
%              [Ticker,Stock Name,Price Code,Ratios Code,In Market?]
% OUTPUT
%   X0        : [Nx1] column vector with days
%   Y0        : [NxM] matrix with Close Prices for M stocks
%   R0        : [(N-1)xM] matrix of Retruns
%   AssetList : a list of tickers (cell array)
%
% (c) 2013, QuantAtRisk.com, by Pawel Lachowicz
 
function [X0,Y0,R0,AssetList0] = QuandlForPortfolio(n,tickers,qcodes)
    fileID = fopen(tickers);
    tmp = textscan(fileID, '%s');
    fclose(fileID);
    AssetList=tmp{1};  % a list as a cell array
 
    % Read in the list of tickers and internal Quandl codes
    %
    [~,text,~] = xlsread(qcodes);
    quandlc=text(:,1); % again, as a list in a cell array
    quandlcode=text(:,3); % corresponding Quandl's Price Code
 
    date1=datestr(today-n,'yyyy-mm-dd'); % from
    date2=datestr(today,'yyyy-mm-dd');   % to
 
    % Fetch the data from Quandl.com
    %
    QData={};
    for i=1:length(AssetList)
        for j=1:length(quandlc)
            if(strcmp(AssetList{i},quandlc{j}))
                fprintf('%4.0f %s\n',i,quandlc{j});
                fts=0;
                [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                   'authcode','x',...
                   'start_date',date1,'end_date',date2,'collapse','daily');
                QData{i}=fts;
            end
        end
    end
 
    % Post-Processing of Fetched Data
    %
    % create a list of days across all tickers
    TMP=[];
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        tmp=tmp(:,1);
        TMP=[TMP; tmp];
    end
    ut=unique(TMP);
    % use that list to find these days that are not present
    %  among all data sets
    TMP=[];
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        tmp=tmp(:,1);
        TMP=[TMP; setdiff(ut,tmp)];
    end
    ut=unique(TMP);
    % finally, extract Close Prices from FTS object and store them
    %  in Y0 matrix, plus corresponding days in X0
    X0=[];
    Y0=[]; 
    for i=1:length(QData)
        tmp=fts2mat(QData{i},1);
        cp=[];
        for j=1:size(tmp,1)
            [r,~,~]=find(ut==tmp(j,1));
            if(isempty(r))
                cp=[cp; tmp(j,5)]; % column 5 corresponds to Close Price
                if(i<2)
                    % create a time column vector listing days
                    % common among all data sets
                    X0=[X0; tmp(j,1)];
                end
            end
        end
        Y0=[Y0 cp];
    end
    % transform Close Prices into Returns, R(i)=cp(i)/cp(i-1)-1
    R0=tick2ret(Y0);
    AssetList0=AssetList';
end

The main bottleneck comes from the fact that Matlab’s portfolio object demands an equal number of historical returns ($N-1$) in the matrix of $R$ for all $M$ assets. We design the function in the way that it sets the common timeframe for all stocks listed on our shopping list. Of course, we ensure that all stocks were traded in the markets for about $n$ last days (rough estimation).

Now, the timeframe of $n$ last days should be understood as a first approximation. We fetch the data from Quandl (numeric date, Open, High, Low, Close, Volume) and save them in the cell array QData (lines #37-49) for each stock separately as FTS objects (Financial Time-Series objects; see Financial Toolbox). However, it may occur that not every stock we fetched displays the same amount of data. That is why we need to investigate for what days and for what stocks we miss the data. We achieve that by scanning each FTS object and creating a unique list of all days for which we have data (lines #54-60).

Next, we loop again over the same data sets but now we compare that list with a list of all dates for each stock individually (lines #63-69), capturing (line #67) those dates that are missing. Their complete list is stored as a vector in line #69. Eventually, given that, we are able to compile the full data set (e.g. Close Prices; here line #80) for all stocks in our portfolio ensuring that we will include only those dates for which we have prices across all $M$ assets (lines #70-91).

Beast Unleashed

We test our data pre-processing simply by running the block of code listed above engaging QuandlForPortfolio function and we check the results in the Matlab’s command window as follows:

>> whos X Y R AssetList
  Name             Size            Bytes  Class     Attributes
 
  AssetList        1x10             1192  cell                
  R              250x10            20000  double              
  X              251x1              2008  double              
  Y              251x10            20080  double

what confirms the correctness of dimensions as expected.

At this stage, the aforementioned function can be used two-fold. First, we are interested in the portfolio optimisation and we look back at last $n$ calendar days since the most current one (today). The second usage is handy too. We consider our stocks on the shopping list and fetch for their last, say, $n=7\times365$ days with data. If all stocks were traded over past 7 years we should be able to collect a reach data set. If not, the function will adjust the beginning and end date to meet the initial time constrains as required for $R$ matrix construction. For the former case, we can use 7-year data sample for direct backtesting of algo models utilizing Portfolio Optimization.

Stay tuned as we will rock this land in the next post!

Any Questions?

Share them across QuantCove.com – the official Forum of QuantAtRisk.

Create a Portfolio of Stocks based on Google Finance Data fed by Quandl

There is an old saying Be careful what you wish for. Many quants and traders wished for a long time for a better quality of publicly available data. It wasn’t an easy task to accomplish but with a minimum of effort, it has become possible to get all what you truly dreamt of. When I started my experience with daily backtesting of my algo strategies I initially relied on Yahoo! data. Yeah, easily accessible but of low quality. Since the inception of Google Finance the same dream had been reborn. More precise data, covering more aspects of trading, platforms, markets, instruments. But how to get them all?


Quandl.com provides us with an enormous access to an enormous database. Just visit the webpage to find out what I mean by barley putting my thoughts all together in one sentence. In fact, the portal’s accessibility via different programming languages (e.g. Matlab, C++, Python, etc.) makes that database more alive than ever. The space of parameters the quants get an access to becomes astronomical in number!

Alleluja! Let’s ride this horse. Enough of watching the online charts changing a direction of our investing moods with a one-mintute time resolution. We were born to be smarter than stochastic processes taking control over our daily strategies. In this short post, making use of the Matlab R2013a platform, I want to show you how quickly you can construct any portfolio of US stocks taken from the Dow Jones Index universe (as an example). We will use the Google Finance stock data fed by Quandl.com.

Google Finance Stock Data for Model Feeding

Similar to Yahoo! Finance’s resources, Quandl.com has an own system of coding the names of all US stocks. Therefore, we need to know it before doing any further step. The .csv file containing all of the codes for Quandl’s stock data can be obtained via API for Stock Data webpage. The file contains the header of the form:

Ticker      Stock Name      Price Code      Ratios Code      In Market?

linking Google Finance ticker with Quandl’s code (Price Code). Doing some extra job with data filtering, I created the pre-processed set of all tickers excluding those non-active in the markets and removing all invalid data. You can download it here QuandlStockCodeListUS.xlsx as an Excel workbook.

The second list we need to have is a list of potential stock we want to deal with in our portfolio. As for today, the current list of Dow Jones Index’s components you can download here: dowjones.lst. Let’s start with it:

1
2
3
4
5
6
7
8
9
10
11
12
13
% Create a Portfolio from Dow Jones Stocks based on Google Finance
%  Data fed by Quandl.com
%
% (c) 2013 QuantAtRisk.com, by Pawel Lachowicz
 
 
close all; clear all; clc;
 
% Read the list of Dow Jones components
fileID = fopen('dowjones.lst');
tmp = textscan(fileID, '%s');
fclose(fileID);
dowjc=tmp{1};  % a list as a cell array

The guys from Mathworks suggest to make a new and good practice of using textscan command (lines 10-12) instead of dataread for reading data from the text files. The latter will be removed from the next Matlab release.

Next, we need to import the current list of tickers for US stocks and their corresponding codes in the Quandl’s database. We will make the use of my pre-processed data set of QuandlStockCodeListUS.xlsx as mentioned earlier in the following way:

15
16
17
% Read in the list of tickers and internal code from Quandl.com
[ndata, text, alldata] = xlsread('QuandlStockCodeListUS.xlsx');
quandlc=text(:,1); % again, as a list in a cell array

For the simplicity of our example, we will be considering all 30 stocks as a complete sample set of stocks in our portfolio. A portfolio construction is not solely a selection of stocks. It usually links a time period of the past performance of each individual stock. In the following move, we will collect daily stock returns over last 365 calendar year (corresponding to about 252 trading days).

This framework is pretty handful. Why? If you build or run live a model, most probably your wish is to update your database when US markets are closed. You fetch Quandl.com for data. Now, if your model employs risk management or portfolio optimisation as an integrated building-block, it’s very likely, you are interested in risk(-return) estimation, e.g. by calculating Value-at-Risk (VaR) or similar risk measures. Of course, the choice of the time-frame is up to you based on your strategy.

Having that said,

19
20
21
% fetch stock data for last 365 days
date2=datestr(today,'yyyy-mm-dd')     % from
date1=datestr(today-365,'yyyy-mm-dd') % to

what employs an improved Matlab’s command of datestr with formatting as a parameter. It’s really useful when it comes to data fetching from Quandl.com as the format of the date they like to see in your code is yyyy-mm-dd. The above two commands return in Matlab the following values:

date2 =
  2013-11-03
date1 =
  2012-11-03

as for the time of writing this article (string type).

Get an Access to Quandl’s Resources


Excellent! Let’s catch some fishes in the pond! Before doing it we need to make sure that Quandl’s software is installed and linked to our Matlab console. In order to download Quandl Matlab package just visit this website and follow the installation procedure at the top of the page. Add proper path in Matlab environment too. The best practice is to register yourself on Quandl. That allows you to rise the daily number of data call from 50 to 500. If you think you will need more data requests per day, from the same webpage you can send your request.

When registered, in your Profile section on Quandi, you will gain an access to your individual Authorization Token, for example:
qtokenYou will need to replace ‘yourauthenticationtoken’ in the Matlab code (see line 35) with your token in order to get the data.

Fetching Stock Data

From this point the road stops to wind. For all 30 stocks belonging to the current composition of Dow Jones Index, we get their Open, High, Low, Close prices (as retrieved from Google Finance) in the following way:

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
% create an empty array for storing stock data
stockd={};
% scan the Dow Jones tickers and fetch the data from Quandl.com
for i=1:length(dowjc)
    for j=1:length(quandlc)
        if(strcmp(dowjc{i},quandlc{j}))
            fprintf('%4.0f %s\n',i,quandlc{j});
            % fetch the data of the stock from Quandl
            % using recommanded Quandl's command and
            % saving them directly into Matlab's FTS object (fts)
            fts=0;
            [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ...
                          'authcode','yourauthenticationtoken',...
                          'start_date',date1,'end_date',date2);
            stockd{i}=fts; % entire FTS object in an array's cell
        end
    end
end
% use 'save' command to save all actual variables from memory on the disc,
%  if you need it for multiple code running
% save data

A code in line 35 to 37 is an example how you call Quandi for data and fetch them into FINTS (financial time series object from the Financial Toolbox in Matlab). So easy. If you wish to change the time frequency, apply transformation or get the data not in FINTS but .csv format, all is available for you as a parameter at the level of calling Quandl.get() command (explore all possibilities here).

Having that, you can easily plot the Google Finance data, for instance, of IBM stock straight away:

% diplay High-Low Chart for IBM
highlow(stockd{11})
title('IBM');
ylabel('Stock Price ($)');

to get:
quandl-fig01

or prepare a return-risk plot for your further portfolio optimisation

ret=[];
for i=1:length(stockd)
    % extract the Close Price
    tmp=fts2mat(stockd{i}.Close,0);
    % calculate a vector of daily returns
    r=tmp(2:end)./tmp(1:end-1)-1;
    ret=[ret r];
end
% calculate the annualized mean returns for 30 stocks
mret=mean(ret)*sqrt(250);
% and corresponding standard deviations
msig=std(ret)*sqrt(250);
 
% plot return-risk diagram
p=100*[msig; mret]'
plot(p(:,1),p(:,2),'b.');
xlim([0 30]);
xlabel('Annualized Risk of Stock (%)')
ylabel('Annualized Mean Return (%)');

providing us with
quandl-fig02

Forum: Further Discussions

You can discuss the current post with other quants within a new Quant At Risk Forum (Data Handling and Pre-Processing section).

Rebinning of Financial Time-Series

Working with financial time-series, especially in trading and its following analysis of data trends or so on, we wish to rebin (resample) our data into a new time-series which would provide us with some sort of a new information on the average value of the underlying data records characterized by high volatility in time.

The following algorithm in Maltab does the job pretty well:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
% Function rebins any time-series [x,y] where x is a time vector
%   and y stores time-series values from the current (regular or
%   irregular samping). A new time-series sampling is dt.
%
%  Input: x  : time vector [1xN]
%         y  : data vector [1xN]
%         dy : data error vector [1xN]
%         dt : new rebinning time (dt > current sampling)
%
% (c) 2013, QuantAtRisk.com, by Pawel Lachowicz
%
% Example: x=1:1:1001; % assume 1001 trading days
%          y=rand(1,1001); % simulate pseudo data
%          dy=0.1*rand(1,1001); % simulate pseudo data errors (if required)
%          dt=50; % rebin data from 1 day sampling down to 50 day intervals
%          [rx,ry]=rebin(x,y,dy,dt);
%      or  [rx,ry]=rebin(x,y,[],dt];
%          plot(x,y,'color',[0.7 0.7 0.7]); % plot original time-series
%          hold on; plot(rx,ry,'ro-'); % plot down-rebined time-series
 
function [rx,ry]=rebin(x,y,dy,dt)
    if(isempty(dy))
        dy=ones(1,length(y));
    end
    rdata=[];
    j=1;
    k=1;
    t2=x(1);
    while(j<=length(x))
        i=j;
        if((x(i)+dt)>x(end))
            break
        else
            t2=t2+dt;
        end
        i=j;
        sa=0;
        wa=0;
        ea=0;
        il=0;
        while(x(i)<t2)
            sa=sa+(y(i)/dy(i)^2);
            wa=wa+(1/dy(i)^2);
            i=i+1;
            il=il+1;
        end
        ry=sa/wa;
        rx=t2-dt;
        if(il>=1)
            rdata=[rdata; rx ry];
        end
        j=j+il;
    end
    rx=rdata(:,1);
    ry=rdata(:,2);
end

Example

As an example we will use the daily rate of returns in trading of AAPL (Apple Inc.) from Jan-1984 to Apr-2011 (download Matlab’s aaplr.mat M-file). Below, the short set of command lines allow us to execute the rebinning process of return series,

clear all; close all; clc;
 
load aaplr.mat
plot(aaplR)
 
x=1:length(aaplR); % 6855 days
y=aaplR;
plot(x,y,'color',[0.7 0.7 0.7]);
 
% rebin the data with dt=25 days step (one trading month)
dt=25;
dy=[]; % we have no information on data errors
 
[rx,ry]=rebin(x,y,dy,dt)
 
% overplot results and display first 500 days
hold on; plot(rx,ry,'ro-');
xlim([0 500]);
ylim([-0.125 0.15]);

and plot both time-series, original and rebinned with a new bin time of 25 days for the first 500 days of trading:

rb

where a red line denotes a new rebinned data time-series with a binning time of 25 trading days. The function (the algorithm) computes a simple weighted mean based on data points falling into an interval of $\langle t,t+dt \rangle$. If we do not specify the input data error vector of $dy$ as in the example above, we should get a simple mean as a consequence.

Accessing a SQL Server database in Matlab and the use of FTS objects


Feeding our models with data for backtesting purposes requires pulling data from the server. If we do our modeling part in Matlab under Linux and the MS SQL Server database is our target, the quickest solution for the establishment of a good connectivity is making use of Java Database Connectivity (JDBC) driver. Firstly, visit the following web page to download and current driver (.tar.gz) and next follow the installation steps provided by Mathworks.

In the following example we will try to connect to our database called ‘XDataBase’ and retrieve stock prices from Feb 2, 1998 of Apple Inc. (NASDAQ ticker: AAPL).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ticker=['AAPL'];
start=['02-Feb-1998'];
 
% --Connect to your local SQL Server and fetch the data for a ticker
%
% define connection
conn=database('XDataBase','matlab','m4tl4b','com.microsoft.sqlserver.jdbc.SQLServerDriver','jdbc:sqlserver://163.118.2.7:1533;database=XDataBase');
 
% specify return format
setdbprefs('DataReturnFormat','structure');
 
% define your SQL query, as for example:
QueryString = ['SELECT sec.tic, dp.datadate, dp.prchd As PriceH, dp.prcld As PriceL, dp.prccd As PriceC, dp.prccd/dp.ajexdi As PriceCA, dp.cshtrd*dp.ajexdi As Vol FROM  Xpressfeed.dbo.security AS sec INNER JOIN Xpressfeed.dbo.sec_dprc AS dp ON sec.gvkey = dp.gvkey AND sec.iid = dp.iid AND dp.datadate >=''',start,''' WHERE (sec.tic IN (''',ticker,'''))'];
 
% execute connection, fetch data, and assign them to a Matlab variable
cur = exec(conn, QueryString);
myData = fetch(cur);
secData = myData.data;

In the above example, our structure variable of secData is composed of secData.PriceH, secData.PriceL, secData.PriceC, secData.PriceCA, and secData.Vol arrays containing information about AAPL’s High, Low, Close, Adjusted Close prices and Volume traded, respectively.

Making use of Matlab’s Financial Toolbox, one can store all these data into a useful text-like file format. We do it by construction of Financial Time-Series (FTS) object:

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
% verify the existence of data for a ticker
skip=false;
try
    n=length(secData.PriceC);  % a random test
catch
    skip=true;
end
if(~skip)
    dates=secData.datadate;
    dates=datenum(dates);
    data=[zeros(n,1) secData.PriceH secData.PriceL secData.PriceC secData.PriceCA secData.Vol];
    colheads={'DATES';'OPEN';'HIGH';'LOW';'CLOSE';'CLOSEA';'VOLUME'};
    desc=ticker;
    filename=[ticker,'.dat'];
    % save AAPL's extracted Price and Volume data into AAPL.dat file
    fts2ascii(filename,dates,data,colheads,desc);
 
    % construct FTS object by reading the data from the file
    fts=ascii2fts(filename,1,2);
    data=fts2mat(fts,1);
    % and plot the Adjusted Close Price
    scrsz = get(0,'ScreenSize'); 
    hh=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none');
    plot(data(:,1),data(:,5)); 
    xlim([min(data(:,1)) max(data(:,1))]); 
    title(ticker);
    xlabel('Time [days]'); ylabel('Adjusted Close Price [US$]');
end

As the result of execution of code lines 28 to 35, we were able to retrieve AAPL stock data and save them into AAPL.dat file. First ten lines of the file are as follows:

$ head -10 AAPL.txt
AAPL
DATES 	OPEN 	HIGH 	LOW 	CLOSE 	CLOSEA 	VOLUME
02-Feb-1998   0.000000   18.500000   17.375000   17.688000   4.422000   22740000
03-Feb-1998   0.000000   18.625000   17.688000   18.313000   4.578250   14380000
04-Feb-1998   0.000000   18.500000   18.000000   18.250000   4.562500   6078400
05-Feb-1998   0.000000   18.500000   18.000000   18.313000   4.578250   8508000
06-Feb-1998   0.000000   18.688000   18.250000   18.500000   4.625000   7228000
09-Feb-1998   0.000000   19.500000   18.375000   19.188000   4.797000   17668000
10-Feb-1998   0.000000   19.563000   19.063000   19.438000   4.859500   15072000
11-Feb-1998   0.000000   19.500000   18.875000   19.000000   4.750000   7560000

In line 38 we show how to construct FTS object

>> whos fts
  Name         Size             Bytes  Class    Attributes
 
  fts       3366x6             190016  fints

whereas line 39 provides us with a transformation of the object’s data into matrix format that we may use to plot AAPL’s Adjusted Close Prices.

It is of great importance to remind that Financial Toolbox allows us to express a date (e.g. 11-Feb-1998) as a continuous number:

>> datenum('02-Feb-1998')
ans =
      729788

or reversely

>> datestr(ans)
ans =
02-Feb-1998

Extracting Time-Series from Tick-Data


The tick-data provided in the .csv (comma separated values) file format sometimes may be a real problem to handle quickly, especially when the total size starts to count in hundreds of GB. If your goal is to extract a time-series with, say, hourly time resolution only, this article will provide you with some fresh and constructive guidelines how to do it smartly in the Linux environment.

First, let’s have a closer look at the data. Say, we have a collection of 2148 .csv files hosting the FX trading history of AUDUSD pair, covering nearly 10 years between 2000 and 2010. Each file is 7.1 MB large what leaves us with approximately 15 GB of data to process. Having a look into the randomly selected file we can identify the header and data themselves:

$ head -10 audusd_216.csv 
Ticks,TimeStamp,Bid Price,Bid Size,Ask Price,Ask Size
632258349000000015,2004-07-19 11:55:00.000,0.7329,1000000,0.7334,1000000
632258349000000016,2004-07-19 11:55:00.000,0.7329,1000000,0.7334,1000000
632258349000000017,2004-07-19 11:55:00.000,0.7329,1000000,0.7333,1000000
632258349000000018,2004-07-19 11:55:00.000,0.7327,1000000,0.7333,1000000
632258349000000019,2004-07-19 11:55:00.000,0.7327,1000000,0.7333,1000000
632258349000000020,2004-07-19 11:55:00.000,0.7328,1000000,0.7333,1000000
632258349000000021,2004-07-19 11:55:00.000,0.7328,1000000,0.7334,1000000
632258349600000000,2004-07-19 11:56:00.000,0.7328,1000000,0.7334,1000000
632258349600000001,2004-07-19 11:56:00.000,0.7328,1000000,0.7336,1000000

Our aim will be to extract Bid and Ask Price time-series. We will make use of a few Linux standard tools, e.g. sed, awk, supplemented with extra f77 codes. It is also to demonstrate how shell programming can be useful while we have an opportunity to explore the enigmatic syntax of its programs. Generally, we will be writing a shell script, executable for any FX pair name, e.g. gbpnzd, eurjpy, and so on.

In the first step of the script we create a list of all files. This is tricky in Linux as the standard command of ‘ls -lotr’ though returns a desired list but also all details on the file size, attributes, etc. which we do not simply want. Lines 9 and 10 solve the problem,

1
2
3
4
5
6
7
8
9
10
# Extracting Time-Series from Tick-Data .csv files
# (c) Quant at Risk, 2012
#
# Exemplary usage: ./script.src audusd
 
#!/bin/bash
 
echo "..making a sorted list of .csv files"
for i in $1_*.csv; do echo ${i##$1_} $i ${i##.csv}; 
done | sort -n | awk '{print $2}' > $1.lst

and a newly create file of \$1.lst (note: \$1 corresponds in the shell script to the parameter’s name we called the script with, e.g. audusd; therefore \$1.lst physically means audusd.lst) contains the list:

audusd_1.csv
audusd_2.csv
audusd_3.csv
...
audusd_2148.csv

We create one data file from all 2148 pieces by creating and executing an in-line script:

12
13
14
15
16
17
echo "..creating one data file"
awk '{print "cat",$1," &gt;&gt; tmp.lst"}' $1.lst &gt; tmp.cmd
chmod +x tmp.cmd
./tmp.cmd
rm tmp.cmd
mv tmp.lst $1.tmp

Now, \$1.tmp is a 15 GB file and we may wish to remove some unnecessary comments and tokens:

19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
echo "..removing comments"
sed 's/Ticks,TimeStamp,Bid Price,Bid Size,Ask Price,Ask Size//g' $1.tmp > $1.tmp2
rm $1.tmp
 
echo "..removing empty lines"
sed '/^$/d' $1.tmp2 > $1.tmp
rm $1.tmp2
 
echo "..removing token ,"
sed 's/,/ /g' $1.tmp > $1.tmp2
rm $1.tmp
 
echo "..removing token :"
sed 's/:/ /g' $1.tmp2 > $1.tmp
rm $1.tmp2
 
echo "..removing token -"
sed 's/-/ /g' $1.tmp > $1.tmp2
rm $1.tmp
 
echo "..removing column with ticks and ask/bid size"
awk '{print $2,$3,$4,$5,$6,$7,$8,$10}' $1.tmp2 > $1.tmp
rm $1.tmp2

In order to convert a time information into a continuous measure of time, we modify the f77 code for our task as follows:

c Extracting Time-Series from Tick-Data .csv files
c (c) Quant at Risk, 2012
c
c Program name: fx_getmjd.for
c Aim: removes ticks and coverts trade time into MJD time [day]
c Input data format: YYYY MM DD HH MM SS.SSS BID BID_Vol ASK ASK_Vol
 
      implicit double precision (a-z)
      integer y,m,d,hh,mm,jd
      integer*8 bidv,askv
      character zb*50
 
      call getarg(1,zb)
      open(1,file=zb)
      do i=1.d0, 500.d6
        read(1,*,end=1) y,m,d,hh,mm,ss,bid,ask
        jd= d-32075+1461*(y+4800+(m-14)/12)/4+367*(m-2-(m-14)/12*12)
     _    /12-3*((y+4900+(m-14)/12)/100)/4
        mjd=(jd+(hh-12.d0)/24.d0+mm/1440.d0+ss/86400.d0)-2400000.5d0
        mjd=mjd-51544.d0 ! T_0 = 2000.01.01 00:00
        abr=ask/bid
        write(*,2) mjd,bid,ask,abr
      enddo                             
    1 close(1)
    2 format(F15.8,F8.4,F8.4,F12.6)
 
end

and execute it in our script:

43
44
45
echo "..changing a date to MJD"
fx_getmjd $1.tmp > $1.dat
rm $1.tmp

In the aforementioned f77 code, we set a zero time point (MJD=0.00) on Jan 1, 2000 00:00. Since that day, now our time is expressed as a single column measuring time progress in days with fractional parts tracking hours and minutes.

We may split the data into two separate time-series containing Bid and Ask Prices at the tick-data level:

47
48
49
echo "..splitting into bid/ask/abr files"
awk '{print $1,$2}' $1.dat > $1.bid
awk '{print $1,$3}' $1.dat > $1.ask

A quick inspection of both files reveals we deal with nearly $500\times 10^6$ lines! Before we reach our chief aim, i.e. rebinning the series with 1-hour time resolution, there is a need to, unfortunately, separate input into 5 parts, each of maximum $100\times 10^6$ lines. The latter may vary depending of RAM memory size available, and if sufficient, this step can be even skipped. We proceed:

51
52
53
54
55
56
echo "..spliting bid/ask/abr into separate files"
fx_splitdat $1 1
fx_splitdat $1 2
fx_splitdat $1 3
fx_splitdat $1 4
fx_splitdat $1 5

where fx_splitdat.for code is given as follows:

c Extracting Time-Series from Tick-Data .csv files
c (c) Quant at Risk, 2012
c
c Program name: fx_splitdat.for
c Exemplary usage: ./fx_splitdat audusd [1,2,3,4,5]
 
      implicit double precision (a-z)
      integer      nc
      character*6  zb
      character*10 zbask,zbbid
      character*16 zb1ask,zb2ask,zb3ask,zb4ask,zb5ask
      character*16 zb1bid,zb2bid,zb3bid,zb4bid,zb5bid
      character*1  par2,st
 
c zb- name of length equal 6 characters only
      call getarg(1,zb)
      call getarg(2,par2)  ! case
 
      write(st,'(a1)') par2
      read(st,'(i1)') nc
 
      zbask=zb(1:6)//'.ask'
      zbbid=zb(1:6)//'.bid'
 
      zb1ask=zb(1:6)//'.ask.part1'
      zb2ask=zb(1:6)//'.ask.part2'
      zb3ask=zb(1:6)//'.ask.part3'
      zb4ask=zb(1:6)//'.ask.part4'
      zb5ask=zb(1:6)//'.ask.part5'
 
      zb1bid=zb(1:6)//'.bid.part1'
      zb2bid=zb(1:6)//'.bid.part2'
      zb3bid=zb(1:6)//'.bid.part3'
      zb4bid=zb(1:6)//'.bid.part4'
      zb5bid=zb(1:6)//'.bid.part5'
 
      open(11,file=zbask)
      open(12,file=zbbid)
 
      if(nc.eq.1) then
        open(21,file=zb1ask)
        open(22,file=zb1bid)
        do i=1.d0, 100.d6
          read(11,*,end=1) mjd_ask,dat_ask
          read(12,*,end=1) mjd_bid,dat_bid
          if((i>=1.0).and.(i<100000001.d0)) then
            write(21,2) mjd_ask,dat_ask
            write(22,2) mjd_bid,dat_bid
          endif
        enddo              
      endif
 
      if(nc.eq.2) then
        open(31,file=zb2ask)
        open(32,file=zb2bid)
        do i=1.d0, 200.d6
          read(11,*,end=1) mjd_ask,dat_ask
          read(12,*,end=1) mjd_bid,dat_bid
          if((i>=100000001.d0).and.(i<200000001.d0)) then
            write(31,2) mjd_ask,dat_ask
            write(32,2) mjd_bid,dat_bid
          endif
        enddo              
      endif
 
      if(nc.eq.3) then
        open(41,file=zb3ask)
        open(42,file=zb3bid)
        do i=1.d0, 300.d6
          read(11,*,end=1) mjd_ask,dat_ask
          read(12,*,end=1) mjd_bid,dat_bid
          if((i>=200000001.d0).and.(i<300000001.d0)) then
            write(41,2) mjd_ask,dat_ask
            write(42,2) mjd_bid,dat_bid
          endif
        enddo              
      endif
 
      if(nc.eq.4) then
        open(51,file=zb4ask)
        open(52,file=zb4bid)
        do i=1.d0, 400.d6
          read(11,*,end=1) mjd_ask,dat_ask
          read(12,*,end=1) mjd_bid,dat_bid
          if((i>=300000001.d0).and.(i<400000001.d0)) then
            write(51,2) mjd_ask,dat_ask
            write(52,2) mjd_bid,dat_bid
          endif
        enddo              
      endif
 
      if(nc.eq.5) then
        open(61,file=zb5ask)
        open(62,file=zb5bid)
        do i=1.d0, 500.d6
          read(11,*,end=1) mjd_ask,dat_ask
          read(12,*,end=1) mjd_bid,dat_bid
          if((i>=400000001.d0).and.(i<500000001.d0)) then
            write(61,2) mjd_ask,dat_ask
            write(62,2) mjd_bid,dat_bid
          endif
        enddo              
      endif
 
    1 close(1)
    2 format(F15.8,F8.4)
 
      stop
      end

and compiling it as usual:

f77 fx_splitdat.for -o fx_splitdat

Finally, we can extract the rebinned Bid and Ask Price time-series with bin time of 1 hour, i.e. $dt=0.041666667$ d, making use of the following f77 code:

c Extracting Time-Series from Tick-Data .csv files
c (c) Quant at Risk, 2012
c
c Program name: fx_rebin.for
c Exemplary usage: ./fx_rebin audusd 2
 
      implicit double precision (a-z)
      parameter (dim=100.d6)
 
      double precision f(dim), mjd(dim), step
      character*50 par1, par2, st
 
      call getarg(1,par1)  ! file name
      call getarg(2,par2)  ! bining [d]
 
      write(st,'(a20)') par2
      read(st,'(f20)') step
 
c reading data
      open(1,file=par1)
      do i=1,100.d6
        read(1,*,end=1) 
     _       mjd(i),f(i)
      enddo
    1 close(1)
      n=i-1.d0
 
c main loop
      j=1.d0
      k=1.d0
      t2=0.
      t2=dint(mjd(j))
      do while (j.lt.n)
        i=j
        if ((mjd(i)+step).gt.(mjd(n))) then
          print*
          stop
        else
          t2=t2+step
        endif
        i=j
        il=0.d0
        s=0.d0
	    do while (mjd(i).lt.t2)
	      s=s+f(i)
	      i=i+1.d0
          il=il+1.d0    ! how many points in segment
	    enddo
	    av=s/il
        day=t2-step
        if (il.ge.1.d0) then 
          write(*,3) day,av
        endif
        j=j+il
      enddo
 
    2 format(f30.7,2f30.6)
    3 format(f20.8,f8.4)
 
   10 stop
      end

executed in our script for all five part of both tick-data time-series:

58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
echo "..rebinning with dt = 1 h"
dt=0.041666667
fx_rebin $1.ask.part1 $dt > $1.ask.part1.1h
fx_rebin $1.ask.part2 $dt > $1.ask.part2.1h
fx_rebin $1.ask.part3 $dt > $1.ask.part3.1h
fx_rebin $1.ask.part4 $dt > $1.ask.part4.1h
fx_rebin $1.ask.part5 $dt > $1.ask.part5.1h
fx_rebin $1.bid.part1 $dt > $1.bid.part1.1h
fx_rebin $1.bid.part2 $dt > $1.bid.part2.1h
fx_rebin $1.bid.part3 $dt > $1.bid.part3.1h
fx_rebin $1.bid.part4 $dt > $1.bid.part4.1h
fx_rebin $1.bid.part5 $dt > $1.bid.part5.1h
 
echo "..appending rebinned files"
cat $1.ask.part1.1h $1.ask.part2.1h $1.ask.part3.1h $1.ask.part4.1h $1.ask.part5.1h >$1.ask.1h.tmp
cat $1.bid.part1.1h $1.bid.part2.1h $1.bid.part3.1h $1.bid.part4.1h $1.bid.part5.1h > $1.bid.1h.tmp
rm *part*
 
echo "..removing empty lines in rebinned files"
sed '/^$/d' $1.ask.1h.tmp > $1.ask.1h
rm $1.ask.1h.tmp
sed '/^$/d' $1.bid.1h.tmp > $1.bid.1h
rm $1.bid.1h.tmp
 
echo "..done!"

As the final product we obtain two files, say,

audusd.bid.1h
audusd.ask.1h

of the content:

$ tail -5 audusd.bid.1h 
       3772.62500304   0.9304
       3772.66666970   0.9283
       3772.70833637   0.9266
       3772.75000304   0.9263
       3772.79166970   0.9253

where the first column is time (MJD) with 1-hour resoulution and the second column contains a simple Price average from all tick prices falling into 1-hour time bin. As for dessert, we plot both Bid and Ask Price time-series what accomplishes our efforts:

AUD/USD rebinned time-series with 1-hour resolution

Continuous Measure of Time

Dealing with financial data sets in a raw format sometimes we find ourselves in a need of converting a specific date and time into continuous measure of time expressed by a single number.

The following Fortran code reads from the text file six first columns containing year, month, day, hour, minute, and second, respectively. As the output it returns a Modified Julian Date, a solution often used in those instances where there is a need to express time axis in a computer friendly format.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
c     getmjd.for
      implicit double precision (a-z)
      integer y,m,d,hh,mm,jd
      character zb*50
 
      call getarg(1,zb) ! read the text file name as an argument
      open(1,file=zb)
      do i=1.d0, 1.d6
        read(1,*,end=1) y,m,d,hh,mm,ss
        jd= d-32075+1461*(y+4800+(m-14)/12)/4+367*(m-2-(m-14)/12*12)
     _    /12-3*((y+4900+(m-14)/12)/100)/4
        mjd=(jd+(hh-12.d0)/24.d0+mm/1440.d0+ss/86400.d0)-2400000.5d0
        write(*,'(F20.8)') mjd
      enddo                             
    1 close(1)
 
      end

Exemplary Data

2012 1 1 12 53 01
2012 1 2 13 28 57
2012 1 3 14 53 06
...

Compilation and Execution

f77 getmjd.for -o getmjd
./getmjd data.dat

Output

      55927.53681713
      55928.56177083
      55929.62020833
...

Bank format of numbers in Matlab

The following Matlab function allows for displaying any double-type number using a nice bank format that includes commas separators and two digit precision (without rounding). The number $num$ is converted into a new variable of a string type.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
function [str]=bankformat(num)
    num2=floor(num*1000)/1000;
    r=int32(100*(num-floor(num)));
    str = num2str(num2);
    k=find(str == '.');
    if(isempty(k))
        str=[str,'.00'];
    end
    FIN = min(length(str),find(str == '.')-1);
    for i = FIN-2:-3:2
        str(i+1:end+1) = str(i:end);
        str(i) = ',';
    end
    x=mod(r,10);
    if(x==0)
        str=[str,'0'];
    end
    k=find(str == '.');
    d=length(str)-k;
    if(d>2)
        str=str(1:end-(d-2));
    end
end

Example:

>> n=70149506.5;
>> s=bankformat(n);
>> fprintf('\nInvested capital: £\%1s\n',s)
 
Invested capital: £70,149,506.50
Contact Form Powered By : XYZScripts.com