Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

Retrieve the Data of Fund Performance utilizing Google and Python

Do you remember my post on Get the Data of Fund Performance directly into Excel utilizing VBA and Google? If not, have a look as this time we will do the same but in Python. Shortly, given a list of APIR codes (referring to different investment option performance) we want to fetch the publicly available data at InvestSMART.com.au website. Previously we built a code in Excel/VBA. It works but it is slow as the process makes use of Excel’s plug-ins for establishing connection with external websites and downloading the data into Excel’s workbooks. We can improve it significantly using Python language.

The second reason for doing it in Python is purely educational. We will see below how we can read the data from CSV file, send a query from Python directly to Google Search, retrieve a list of responses, get the URL of the page we are interested in, download it as a HTML code, parse it to get the content of the webpage without HTML ornaments, scan the content for required information (in our case: the most actual fund performance figures available online), and finally to save all in a file.

This is our roadmap for today. Are you ready for the journey?

In Search of Sunrise

To get the job done, we will need to utilize some existing modules for Python. The most important one is google module available to download from here. We also will need mechanize module, a stateful programmatic web browsing in Python. If you need to install BeautifulSoap, here is all what you look for. Okay then, let’s roll it:

1
2
3
4
5
6
7
8
9
10
# Retrieve the Data of Fund Performance utilizing Google and Python
#  
# (c) 2014 QuantAtRisk.com, by Pawel Lachowicz
 
import mechanize, csv, os, unicodedata
from google import search as gs
from bs4 import BeautifulSoup
 
# remove file, if exists
os.path.exists("investsmart.csv") and os.remove("investsmart.csv")

where the last two lines refer to the comma-separated-values file where we will save all fetched data. Sometimes it’s a good practice to remove it if it exists in a local directory. As you will see in a moment, we do it at the beginning as whatever will be retrieved by us from the web and filtered out for content we will be appending to the output file of investsmart.csv in a loop.

We feed our algo with a list of APIR codes stored in a text file of codes.csv of the exemplary structure:

AMP1232AU
AMP1524AU
AMP0167AU
AMP1196AU
...

In order to read each line from this file, we commence with:

12
13
14
15
16
17
18
19
with open('test.csv', 'rb') as f:
    reader = csv.reader(f) # reads the lines as a list
    for irow in reader:
        apircode=irow[0]
        print(apircode)
 
        # define query for Google Search
        phrase="investsmart com au "+apircode

where line #14 iterates through file line-by-line reading in the APIR code (line #15) as the first element from the irow list. Having code in the variable of apircode we define a query for search engine (line #19) and open an inner operational loop, here referring to the output file:

21
22
23
24
25
26
27
28
29
30
        with open('investsmart.csv', 'a') as csvfile:
            writer=csv.writer(csvfile, dialect="excel")
 
            i=0
            # Search in Google for InvestSMART APIR code
            for url in gs(phrase, stop=1):
                i+=1
                if(i==1):
                    link=url # store the first URL from Google results
                    print("%s\n" % link)

The whole action starts from line #24 where we execute the first procedure of sending query to Google (command gs in line #26). As you look at this creature, it is, in fact, the masterpiece, beauty, and Python’s elegance all in one. Not only we hit the search engine with our phrase but we filter the incoming data from Google Search according to URLs list and extract URL corresponding to the first one found by Google (line #29). Next, we browse through the first link,

32
33
34
35
            br=mechanize.Browser()  # open a virtual browser
            resp=br.open(link)      
            content=resp.get_data()
            #print content

and download the HTML code of the website (line #34). To have a better feeling what the variable content contains for the first APIR code from our codes.csv input file, just uncomment and execute code in line #35.

For AMP1232AU investment option, the content displays the website HTML code starting from:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
 
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]--> 
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]--> 
<!--[if IE 8]>    <html class="no-js lt-ie9" lang="en"> <![endif]--> 
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]--> 
 
<html> 
<head> 
<title>AMP SigSup - AMP Super Cash - Managed fund Profile - Managed funds - InvestSMART - managed funds, shares and investment news</title> 
 
<meta http-equiv="X-UA-Compatible" content="IE=9" /> 
<meta name="verify-v1" content="xgkff+3TBcugNz7JE2NiJoqkiVs1PHybWgFkaBuhblI=" /> 
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> 
<meta name="title" content="Discount access and research on more than 1,000 top performing managed funds" /> 
<meta name="description" content="Rebate of entry fees on the majority of Australian managed funds. Research on managed funds and shares." /> 
 
...

and containing (somewhere in the middle) the information we would like to extract:

...
 
<!-- ****************************** --> 
<!-- ***   Performance Table    *** --> 
<!-- ****************************** --> 
<br /><br /> 
<table class="Container" width="100%" cellpadding="0" cellspacing="0" border="0"> 
  <tr> 
    <td class="ContainerHeader" align="left"><b>Fund Performance</b> (as at 30th Nov 2013)&nbsp;<a href="javascript:PopupWindow('/education/GlossaryPopup.asp?id=133', '_blank', 700, 450)" class="glossaryLnk" title="More information...">&nbsp;</a></td> 
    <td class="ContainerHeader" align="right">NOTE : returns for periods greater than 1 year are annualised</td> 
  </tr> 
  <tr> 
    <td colspan="2" width="100%" align="left" valign="top" class="ContainerBody"> 
      <table width="100%" cellpadding="0" cellspacing="0" border="0" class="DataTable"> 
        <tr class="DataTableHeader"> 
          <td align="left"  valign="top" nowrap>&nbsp;</td> 
          <td align="right" valign="top" nowrap>1 Month<br />%</td> 
          <td align="right" valign="top" nowrap>3 Month<br />%</td> 
          <td align="right" valign="top" nowrap>6 Month<br />%</td> 
          <td align="right" valign="top" nowrap>1 Year<br />% p.a.</td> 
          <td align="right" valign="top" nowrap>2 Year<br />% p.a.</td> 
          <td align="right" valign="top" nowrap>3 Year<br />% p.a.</td> 
          <td align="right" valign="top" nowrap>5 Year<br />% p.a.</td> 
          <!--<td align="right" valign="top" nowrap>7 Year<br />% p.a.</td>--> 
          <td align="right" valign="top" nowrap>10 Year<br />% p.a.</td> 
        </tr> 
 
 
        <tr class="DataTableRow" onMouseOver="this.className = 'DataTableRowHighlight';" onMouseOut ="this.className = 'DataTableRow';"> 
          <td align="left"  valign="top"><b>Total Return</b></td> 
          <td align="right" valign="top"><b>0.12</b></td> 
          <td align="right" valign="top"><b>0.37</b></td> 
          <td align="right" valign="top"><b>0.8</b></td> 
          <td align="right" valign="top"><b>1.77</b></td> 
          <td align="right" valign="top"><b>2.24</b></td> 
          <td align="right" valign="top"><b>2.62</b></td> 
          <td align="right" valign="top"><b>-</b></td> 
          <!--<td align="right" valign="top"><b>-</b></td>--> 
          <td align="right" valign="top"><b>-</b></td> 
        </tr>
 
...

As we can see the HTML code is pretty nasty and digging up numbers from a subsection “Total Returns” seems to be troublesome. Well, not for Python. Just have a look how easily you can strip HTML from all those Christmas Tree ornaments:

37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
            # transform code into beautiful soup ;-)
            soup = BeautifulSoup(content)
 
            # perform post-processing for stripped data
            outputline=[]
            row=0
            roww=0
            for string in soup.stripped_strings:
                if(string=="Fund Performance"):
                    print(string)
                    j=1
                    row=1
                if(row!=0): 
                    row+=1
                if(row==3):
                    updated=string
                    print(string)
                if(string=="Total Return") and (j==1):
                    print(string)
                    roww=1
                    j=0
                if(roww!=0):
                    roww+=1
                if(roww>=3) and (roww<11):
                    if(roww==3):
                        print(string)
                        s=string.lstrip('(').rstrip(')').lstrip('as at ')
                    else:
                        s=string
                    outputline.append(s)
                    print(string)

In line #38 we make use of BeautifulSoup – a Python library designed for quick turnaround projects like screen-scraping. The transformed data we can scan line-by-line (the loop starting in line #44) where the outer function of .stripped_strings makes all HTML bumpers and stickers vanish leaving us with pure text! From that point, all following lines of Python code (#45-67) are designed to extract specific information, print it on the screen, and append to Python’s list of outputline. For AMP1232AU investment option from our query, Python displays:

AMP1232AU
http://www.investsmart.com.au/managed-funds/profile.asp?function=print&FundID=16654
 
Fund Performance
(as at 30th Nov 2013)
Total Return
0.12
0.12
0.37
0.8
1.77
2.24
2.62
-
-

If Google returns the first link to be something completely different than InvestSMART webpage with requested APIR code,

70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
            # check validity of desired data for APIR code
            if(len(outputline)>0):
                rowout=[]
                outputline.insert(0,apircode) # add APIR code to the list
                outputline.append(updated)    # add the performance numbers
                for item in outputline:
                    rowout.append(item.encode('ascii')) # convert <type:unicode>
                                                        # to <type:string>
                print(rowout) # show final version of the list on the screen
                if(len(rowout)>0):
                    writer.writerow(rowout) # if non-zero, save/append the list
                                            # with data in "investsmart.csv" 
                                            # output file, otherwise ignore
                                            # this query's results
            csvfile.close()
 
f.close()

the outputline string should be of zero length. With extra few lines for post-processing, the output file stores the subject of our small endeavour, i.e.:

pawels-mbp:Py pawel$ cat investsmart.csv 
AMP1232AU,0.12,0.37,0.8,1.77,2.24,2.62,-,-,(as at 30th Nov 2013)
AMP1524AU,-2.47,0.14,-1.58,8.63,14.09,9.42,-,-,(as at 30th Nov 2013)
AMP0167AU,0.39,1.96,3.18,8.45,7.72,5.82,5.09,4.89,(as at 30th Nov 2013)
...

what ends our journey In Search of Sunrise. The total time of scanning 1000 APIR codes is cut down by 60% making use of Python instead of VBA. That brings to my mind an AirAustral commercial that I saw a year ago: Arrive fast, then slow down.

Contact Form Powered By : XYZScripts.com