This is something of a conceptual question rather than on the specifics of code (ie am I going about this the right way in general or is there a better technique I could use?). I think that my problem represents a broad issue affecting many of the inexperienced people who post on this forum so an overview and sharing of best practice would also help many people.
My aim is to scrape statistical data from a website (here is an exemplar page: www.racingpost.com/horses/result_home.sd?race_id=572318&r_date=2013-02-26&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS
I have a very basic (if you pardon the pun) knowledge of VB which I use through excel but know nothing about other programming languages or conventions (SQL, HTML, XML etc.), however I am quite good at writing code to manipulate strings- that is, once I can scrape the data, even if it is in a very noisy form then I am expert at processing it. I am trying to build an automated process that will scrape up to 1000 pages in one hit. In one form or another, I have been working on this for years and the last few weeks have been very frustrating in that I have come up with several new methods which have taken days of work but have each had one fatal flaw that has stopped my progress.
Here are the methods I have tried (all using a VB macro run from Excel):
1) Control Firefox (as a shell application) - this was the poorest, I found that I could not interact with Firefox properly using a VB excel macro- i tried mainly using keystrokes etc.
2) Get inner text, outer text, inner html or outer html from internet explorer (IE)- this method was by far the most reliable but the data was, at times, very hard to parse and did not always contain everything I needed (good for some applications but bad for others)
3) automated Copy and Paste from IE- this was tantalisingly close to being perfect but is given to throwing up inexplicable errors whereby the type of information copied to the clipboard differs depending on whether it is done manually (ie CTRL+A, CTRL+C) or through the automated process (with the former I could get the HTML structure- ie tables, with the latter only text). The enigma here was that I could get the automated copy/paste to give me the right info IF I FIRST CLICKED INSIDE THE IE WINDOW USING MOUSE POINTER- however I was unable to automate this using a VB MACRO (I tried sendkeys and various other methods)
4) By automating an excel webquery- I recorded a macro of a query, this worked flawlessly giving me the structure of tables I needed. Snag was it was very very slow- even for a single page it might take 14 to 16 seconds (some of the other methods I used were near instantaneous). Also this method appears to encounter severe lagging/crashing problems when many refreshes are done (that may be because I don't know how to update the queries with different criteria, or properly extinguish them)
5) Loading the page as an XML document- I am investigating this method now- I know next-to-nothing about XML but have a hunch the sort of pages I am scraping (see example above) are suitable for this. I have managed to load the pages as an XML object but at present seem to be running into difficulties trying to parse the structure (ie various nodes) to extract text- keep running into object errors.
For the record I have posted highly specific questions with code relating to these individual methods without response so I am trying a broader question. What is the experience of others here? Which of these methods should I focus on? (bear in mind I am trying to keep everything to Excel VB Macros). I am getting to the point where I might look to get someone to code something for me and pay them (as this is taking hundreds of hours) - have people had good experiences employing others to write ad hoc code in this manner?
Source: http://www.mrexcel.com/forum/excel-questions/688229-best-method-scraping-data-web-using-vbulletin-macro.html
My aim is to scrape statistical data from a website (here is an exemplar page: www.racingpost.com/horses/result_home.sd?race_id=572318&r_date=2013-02-26&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS
I have a very basic (if you pardon the pun) knowledge of VB which I use through excel but know nothing about other programming languages or conventions (SQL, HTML, XML etc.), however I am quite good at writing code to manipulate strings- that is, once I can scrape the data, even if it is in a very noisy form then I am expert at processing it. I am trying to build an automated process that will scrape up to 1000 pages in one hit. In one form or another, I have been working on this for years and the last few weeks have been very frustrating in that I have come up with several new methods which have taken days of work but have each had one fatal flaw that has stopped my progress.
Here are the methods I have tried (all using a VB macro run from Excel):
1) Control Firefox (as a shell application) - this was the poorest, I found that I could not interact with Firefox properly using a VB excel macro- i tried mainly using keystrokes etc.
2) Get inner text, outer text, inner html or outer html from internet explorer (IE)- this method was by far the most reliable but the data was, at times, very hard to parse and did not always contain everything I needed (good for some applications but bad for others)
3) automated Copy and Paste from IE- this was tantalisingly close to being perfect but is given to throwing up inexplicable errors whereby the type of information copied to the clipboard differs depending on whether it is done manually (ie CTRL+A, CTRL+C) or through the automated process (with the former I could get the HTML structure- ie tables, with the latter only text). The enigma here was that I could get the automated copy/paste to give me the right info IF I FIRST CLICKED INSIDE THE IE WINDOW USING MOUSE POINTER- however I was unable to automate this using a VB MACRO (I tried sendkeys and various other methods)
4) By automating an excel webquery- I recorded a macro of a query, this worked flawlessly giving me the structure of tables I needed. Snag was it was very very slow- even for a single page it might take 14 to 16 seconds (some of the other methods I used were near instantaneous). Also this method appears to encounter severe lagging/crashing problems when many refreshes are done (that may be because I don't know how to update the queries with different criteria, or properly extinguish them)
5) Loading the page as an XML document- I am investigating this method now- I know next-to-nothing about XML but have a hunch the sort of pages I am scraping (see example above) are suitable for this. I have managed to load the pages as an XML object but at present seem to be running into difficulties trying to parse the structure (ie various nodes) to extract text- keep running into object errors.
For the record I have posted highly specific questions with code relating to these individual methods without response so I am trying a broader question. What is the experience of others here? Which of these methods should I focus on? (bear in mind I am trying to keep everything to Excel VB Macros). I am getting to the point where I might look to get someone to code something for me and pay them (as this is taking hundreds of hours) - have people had good experiences employing others to write ad hoc code in this manner?
Source: http://www.mrexcel.com/forum/excel-questions/688229-best-method-scraping-data-web-using-vbulletin-macro.html
No comments:
Post a Comment