Thursday 30 May 2013

Web Scraping is actually pretty easy

For some of our clients we worked on extracting or submitting data automatically from websites which didn’t have an API we could use. This and more is called web scraping. Since our announcement of SellerScout, which relies heavily on this, I received a list of questions how we actually do this. So here are some thoughts on how to get started in the interesting web scraping world.

This article talks about some basics, which will work fine for most cases. This is probably not even remotely close to how Google does this or how we do it at SellerScout. The reason being both of those systems work in much larger scale and use cases are different. For example relying on machine learning, text analysis and semantic search algorithms etc. are all the things you might be doing if you want to build something big.
Downloading the web
Scraper

It's all just scraping

Spiders are the small applications you are going to be writing. Usually they are self-contained and CLI-friendly scripts, which have some internal logic how to extract information from a specific website or websites. As an example, the script might go to website’s homepage, download all the category pages, download products list for each of them and extract a list of products in the store.

If you are a Python guy, you might want to look at Twisted or Scrapy, later being very easy to use. If it’s PHP you are using, combination of cURL and libxml will allow doing the same; I’m not aware of any PHP frameworks for this. For any other language, you should give a look at Google.

Depending on your task you will need to support different functionality. If the website is for logged in users only, you should configure cURL to use cookies jar and initialize the scraping with a request to login page. If you need to extract thousands of documents, have some logic to pause and resume the script, so if it crashes it can start from the last completed document rather than from the start. In any case, try to replicate the natural user behaviour on the site.

Is it legal? Depends. There is no strict answer and it varies on what data you are trying to extract. Some data can be copyrighted, for example original texts, so if you are scraping them and showing in your website - you are being a bad person. Stop! Ideally you should discuss this with your lawyer, which we did, and get some thoughts on how to proceed.
Getting blocked
Stop sign

One of the decisions you will need to make is how you are going to identify the spider - you can either replicate normal browser’s headers or introduce the spider by its name (eg. ’googlebot’). First one will allow you to stay undetected, probably, while later one is considered to be the correct way. From my experience, for anything small Firefox headers will work just fine.

Websites might still decide to block you though, and it’s something you might want to be prepared for. If you are identifying the spider by a name, you should respect robots.txt and stop crawling if you are being denied by that file. However the most likely blocking mechanism is to block your IP address, which is going to happen if you are being stupid. Really stupid.

You see, when people are browsing the web they request 1 page each 3 or 4 seconds, hence if you have a list of 1000 urls to download and you just start iterating over them and issuing requests… Well, you are easy to catch by just looking at the access logs. Don’t do this. Rather have a queue of urls to download and issue requests with a random delay from a range of 1 to 5 seconds. It’s going to take longer, but it will help to avoid problems.

This doesn’t scale though, you might say. And in fact you are right, because 5 seconds delay between each request limits the amount of content you can download per day. Luckily for you, I have a tip here too - use proxy servers. It’s going to require writing a requests scheduler, but if you need to download the same 1000 urls you might as well distribute them over a list of proxies each with their own delay times. The more proxies you have the amount of content you can download increases linearly.
Extracting the data
XPath

Once you have the HTML you want to process (to extract links to follow or to extract actual data), you might wonder how to actually do it. There couple of ways and libraries for this, however if you want to keep it simple using XPath or CSS-like queries is going work just fine. If you feel like it, and believe me sometimes there is no other way, you might go with using regular expressions for this, but that’s got problems I’m going to talk about just in a second.

I tend to go with XPath because it’s very easy to write and to debug. Furthermore there are various extensions you get for your browser which will allow creating those queries and test them on the actual website. I have worked on spiders for over a 100 different websites and XPath worked fine all the time, as long as…

The problem you will need to solve is how to process invalid HTML or XHTML markup. And from my experience, I’m yet to see a website with all pages being 100% valid. The more invalid it is, the harder it’s to fix those problems. There are libraries though, most famously BeautifulSoap, which will try to process invalid markup. They do have performance implications, but keep them in mind because you won’t be able to issue XPath queries on invalid syntax.

Now let’s get back to regular expressions. Theoretically they might look awesome, because they can extract data even if the HTML markup is invalid, however the problem is that soon they get complicated and very easy to break. XPath allows you to work on a DOM tree, hence if the website structure change they just stop working completely. Regexps on the other hand might still work, but produce very unpredictable results.
Conclusion

We, as a company, have a lot of experience web scraping the data and it’s actually very very easy. As long as you follow the logical rules and don’t try to over-complicate the data extraction, you could easily extract all news items, products or blog posts in 30 or so code lines of spider. I can talk on this for hours or days, so I might write more on this soon, because this is just a top of an iceberg.



Source: http://blog.webspecies.co.uk/2011-07-27/web-scrapping-is-actually-pretty-easy.html

Monday 27 May 2013

An introduction to data scraping with Scraperwiki

Last week I spent a day playing with the screen scraping website Scraperwiki with a class of MA Online Journalism students and a local blogger or two, led by Scraperwiki’s own Anna Powell-Smith. I thought I might take the opportunity to try to explain what screen scraping is through the functionality of Scraperwiki, in journalistic terms.

It’s pretty good.
Why screen scraping is useful for journalists

Screen scraping can cover a range of things but for journalists it, initially, boils down to a few things:

    Getting information from somewhere
    Storing it somewhere that you can get to it later
    And in a form that makes it easy (or easier) to analyse and interrogate

So, for instance, you might use a screen scraper to gather information from a local police authority website, and store it in a lovely spreadsheet that you can then sort through, average, total up, filter and so on – when the alternative may have been to print off 80 PDFs and get out the highlighter pens, Post-Its and back-of-a-fag-packet calculations.

But those are just the initial aspects of screen scraping. Screen scraping tools like Scraperwiki or scripts you might write yourself offer further benefits that are also worth outlining:

    Scheduling a scraper to run at regular intervals (Adrian Holovaty compares this to making regular virtual trips to the local police station)
    Re-formatting data to clarify it, filter it, or make it compatible with other sets of data (for example, converting lat-long coordinates to postcodes, or feet to metres)
    Visualising data (for example as a chart, or on a map)
    Combining data from more than one source (for example, scraping a list of company directors and comparing that against a list of donors)

If you can think of any more, let me know.


Source: http://onlinejournalismblog.com/2010/07/07/an-introduction-to-data-scraping-with-scraperwiki/

Friday 24 May 2013

Best method for scraping data from Web using VB macro?

This is something of a conceptual question rather than on the specifics of code (ie am I going about this the right way in general or is there a better technique I could use?). I think that my problem represents a broad issue affecting many of the inexperienced people who post on this forum so an overview and sharing of best practice would also help many people.

My aim is to scrape statistical data from a website (here is an exemplar page: www.racingpost.com/horses/result_home.sd?race_id=572318&r_date=2013-02-26&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS

I have a very basic (if you pardon the pun) knowledge of VB which I use through excel but know nothing about other programming languages or conventions (SQL, HTML, XML etc.), however I am quite good at writing code to manipulate strings- that is, once I can scrape the data, even if it is in a very noisy form then I am expert at processing it. I am trying to build an automated process that will scrape up to 1000 pages in one hit. In one form or another, I have been working on this for years and the last few weeks have been very frustrating in that I have come up with several new methods which have taken days of work but have each had one fatal flaw that has stopped my progress.

Here are the methods I have tried (all using a VB macro run from Excel):
1) Control Firefox (as a shell application) - this was the poorest, I found that I could not interact with Firefox properly using a VB excel macro- i tried mainly using keystrokes etc.
2) Get inner text, outer text, inner html or outer html from internet explorer (IE)- this method was by far the most reliable but the data was, at times, very hard to parse and did not always contain everything I needed (good for some applications but bad for others)
3) automated Copy and Paste from IE- this was tantalisingly close to being perfect but is given to throwing up inexplicable errors whereby the type of information copied to the clipboard differs depending on whether it is done manually (ie CTRL+A, CTRL+C) or through the automated process (with the former I could get the HTML structure- ie tables, with the latter only text). The enigma here was that I could get the automated copy/paste to give me the right info IF I FIRST CLICKED INSIDE THE IE WINDOW USING MOUSE POINTER- however I was unable to automate this using a VB MACRO (I tried sendkeys and various other methods)
4) By automating an excel webquery- I recorded a macro of a query, this worked flawlessly giving me the structure of tables I needed. Snag was it was very very slow- even for a single page it might take 14 to 16 seconds (some of the other methods I used were near instantaneous). Also this method appears to encounter severe lagging/crashing problems when many refreshes are done (that may be because I don't know how to update the queries with different criteria, or properly extinguish them)
5) Loading the page as an XML document- I am investigating this method now- I know next-to-nothing about XML but have a hunch the sort of pages I am scraping (see example above) are suitable for this. I have managed to load the pages as an XML object but at present seem to be running into difficulties trying to parse the structure (ie various nodes) to extract text- keep running into object errors.

For the record I have posted highly specific questions with code relating to these individual methods without response so I am trying a broader question. What is the experience of others here? Which of these methods should I focus on? (bear in mind I am trying to keep everything to Excel VB Macros). I am getting to the point where I might look to get someone to code something for me and pay them (as this is taking hundreds of hours) - have people had good experiences employing others to write ad hoc code in this manner?


Source: http://www.mrexcel.com/forum/excel-questions/688229-best-method-scraping-data-web-using-vbulletin-macro.html

Friday 17 May 2013

Scrape data from Store Locators

One of the many applications of web scraping is to capture the locations of store from a retailers store locator site. For many years, retail store planners wanting to relocate or open new branches of their stores often had to rely on finding a connection within the retail organization of which data was needed. Moreover, if data list exchanged many hands, the final end user ended up paying thousands of dollars. With many retail chains putting up-to-date information of their stores on the store locator engine, web scraping allows us to capture this data quickly, accurately and at a fraction of the cost of what you might have to pay if you are looking to obtain this data by finding a source within the said organization.

The information you can generally scrape is:

    Name of the store / Store ID
    Address
    City
    State
    Zip / Postal Code
    Phone Number
    Facilities Available
    Store Hours
    Latitude & Longitude

Utilizing our experience in custom web scraping, we can provide you with all the above information from the store locator site in Excel / CSV or in any other format as per requirement. For a quote on your web scraping project, send us your request along with the targeted store locator site address, the geographical region / area and format in which data is needed. Our expert team of programmers will evaluate the site and write scripts that will anonymously and with 100% accuracy scrape the data.

Source: http://blog.website-scraping.com/scrape-data-from-store-locators/

Monday 6 May 2013

Website Data Scraping Services Helps Even Internet Solutions

In all over world “in the most credible and reliable” based on an outsourcing provider – data services, scratching the business Web Solutions. Scraping data services, high quality, and industry specific information and manually scraping and scraping the web site services to the lowest possible rate.

Scraping data from the services of an Indian company outsourcing data entry skills,derrick rose cards, data processing, research,washington capitals ovechkin, and the scraping of Web site data. Scraping a wide range of services, data entry, data conversion, document scanning and data services at a lower price to the scraping of the industry since 2005.

Our services cover the following areas; data entry, data mining, Web search, data conversion, data processing, web scraping, web harvesting and data collection, e mail.

Standard for high quality web search, data mining and web scraping services to provide a scraping action, the following information services. Our research on the web, data mining and data entry into the project formulation process is the quality standard.

Scraping industry information usually can be lawyers, doctors, hospitals, students, school,discount softball jerseys, university, chiropractor, dentist, hotels, real estate, real estate, pubs, bars, nightclubs, restaurants and IT business to follow a. The most common tool for database and email IDs of scratching the online business, twitter, Face book, social networking sites and the Google search.

Scratching more credible and reliable data services provider in the world service, data processing, Data scraping services offered, database development, web data mining, data extraction and business scraping. We are already ahead of many popular online business directories. We are skilled enough to be publicly available database of businesses from scratch in any directory.

Data scraping Services is one of the best and reliable web data scraping business in the world. Data scraping Services provide a wide range of web data scraping, scraping website data, web data mining, web data extraction, and automated data scraping and scraping manual data services.

Within a short time we have worked for various offshore clients, not just a few times (with many current customers since worked first project). We can one-stop solution provider for all your data entry, offline data mining, data capture, web data extraction and scraping work requirement. Try us for your project requirement in order to get quality results within a fast time.

Data scraping Services is India based on the most reliable company; the website offers data scraping solution for offshore clients. Try scraping with Data Services to your web search, data mining, data conversion, document scanning and web data scraping work to accomplish. Get free quote and sample work by sending your information and data scraping scraper development should be our

The forward thinking like this that from its competitors. It is not only our company. It is our passion. We have a lot of time and effort in making it easy and affordable for you to build and manage your own data capture scripts.

Scraping software turns the web into your own personal database with phenomenal ease. Once you know you will find there are no restrictions with a comprehensive suite of tool that you as an indispensable point after your first experience.

Source: http://horst6.co.nu/website-data-scraping-services-helps-even-internet-solutions/