Monday, October 19, 2009

Screen-scraping from Wikipedia to Google Books

Most people will have had this experience: you go to a search engine and type in a search term and up come a host of results with what looks like encyclopedic entries. But you click on them and they all have exactly the same text.

This practice is known generally as data scraping. For online data, it's web scraping or web harvesting. When you're reading webpages and taking the text from them, that's screen scraping, but it's also possible to use RSS feeds, databases, and other sources to get text. Why? Well, once you have text, you can get people reading it, and once you have readers, you can get money from advertising.

Thanks to Google and other less reputable ad brokers, it's easy to stick ads on your webpage and make some money. For this reason, many people thought Google would never remove screen-scraping sites from their search, but recently they seem to have taken action and such sites have fallen down the rankings.

answers.com is perhaps the leading example, reproducing Wikipedia pages, though other sites do the same - wapedia reformats Wikipedia for mobiles; astrology site astrotheme.com combines star signs with Wikipedia biographies. Sites such as fullbooks.com display uncopyrighted books - their edition of Emma serves Google ads asking "Looking For Rich Women" and offering "Inside A Boyfriends Mind". Anyone can display an out-of-copyright text, and because of the licensing of Wikipedia content, it can be reproduced as long as you credit or link back to the source. Wikipedia does not show advertising but many of the sites reproducing its content do.

Some companies attempt to increase their value by adding extra functionality. The now-defunct LJ Find, for instance, scraped LiveJournal (based on RSS feeds) and offered a search facility - something the site itself didn't offer - as well as displaying the entire contents of people's journals with ads alongside. Other sites offer fully-searchable novels, or attempt to package content from multiple sites onto one page.

It's in this context that I come to Google Books. While Google have removed many of the web-scraping sites from their search results, the world's leading ad broker has another way of making money from other people's content. They have digitised a huge number of books and now serve them up online with appropriate adverts displayed both when you search for a book, and in a side panel when you read the book. I'll be interested to see if anyone can screen-scrape the books that Google has scraped from the world's libraries, and put up their own advertisements - or even better, ads supplied by Google.

No comments:

Post a Comment