This is the first installment in what will hopefully become a series.
Here at screen-scraper we handle a variety of projects for a myriad of different clients. All of our work is centered around our core software, screen-scraper, but is often complimented by third-party software such as PHP, Tomcat, Lucene, Google Web Toolkit, mySQL, along with our own set of custom-built code.
- ScrapbookFinds.com: Our in-house scrapbooking comparison shopping site. Since 2006 we have been scraping many scrapbooking supply websites for product data. While scraping, the data is added to a mySQL database where we categorize and scrub it for duplicates. When you search the site Lucene quickly handles the finding of results related to your query.
Challenges:
Data normalization is the process of identifying a single product that is found on more than one site. Each site may refer to that product using different characteristics in, say, the title, description, or part number. Finding likeness despite the differences is a common challenge for us. Data normalization is handled by Lucene’s ability to index and tokenize disparate data to find commonality.
We mitigate changes to a site by monitoring the number of records each time it scrapes. If the current number of records drops below 80% of the previous total then we know to look over the logs for errors and/or warnings issued by screen-scraper.
Technology used:
Stay tuned for more to come…
The solution sutgesged by phil is the ultimate solution – what i call as “divide and rule”. But it has to be designed in a manner that resource usage is minimized. And each application has specific needs which needs special designing to satisfy its needs and keep the app up and running…So, lets c..