Our solution consists of a number of services that communicate with each other on the basis of pre-defined interfaces. It is possible to update individual modules without infringing the overall system. This modularity also guarantees scalability and facilitates quality checks and KPI-based evaluation on the individual parts of the solution.
The individual modules communicate with each other through simple, light-weight XML-based interfaces. At each stage, the input and output data is readable by humans. Our solution carries a small storage and network footprint.
At each stage, humans can inspect results. We only use human readable formats (typically compressed with standard tools such as gzip to minimize storage and bandwidth consumption). It is possible to quickly check part results without technical support.
Our solution is capable of running on many servers in parallel. By increasing the hardware (processor capacities, RAM, hard-drive space and network bandwidth), it is possible to scale the processing throughput without changing program codes. Even on a single machine, it is feasible to process – crawling, running the extractors and combining the results – half a million websites in less than 2 weeks.