Technology

Modularity



Our solution consists of a number of services that communicate with each other on the basis of pre-defined interfaces. It is possible to update individual modules without infringing the overall system. This modularity also guarantees scalability and facilitates quality checks and KPI-based evaluation on the individual parts of the solution.



Simple interfaces



The individual modules communicate with each other through simple, light-weight XML-based interfaces. At each stage, the input and output data is readable by humans. Our solution carries a small storage and network footprint.



Human-readable intermediate results



At each stage, humans can inspect results. We only use human readable formats (typically compressed with standard tools such as gzip to minimize storage and bandwidth consumption). It is possible to quickly check part results without technical support.




Extensibility / scalability



Our solution is capable of running on many servers in parallel. By increasing the hardware (processor capacities, RAM, hard-drive space and network bandwidth), it is possible to scale the processing throughput without changing program codes. Even on a single machine, it is feasible to process – crawling, running the extractors and combining the results – half a million websites in less than 2 weeks.