Modularity
Our
solution consists of a number of services that communicate with each
other on the basis of pre-defined interfaces. It is possible to
update individual modules without infringing the overall system. This
modularity also guarantees scalability and facilitates quality checks
and KPI-based evaluation on the individual parts of the solution.
Simple
interfaces
The
individual modules communicate with each other through simple,
light-weight XML-based interfaces. At each stage, the input and
output data is readable by humans. Our solution carries a small
storage and network footprint.
Human-readable
intermediate results
At
each stage, humans can inspect results. We only use human readable
formats (typically compressed with standard tools such as gzip to
minimize storage and bandwidth consumption). It is possible to
quickly check part results without technical support.
Extensibility
/ scalability
Our
solution is capable of running on many servers in parallel. By
increasing the hardware (processor capacities, RAM, hard-drive space
and network bandwidth), it is possible to scale the processing
throughput without changing program codes. Even on a single machine,
it is feasible to process – crawling, running the extractors and
combining the results – half a million websites in less than 2
weeks.
|