Ausschreibung/GraphServ-Updater/Spec

Aus Wikimedia Deutschland
Wechseln zu: Navigation, Suche

The GraphServ-Updater is a program that feeds updates from wiki databases to GraphServ. The purpose is to provide an up-to-date representation of the category structure of all Wikimedia wikis on the Toolserver.

The updater is a standalone program that is intended to be run as a background service on the Toolserver.

Functionality

The general idea of the updater is this: load data from one kind of database (MySQL) and push it into another kind of database (GraphServ). Then look what changed in the one database, and push the changes to the other database.

In more detail, the following functionality shall be provided by the updater:

Importing a Wiki's Category Structure

  1. Connect to the wiki's database copy.
  2. Connect to the desired GraphServ instance
  3. from the Wiki-Database, load the category structure, represented as a set of pairs of page IDs, where the first ID is the parent and the second ID is the child. The updater should not assume anything about the structure: it may or may not be hierarchical, acyclic, connected, etc. This structure can easily be derived from the categorylinks table in the Wiki's database, combined with the page table, to resolve category names to page IDs.
  4. optionally, create the target graph if it doesn't exist (command line option)
  5. clear the target graph if it already existed
  6. use the add-arcs command (see the GraphCore spec) on GraphServ to import the category structure loaded before. Note that the structure does not need to be kept in memory. It could be streamed from MySQL to GraphServ arc by arc.

If the MySQL host is the same as the GraphServ host, a temporary file might be used to transfer the bulk data: use MYSQL's INTO DATA OUTFILE to create a CSV file containing the category structure, and GraphCore's piping facility to use it as the input for add-arcs. This should be controlled by a command line option.

Updating a Wiki's Category Structure

  1. Connect to the wiki's database copy.
  2. Connect to the desired GraphServ instance and select the appropriate graph (according to the command line)
  3. Determin the set of modified categories:
    • From the wiki's recentchanges table, determin all categories (pages in the category namespace) that have been modified since a given reference time.
      • the reference time is given on the command line, optionally as a timestamp or as an id to be compared with the rc_id field.
    • From the wiki's recentchanges and templatelinks tables, determin all categories that transclude a page that has been modified since the given reference time.
  4. Determin the "parents" (categories) of each modified category page using the categorylinks table. Do this in a single query, rather using one query per page.
  5. For each modified category, call the replace-predecessors command (see the GraphCore spec) on GraphServ to replace the stale set of parent categories with the fresh one.

Maintaining Multiple Category Structures

  1. Load a list of wikis for which to maintain the category structure
    • may be given as a file specified on the command line
    • may be derived from the list of graphs known to the GraphServ instance
    • may be derived from the list of databases in MySQL
    • may be taken from the Toolserver's meta-database.
  2. for each wiki, perform the category structure update as described in the section above. Wait a configurable amount of time between wikis.
    • the reference time (or rather, reference rc_id) for each wiki is remembered from the previous pass, so the next pass will start where the last one left off.
    • the reference time for each wiki should be stored persistently, perhaps in a file, so it survived a restart of the updater. (Note: it may become possible to store this in the graph itself).
    • more generally, the internal state should be stored persistently in a way that allowes for crash recovery, resuming the updating process where it left off. Redundant operations are permissable, but re-processing large amounts of data should be avoided.
    • it shall be possible to force the reference time for the first pass to some timestamp from the command line
  3. optionally, repeat this indefinitely, waiting a configurable amount of time between passes.
  • It can be assumed that all wiki-databases reside on the same host, so a single MySQL connection can be used for all of them.
  • similarly, it can be assumed that all category graphs reside on the same GraphServ instance, and a single connection can be recycled.
  • if for a given wiki there is no corresponding graph, this should be reported, but it should not cause the updater to fail.

Implementation Constraints

Configuration

  • The configuration file should use ini-file syntax.
  • Any information needed to connect to the Wiki's database, given the wiki name, shall be read from the configuration file.
    • to avoid having to provide information for each wiki, information from the Toolserver's meta-database my be used. This way, only access to that meta-database needs to given in the config file.
  • the connection info for contacting GraphServ shall be read from the configuration file.
    • if the GraphServe-instances are located on the same hosts as the databases, this should also be noted in the configuration file, to avoid having to specify GraphServ's location for every wiki individually.

Command line

  • the database name of the wiki to import the category structure for is given on the command line.
  • optionally, the name of the target graph is also given. If not, the target graph is assumed to have the same name as the database.
  • options for "greate graph", "use data file", etc. (see above).

Plattform

  • The updater is requirede to run on recent versions of Linux and Solaris, in both 32 bit and 64 bit variants.
  • The updater should be designed for long term operation. Care must be taken to avoid memory leakage and involuntary retention of objects.
  • There are no hard requirements for the language of implementation, but since the updater acts mostly as "glue" and does no heavy processing, maintanability and portability are more important that execution speed. Because of that, it is suggested to use a cross platform scripting language like Python.

GraphServ API

For communication with GraphServ, a client side API abstraction (client lib) should be used. A reference implementation exists in PHP. For other languages, that implementation would have to be (partially) ported.

Documentation

High level documentation shall be created during the course of implementation. It shall describe:

  • To overall functionality
  • Implementation architecture
  • Command line usage
  • Configuration options
  • Setup considerations

The documentation shall be maintained in the same code repository as the source code, and thus versioned along with the code. The documentation shall be written in ReStructuredText format (a lightweight right text markup language similar to wiki text but more suitable fordocumentation).

The source code shall be documented in a way that makes clear:

  • the purpose of modules, classes and functions
  • the reason for implementing things the way they are implemented

Tests

After the inital implementation, a testing phase shall confirm that the software conforms to the specification and meets all operative requirements.

High level test cases shall be implemented to assert the correct operation. In addition, manual test runs and performance tests in the operative environment will be performed. Any problem discovered during the test shall be covered by a corresponding test case, to avoid regression.