Prelimary Description of the CUSP Schema
Carl Johnson, January 15, 1999
Introduction
CUSP seems to have come to mean a variety of things to different people. Recently I have come to realize just how much it is seen as a timing interface, which I suppose makes sense, since that is what you see when you look at it. My view, not surprisingly, is radically different. To me CUSP is a relational database together with a mechanism to track the processing of data as it is assimilated. In concept it is not restricted even to seismology, and I have used it, in a modern form, to organize image processing in remote sensing and medical applications. Applied to regional network processing it can be configured to insure that all events are analyzed, located, reviewed, and archived. The alternative, and still ubiquitous approach, is to rely very heavily on analysts to initiate all phases of processing for each event—I have referred to this as the "cattle prod" approach. It breaks down with complexity and during times of high seismicity, and it breaks down with the number of analysts involved in processing. In other words, it does not scale well.
This is not to say that the timing interface is not important, because it is central. But it is still only a small part of the effort of providing the services that CUSP users have come to expect. The central theme in the original design was to separate the work that can best be done by a computer from that requiring a trained analyst. The analyst is seen as a critical resource that should not be expended on mindless drudgery involved in tracking and bookkeeping.
CUSP was developed on a proprietary and very limited relational database at a time when computers were slow and centralized and when DBMS seats started at $50,000. Needless to say, times have changed. Mounting CUSP on an "off the shelf" DBMS makes a great deal of sense, and eliminates much of the need for long term support which has, if anything, been CUSP’s nemesis. While modules, such as particular timing programs, will still be proprietary and mission specific, the substrate will be commercial. I strongly believe that the process sequencer, which is the heart of CUSP, is still a very viable and valid concept, and should be retained.
Cusp State and Transition Tables
The CUSP schema organizes processing into a network of discrete states and result-oriented state transitions. Provisions are made for shadow tasking and rank based prioritization.
I guess CUSP has come to mean a variety of different things and some have called it a philosophy. The original idea and central theme in the CUSP system comprised a set of relational tables that would track the sequences of processes required for an object (local event), and ensure that all necessary tasks are completed in the correct order. The word CUSP was an acronym for Caltech / USGS Seismic Processing system, that is, it really should be referred to as the CUSP system, not just as CUSP. The name was also a play of words on cusp, referring to a point in time at which a change occurs.
An event, represented as a single long integer, could simultaneously exist in one or more states such as "Time", "Archive", "Delete". In this example, the "Time" state was associated with interactive waveform picking and relocation, "Archiving" represented the process of placing event data in successive files on magnetic tape, and "Delete" actualized the removal of all disk resident files, particularly waveform data. At any given time only a few activities would be occurring, for example file deletions followed archiving for fairly obvious reasons. Similarly, event timing can only occur when an analyst is present, and it was generally considered polite if deletion was blocked until timing was completed. To accomplish this each state assigned to an event was accompanied by a rank, an integer ranging from 1 to 10000. For a given event only the instance with the lowest rank could be executed. Thus if a particular event was posted to "Time" with a rank of 100, and "Delete" with a rank of 1000, then deletions would be postponed until timing was complete. In the original implementation states could lead to the scheduling of new states depending on the numerical result of an operation. Consequently the processing trajectory for a given object could vary considerably and exceptional conditions could be handled in a natural way by providing parallel processing tracks.
For our purposes the application of this technology would focus on a variety of issues requiring tracking of deferred processing. The basic concept is far more general than its original CUSP form, permitting the tracking any objects that can be identified by a single serial number. Deferred processing will arise whenever system resources are inadequate to accommodate real-time processing. Examples might include picking phases for incoming snippets, reassociation of phases from multiple sources, or batch transfer of archival material to a data management center.
For the remainder of the discussion refer to the entity relationship diagram shown in Figure 1. The objects manipulated by the CUSP logic are called entities, and each one that exists is represented as a row in the Entity table. Each entity is identified by a unique Id. In the original implementation this was the "CUSP Id", which uniquely identified a seismic event. For our purposes this could be any kind of object that can be manipulated programmatically, such as a row in a table, or collection of entities associated with some kind of container entity. The column Type designates the type of the entity, and Path is provided for some sort of access path to the entity. For example, Type might indicate that the entity was a row in a database, and the Path would be access information required for retrieving it. For a more familiar example, Type might indicate a local file, and Path would then be the path in a directory structure. These two columns can be used in anyway desired in the design of a DBMS based scheduler. The only non-key attribute is Result. If a state is ready to be executed the value of Result is set to 0. A nonzero value for Result indicates that a process completed with a result code for which a state transition had not been (see discussion of Transition table below).
The central table here is the State table, which contains an entry for each object in each state. That is, if an object is posted to three states there will be three rows representing this in the State table. The primary key is a concatenation of the columns Id, Rank, and State in the order given. This order is chosen to facilitate a round-robin scheduler that continuously scans the State table and instantiates processing tasks as required. As noted above, only that state with the lowest rank for a given entity can be processed. With the compound key expressed in this manner, the logic needed to implement a scheduler could be a simple finite-state machine. As shown, one and only one entity is associated with each State table row while any number of State table rows can be associated with any entity.
Whenever a process associated with a state completes for a given entity a result code is emitted and examined by the scheduler. It uses the combination as a key to locate an entry in the Transition table. If an entry is found, then the original state entry is removed from the State table and a new one is posted using the state and rank from the attributes NewState and Rank respectively. The Result column is set to zero. If no entry is found, the original State table entry is retained with the Result set to the value of the result code.
The Serial table is used as a source of unique numbers. The rows in the Serial table represent pools of unique numbers identified by the Pool column. The value in the Serial column is the last number assigned in that sequence. The serial number increases by 1 each time a new value is requested. The pool used the CUSP API is named "entity".
Finally, the Action table provides the information required to actually instantiate a process. So far schedulers have required that the Command attribute contains the directory path to an executable. The command is executed with the first command line parameter set to the Id of the entity requiring processing. The result code emitted by this process upon termination becomes the value of Result in the State table. From these components rather elaborate processing strategies can be devised and constructed.