Earthworm v5

DBMS Design

1.0 Introduction

This document addresses the functional requirements and current design for the upcoming phase of the Earthworm project (Version 5). This project and the development of its architecture have been ongoing now for over 10 years, and has evolved through the major development phases as summarized below.

 

1.1 RTP Replacement

The initial objective was to provide an updated and improved rapid response system which could acquire various types of telemetry data and generate alarms. The main requirements were reliability, speed, and longevity.

 

1.2 Integration of Regional Networks

As the system was adopted by a growing number of regional networks, numerous additional requirements were added, including real-time linkage multiple processing centers, event storage in various formats, review and modification of events, and elaborate notification schemes. These were initially met by adopting existing legacy codes and ad hoc software modules, but it became clear that these solutions were short-term measures, and could not address the underlying needs. In addition, new goals, such as strong motion data integration, use of late-arriving data, and a growing suite of end-user applications made it clear that a DBMS-centric view was required to address these issues.

 

1.3 Experimental DBMS

The objectives of this phase were to evaluate contemporary DBMSs, select a suitable product, and to demonstrate our ability to utilize it by producing a functioning demonstration system. An initial exploration of the products, contemporary literature, and consultants made it clear that DBMSs had become extremely complex, and that an extensive body of knowledge surrounded the design, implementation, and maintenance of such DBMSs. Furthermore, the products and the associated practices were evolving rapidly. The objectives of this effort were thus to create a development team with the requisite skills and performance standards, and to focus on the underlying techniques and methods, rather than end products.

 

1.4 Production DBMS.

The current phase is aimed at producing a production system to meet a number of objectives, including:

* support of regional and global processing,

* integration of strong motion data and processing,

* providing a geographically distributed system for one center to survive outages,

* support of multiple nodes at various processing centers,

* scalable configurations suitable for small regional installations as well as large networks.

* linkage to data management centers

The main tasks of this phase have been the development of an adequate software architecture surrounding the DBMS, and the design and implementation of an adequate schema. The first version of these efforts is nearly complete, and will be integrated into the Earthworm version 5 release.

 

2.0 Functional Issues

 

The issues discussed here arose during discussion with a large number of individuals including network operators, scientists, commercial end users, and field operations personnel.

Some areas, it was felt, were sufficiently outside the scope of this project that their needs were not considered in a direct manner. Most significant in this regard was the archiving of historical data other than instrumental information. Several institutions, notably IRIS/DMC and the regional archives at the University of California at Berkeley and the California Institute of Technology are well evolved, and there seemed to be no need to duplicate this functionality. To this end, supplying data to data archiving centers in appropriate formats was deemed important, and to this extent such facilities are addressed as "End Users" with respect to design issues. We also examined the documentation supporting the "Early Bird" as an example of an operational system for the rapid dissemination of preliminary summary information. The underlying concepts of the CUSP system for the automated processing of regional networks were also considered. The CUSP system has been an important contributor to regional analysis for the past 20 years, and although it is clear that the current implementation is ageing, the basic concepts are sound.

The following list represents what emerged as the most important issues during early design discussions. Not all of these are explicitly solved in the resulting design, and many are substantially out of scope with respect to current mandates. However, we have tried to make sure that the design elements supporting mandated functionality does not preclude the addressing of some of these other functional issues as time and resources permit.

 

2.1 General

* Automated event processing

* Utilization of modern DBMS Technology.

* Unification of Regional and Teleseismic Processing

* Utilization of existing Earthworm software.

* Assimilation of Disparate Data Formats.

 

2.2 Regional Networks

* Data sharing among regional networks.

* Automated processing and reporting.

* Historical access to local data

* Visual Timing Interface.

* Unassociated Waveform Data.

* Instrumentation (Infrastructure) History.

 

2.3 Global Networks

* Report tracking and control

* Rapid association of disparate data types.

* Integration of regional networks

* Modernization of earthquake review process.

* Integration of global reading groups.

 

2.4 Strong Motion Arrays

* Impact on definition of 'seismic event'.

* Source accreditation.

* Data exclusivity and access control.

 

2.5 End Users.

* real-time WWW access.

* integration of GIS products.

* Real-time waveform feeds.

* Centroid moment tensors.

* Simple access to instrument response parameters.

 

 

3.0 Design Considerations

 

3.1 Distributed Model.

A number of approaches were considered with respect to distributed database design. Tightly coupled architectures, replication or local 'snapshot' were rejected as being too vendor specific and too near the 'bleeding edge' of the commercial products. In addition, there were indications that DBMS-level replication may be too schema-specific, and result in premature hardening of the schema. We decided to implement a more 'confederate' approach in which central analysis nodes operated in a more autonomous fashion using existing, proven Earthworm software to facilitate communications. It was also felt that this approach would more easily facilitate a gradual evolution of capability as DBMS technology advances, encourage development of new and innovative approaches at regional sites, and more easily accommodate other network integration schemes.

 

3.2 Processing versus Data Warehousing.

In designing large-scale DBMS architectures there is a clear distinction that must be made early on between warehousing and transactional structures. In the seismological community, warehousing schemata are appropriate at data archiving facilities, such as the IRIS/DMS and the regional archives at Caltech and Berkeley. Such schemata universally involve a large amount of data denormalization in order to accommodate rapid access to large volumes of data that changes little after they are entered into the DBMS. The transactional approach is appropriate when the database is rapidly evolving, which in the seismological community is represented by real-time analysis systems. In such systems the overhead of repairing denormalization is an unacceptable burden and seriously degrades overall performance. The transactional approach is appropriate when the database is rapidly evolving, which in the seismological community is represented by real-time analysis systems. In the commercial world such architectures support the needs of the banking industry and distributed car rental agencies. In the Earthworm 5 DBMS design we have chosen a transactional architecture, since our mission is intrinsically real-time and transaction oriented. It is expected that the archiving functions will reside at the present repositories, and that our role will terminate with the production of a final SEED volume for delivery to such centers. There will be some installations where limited warehousing will be expected, such as some of the smaller regional networks, but this application is not a primary driver of the DBMS design, which is intrinsically transactional in nature.

 

3.3 Applications Interface.

In the exploratory stages of the DBMS effort, we had discovered several methods by which an applications program could interact with the data base. They involved various software packages by which an application program, (e.g. interactive re-pickers, locators, acquisition modules, etc) could connect to the DBMS, and directly read and write data to and from the tables of the schema. We eventually realized that as simple and appealing this approach was, it had several serious problems:

a) Schema dependence. That is, it tied the applications to the schema. If the schema changed, all applications would cease to function and would have to be recoded to the new schema. This is a serious problem as on one hand, the absence of an established, accepted schema and the dynamic nature of our objectives, it was clear that the schema was going to evolve. On the other, it was a stated objective that the DBMS was to play a central role in network operations - that is, support many applications. This combination made it clear that some form of isolation between schema and applications was needed.

b) Connection control. We noted that traditionally, the cost of the DBMS license was tied to the number of concurrent connections which the DBMS could support. Second, there was the problem we had encountered in our earlier development of servers, wherein an ill-written client could mismanage the connections it established with the server, with the result that other critical applications would be unable to connect.

c) Concurrency control. As we were forced into increasingly complex schema designs, access procedures became an issue. The problem was that many applications would be accessing the same DBMS, and would be updating and reading the same set of tables. As the schema became more complex, the number of tables which had to be accessed (or changed) in order to reach some data increased, and the sequence of locks which had to be applied became delicate and elaborate. It appeared that as the system grew, the likelihood that a mis-written application would corrupt the data base would grow rapidly. There was also the possibility that the classic 'deadly embrace' interlocks might result between applications which each held access to some tables, and could not proceed before gaining the other's table.

d) Application Complexity. Several of our objectives led to the requirement that applications be easy to write. One of the strengths of the Earthworm effort is its relative accessibility. We have made efforts to minimize the time and skill level required to enhance the system. This permitted institutions which could not maintain professional programming staff to modify and enhance the system to meet their needs. Inclusion of a DBMS led to two problems in this area: the need to understand the software layers which permitted direct table access made applications considerably more difficult to write, and the author of the application had to be familiar with the details of the schema. In addition, debugging such applications would be an intricate and lengthy process.

e) Access control. Discussions with potential clients, particularly in the area of strong motion, resulted in requirements for various levels of access for various classes of users and applications. For example, some commercial partners were willing to contribute sensitive data for inclusion in calculations of summary data, but not for direct publication on public web sites. While the DBMS does offers some forms of access control, it was not clear that these would be sufficient, and relying on each application to obey such restrictions seemed risky.

The solution to these problems which we adopted (much to the relief of our DBMS experts), was what appears to be common commercial practice; that is, to insert a layer of software between the applications and the DBMS - the applications program interface (API). This consists of a set of routines which can be called from an applications program to access the DBMS. Each performs a seismically coherent function (e.g. 'getPick', 'putPick', 'createEvent') and presents and receives data via a set of standardized data structures. Internally, they address the issues mentioned above.

Another advantage to the API concept is that the layer under the API can be replaced with a different implementation, as long as the API calls function as before. In our case, this means that all applications using the DBMS could operate with a different DBMS by porting the API. This is of value since we have observed differences in interface protocols between Oracle versions, and it provides and migration path to another DBMS vendor if that should become necessary.

 

3.4 Globally Unique Identifiers.

A problem that arises with distributed database architectures regards the labeling, or more precisely the uniqueness of primary keys. Historically, the venerable CUSP ID suffered from problems of universality once cooperating networks began to exchange large volumes of data on a routine basis and archive seismic data at central repositories. The reason for this is that the same data may reside at any number of satellite databases, and one might like the labeling to be identical at these various sites. A reason for this concerns exchange of data where it is desired to indicate that a particular piece of information represents a refinement of something that was received previously—the summary information for a hypocenter is one example. Most proposed approaches ignore this issue, and those that do use some form of identifier blocking. One such approach is to assign blocks of identifiers to the various sites, or to require all identifiers at a particular site to generate identifiers. Clearly these approaches suffer from scalability issues and difficulties associated with a central issuing authority. The approach currently taken, after some considerable internal discussions and compromise, is to use a 16 digit numerical number. The first four digits represent the originating Earthworm DBMS (permitting up to 10,000 such distributed processing centers) and a 12 digit serial number. This number need only be unique with respect to a particular table. For example, a hypocenter and a phase pick may share the same number without being considered to be in anyway related.

 

3.5 Extensibility Considerations.

There can be little doubt that as time passes different types of data and summary products will arise. As an example, for regional networks, hypocenters are generally based solely on picked arrival times of various phases. In teleseismic processing other types of data are also used to support hypocenters, such as reading groups, two and three component beams for a single station, and spatial array beams. Other kinds of summary data have similar problems. We have addressed this general problem by defining a LINK layer between the SUMMARY and DATUM layers of the database as discussed in greater detail below. Other approaches also have linking tables, but our approach differs in that link tables have a particular well-defined structure—a fixed portion consists of two globally unique keys, and a variable portion following which may contain arbitrary information related to a particular link. For hypocenter-phase picks link such information might be such things as take-off angle or phase identification, both of which are dependent on the linkage. This particular architecture is also critical to facilitate rapid association and dissociating during real-time processing and is common to many commercial OTP or on-line transaction processing systems of which real-time earthquake processing is an example.

 

3.6 Institutional Conformability.

The issue here regards a means of insuring that developments at diverse centers can be shared as easily as possible. The resources available for capability development in the seismological community have traditionally been rather spare and redundant application development should be avoided as much as possible. Cooperation at the application development level is critical if we as a community are to keep pace with the rapidly evolving technological landscape. It is also critical that creativity and innovation not be preempted by a centralized developing agency. Although this concern is almost universally shared, the approach does not seem to be quite so well focused.

The approach that we have taken is to try to achieve conformability not at the level of the underlying schemata, a task that can probably never be effectively completed, but at the level of the application programming interface, or the API as it is sometimes called. In the development of commercial products, where compatibility is clearly required for product marketability, this approach has been adopted almost universally. Because of the high cost of developing and maintaining software, "programming to the metal" can seldom be justified except in the most extreme real-time situations. Instead, one of only a few application programming interfaces are used and low level drivers are made available to map these capabilities in an efficient manner to a rapidly evolving capabilities.

A simple example from seismological application development should help to illustrate this point. For example, in the case of a simple function to retrieve the location of a seismic station given its official name, and such functions can be easily standardized. The details of the underlying file structures or database schemata supporting this request should in general be of no particular interest to the applications developer as long as the function performs as expected. While the efficiency of such an implementation might vary depending on ancillary design considerations, the functionality will not. For this reason we have adopted a design philosophy that achieves conformability at the API level and not at the level of the underlying schema architecture. An example of the success of this approach is presented by the IRIS/DMC where the database engine and database schema were completely modernized with minimal changes needed at the application level. We are hopeful that standardization at the level of the API can be achieved within the seismological community through its national organizations without becoming distracted by issues of schemata design and mission oriented process flow requirements.

 

3.7 State Driven event processing.

A proven method of automating event processing and analysis is by means of a state driven analysis sequencer. The problem addressed is the one dealing with the amount of human activity required to process the massive amounts of data resulting from the rather large number of seismic sensor sites already installed, as well as an expected increase in modernized instrumentation during the coming decade. Without a rather sophisticated mechanism for automation of analysis, processing gets incredibly far behind, and detailed information required during earthquake response becomes available sometimes years after it is needed. Nearly all of the current implementation schemes analyzed provide only rudimentary hooks for processing automation.

In the Earthworm 5 DBMS design this capability is provided by a set of 6 tables shown in State-Driven ERD. The logic is the same as that used by CUSP and leverages the experience obtained over the years resulting from the application of this approach within the context of regional network processing. Here the logic sequencer is used to make sure that all hypocenters requiring human review are analyzed, and that certain processes such as archiving and final magnitude calculations are deferred until previously required levels of analysis are completed. The much more common alternative is something that is sometimes referred to as the "cattle prod" approach, requiring analysts to be very cognizant of processing rules and sequences, and explicitly schedule required activity at the appropriate time. The net result, particularly considering the experience level of routing analysts, is lost events and delayed completion of required products. Such an approach is also required when a large number of analysts are working as a team to prevent redundant tasking and efficiency of execution by eliminating mundane tasks of work assignment and completion management.

The design provides a list of objects with associated processing requirements (e.g. waveform repicking, and mandatory reporting) with associated states. An analyst responsible for waveform repicking can then easily retrieve an event to analyze based on a prioritized schedule and without regard to other details of processing requirements. Following successful analysis, an event is then automatically scheduled for subsequent required operations, which can easily be modified by administrative personnel without extensive retraining of the analysis staff. The logic as implemented provides for multiple branching paths, so that a particular event might be scheduled for inclusion in a variety of final report products based on such things event magnitude or geographical criteria. Object state are carried in the State table shown in the diagram, the Action table provides specific information on how a state transition is implemented, and the Transition table is where a head analyst programs the automation sequencing. States can be blocked by higher priority states as controlled by the Rank parameter in the State table. Thus final magnitude calculations are blocked by the need for final analysis until required human analysis is completed if necessary. By this device, hypocenters with a preliminary magnitude can be finalized without requiring any further human intervention, allowing networks to process to a much lower magnitude threshold than would be the case without such a mechanism for automating processing activities.

 

3.8 Aggregation of Disparate Summary Information.

The need for this design element arose out of a consideration of the greater scope mandated for the Earthworm 5 DBMS. For regional network processing, the centralizing element is the hypocenter, and subsidiary objects such as magnitudes, mechanisms, and focal mechanisms are subordinate to a hypocenter in a relational sense. In other words, the Hypocenter table gives the identity to the highest level objects in the system, and ties together the rest of the derived products. There is a complexity that is encountered even by single regional networks when more than one hypocenter might exist for a particular physical event. This problem is often addressed in the various approaches reviewed by aggregating hypocenters into an associated list by some relational device and then designating one as "preferred". At the NEIC this problem is made more complex by the necessity of including contributed hypocenters from regional processing centers that in general are based on subsets of all phase arrival information available from adjacent networks and independent telemeter stations. In some cases the local hypocenter might be more appropriate for inclusion in some catalogs, such as small to moderate earthquakes in Southern California, where local understanding of crustal structure would undoubtedly lead to more precise locations. On the other hand, for larger events with an associated moment tensor, a global network location might be more appropriate and "preferred" for some applications.

Other considerations and requirements also arose during discussions with those involved in processing other kinds of seismic data where a hypocenter is not necessarily a centralizing theme or even of particular interest. For example, volcano seismic observatories frequently detect very long period but discrete long period tremor events, for which any kind of location is difficult or impossible, yet it is clear that the signals were caused by some common physical process. Another consideration arose during interviews with those involved in strong motion seismology and earthquake engineering, where the concept of discrete events is often blurred. Particularly in the area of seismic engineering a group of digital accelerograms might be organized with respect to a sequence of events, where the calculation of a hypocenter for an embedded strong event would be futile. Here the focus is primarily on characterizing the wave field as it effects buildings, and the hypocenter, or the initiation point of a complex rupture is not of particular interest. A strong motion event then might include multiple hypocenters or none at all.

With respect to NEIC processing requirements, information arrives in a wider variety than that experienced by regional centers, and such information needs to be aggregated and organized possibly for a considerable length of time before even an initial preliminary location can be achieved. Such information comprises contributed hypocenters from regional processing centers without supporting phase or waveform data, reading groups from remote sites, single station or local array, unpicked waveform snippets to name a few examples. Some of these, such as unpicked waveforms might need to be scheduled for automated analysis before a centralizing preliminary hypocenter could be achieved.

The schema design in Core Schema ERD is discussed in greater detail below, provides for these needs by introducing a new layer in the schema hierarchy designated the "Bind". It lies between the more traditional catalog layer and the summary layer, which includes such objects as hypocenters and magnitudes. An object in the bind layer could be thought of as a container that might contain an arbitrary collection of summary information and data, such as tentative phase associations pending processing by a general associator. There is actually only a single table represented in the bind layer as shown. As an identifiable system object, however, objects in the Bind table can be scheduled through the automated state driven scheduler. One simple example would be scheduling preliminary association triggered by the addition of an associatable datum such as a phase pick or single station beam. An object in the bind layer might also represent a collection of tremor signals or all strong motion records during an arbitrary period of time. This aspect of the architecture also supports the need for extensibility in that new ways of collecting seismic observations can easily be added with required automated processing capabilities without impacting pre-existing structures and processes.

 

 

4.0 Schema Design

 

Even though Oracle has become the clear choice of the seismic community, the community consensus unfortunately does not extend further to issues regarding schema design. This can be attributed to two reasons: First, seismic schemas are in a fairly early developmental stage and given their complexity and significance, one can expect schema development to undergo a lengthy evolutionary period. Second, the mission of regional networks is changing and growing, and the schema to support this is going to change and grow in complexity.

 

During the design phase of the project we considered a rather large number of approaches from an equally large number of organization. A report commissioned by the USGS Western Regional Office in Menlo Park was particularly instrumental in facilitating this phase of the project. Database schema from the UCSB Berkeley Consortium, the original CSS schema, the revised schema being used by the IDC, and the regional-centric USGS/PGE Schema were analyzed. In addition two important data exchange formats were examined, SEED Version 2.3, and the CNSS Composite Catalog Format Version 1.0.1. Both of these formats have been accepted by the CNSS, and were valuable in making sure that the resulting table structures were sufficiently complete that the creation of such products could be easily implemented. The relational schema developed by the IRIS/DMC was not reviewed because it was not completed during the design phase of this project.

 

Almost without exception the various schemata analyzed were relational and claimed to be CSS 3.0 compliant. This latter claim, however, would appear to be most often honored in the breach. While there is considerable similarity in the functions of some of the tables in the various schemata, none are related in the same manner as CSS 3.0 and all adopt different table naming conventions as well as primary and secondary key structures. All of the schemata examined also had a distinctly parochial flavor resulting primarily from the limited scope of the intended application. For example, the CSS/IDC schema is clearly targeted at its primary mission of verification, and is weak when considered as a teleseismic or regional earthquake processing substrate. Similarly, the TriNet/USGS/Berkeley schema has a definite regional network flavor. This should not in any way be taken as a criticism of these efforts, since it would seem that they all are well defined and considered within the functional scope of the projects that developed them. Also these various efforts have been an extremely important source of ideas with respect to the design of the Earthworm 5 DBMS capability.

 

The ensuing discussion progresses through decreasing layers of abstraction. The various layers depict a commonality of function that should make the overall architecture fairly easy to comprehend. As discussed, this is an extension of the layers of abstraction that were found useful in comparing and contrasting various schemata and architectures during the design phase of Earthworm V.

The highest level is represented by the 'CUSP' sequencing schema which is a completely generally approach to automated database process flow control, in this case controlling a real-time, evolving seismological database. Following this is a discussion of what we refer to as the Core Schema which is an attempt to define a common framework and representation for seismological data. In the NCEDE/USGS/TriNet schema this entire level is generally referred to as the "Parametric Schema". For our purposes we have further divided it into several6 layers, each containing tables or relations with a distinct commonality of function. The design attempts to reduce representational aspects of seismological attributes to a single, generic form to facilitate the development of an application independent interface. For example, there are almost as many ways of describing or characterizing phase onsets as there are analytical institution. Regional networks use on onset descriptor that is some variation of IPU2, to denote an impulsive arrival of a P wave with a vertical first motion and a relative accuracy of three. Teleseismic processing stations use an entirely different method for onset characterization which emphasized branches of the travel-time curve. In either case, all such describe a particular seismic phase (possibly unknown) that was "picked" at a particular time within some measurement accuracy, which for stochastic purposes must be mapped into something akin to a standard error in the data. Unfortunately, what is meant by components of the description vary widely. In particular, a "2" in the regional example means something quite different if the measurement is made from a helecorder, a develocorder, or a digital record, and in addition tends to be highly subjective varying from analyst to analyst. It is inconceivable that the burden of sorting all this out should be placed at the application level, although this is often the case. To make matters wose, the definition of the meaning of a phase description has evolved through time even at a single network as the precision of timing has steadily improved. For this reason, the portion of the schema referred to as the Core represents a considerable abstraction from the encoding as originally defined by an analyst. Assigning the actual precision of the measurement is then placed in the Datum level where the loading of data and the plethora of site dependant and historical differences can be more easily considered. Thereby, that portion of the schema sitting underneath the applications programming interface is kept as generic as possible. We have rejected the "everything but the kitchen sink" approaches taken to the extreme by SUDS and to a lesser extend by the USGS/NCEDC/TriNet design, as imposing too great a burden on the application development process.

 

4.1 State Sequencing Schema

This level consists of 6 tables, as shown in the Entity Relationship Diagram (ERD) At this level of abstraction there is essentially no information that what is being automated has anything at all to do with seismology. The Serial table is just a DBMS interlocked pool of unique identifiers, although during the actually implementation it appears that this functionality is provided more effectively by services intrinsic to the Oracle DBMS, and hence will probably not be implemented in this form. The Entity table, in a general implementation of the state sequencer, would provide a translation between the single key objects being controlled and whatever key structure is being used in the DBMS being controlled. For our purposes it provides a translation between the single numerical key used by the sequencer and the primary key of the object being sequenced. Specifically, Type refers to a particular table, and Path references a single globally unique id.

The State table is the heart of the sequencer and contains one entry for every activity that must occur, generally several for a particular object. The State attribute indicates what action is scheduled, and the Rank denotes a relative priority. The structure of the primary key facilitates rapid scheduling, while insuring that such processing occurs in the proper sequence. A single object (row in another core table) might be represented by several rows in the State table, but no processing of a lower ranking action will occur until all actions at a higher rank have been completed successfully. Consequently, several required processes can be scheduled simultaneously, such as relocation and cataloging, and cataloging will be masked until relocation is successful. The Action table is just a glue relation that describes how a particular action is carried out, and may contain a system command in the Command attribute and any specialization information in the Parameter column. For example the command might cause a location to occur and the parameter would have specialized information such as what velocity model to use. The actual entry references any of possibly several parameters that are presented to the processing program as a group, so that a single Param entry in the State table might group together information such as a particular velocity model as well as a set of station.

The Transaction table is where the sequencing logic is "programmed" by a site operations manager by means of a simple, visual interface. When an activity (instantiation of an application program) completes, the activity and the result are used as a partial key that automatically determines what happens next. For example, a location program completing with insufficient convergence would return a result that would cause an analyst to re-examine the phase picks. Relocation would then be automatically rescheduled by the same device once the analyst successfully corrected the data, or possibly decided that the event should be discarded. This device allows the station operations manager to reprogram the processing trajectories of various objects controlled by the DBMS as mission procedures and requirements evolve.

 

4.2 Core Schema

During the survey of other contemporary seismic schema designs, it was discovered that the various schema could easily be recast into a canonical form consisting of three functional columns. In the report these were designated Catalog, Summary, Link, and Data. Such a division greatly facilitated the comparison of the various approaches and made it far easier to understand structural differences and the underlying reasons for them. In the discussion of the Core Schema we have followed this same pattern with the inclusion of two additional schema layers designated Infrastructure, and External.

The Core Schema represents the overall structure of the informational aspects of the database. As shown it is incomplete representing only those tables involved with regional seismic network processing and relations. Similar diagrams exist depicting tables emphasizing global network processing and applications to strong motion data. The various levels actually represent successive levels of abstraction with the most fundamental data on the right and the most derived layer on the left (Catalog Layer). In general, data enters the system on the right and flows toward the left under the guidance of state driven sequencing logic.

 

4.2.1 Catalog Layer

This is the highest layer of abstraction in the "Core Schema", and essentially consists of lists of things of interest to particular groups. The most familiar of these is the earthquake catalog which contains the final "preferred" location in summary form. As discussed previously, there may actually be more than one "preferred" location depending on the target audience, and these would be represented in separate catalogs. Other kinds of lists that are supported at this layer and represented by separate tables might be list of "strong motion" events, tremor events, or interim reports such as the PDE and the NEIC. This is a highly extensible layer, and additional tables are expected to be added, and possibly deleted, frequently. A particular event is only listed once in any one table, although it may and in general will appear simultaneously in others.

 

4.2.2 Event Layer

There is only one table in the Event layer, but it plays a pivotal role in supporting extensibility. It purpose is provide an identification and a reference points to a collection of objects in the bind layer. The most familiar kind of event is a "Quake", which would be have the tiEvent attribute set to "Quake". The tiEvent attribute also identifies the table by name in the Catalog layer to which the event is assigned. However, as mentioned above, we wish to process and control collections of seismological observations that are associated with discrete physical events but that might not necessarily be associated with a single hypocenter or any hypocenter at all for that matter. Such an event might be a collection of volcanic tremor records or a set of strong motion records that may be associated with more than one hypocenter. Since the primary key is a globally unique id (GUID), it is capable of being sequenced and controlled by the state sequencer logic. The value of the unique key in the idBind attribute identifies the particular group of objects in the database that are associated

There exists the possibility within the scope of the design for a particular physical event to have multiple identities. For example, a group of tremor records might on some occasions posses a hypocenter and also be considered an instance of a long-period volcanic earthquake. In such cases, two records in the Event table would share a common value for the idBind attribute, and would be referenced by records in each of two tables in the Catalog level.

 

4.2.3 Bind Layer

The Bind layer is somewhat unique to the Earthworm 5 DBMS design and reflects the need to associate and aggregate summary and data objects that are substantially more diverse than those encountered in regional seismic network processing. There is only one table in the Bind layer comprising three attributes, which together are also a single primary key. An "object" consists of all rows of this table with a common value for the idBind attribute. A single "object" thereby consists of several rows in the Bind table. The second and third attributes (columns) identify a particular and arbitrary object elsewhere in the Core schema. The tiCore attribute identifies a particular Core schema table, and the idCore attribute refers to the globally unique ID in that table. For example, a small event located with 4 phases would contain 5 rows in the Bind table, all with a common value for the idEvent attribute. One of these rows would have "Origin" as its tiCore attribute with a specific value of idOrigin to reference a particular event in the Origin table. The remaining 4 rows would have a value of "Pick" as their tiCore attributes and unique values of idPick as their idCore values. This composite key consisting of 3 attributes is guaranteed to be globally as well as locally unique. Rows in the Bind table reference specific rows in tables in both the Summary and Datum layers. As an example, an event might initially consist of a collection of phase "Picks" from the Datum Layer, and later after a successful location is achieved a row in the Origin table in the Summary layer as well. In general, an earthquake event will reference several rows in the Origin table, one of which being "preferred".

 

The maintenance of the Bind table is the sole responsibility of an application program known as the Global Integrator or GLINT, although objects will be initially be aggregated by various data-population processes representing a preliminary logical association. GLINT will be responsible for merging objects, deleting objects, and changing the identity of objects as additional information becomes available. For example, an object that is originally identified initially as a collection of waveforms might as a result of the application of a picking application be reassigned as an earthquake and sequenced as such through the sequencing logic for that kind of an object.

 

4.2.4 Summary

The Summary layer is common and present in all of the schemata investigated, although it is not necessarily designated as such. Such summary tables, as shown in the ERD include earthquake origins tables, various kinds of mechanisms, focal mechanism, and moment tensors. A common feature for all of tables in the Summary layer is the use of a GUID as the primary key. Because of this all summary objects are capable of process control through the CUSP sequencing logic. In addition, all tables in the Summary layer contain entries that are distillations of collections of data from the Datum layer, although these connections are not expressed directly but through information contained in the Link layer. It should be noted that the association is not in general homogeneous. For example, a given Origin might be linked to phase picks, three component station beams, or spatial array beams.

4.2.5 Link

As with other layers discussed, tables in the Link layer have a common column structure. In all cases they contain a primary key which is a composite key, each of which is in itself a primary key in another table. The first is a primary key to a summary object, and the second is a primary key to a datum object. The remaining attributes are referred to as link summary attributes and express information uniquely associated with the kinds of data and summary objects being linked. For example, for an OriginPick entry, such information would include such things as the travel-time branch used by a location program (which in general may be more specific or at variance with that assigned by an analyst), takeoff angles, origin station distance, and station azimuth. A summary table entry in general would be linked to more than one kind of data used to support it. For example, an origin table entry might be referenced by entries in the OriginPick table as well as the OriginRay tables. This design element supports extensibility in that new kinds of data can be used to support old kinds of summaries without changing the structure of tables in either the Summary or Datum layers.

Information in the Link layer tables are populated during two stages of processing. As a specific example consider the processing involved in associating phase picks with hypocenters. In the first stage, the primary key is constructed by the global associator (GLASS), and the body of the link is subsequently updated every time a location program relocates the event.

This design element specifically supports the transactional nature of the Earthworm 5 design criterion. The more robust schemata examined during the design phase, especially those designed to support real-time systems, all posses similar logic. The reasons for this, of course, are well founded. The Link layer tables are highly ephemeral, particularly in the minutes following a seismic event. Tables in this layer are modified continuously, but generally only by a single process. If the link information were embedded in the summary and datum tables, as would be appropriate for a data warehousing design, then these tables would be constantly updated. Since summary and datum tables are also referenced by a variety of other processes, the result would be data lock conditions and data flow bottlenecks. Although there are two applications that process data in the link tables, that is, an associator and a summarizer (e.g. location program), the location processes involved in association are controlled by the associator, so data locks need not occur. In the case of final location calculations controlled by the event sequencer, links are being processed that are not in general part of the association process, so again record level locking does not occur.

 

4.2.6 Datum

Tables in the Datum layers for the most part carry primary observational data, and there is one table for each kind of data that is being collected. Information in these tables is static, and is not modified during subsequent processing. Each table in the Datum layer also has a primary key which is a GUID, and as such is controllable by the event sequencer. The GUID id is also important with respect to exchange of data and maintaining the identity of data between cooperating nodes in the Earthworm 5 distributed architecture.

There is more complexity in the structure and relations shown than are present in the associated ERD. For example, one kind of object in this layer is a waveform snippet, which might be referenced by a phase pick from which it is derived. A three component beam references three or six waveform snippets, while a spatial beam would in general reference an arbitrary number of waveform snippets. A exponential coda decay function from which magnitudes of small local earthquakes are often calculated references an undefined number of coda segments which themselves are also Datum layer elements.

Objects in the Datum layer, as with objects in the Summary layer, also possess an optional composite foreign key consisting of idSource, tiExternal, and xidExternal. In the Datum layer this reference is particularly important, as discussed in the general introduction to this section, the representation and interpretation of the same data from different sources varies widely. The Core schema attempts to render all such information into a common, generic form during data population, but undoubtedly the translation is not always accurate so that it is necessary, it would appear, to also preserve the data in its original representation. This may be less necessary than we currently believe, but for now this is our working assumption.

 

 

4.3 Infrastructure Schema

The infrastructure schema, along with the CUSP and core schemata completes the EW5 specification. The infrastructure schema comprises station location, configuration, and response information. The current design was intended to satisfy a number of goals including performance, flexibility, and generality. The need for performance is obvious as key tables in the infrastructure may be accessed hundreds of thousands of times a day in routine operation. The requirement for flexibility and generality arises from several sources. First, it is necessary to consider the needs for identifying the station-component-channel definitions of a wide variety of fixed seismic network configurations as well as the needs of strong motion and portable seismic networks. In addition, response information reflects physical hardware, firmware, and software that continue to evolve at a rapid pace.

Although one of the primary requirements of the infrastructure schema is that it must be straightforward to generate SEED volume headers, the schema is actually hardware centric. There are a number of reasons for this. The most important reason for a hardware centric design is that each seismic network will have primary responsibility for tracking equipment and the related responses for stations they install and operate. Building responses from the hardware up is a convenient, intuitive way of maintaining response information. It also has the added advantage of imposing real world constraints on the responses generated. For example, all like channels from a typical data logger use the same sequence of FIR filters and all channels at any rate derived from the same seismic sensor must share the same set of instrumental poles and zeros. Finally, a hardware centric approach makes it much more intuitive to determine exactly what has changed at each station-epoch boundary.

 

4.3.1 Data Channel Identification.

Station-component-channel information is defined through the Site, Comp, and Chan tables respectively. Each table has both static and time dependent components recognizing that a station-component-channel may exist under one nomenclature even though its characteristics may have changed with time. In general, the SEED recommendations for nomenclature have been followed (e.g., network, station, component-channel, and location codes). However, these identifiers have been generalized to support a range of possibilities that are not currently convenient in SEED including free field strong motion instruments, strong motion building arrays, and portable seismic networks. Note that the Site, Comp, Chan tables have been purposely denormalized to accelerate performance.

 

4.3.2 SEED Structure.

For each epoch, each channel points to a row in the Device table that may point to another row of the Device table, etc. A sequence of Device table entries is referred to as a device chain. Each Device table entry points to a row of a table associated with a particular type of hardware/software unit (e.g., sensor, amplifier, digitizer, filter, decimator, etc.). Thus, rows of the device table may be thought of as being one-to-one with SEED blockettes. However, to economically accommodate sequences of cascaded units, the Device chains are specified in reverse order (i.e., from output to input). This allows channels derived from the same component through a cascade of FIR filters to point to different starting places in the same device chain.

 

Effective date-time ranges for a response are stored at the lowest level. That is, a device chain doesn't have a date-time range, but a sensor, for example, does. This organization recognizes that an epoch begins at the date-time that equipment was physically repaired or replaced. Thus the epochs for a channel are defined by the union of the date-time ranges of all units pointed to by the associated device chain. The epoch date-time ranges are saved in the Chan table for quick reference. Again, the response information has been purposely denormalized to accelerate performance. For each channel epoch, the raw response is approximated by the sensor poles and zeros and the overall channel gain for the purpose of computing magnitudes. This accelerator table, ChanTransferFunction is accessed directly from the Chan table and will be generated automatically by application software as raw response information is updated.

 

4.3.3 Modules.

Modern digital seismograph systems result in networks which may have to track tens of thousands of channel epochs, each with its own device chain. In reality, most of this complexity is often hidden inside of modules (e.g., data loggers) which change infrequently and usually as a whole. In addition, one network may have many modules that are essentially identical. This factor has been exploited in the infrastructure schema to simplify the maintenance of device chains. Thus, a Device table row may point to a row in the Module table rather than to a hardware/software unit table. The behavior of the module is itself defined by another device chain (e.g., a sequence of FIR filters). Because the module typically defines the behavior of many channels, it then points to a row in the Plexor table specific to the desired channel. The plexor entry allows an additional device chain to be associated only with the channel in question (e.g., a gain). A module can be thought of as a device chain subroutine. Although it is not necessary to ever use modules, one could simplify the maintenance of response information by constructing a set of modules to describe the internals of a common data logger. This set of modules could then be used unchanged for all similar data loggers throughout the network with varying elements such as channel gains being centralized in the Plexor table.

 

4.3.4 Inventory Control.

Networks generally have a statutory responsibility to track their inventory of seismic equipment. The infrastructure schema accommodates this need by providing a site configuration table. Site configuration information also makes the maintenance of channel response information much more intuitive. The idea is to develop applications that use the inventory of equipment installed at each site to create the associated device chains. Note that inventory control information will probably not be available for stations acquired from cooperating networks. In this case, the device chain would have to be constructed directly from response information provided by the cooperating network. For example, device chains could be built from a dataless SEED volume. However, this approach is not recommended because it would be analogous to an implicit numerical algorithm (e.g., determining which FIR filter a SEED blockette represents would require comparing the coefficients of the FIR filter to the coefficients of all known FIR filters).

 

4.3.5 Historical Configuration "Snapshots"

Because epoch date-time ranges are saved for each channel, constructing historical response and/or configuration snapshots is straightforward. This capability supports such requirements as generating dataless SEED volumes, providing response information for historical data accessed via AutoDRM, etc. It also makes it possible to smoothly handle the impact of field equipment changes on routine processing, particularly for events that may be revised some time after they are recorded.


Questions? Issues? Subscribe to the Earthworm List (earthw).