This document addresses an assessment of functional requirements and current design for Phase 5 of the Earthworm project. The early stages of this project have been ongoing now for over 10 years, and the major development phases are summarized below. Phase 5 focuses on adding a distributed database functionality to facilitate the distribution of data among the many cooperating institutions, to simplify the management of the operation and maintenance of the distribution systems, and to provide supported for data analysis at multiple nodes at several central coordination sites.
The issues discussed here arose during discussion with a large number of individuals including network operators, scientists, commercial end users, and field operations personnel.
Some areas, it was felt, were sufficiently outside the scope of this project that their needs were not considered in a direct manner. Topical array deployments, such as aftershock monitoring and response to an impending volcanic crisis were not addressed directly. Also, the archiving of historical data other than instrumental information, was considered to be out of scope. Several institutions, notably IRIS/DMC and the regional archives at the University of California at Berkeley and the California Institute of Technology are well evolved, and there seemed to be no need to duplicate this functionality. To this end, supplying data to data archiving centers in appropriate formats was deemed important, and to this extent such facilities are addressed as "End Users" with respect to design issues. We also examined the documentation supporting the "Early Bird" as an example of an operational system for the rapid dissemination of preliminary summary information. Members of the development team also have a substantial knowledge of the CUSP system for the automated processing of regional networks. The CUSP system has been an important contributor to regional analysis for the past 20 years, although it is clear that at this time it is getting a bit "long in the tooth".
The following section presents a reasonably complete list of what emerged at the most important issues during early design discussions. Not all of these are explicitly solved in the resulting design, and many are substantially out of scope with respect to current mandates. However, we have tried to make sure that the design elements supporting mandated functionality does not preclude the addressing of some of these other functional issues as time and resources permit.
Although the choice of database structure is central and critical, there is almost no issue regarding which course to pursue. Early on it became clear that we wished to use a modern, commercial database engine as the foundation. Proprietary data structures and "roll your own" databases have almost universally met with disaster. Fortunately there is at this time almost a complete consensus within the seismological community to use the Oracle relational DBMS, and as such this was also our choice.
Unfortunately the community consensus does not extend further to issues regarding schema design. During the design phase of the project we considered a rather large number of approaches from an equally large number of organization. A report commissioned by the USGS Western Regional Office in Menlo Park [HYPERLINK DAVE’S REPORT] was particularly instrumental in facilitating this phase of the project. Database schema from the UCSB Berkeley Consortium, the original CSS schema, the revised schema being used by the IDC, and the regional-centric USGS/PGE Schema were analyzed. In addition two important data exchange formats were examined, SEED Version 2.3, and the CNSS Composite Catalog Format Version 1.0.1. Both of these formats have been accepted by the CNSS, and were valuable in making sure that the resulting table structures were sufficiently complete that the creation of such products could be easily implemented. The relational schema developed by the IRIS/DMC was not reviewed because it was not completed during the design phase of this project.
Almost without exception the various schemata analyzed were relational and claimed to be CSS 3.0 compliant. This latter claim, however, would appear to be most often honored in the breach. While there is considerable similarity in the functions of some of the tables in the various schemata, none are related in the same manner as CSS 3.0 and all adopt different table naming conventions as well as primary and secondary key structures. All of the schemata examined also had a distinctly parochial flavor resulting primarily from the limited scope of the intended application. For example, the CSS/IDC schema is clearly targeted at its primary mission of verification, and is weak when considered as a teleseismic or regional earthquake processing substrate. Similarly, the TriNet/USGS/Berkeley schema has a definite regional network flavor. This should not in any way be taken as a criticism of these efforts, since it would seem that they all are well defined and considered within the functional scope of the projects that developed them. Also these various efforts have been an extremely important source of ideas with respect to the design of the Earthworm 5 DBMS capability.
The approach currently taken, after some considerable internal discussions and compromise, is to use a 13 digit numerical number. The first four digits represent the originating Earthworm DBMS (permitting up to 10,000 such distributed processing centers) and a nine digit serial number (permitting 1,000,000,000 unique local identifiers). This number need only be unique with respect to a particular table. For example, a hypocenter and a phase pick may share the same number without being considered to be in anyway related. [ARGHHH – I don’t think 1,000,000 is enough – can we go to a 16 digit number with 4 and 12, or 5 and 11?]
The approach that we have taken is to try to achieve conformability not at the level of the underlying schemata, a task that can probably never be effectively completed, but at the level of the application programming interface, or the API as it is sometimes called. In the development of commercial where compatibility is clearly required for product marketability, this approach has been adopted almost universally. Because of the high cost of developing and maintaining software, "programming to the metal" can seldom be justified except in the most extreme real-time situations. Instead, one of only a few application programming interfaces are used and low level drivers are made available to map these capabilities in an efficient manner to a rapidly evolving hardware capabilities.
A simple example from seismological application development should help to illustrate this point. It is not difficult to imagine a simple function to retrieve the location of a seismic station given its official name, and such functions can be easily standardized. The details of the underlying file structures or database schemata supporting this request should in general be of no particular interest to the applications developer as long as the function performs as expected. While the efficiency of such an implementation might vary depending on ancillary design considerations, the functionality will not. For this reason we have adopted a design philosophy that achieves conformability at the API level and not at the level of the underlying schema architecture. An example of the success of this approach is presented by the IRIS/DMC where the database engine and database schema were completely modernized with minimal changes needed at the application level. We are hopeful that standardization at the level of the API can be achieved within the seismological community through its national organizations without becoming distracted by issues of schemata design and mission oriented process flow requirements.
In the Earthworm 5 DBMS design this capability is provide by a set of 6 tables shown in [CUSP SCHEMA HYPERLINK HERE]. The logic is exactly the same as that used by CUSP and leverages the experience obtained over the years resulting from the application of this approach within the context of regional network processing. Here the logic sequencer is used to make sure that all hypocenters requiring human review are analyzed, and that certain processes such as archiving and final magnitude calculations are deferred until previously required levels of analysis are completed. The much more common alternative is something that is sometimes referred to as the "cattle prod" approach, requiring analysts to be very cognizant of processing rules and sequences, and explicitly schedule required activity at the appropriate time. The net result, particularly considering the experience level of routing analysts, is lost events and delayed completion of required products. Such an approach is also required when a large number of analysts are working as a team to prevent redundant tasking and efficiency of execution by eliminating mundane tasks of work assignment and completion management.
The design provides a list of objects with associated processing requirements (e.g. waveform repicking, and mandatory reporting) with associated states. An analyst responsible for waveform repicking can then easily retrieve an event to analyze based on a prioritized schedule and without regard to other details of processing requirements. Following successful analysis, an event is then automatically scheduled for subsequent required operations, which can easily be modified by administrative personnel without extensive retraining of the analysis staff. The logic as implemented provides for multiple branching paths, so that a particular event might be scheduled for inclusion in a variety of final report products based on such things event magnitude or geographical criteria. Object state are carried in the State table shown in the diagram, the Action table provides specific information on how a state transition is implemented, and the Transition table is where a head analyst programs the automation sequencing. States can be block by higher priority states as controlled by the Rank parameter in the State table. Thus final magnitude calculations are blocked by the need for final analysis until required human analysis is completed if necessary. By this device, hypocenters with a preliminary magnitude can be finalized without requiring any further human intervention, allowing networks to process to a much magnitude threshold than would be the case without such a mechanism for automating processing activities.
Other considerations and requirements also arose during discussions with those involved in processing other kinds of seismic data where a hypocenter is not necessarily a centralizing them or even of particular interest. For example, volcano seismic observatories frequently detect very long period but discrete long period tremor events, for which any kind of location is difficult or impossible, yet it is clear that the signals were caused by some common physical process. Tremor signals resulting from the passage of fluid magma through a conduit system is one example of such signals where a hypocentral representation is not only inappropriate but misleading. Another consideration arose during interviews with those involved in strong motion seismology and earthquake engineering where the concept of discrete events is often blurred. Particularly in the area of seismic engineering a group of digital accelerograms might be organized with respect to a sequence of events, where the calculation of a hypocenter for an embedded strong event would be futile. Here the focus is primarily on characterizing the wave field as it effects buildings, and the hypocenter, or the initiation point of a complex rupture, is not of particular interest. A strong motion event then might include multiple hypocenters or none at all.
With respect to NEIC processing requirements, information arrives in a wider variety than that experienced by regional centers, and such information needs to be aggregated and organized possibly for a considerable length of time before even an initial preliminary location can be achieved. Such information comprises contributed hypocenters from regional processing centers without supporting phase or waveform data, reading groups from remote sites, single station or local array, unpicked waveform snippets to name a few examples. Some of these, such as unpicked waveforms might need to be scheduled for automated analysis before a centralizing preliminary hypocenter could be achieved.
The schema design in [CORE SCHEMA HYPERLINK], discussed in greater detail below, provides for these needs by introducing a new layer in the schema hierarchy designated the "Bind". It lies between the more traditional catalog layer and the summary layer, which includes such objects as hypocenters and magnitudes. An object in the bind layer could be thought of as a container that might contain an arbitrary collection of summary information and data, such as tentative phase associations pending processing by a general associator. There is actually only a single table represented in the bind layer as shown. As an identifiable system object, however, objects in the Bind table can be scheduled through the automated state driven scheduler. One simple example would be scheduling preliminary association triggered by the addition of an associatable datum such as a phase pick or single station beam. An object in the bind layer might also represent a collection of tremor signals or all strong motion records during an arbitrary period of time. This aspect of the architecture also supports the need for extensibility in that new ways of collecting seismic observations can easily be added with required automated processing capabilities without impacting pre-existing structures and processes.
The ensuing discussion progresses through decreasing layers of abstraction. The various layers depict a commonality of function that should make the overall architecture fairly easy to comprehend. As discussed, this is an extension of the layers of abstraction that were found useful in comparing and contrasting various schemata and architectures during the design phase of Earthworm V.
The highest level is represented by the "CUSP Schema" which is a completely generally approach to automated database process flow control. That it is controlling a real-time, evolving seisimological database is of no concern at this level of abstraction. Following this is a discussion of what we refer to as the Core Schema which is an attempt to define a common framework and representation for seismological data. In the NCEDE/USGS/TriNet schema this entire level is generally referred to as the "Parametric Schema". For our purposes we have further divided it into 7 layers, each containing tables or relations with a distinct commonality of function. It should be noted that it is primarily this layer that is the foundation of the API, or application programming interface. The design attempts to reduce representational aspects of seismological attributes to a single, generic form to facilitate the development of an application independent interface. For example, there are almost as many ways of describing or characterizing phase onsets as there are analytical institution. Regional networks use on onset descriptor that is some variation of IPU2, to denote an impulsive arrival of a P wave with a vertical first motion and a relative accuracy of three. Teleseismic processing stations use an entirely different method for onset characterization which emphasized branches of the travel-time curve. In either case, all such describe a particular seismic phase (possibly unknown) that was "picked" at a particular time within some measurement accuracy, which for stochastic purposes must be mapped into something akin to a standard error in the data. Unfortunately, what is meant by components of the description vary widely. In particular, a "2" in the regional example means something quite different if the measurement is made from a helecorder, a develocorder, or a digital record, and in addition tends to be highly subjective varying from analyst to analyst. It is inconceivable that the burden of sorting all this out should be placed at the application level, although this is often the case. To make matters worth, the definitition of the meaning of a phase description has evolved through time even at a single network as the precision of timing has steadily improved. For this reason, the portion of the schema referred to as the Core represents a considerable abstraction from the encoding as originally defined by an analyst. Assigning the actual precision of the measurement is then placed in the Datum level where the loading of data and the plethora of site dependant and historical differences can be more easily considered. Thereby, that portion of the schema sitting underneath the applications programming interface is kept as generic as possible. We have rejected the "everything but the kitchen sink" approaches taken to the extreme by SUDS and to a lesser extend by the USGS/NCEDC/TriNet design, as imposing too great a burden on the application development process.
This of course results in a potential loss of information content, which for historical and accuracy purposes must be preserved. Information in the original format is contained at the "External Level", and transcriptions to the generic forms used in higher levels of abstraction are carried forth by the database populating applications. These need to be crafted as each potential source of data is integrated into the Core Schema, but it is also precisely here that these issues can most easily be resolved. In a sense the schema at this level embraces all sources of data that have been assimilated, with their intrinsic relational structures remaining intact. As a result and as a specific example, the generic core can be made to mimic any other relational schema, such as that developed by the USGS/NCEDC/TriNet effort. For purposes of external interfacing in either direction, the appearance of these other representation can be presented. The higher levels of abstraction then only serve to integrate not to replace those developed by other parties.
This level consists of 6 tables, as shown in the Entity Relationship Diagram (ERD) [HYPERLINK HERE]. At this level of abstraaction there is essentially no information that what is being automated has anything at all to do with seismology. The Serial table is just a DBMS interlocked pool of unique identifiers, although during the actually implementation it appears that this functionality is provided more effectively by services intrinsic to the Oracle DBMS, and hence will probably not be implemented in this form. The Entity table in a general implementation of the CUSP sequencer would provide a translation between the single key objects being controlled and whatever key structure is being used in the DBMS being controlled. For our purposes it provides a translation between the single numerical key used by the sequensor and the primary key of the object being sequenced. Specifically, Type refers to a particular table, and Path references a single globally unique id.
The State table is the heart of the CUSP sequencer and contains one entry for every activity that must occur, generally several for a particular object. The State attribute indicates what action is scheduled, and the Rank denotes a relative priority. The structure of the primary key facilitates rapid scheduling, while insuring that such processing occurs in the proper sequence. A single object (row in another core table) might be represented by several rows in the State table, but no processing of a lower ranking action will occur until all actions at a higher rank have been completed successfully. Consequently, several required processes can be scheduled simultaneously, such as relocation and cataloging, and cataloging will be masked until relocation is successful. The Action table is just a glue relation that describes how a particular action is carried out, and may contain a system command in the Command attribute and any specialization information in the Parameter column. For example the command might cause a location to occur and the parameter would have specialized information such as what velocity model to use. The actual entry references any of possibly several parameters that are presented to the processing program as a group, so that a single Param entry in the State table might group together information such as a particular velocity model as well as a set of station.
The Transaction table is where the sequencing logic is "programmed" by a site operations manager by means of a simple, visual interface. When an activity (instantiation of an application program) completes, the activity and the result are used as a partial key that automatically determines what happens next. For example, a location program completing with insufficient convergence would return a result that would cause an analyst to re-examine the phase picks. Relocation would then be automatically rescheduled by the same device once the analyst successfully corrected the data, or possibly decided that the event should be discarded. This device allows the station operations manager to reprogram the processing trajectories of various objects controlled by the DBMS as mission procedures and requirements evolve.
During the survey of other contemporary seismic schema designs [HYPERLINK DAVE’S REPORT] it was discovered that the various schema could easily be recast into a canonical form consisting of three functional columns. In the report these were designated Catalog, Summary, Link, and Data. Such a division greatly facilitated the comparison of the various approaches and made it far easier to understand structural differences and the underlying reasons for them. In the discussion of the Core Schema [HYPERLINK CORE SCHEMA] we have followed this same pattern with the inclusion of four additional schema layers designated Event, Bind, Infrastructure, and External.
The Core Schema represents the overall structure of the informational aspects of the database. As shown it is incomplete representing only those tables involved with regional seismic network processing and relations. Similar diagrams exist depicting tables emphasizing global network processing and applications to strong motion data. The various levels actually represent successive levels of abstraction with the most fundamental data on the left (External Layer) and the most derived layer on the left (Catalog Layer). In general, data enters the system on the right and flows toward the left under the guidance of state driven sequencing logic.
This is the highest layer of abstraction in the "Core Schema", and essentially consists of lists of things of interest to particular groups. The most familiar of these is the earthquake catalog which contains the final "preferred" location in summary form. As discussed previously, there may actually be more than one "preferred" location depending on the target audience, and these would be represented in separate catalogs. Other kinds of lists that are supported at this layer and represented by separate tables might be list of "strong motion" events, tremor events, or interim reports such as the PDE and the NEIC. This is a highly extensible layer, and additional tables are expected to be added, and possibly deleted, frequently. A particular event is only listed once in any one table, although it may and in general will appear simultaneously in others.
There is only one table in the Event layer, but it plays a pivotal role in supporting extensibility. It purpose is provide an identification and a reference points to a collection of objects in the bind layer. The most familiar kind of event is a "Quake", which would be have the tiEvent attribute set to "Quake". The tiEvent attribute also identifies the table by name in the Catalog layer to which the event is assigned. However, as mentioned above, we wish to process and control collections of seismological observations that are associated with discrete physical events but that might not necessarily be associated with a single hypocenter or any hypocenter at all for that matter. Such an event might be a collection of volcanic tremor records or a set of strong motion records that may be associated with more than one hypocenter. Since the primary key is a globally unique id (GUID), it is capable of being sequenced and controlled by the CUSP sequencer logic. The value of the unique key in the idBind attribute identifies the particular group of objects in the database that are associated
There exists the possibility within the scope of the design for a particular physical event to have multiple identities. For example, a group of tremor records might on some occassions posses a hypocenter and also be considered an instance of a long-period volcanic earthquake. In such cases, two records in the Event table would share a common value for the idBind attribute, and would be referenced by records in each of two tables in the Catalog level.
The Bind layer is somewhat unique to the Earthworm 5 DBMS design and reflects the need to associate and aggregate summary and data objects that are substantially more diverse than those encountered in regional seismic network processing. There is only one table in the Bind layer comprising three attributes, which together are also a single primary key. An "object" consists of all rows of this table with a common value for the idBind attribute. A single "object" thereby consists of several rows in the Bind table. The second and third attributes (columns) identify a particular and arbitrary object elsewhere in the Core schema. The tiCore attribute identifies a particular Core schema table, and the idCore attribute refers to the globally unique ID in that table. For example, a small event located with 4 phases would contain 5 rows in the Bind table, all with a common value for the idEvent attribute. One of these rows would have "Origin" as its tiCore attribute with a specific value of idOrigin to reference a particular event in the Origin table. The remaining 4 rows would have a value of "Pick" as their tiCore attributes and unique values of idPick as their idCore values. This composite key consisting of 3 attributes is guaranteed to be globally as well as locally unique. Rows in the Bind table reference specific rows in tables in both the Summary and Datum layers. As an example, an event might initially consist of a collection of phase "Picks" from the Datum Layer, and later after a successful location is achieved a row in the Origin table in the Summary layer as well. In general, an earthquake event will reference several rows in the Origin table, one of which being "preferred".
** There is a design discrepancy between my diagrams and DKs that needs to be resolved. In DKs diagrams idBind is a unique id, which means that only one row can have this value. In the core design all rows with a common value of idBind represent a single composite object which is in fact the "Bag". The issues involve process flow control, and it is urgent that we address this level of design functionality before we proceed much further.
The maintenance of the Bind table is the sole responsibility of an application program known as the Global Integrator or GLINT, although objects will be initially be aggregated by various data population processes representing a preliminary logical association. GLINT will be responsible for merging objects, deleting objects, and changing the identity of objects as additional information becomes available. For example, an object that is originally identified initially as a collection of waveforms might as a result of the application of a picking application be reassigned as an earthquake and sequenced as such through the CUSP logic for that kind of an object.
The Summary layer is common and present in all of the schemata investigated, although it is not necessarily designated as such. Such summary tables, as shown in the ERD [CORE SCHEMA HYPERLINK HERE] include earthquake origins tables, various kinds of mechanisms, focal mechanism, and moment tensors. A common feature for all of tables in the Summary layer is the use of a GUID as the primary key. Because of this all summary objects are capable of process control through the CUSP sequencing logic. In addition, all tables in the Summary layer contain entries that are distillations of collections of data from the Datum layer, although these connections are not expressed directly but through information contained in the Link layer. It should be noted that the association is not in general homogeneous. For example, a given Origin might be linked to phase picks, three component station beams, or spatial array beams.
Two other attributes are also common to all tables in the Summary layer, and these are idSource, tiExternal, and xidExternal, which together reference the source of the information in the External layer tables. They are only present if the information was entered into the system from an external source and is maintained in an external DBMS format. For summary information calculated by processing elements of the Earthworm 5 DBMS, they are not present, and with reference to the Oracle DBMS architecture have a minimal impact with respect to space. Referencing correctly a contributed epicenter from a regional network not using the Earthworm 5 DBMS would be one of many applications for these fields. The idSource identifies the specific external database being referenced, the tiExternal is a table name in the external schema, and xidExternal is a primary key in the referenced external table.
As with other layers discussed, tables in the Link layer have a common column structure. In all cases they contain a primary key which is a composite key, each of which is in itself a primary key in another table. The first is a primary key to a summary object, and the second is a primary key to a datum object. The remaining attributes are referred to as link summary attributes and express information uniquely associated with the kinds of data and summary objects being linked. For example, for an OriginPick entry, such information would include such things as the travel-time branch used by a location program (which in general may be more specific or at variance with that assigned by an analyst), takeoff angles, origin station distance, and station azimuth. A summary table entry in general would be linked to more than one kind of data used to support it. For example, an origin table entry might be referenced by entries in the OriginPick table as well as the OriginRay tables. This design element supports extensibility in that new kinds of data can be used to support old kinds of summaries without changing the structure of tables in either the Summary or Datum layers.
Information in the Link layer tables are populated during two stages of processing. As a specific example consider the processing involved in associating phase picks with hypocenters. In the first stage, the primary key is constructed by the global associator (GLASS), and the body of the link is subsequently updated every time a location program relocates the event.
This design element specifically supports the transactional nature of the Earthworm 5 design criterion. The more robust schemata examined during the design phase, especially those designed to support real-time systems, all posses similar logic. The reasons for this, of course, are well founded. The Link layer tables are highly ephemeral, particularly in the minutes following a seismic event. Tables in this layer are modified continuously, but generally only by a single process. If the link information were embedded in the summary and datum tables, as would be appropriate for a data warehousing design, then these tables would be constantly updated. Since summary and datum tables are also referenced by a variety of other processes, the result would be data lock conditions and data flow bottlenecks. Although there are two applications the process data in the link tables, that is an associator and a summarizer (e.g. location program), the location processes involved in association are controlled by the associator, so data locks need not occur. In the case of final location calculations controlled by the CUSP sequencer, links are being processed that are not in general part of the association process, so again record level locking does not occur.
** There is another design discrepancy in the current implementation of objects in the link layer. My intention, and I think it is well considered, is that a particular link table only associates rows in one kind of summary table with data in one kind of datum table. The MagLink table shows the problem, in that it has an idDatum key but no referential information with respect to what table is referenced. In addition link summary information will differ depending upon what kind of data is associated. For example, Origin-Ray associations have considerably different link summary attributes than Origin-Pick links. We need to be able to associate new kinds of data with old kinds of summaries, and this is best accomplished if each link type has a unique identity and structure, otherwise the integration of new kinds of data requires addition of columns to existing link tables. Also, I see no need for GUIDS at the link level, since I do not foresee the need to reference these objects as controllable entities in themselves – they simply express relastionships. The current approach will also complicates application programs which will have to be modified to address changes in link tables. This seriously inhibits the extensibility of the design.
Tables in the Datum layers for the most part carry primary observational data, and there is one table for each kind of data that is being collected. Information in this tables is static, and is not modified during subsequent processing. Each table in the Datum layer also has a primary key which is a GUID, and as such is controllable by the CUSP sequencer. The GUID id is also important with respect to exchange of data and maintaining the identity of data between cooperating nodes in the Earthworm 5 distributed architecture.
There is more complexity in the structure and relations shown than are present in the associated ERD [HYPERLINK]. For example, one kind of object in this layer is a waveform snippet, which might be referenced by a phase pick from which it is derived A three component beam references three or six waveform snippets, while a spatial beam would in general reference an arbitrary number of waveform snippets. A exponential coda decay function from which magnitudes of small local earthquakes are often calculated references an undefined number of coda segments which themselves are also Datum layer elements.
Objects in the Datum layer, as with objects in the Summary layer, also possess an optional composite foreign key consisting of idSource, tiExternal, and xidExternal. In the Datum layer this reference is particularly important, as discussed in the general introduction to this section, the representation and interpretation of the same data from different sources varies widely. The Core schema attempts to render all such information into a common, generic form during data population, but undoubtedly the translation is not always accurate so that it is necessary, it would appear, to also preserve the data in its original representation. This may be less necessary than we currently believe, but for now this is our working assumption.
** NOTA BENE: Somehow I seem to have lost track of the reasoning behind preserving external representations. I know at one point it seemed critical, but now I’m not sure why, and it certainly introduces a lot of complexity if its not required. If we are serving an ephemeral processing role, and not a data center role, then after the passage of some amount of time we would "forget" the data anyway. Do we plan on casting external data into the SEED format when it is expunged from the system, and if so how can we do this in anything like a general way. It might be that we should preserve waveform data in miniseed, but is there any reason to preserve the integrity of other kinds of low level data? I know I’ve argued loud and strong for this in this document, but I’m not quite sure I believe my own arguments anymore =) "Help me Obiwan! You’re my only hope!" Surely it is more than just a mechanism for indefinitely deferring important design decisions.
The most notable thing that can be said of the contents of this layer is "Here there be dragons"! This layer is most important during the integration of noncompliant data formats and representation into those of the Core Schema. For the most part this layer is identical to that of the contributors database, and it is the intention to replicate external schemata, at least that part we wish to reference, without substantive changes. For control and referencing purpose we add to each Summary and Datum table in the external representation a single globally unique key so that they can be referenced by the Earthworm 5 DBMS Summary and Datum layers. This should in no way effect external queries, since the original primary key is preserved, and we are merely adding a secondary key for our convenience.
Another significant purpose of the External layer is to present an expected interface to installations using a substantially different schema designs. We do not expect, nor would it be reasonable for us to anticipate, that everyone would or should adopt our approach. On the other hand, it seems equally clear that it will be necessary to exchange information with centers using very different architectures. Certainly standardized exchange format such as the CNSS format for regional network data and SEED 2.3 for teleseismic data address this issue to some extent. On the other hand there are many instances in which another institution would like to be able to access the contents of an Earthworm 5 DBMS in essentially real-time. We can only imagine two possible solutions to this conundrum. One is to adopt external schemas for our processing, although none of those we have examined adequately support our needs. Also there is far more than one alternative, and little likelihood of standardization in the foreseeable future. The other solution is to shadow the contents of an Earthworm 5 DBMS within the structure of at least a few external schemata. While this is itself a rather daunting task, and should not be undertaken lightly, it does seem to be the least troublesome approach. The situation is not quite as bad as it might seem, since it is unthinkable to allow direct external access to a DBMS supporting a real-time, critical mission. Even cooperating nodes of the Earthworm 5 distributed architecture forego this luxury, as was discussed under the topic of the "Confederacy Model" in the functional requirements section. Consequently, the only real loss is in the development time to support continuously updating a shadow DBMS using an alternative schema. It may be, although it is certainly not our decision to make, that other mechanisms of exchanging data in real-time might be more appropriate and less odious to all concerned than the two alternatives discussed here. However, it is presently our intention to provide whatever "face" is necessary to support external critical requirements.