Universal Data Analytics as Semantic Spacetime(Part 1)

Part 1: New Horizons for Sensory Learning

Every volume of space, each snapshot in time, embodies a microcosm of information bursting with meaning. What’s it all for? In our modern world of smart technology, information can be offered to us in the form of services that are engineered for human consumption. In the world of (say) biology, by contrast, the rich information forms an ad hoc catalogue of diseases, cures, foodstuffs, and more, shadowing all kinds of diverse possibilities that we observe and measure ad hoc. Joined together, such snapshots recorded in space and time form a larger map of the world around us — a map of spacetime with an interpreted meaning known as semantics.

Neck deep in observations and interactions, we humans wade through a river of information as we go about our business. Some information is intentional — and can be harnessed to govern, even enhance our lives. Sometimes it appears to be just unintentional background noise that we mine for secrets about the universe — encoded rules and principles about phenomena that we strive to decrypt. The difference? Interpretation.

Figure 1: Every volume of space in time burst with information, some natural and intrinsic, other artificial and intended for smart services.

We model information in order to find a way to use it. But modelling isn’t easy. It has to find hidden simplicity in what might first appear ad-hoc.

In this post (the first of a series on the subject of data analysis and the Semantic Spacetime model), I’ll try to explain a little about modelling using some practical tools. The environment may be overflowing with data in the form of numbers, images, and text — but it also expresses complex, context-dependent semantics, yet we don’t always pay as close attention to this meaning as to raw numbers or descriptions, perhaps because the need to interpret implies that world isn’t straightforwardly objective, which has traditionally been something of a taboo notion.

During the last decade, technology has begun to improve in its support of capturing and modelling the many meanings of data, as well as the raw numbers and strings of observations themselves. We’ve gone from “focused data” to “big data” to “machine learning” and “AI”. All this implies different kinds of reasoning. New kinds of data models have been developed, along with new kinds of databases, with the potential for making realtime learning of richly annotated data a reality. We can make use of these.

A lot has been written on a theoretical level, but I’ll do my best to illustrate some of these developments, from a practical and pedestrian viewpoint: to show how we can approach data analytics, including semantics, in a simple way: one that’s based on universal principles. It’s a way to cut through all the complexity in the fields and adopt a surprisingly simple picture of the world that can save a lot of time and effort. I’ll use a practical version of what I’ve already discussed in my book Smart Spacetime (SST), as well as in a number of papers belonging to the Semantic Spacetime Project.

Smart homes, cities, ecosystems — all patterns in spacetime!

In bygone days, it was only scientists who collected or cared about data. The intrepid scientist would perform highly constrained and isolated experiments, controlling the conditions for repeatability. He or she would then collect numbersinto columns, by quill into a trusty lab notebook, or today by smart device into data stores, compute averages and error functions, then finally present a Gaussian average picture of the phenomenon in a paper for publication. It’s an approach that has carried us for centuries, but which is also less important today. The ultimate goal of such an isolated experiment was to look for patterns, as Alfred North Whitehead put it

“to see the general in the particular and the eternal in the transitory”

Today, fewer problems are about such isolated experiments, or even about discovering natural phenomena. We’re using data more widely and more casually — building intentional systems that use data for feedback and control in realtime, as well as for documenting histories and processes that we already understand, but perhaps can’t fully comprehend without assistance. We look for diagnostics, causal patterns, and forensic traces amongst implied relationships. So, observations are no longer just to look for scientific repeatability in experiments (i.e. statistically invariant patterns). They are also about human-cybernetic system control. We might be searching a single customer record in a purchase history, a path through a log of supply chain transactions or financial dealings, for isolated anomalies, or simply to calm an inflamed relationship. Anomalous phenomena, customer relations, supply chain delivery, forensic analysis, biological mutation, social justice — these are all aspects of one giant bio-information network that governs the world around us.

Today, data are (or “is” if you prefer — take your grammatical pick) more often collected directly and automatically, by soft or hard sensors, even by filling in online forms — as part of some semi-controlled environment. They are then fed into files or some kind of structured storage, with various degrees of sophistication. The “measurements” yield representations that are numerical, symbolic, image-based, matrix valued, and so on. And each cluster, document, or “tuple” of data has its own significance, which may be lost unless it can be captured and explained as part of a model that captures that importance. This is what we refer to as data semantics.

Data formats

Databases — actually database systems — are the tools we use for managing data — not just storing but also for filtering and interpreting. There’s a few basic model types to choose from, and they influence our usage significantly. Conceptually, databases aren’t as different as their designers and vendors like to claim (for market differentiation), but the devilry is in the implementation, and in the usage. Some common cases:

Key-value data: are the simplest scratch-pad representations of tables or ledgers. All databases are basically key-value stores of some kind with more or less complicated “values”. The “rows” are labelled by keys which are names of some kind, and can be generalized to URIs to add more structure. Even simple disk storage is a key-value store, with a structured index on top. In programming, key-value stores are associative arrays or maps. Key-value stores can store tabular data, like time-series and event logs, when the keys are timestamps, or ledgers in which keys are entities and values are financial transactions. They’re used as configuration data, or for keeping “settings”, as in the Windows Registry or etcd. KV stores are very convenient to use in machine-learning of scalar, vector, and matrix data, like the monitoring of signal traces, and accumulated counting of histograms, measurement by measurement (a name/number key value pair is exactly a histogram). Keys represent a symbolic interface to the world, and to avoid explosive growth, they should be used with convergent learning models.
Relational, column, or tabular data (SQL): this is the kind of database that most are now taught in college. It’s the simplest generalization of a key-value store, based on “tables”, which are basically templates for data types or schemas. Each element in a table is a named “column” (as in a spreadsheet) and each instance is a row, as in a ledger. The model is based on a spreadsheet model. The use of rigid schemas is associated with the three normal forms, and the transactional properties of updates during reading and writing are a huge issue amongst different implementations. The same is true for all kinds of database.
Document representations: like relational or tabular data, these are another generalization of key-value stores, often called noSQL databases, as they were the first notable deviation from that model. The values in a key-value store benefit from stronger and more detailed semantic elaboration, like putting a folder or document in an archive, rather than just a number on a ledger. Because that’s harder to find, we also make key-value indexes to organize documents by different criteria. For instance, in collecting data for personnel files or digital twins, it’s helpful to encapsulate related aspects of a device under a single reference.
Graph representations: are a relatively new trend, used for expressing structural relationships — data of a different kind, especially those that are non-linear — not just serial. For example, an “org-chart” would be represented as a graph. A map of services could be a network. Functional relationships between components and processes like circuit diagrams, flowcharts and other processes, can all be described by graphs. Ad hoc networks, like mobile communications during emergency operations form real-time changing graph connections to nearby mobile devices. The locations of nodes within a Euclidean or spherical coordinate embedding can be used to represent geo-spatial data. The hub-spoke pattern (see figure 2) is a common configuration that links all the data models in this list: a hub as as a primary key for referring to its satellite nodes as data (like network address prefixes), a hub binds independent items into a single entity (a document), and it acts as a cluster focal point for anchoring related items in space or time as a graph.

There are always choices to be made: should we use a time-series of documents or a document of time-series? Data modelling is an art, and it’s important to get it right, because it forms the foundation for algorithmic efficiency. In the Semantic Spacetime (SST) model, we try to make use of all of these data formats at the right moment.

Key coordinates