Universal Data Analytics as Semantic Spacetime(Part 5)

Part 5. Graph spacetime, association types, and causal semantics

Data analysis inevitably involves relationships between sources and the signals they give rise to — with, of course, their associated semantics. So far we looked at only simple tabular associations (so-called documents), typical of what’s usually stored in databases. But we can do much better than this when complex causation is involved. Charting relationships qualitatively and quantitatively, as a graph (the mathematical name for a network), may offer a more natural and complete way of representing processes. The graph format has the additional benefit of visually mapping to the trajectories of unfolding processes (such as fluid motion, state transitions, or even logical flow charts).

Types are not enough…

I recall an episode from many years ago, during the early development of Promise Theory, in which a PhD graduate from the University of Oslo came to discuss my approach and tell me about his problem. The student had worked on Object Oriented Type-Analysis for his thesis and had been unable to resolve a final issue. His advisor had identified the case as a paradoxical problem in OO modelling, and they had spent some years working on it. After hearing about the problem, I sketched it on the blackboard, using Promise Theory, and showed him how I would have done it. His mouth fell open as he saw that the answer was so obvious — a simple shift in thought, from the imposed obligation of a rigid type model to autonomous promises, made the issue unambiguous. Freed of a restraining doctrine, we had solved the problem immediately. It was one of many incidents that pushed me towards network modelling without the artificial logical structures that are tradition in mathematics — towards ideas like Semantic Spacetime.

I share this story because we sometimes imagine that modelling choices are easy, and that analysis will be more or less the same no matter what static data model we choose. Any model is surely as good as any other — it’s just a matter of style, right? But this is not so. Sometimes we set ourselves up for success or failure by the choices we make at the start. That’s also true about the choice of tools. Happily, Go with ArangoDB, have proven to be the clearest and most efficient version of these ideas I’ve been able to produce so far!

In this fifth post of the series, I want to show how we can represent completely general semantic relationships, between events in a process, using the semantic spacetime tools and a graph representation. Using Go and ArangoDB, this has never been easier, but you might have to tune your thinking to a different frequency to avoid some pitfalls.

In computer coding, the concept of data types is by now quite ubiquitous and is used with a specific purpose in mind: to constrain programmers, by being intentionally inflexible! Data types, like double-entry book-keeping in accounts or categories, are designed to highlight errors of reasoning by forcing a discipline onto users. Types have to match in all transactions, else syntax error!

Graph spacetime

Suppose you want to model relationships between a number of individuals. You can’t do that with a one-dimensional histogram. With a one-dimensional key-value table, at best you could count the number of interactions each individual experiences (with everyone else) and store it in a key-value pair. It would be natural to store that in the node, because it’s information only about itself. Interactions, on the other hand, are two-dimensional tables: matrices or 2-tensors. If we make a table with rows and columns labelled by individuals’ names, we get a so-called adjacency-matrix.

          A  B  C  D  E  F  G  H
       A  0  -  -  -  -  -  -  -
 ^     B  -  0  -  3  -  -  -  -
 |     C  -  -  -  -  -  -  -  -      
 |     D  -  1  -  0  -  -  -  -
row    E  -  -  -  -  0  -  -  -
 |     F  -  -  -  -  -  0  -  -
 |     G  -  -  -  -  -  -  0  -
 v     H  -  -  -  -  -  -  -  0              <--- column --->         matrix(r,c) = value
         matrix(2,4) = 3
         matrix(4,2) = 1

Now there is a reason to store information separately from the nodes, because some of it belongs to both, or the connection between them. Shared information belongs to the directed path between the end-points.

This is also a key-value structure here, of course. The labels are still the keys — but now two dimensional key-pairs (k1, k2), which act as (y,x) coordinates formed from the nodes. The values belonging to the paths between them are the numbers in the table itself. This matrix is completely equivalent to a graph or network. The values within the table, at these coordinates, represent links or edges from row node to column node. If the process edges have no arrow head or directionality, the matrix must be symmetrical, because A connected with B implies B connected with A, but that doesn’t have to be the case. A one way street connects A to B but not vice versa. Also, the diagonal values are often zero, because we don’t usually imagine needing to connect a node to itself. Every node is next to itself, right? But this is also a failure of imagination, perpetuated by our prejudices about space. If we think of the arrows as processes, not distances, then any node can talk to itself or change itself. Indeed, this kind of “self-pumping” turns out to be very important in directed processes like recurrent networks for keeping nodes alive.

In the table above, we see that B influences D with a value 3, but D only influences B with a value of 1. Perhaps B called D 3times, but D only called B once. Clearly, we can use as as a kind of histogram too, by interpreting the numbers as frequencies e.g. conversation frequency. To get the one-dimensional summary, we just have to sum over the columns of the “ledger”. This is how a statistician might look at a graph — but it’s far from the whole story.

We’re conditioned to think about space as Euclidean space, i.e. a world of a few things or locations that’s full of inert padding. Graphs are a more direct representation of processes, without all the “wasted” space in between. We might draw graphs embedded in Euclidean space (as a picture) in order to satisfy our senses, but we don’t have to. The matrix shows that the empty space of the page has no meaning — all the information is in the matrix. The empty space is not represented. Classically, scientists have focused on just putting numbers in the matrix, but in the modern era, we need much more detailed symbolic information to make sense of processes. Graphs and their fraught relationship to Euclidean space are at the heart of machines learning and statistics.

Graphs are good at representing highly inhomogeneous processes (data with types, which includes ontologies as well as control information). In a Euclidean space, every point or place is just the same as the next, and we can only compare relative locations. This assumption of translational invariance in data is a big topic in mathematics and physics (group theory), but it’s founded on a prejudicethat the world is ultimately homogeneous at its root. This isn’t a useful point of view in a semantic spacetime, e.g. in biology, technology, sociology, etc. Euclidean space was invented to model the homogeneity and isotropy of spacetimes — the fact that in mostly empty space nothing much happens except at special places, which is an assumption about the representation, not the process. Historically, we then called those special places “bodies” or “particles” (matter) to distinguish “nothing happening” from “something happening”. Graphs work differently: nodes provide a kind of material location at every vertex, so the idea of matter is redundant. This means graphs can have different properties at every single node or point.

It’s hard to imagine different properties at different absolute locations in Euclidean space, partly because there is an assumption of its emptiness (featurelessness or typelessness), and there isn’t a natural way to describe boundaries on an arbitrary scale in Euclidean space, so we tend to treat it as featureless. But this is a mistake, if you don’t assign meaning to the quadrants on a page (e.g. upper right is good, lower left is bad, as in marketing graphics), no conclusion had any intrinsic meaning, only relative meaning. But we know this isn’t natural in the world around us: absolute properties model context, or boundary information. If you’re in New York, Times Square is obviously different from Central Park not only in size, but in its features. Our inability to escape Euclidean ideas is tied to our sensory and cognitive model of distance.This is the semantic trap that the field of statistics easily falls into: letting representation dictate interpretation. But in physics, we’ve rediscovered graph importance especially in quantum physics (path integrals, lattices, etc.) — events form trajectories, where transitions between states are the important feature, not distances. Trajectories were once treated as secondary patterns in physics, but in fact they are the fundamental description of processes. The empty spaces we introduce in between steps is just scaffolding that we end up integrating away.

As we start out modelling phenomena with graphs, all these issues of pride and prejudice will get in the way all the time. Is a graph a spacetime? Is Euclidean more fundamental or is a graph more fundamental? Semantic spacetime is a model which allows us to see these issues clearly. And Machine Learning is an area where the tension between Euclidean and graph representations reaches a crescendo — so, as you expect, I’ll be discussing tis more in the rest of the series.

Graphs: associations, links, or “edges” vs typess in Go

Continue Reading...