Semantic Spacetime and Data Analytics(Part 12)

Appendix 1: Exploring the digital barrier reef with CAIDA and ArangoDB

We probably think we know the Internet, but how many can really say they understand how it works? What do you think of when you hear the word? Your favourite websites? Social media? Perhaps the WiFi router supplied by your service provider. Understanding all its moving parts is not for the faint-hearted, yet a few researchers do this for a living (if you call that living!). Like all research, there are both ambiguities and controversies involved, so we're always looking for a new approach. A Semantic Spacetime approach presents new opportunities to unravel those issues, enabled by the new hybrid database technologies for document-graph representation. In this Appendix to the main series, I want to showcase an application of SST to Internet data analytics.

The Internet isn’t so much a web as it is a “living network” — a busy ecosystem of ongoing processes, all of which are connected together to form a large cooperative graph. The graph might appear straightforward, as any snapshot of something in time seems to be, but we shouldn’t mistake a snapshot for the real thing.

Figure 1: A local sample of Internet structure from CAIDA data.

Understanding the Internet takes a lot more than collecting and mapping out the names of websites, domains, IP addresses, or even seeing the physical connections between them. There’s a physical layer, to be sure, but there are also multiple virtual layers in which the actual links between one user and another fluctuate and multiply wildly in real time, much faster than a human could comprehend. There is virtual motion, and also semantic circuitry to expose. The resulting ambiguity is a bit like quantum mechanics — the best picture we can describe is a probabilistic one, an equilibration of underlying flows, which is summarized by an average "field", which has to be probed by each every observer for themselves to form a current picture. There are slowly moving trends too: addresses change hands, companies come and go, names change, technology is replaced. The Internet is a patchwork of overlapping stories occurring across a multitude of timescales.

CAIDA in Semantic Spacetime

The Center for Applied Internet Data Analysis (CAIDA for short) is a small research wing, located in a pleasant forest grove, deep within the La Jolla campus of the University of California San Diego (UCSD). I’ve visited it only once in person, some 20 years ago, and it felt like a holiday resort with sea views and forest walks! Yet, from this idyllic vantage point, for the past 25 years CAIDA has been researching, mapping, and visualizing the Internet, in spite of the odds being stacked against them. You may have seen beautiful orchid-like maps as well as deep insights into the murky processes that drive it. These are some of the women and men, led by Kimberly (kc) Claffy, who try to answer the big questions. Getting a grip on data about the Internet is fraught with ambiguity and difficulty. Today the Internet is controlled largely by corporate America’s tech giants, who are not too forthcoming when it comes to independent research. So much for the rumours of no central control. Ingenuity is needed to get around the obstacles.

Figure 2: All socio-technical networks are multi-scale graphs of complex overlapping processes.

On one level, the Internet is a graph — indeed, it’s several, depending on how you look at it — but how should one capture its nuances and model it? That’s a long term project with enormous scope, so we have to begin with the elementary questions — to capture its main structural and semantic relationships.

Mapping out the Internet is a bit like taking satellite imagery of the Earth: we can see some things in plain sight, but we can’t see the insides of buildings or underground tunnels. Similarly, the Internet hides some information, both for security and privacy reasons, as well as for reasons of scale. Some information has to be based on inference. Probing the data requires sophisticated machine learning — not the headline grabbing Deep Learning kind, but rather the more widespread learning about changing signals, by automated sampling and reasoning, used to run the biggest investigative and predictive endeavours of our time, from particle physics to supply chain management.

CAIDA publishes a portion of their results as monthly public sets that capture a snapshot tomography of the inferred structure based on complex and heuristically-guided algorithms. This is the ITDK project (for Internet Topology Data Kit). It’s a small part of the whole picture, but one we can use as an example of semantic spacetime modelling, because it offers a recognizable picture of the Internet — something roughly analogous to the wavefunction for the Internet.

Semantic Spacetime Again

One answer to organizing network information, qualitatively and quantitatively, is to use Semantic Spacetime — as highlighted in the main body of this series. Semantic Spacetime models processes as graphs (something like scaffolding built from Feynman diagrams, but generalized to capture more aspects of a process). Semantic Spacetime retains special labels that distinguish the meaning of locations and their relationships, with respect to different flows and processes, mapping between different causal influences.

Figure 3: Probing the local structure of the Internet using traceroute (see part 7 of this series).

In Part 7 of the main series about Semantic Spacetime and Data Analytics, I showed how network probing, using traceroute, could reveal the structure of a network, thread by thread and step by step (see figure 4). It’s a laborious undertaking that CAIDA has specialized in — sewing together endless probes into a kind of snapshot, accumulated from an ensemble of millions of independent processes. ITDK is summarized methodically as a published history of effective snapshots for the entire Internet.

Figure 4: Building up a traceroute as radial slices taken over multi-slit maps forms a superposition of possible boundary conditioned paths.

Where are space and time in this picture? The Internet’s nodes and organizations are different flavours of space: the succession of summaries describes one kind of (adiabatic equilibrium) time, while the individual probes map out transitions that form the proper times of millions of independent observer processes. As usual, it's spacetime all the way down.

The result is effectively a giant non-equilibrium field of fluctuating causal influence–quite similar to a quantum mechanical wavefunction in a number of ways (see the discussion in part 7). The main difference is that the individual causal pathways remain in pinpoint graphical form, rather than being smeared out as a functional Euclidean embedding, or a field. But, while embedding functions are a popular approach for approximating inferences in modern machine learning, the Internet is too complex an ecosystem to be represented as a purely quantitative field. It’s richness of semantics is what holds the keys to its understanding on multiple levels.

Isn’t it just a graph database?