### Part 10: Big data, coarse graining, and “effective” approximations

In the previous post, I looked at how large scale properties of graphs could be calculated from smaller scales, and how the graph itself reveals the computation! Before we end this series of posts, we need to finish addressing the elephant in the room, which is scale, in all its interpretations! We already saw in part 8 what happens when we zoom into a graph of even moderate size, when it has dense interconnectivity. We lose our way, and find it hard to track causality. For some, the solution is to use one of several inscrutable Artificial Neural Network techniques being developed in the ML research community, which mistakenly (in my opinon) try to replace graphs with Euclidean embeddings. But there’s a lot we can do without throwing away causal explanation. In this post, I want to show how the concept of semantic spacetime shows us how to stay in control of a more traditional and causal alternative: we can apply classic methods from statistical mechanics to integrate out degrees of freedom, leaving a controllable approximation over a scale parameter.

### Big data — why do we want it?

The fact is that it mainly became popular to accumulate Big Data when disk prices fell below a certain threshold of affordability in the tech industry. Supply drove demand more than direction. We can blame tech marketing for some gratuitous over-usage of storage during those times, but — if it has been advocated indiscriminately in the past, then — today there’s a higher level of sophistication involved in the use of large caches of data. Perhaps the same can’t be said of CPU time.

Let’s recall the two main reasons for using data:

To seek answers about empirically gathered data, i.e. to determine (repeatable) causal relationships.

To combine data through models in order to generate composite outcomes (e.g. for modern “AI” syntheses) like an expert system.

The former is a traditional view of scientific inquiry — high fidelity signal analysis, as “objective” as can be. In the latter, there’s a kind of artistic license like making photo-fit mug shots from body parts— we don’t have full causal integrity in the story between input and output, but we apply superpositionsover possible “worlds” or “alternative states” to compose an outcome to be judged on the merits of later applicability. Not every data set (large or small) will contain all the information we need to make it meaningful by itself. Semantics are not always apparent from observable dynamics. Nonetheless, we might still be able to combine one source with another, to “imagine” possible ideas, as we do in “AI” methods — again, a bit like a photo-fit face pieced together to identify a criminal suspect.

### Methodological Challenges

Scale is about more than just the amount of data, or the number of computers needed to crunch it, though that’s its conventional usage in Computer Science. In Natural Sciences, scale can also refer to standard intervals on a measuring stick or other meter. I’ve discussed this before in my books In Search of Certainty and Smart Spacetime. The two definitions are closely related, because we can’t solve the problem of crunching bulk data without understanding how the number of countable intervals between relative measures that data represent affects the questions we’re trying to answer.

Figure 1: Read more detail about the motivations and theoretical background for Semantic Spacetime.

Anyone who has had to deal with “large data” will know that comprehending more than a certain amount of it is impossible, even unhelpful, because we can only tell a decent story about a few variables or patterns at a time. Approximations are more useful than high dimensional precision for story telling. So, we look for ways to project out certain “viewpoints” within a problem, perhaps summing up small contributions and replacing many things with one effective thing, or taking the different angles one at a time. As we do this, it’s important to understand the integrity of the “chain of evidence”. i.e. how the final answer emerged from the data, so that we understand it’s claim to being an honest story. If we lose that, then it’s hard to trust the result.

Chain of evidence is something graphs do well, but not something Euclidean space representation (tuple spaces) do well — yet Euclidean methods are more familiar to us, and there’s a tendency to want to switch to a Euclidean picture at the first available opportunity. This is one of the challenges facing modern Machine Learning methods.

The causal path of a calculation may or may not be apparent in the results we see. Answers may be independent on the particular way something was calculated. This happens for so-called memoryless processes. Ironically, this makes it easier to trace them, because they have to expose all their information externally. Physicists traditionally prefer memoryless processes, because with less contextual dependency they feel more “objective”, but they are relatively rare in systems of greater complexity. All cognitive (learning) processes are based on cumulative path-dependent learning. The context dependence is an important discriminator or memory key.

With many calculation steps and many variables in play, one can easily lose track of whether techniques are answering the question we intended. Physicists have studied this problem in detail, employing detailed analyses of approximation techniques known as the renormalization group. Techniques like neural networks (e.g. Deep Learning) have a “coolness” factor, but it can be hard to understand how they calculate. Why should we care? Because the two use-cases above require very different measure of integrity,

The tools I’ve chosen to adopt, in this series, are just those of plain old procedural computer programming and data representation in ArangoDB’s multi-model database. They don’t imply a particular approach, so we need to find one — but that’s not too hard if we trust the structure and meaning inherent in graphs. Sticking with these tools may not always be the most efficient way to approach large data problems today, but I believe we should begin with understanding cart horses rather than jumping at passing zebras.

Let’s look at how the classical methods of approximating high dimensional problems with statistical mechanics can be used together with semantic spacetime to simplify computations, without losing sight of their intent. For any physicists out there, this is very like what’s known in the trade as Effective Field Theory.

### Controlling scale with tunable parameters

To try to make a story concrete, let’s examine a public data set, derived from a “real world problem” — one that has enough size and detail to think about the issues. The Amazon Product relational map is available from The Open Graph Benchmark project at Stanford University. The data set isn’t fully raw data, with full insight into the sales processes; it’s something of a hybrid, used for validating machine learning algorithms. But that, in itself, is useful to the extent that it forces us to think about the problem and to find compromises.

The Amazon data set, consists of files, line by line, representing products purchased by online users, with only indirect supplemental information about the types of product — nothing explicit. We find:

For each node, a coarse category label to one of 50 or so product categories, e.g. 1) Books, 2) Beauty Products, etc. In the four SST relations, this is a CONTAINS type.

For each node, a feature vector which is the output of a machine-learned pre-classification of product descriptions into a 100 dimensional Euclideanized “word space”, where similar points are about similar topics. In the four SST relations, this is a NEAR=”IS LIKE” kind, with implicit links of a graph. However, the data are not presented as a graph, but rather as a set of tuples or points in a 100 dimensional vector space.

For each pair of products co-purchased (on any occasion), an idempotent edge or link is present, forming a graph of all such time-integrated events. The links are unweighted edges, so the relative probabilities or frequencies are not recorded over the integration. In the four semantics, this is a NEAR=”COACTIVATED” kind of relationship. This is like a bioinformatic network of co-present proteins in some process, hence making the connection to bioinformatics noted in the previous post.

We might try to tell a story about the features. The feature vector data represents a second data set, and a kind of “hidden” graph of “similarity links” between similar feature vectors and its dense web of co-purchases. But it’s not explicit. We don’t need to expose this secondary graph explicitly, although had the data been cached and represented as a graph in the first place, the problem would have been simpler.

Suppose we want to make recommendations based on a product, we either want alternative product recommendations, i.e. similar products from the same category (books by Tolkien or Lewis), or products that complement one another, from any category (chocolates with roses). In the absence of an explicit information about dependencies (batteries not included), we have to infer dependencies or complements by looking at what people purchased together. This is a long shot, but worth a try, and of course it’s the use-case for this test.

The graph edges have limitations. They are an aggregate, flattened projection of the co-purchased products averaged over the timeline of the dataset. Time (a sequence or some kinds of FOLLOWS relation) is therefore absent from the data. There is an imposed assumption buried in this, which may lead to a false and fictitious result: co-purchasing is not a transitive relation. If A and B were purchased together on one occasion, and B and C were purchased together on another, it does not imply that A and C were ever purchased together. Moreover, this is not evidence of future purchasing needs: if I buy a toy, I need to buy a battery. If I buy a battery, I don’t need to buy a toy to go with it. If someone bought the toy and a torch with batteries on different occasions, it doesn’t mean it’s of any interest to buy the torch if you buy the toy — or vice versa. Many users of graphs don’t make use of this directedness, because it makes some computations harder! They should!

Suppose now we want to answer the question, what products to recommend given a first choice, then analysis of the whole graph is of no interest, because that’s an entirely local process. Having a general idea of trends isn’t too revealing either. Probably, we could guess the answer. Similarly, if we’re interested in which products by name are considered similar, the whole graph is a distraction and no need to render it as a plot.

With these aggregations, all we can say is that certain product categories are likely to be bought together with others.That’s the information which is inherent in the graph edges. What the amazon strategy does to try to work around this is to use the description data to coarse grain and reduce the product basis at the outset. So there are only really two questions we can answer: possible likelihoods for co-purchasing different kinds of product (without necessarily understanding what they are in detail), and a meta-question about the efficiency of representing data by graph or as Euclidean vectors.

### Implementing analysis with Arangolangs

The repository on Github for this post contains the following programs, which you should execute in the following order.

amazon-load-data.go — load the graph into the ArangoDB SST model.

amazon-category-level-graph.go — count link nodes between type hubs.

amazon-coarse-product-nodes.go — count link nodes using knowledge of products inside which hubs.

amazon-coarse-product-nodes-v2.go — count link nodes based on correlating products by feature vector, then compare to type hubs.

The data can be loaded with the program amazon-load-data.go. The data files are line by line, so we use the line numbers as node labels.

`% go run amazon-load-data.goReading nodes5000 …10000 …15000 …20000 …25000 …30000 …35000 …40000 …45000 …`

We use the different collections in the Semantic Spacetime toolkit to read the products into Nodes, then we use Hubs for the 50 or so product categories. We can immediately connect product nodes to their intended categories (see figure 2, left hand side).

Figure 2: We read in products (blue Nodes) and their product category (green Hubs). The nodes contain feature vectors based on product descriptions. One way to group nodes by original purchase context would be to join them to purchase events (white Fragments). However, this information is not in the data — it has been flattened, so we need to work around this.

The SST model uses:

Nodes as purchases, i.e. products.

Hubs as product categories (CONTAINment of a Node in a category)

Fragments as coarse grained product category inferences (summarizing feature vector NEARness)

As graphs, different purchase events could also be simply hubs that united several products — but we don’t have that information, just like the events trails in the passport example or the traceroute map.

What could we do with the data in this SST graph? We might:

Use the COACTIVATION graph to read off and suggest the linkage of products used together (referring later to the independent feature graph), but without ranking.

Use the CONTAINS graph to find intentional product classifications, added by hand.

Use the feature vector with a Euclidean distance function to renormalizethe embedding of the graph in its imaginary Euclidean space of intentional product descriptions, in order to find other products that are similar to a given product. We can then check whether the product categories inferred by distance correlate with the product descriptions (this is more than 50%).

Combining all these, we can see trends and patterns of co-purchasing. However, because the edge weights are missing, we have to rely on summation over different products, rather than different purchasing events, together with a renormalization for extracting significance.

The format of the data presents challenges. There are two data sets, both tied to nodes. In one set, we use the nodes to form a point to point graph. In the other, we use the nodes as random samples in a Euclidean continuum. This is a common approach in a staged machine learning pipeline, however it leads to subtle trouble for the analysis.