The Versioning System¶

After having an understanding of how the data needed for versioning is stored into the helper collections, we can go one level higher and look at the versioning system uses that data to perform the supported versioning operations. We will analyse the data structures built from the data stored in the auxiliary collections and illustrate with examples how they interact with the main tracked collection.

The Version Tree¶

The main concept behind this library is the version tree. This is not an actual data structure, and it is not physically implemented by any class or collection, but the version tree is the combination between the log tree and the set of per-document delta trees. We will proceed in explaining its components, since they have a direct correspondence in the tracking collections we have already seen in the previous section, and then we will come back to it. The only two points to remember so far are:

The version tree has the same nodes as the log tree

The edges of the version tree are the sets of deltas that need to be applied in between versions (nodes) to transform some or all of the documents of the collection.

A log tree. — Fig. 4 A log tree of a collection with two branches.¶

Before doing that, let us create a new database with structure presented in Fig. 4, and consider that the collection has three documents:

\(\mathcal D_1\) that is modified in each version of the collection, except in \(\texttt{1\_b}\)

\(\mathcal D_2\), that is inserted in \(\texttt{1\_m}\), and modified in \(\texttt{2\_m}\) and in \(\texttt{1\_b}\)

\(\mathcal D_3\) that was inserted once in \(\texttt{2\_m}\) on the main branch and once in \(\texttt{1\_b}\), on the b branch.

As a notation, the tags \(\texttt{0\_m}\), \(\texttt{1\_b}\), etc, written inside of each node represent a quicker way of referring to the versions (0, 'main'), (1, 'b') and so on.

Consider that each document contains a single field 'v' that is initially set to one and the it is incremented each time the document is modified. This setup can be reproduced using the following piece of code:

from versioned_collection import VersionedCollection
from pymongo import MongoClient
from bson.objectid import ObjectId

client = MongoClient("mongodb://localhost:27017")
db = client['versioned_collection_test']

def setup_database(database, collection_name):

    collection = VersionedCollection(db, collection_name)

    DOCUMENT1 = dict()
    DOCUMENT2 = dict()
    DOCUMENT3 = dict()

    def _increase_v_and_update(doc):
        doc['v'] += 1
        collection.find_one_and_replace(
            {'_id': doc['_id']}, doc
        )

    DOCUMENT1['v'] = 1
    DOCUMENT2['v'] = 1
    DOCUMENT3['v'] = 1

    collection.insert_one(DOCUMENT1)
    collection.init("0_m")

    _increase_v_and_update(DOCUMENT1)
    collection.insert_one(DOCUMENT2)
    collection.register("1_m")

    _increase_v_and_update(DOCUMENT1)
    _increase_v_and_update(DOCUMENT2)
    collection.insert_one(DOCUMENT3)
    collection.register("2_m")

    for i in range(2, 5):
        _increase_v_and_update(DOCUMENT1)
        collection.register(f"{i}_m")

    collection.checkout(1, 'main')
    _increase_v_and_update(DOCUMENT1)
    collection.register('0_b', 'b')

    _increase_v_and_update(DOCUMENT2)
    _increase_v_and_update(DOCUMENT3)
    collection.register("1_b")

    collection.checkout(4, 'main')
    return collection

collection = setup_database(db, 'versioning')

We will use this setup to explain the concepts of log tree and delta trees in the following subsections.

https://imgflip.com/i/nwpud — Fig. 5 source¶

The Log Tree¶

The Log Tree (see Fig. 4) stores the log or history of versions, each branch of the tree representing a versioning branch, therefore the log tree can be seen as a tree describing how the versions of the target collection have evolved in time. Each node of the tree corresponds to a registered version and is identified by the name of the branch on which it is located and the version ID for that branch. When a new branch is created, the version of the first node on that branch will be 0.

In versioned_collection branches are pointers to the version nodes of the log tree, as shown in Fig. 6. Here, we are checked out at version \(\texttt{3\_m}\) (remember that this is just notation for (3, 'main')), so the HEAD pointer, represented by the metadata collection points to that version, the 'main' branch pointer points to the latest version on the 'main' branch and the 'b' branch pointer points to the latest version on the 'b' branch. Each time a new version is added on a branch, the branch pointers will advance to point to the newly added version, and similarly when versions are removed.

Version registration is an atomic operation, so there can be a single operation that registers a new version running at any time. This means that the versions represented by the nodes are a partially ordered set. This makes sense since we’ve seen that the versions can be organised in the log tree, and the tree induces a partial order on its set of nodes.

Let \(\mathcal V\) be the set of all versions registered. The set \(\mathcal V\) can be seen as the nodes of the log tree \(\mathcal L\) (or the nodes of the version tree since they have the same nodes). Formally, \(\mathcal L\) is a collection of (finite) sequences of elements of \(\mathcal V\) in the form \(\langle v_0, v_1, \dots , v_n \rangle\) such that \(v_0 < v, \forall v \in \mathcal V \setminus \ \{v\}\) and each \(v_n\) is a version to which a branch pointer is pointing. Moreover, we have the property that if we have a sequence of length \(n\) \(\langle v_0, v_1, \dots , v_{n-1} \rangle \in \mathcal L\) and if \(0 \leq m < n\), then the truncated sequence \(\langle v_0, v_1, \dots , v_{m-1} \rangle\) also belongs to \(\mathcal L\).

Based on the ordering property of nodes of the log tree, we now define four related operations:

The first operation we define is the predecessor, denoted by \(pred\):

\[\begin{gather*} pred : \mathcal V \rightarrow \texttt{Set(} \mathcal V \texttt{)} \\ pred(v) = \{s \in \mathcal L \setminus \{v\} \mid s[-1] = v \} \end{gather*}\]

, which takes a version in \(\mathcal V\) and returns the set of versions that were registered before (on the path to the root of the tree).

Similarly, we define the successor operation:

\[\begin{gather*} succ : \mathcal V \rightarrow \texttt{Set(} \mathcal V \texttt{)} \\ succ(v) = \bigcup_{v' \in \mathcal V} pred(v') \setminus pred(v) \ \text{ s.t. } \forall s \in \mathcal L, s[-1] = v' \wedge v \in s \end{gather*}\]

This simply means that the predecessors of a version node are all versions registered in the log tree on the path from the given version to the root of the tree, and the successors of a version node are all node that are part of the subtree of \(\mathcal L\) rooted in the given version, excluding that version. For example, using the data in Fig. 6, we have that \(succ(\texttt{2\_m}) = \{ \texttt{3\_m}, \texttt{4\_m} \}\) and \(pred(\texttt{0\_b}) = \{ \texttt{1\_m}, \texttt{0\_m} \}\).

One useful notion from trees that automatically extends to log trees is the idea of lowest common ancestor of two nodes in the tree. This can be written as follows:

\[LCA(v_1, v_2) = \min_{v} \left ( v \in pred(v_1) \cap pred(v_2) \right )\]

Deltas and per-document Delta Trees¶

The Deltas¶

Deltas are documents of the __deltas_<target_collection_name> collection that store the forward and backward differences between two versions of a document. For each edge of Version Tree, there exists a non empty set of Deltas, that can be applied to some or all of the documents in the source version to change the state of the collection to the target version.

We’ve seen previously that the documents deltas collection contain the timestamp when the version for which the delta document was created, therefore deltas corresponding to a document of the collection form a partial order. This is intuitive and easy to understand since deltas are strictly linked with the nodes of the version tree, i.e., a delta transforms a document to bring it into the state it should be for the version the delta was registered. Since the total delta tree of a document forms a tree, and the deltas are partially ordered, then for each node n in the delta tree, it is guaranteed that any successor node has a timestamp larger than the timestamp of the node n. This is a similar concept to the ordering notions of version nodes in the log tree.

Before we move on, we will define some notation that will make easier to understand the concepts around delta trees.

Let \(\mathscr C\) be a versioned collection and let \(\mathscr C(v)\) be the state of the collection in version \(v \in \mathcal V\). Let \(\delta_{v | d}^{\mathcal D}\) represent a delta object, where \(\mathcal D \in \mathscr C(v), v \in \mathcal V\) and \(d\) is a direction in time. This means that \(\delta_{v | d}^{\mathcal D}\) is an object that acts on a document and transforms it to reflect the state of the collection at version \(v\). Deltas behave differently when applied forward (advance in time, when \(d = 1\)) or backwards (go back in time, when \(d = -1\)). By default, we think of the version tree as a tree with the root up, and the leaves down, so the default direction in time is forward. We do that because deltas are objects that ‘sit’ between versions, for example, if we have \(v, v' \in V\) such that \(v' \in succ(v)\) and we have a two versions \(\mathscr D \in \mathscr C(v), \mathscr D' \in \mathscr C(v')\) of the same document \(\mathcal D\) then \(\mathscr D + \delta_{v' | 1}^{\mathcal D} = \mathscr D'\) and \(\mathscr D' + \delta_{v' | -1}^{\mathcal D} = \mathscr D\). Notice that the delta object remained the same, but only the direction of application changed. Because of this we will usually omit the direction because it can be deduced from the context, when we have a ordered delta tree for a document.

When defining a delta \(\delta_v^{\mathcal D}\) we do not know what was the previous version in which \(\mathcal D\) existed, and we do not care, as long as the deltas are applied in the correct order, from a valid state of the versioned collection. This is reflected in the state of the deltas collection, where each document only records the version for which the delta was registered.

Example

Let’s look again at the example set in Fig. 4. The deltas registered by each register operation are presented in Fig. 7. As a shorthand notation, we used \(\delta_v^{\mathcal D\{1,2,...\}}\), for the versions where deltas were registered for multiple documents, so we have as many deltas as the numbers in the curly braces (one for each modified document). The first thing we notice is that deltas sit in between the nodes of version tree, so they represent the transitions that need to be applied to specific documents in the collection to change their state (and the state of the collection implicitly, since a collection is made out of documents).

Deltas visualised on the log tree. — Fig. 7 The deltas that modify the documents between two collection versions, displayed on the version tree.¶

Document Delta Trees¶

Now since we’ve briefly understood how deltas work, we can look at how deltas registered for a document are related. If we look at the deltas corresponding to a single document, we notice that they form a tree structure. For instance, for document \(\mathcal D_1\), we notice that it was modified in all versions in \(succ(\texttt{0\_m}) \setminus \{ \texttt{1\_b} \}\). If we want to go from version \(\texttt{0\_m}\) to version \(\texttt{4\_m}\), we’ll have to iteratively apply the deltas registered for the 'main' branch starting with \(\delta_{\texttt{1\_m}}^{\mathcal D_1}\), then apply \(\delta_{\texttt{2\_m}}^{\mathcal D_1}\) and so on up to (and including) \(\delta_{\texttt{4\_m}}^{\mathcal D_1}\). Similarly, if we are at version \(\texttt{0\_m}\) and want to go to version \(\texttt{1\_b}\), we’ll have to apply \(\delta_{\texttt{1\_m}}^{\mathcal D_1}\), \(\delta_{\texttt{0\_b}}^{\mathcal D_1}\) and \(\delta_{\texttt{1\_b}}^{\mathcal D_1}\) in order. We notice how this induces a tree structure on the deltas, which we will call the document delta tree or per-document delta tree, where the deltas, which are the transitions between versions, become nodes.

Delta tree nodes. — Fig. 8 Delta trees for \(\mathcal D_1\) (left) and \(\mathcal D_2\) (middle) and \(\mathcal D_3\) (right)¶

In addition to that, we observe how this document delta tree is sparser than the log tree (see Fig. 8). For \(\mathcal D_1\), there is no delta between versions \(\texttt{0\_b}\) and \(\texttt{1\_b}\), and for \(\mathcal D_3\) we only have two deltas: one between \(\texttt{1\_m}\) and \(\texttt{2\_m}\) and the other one between \(\texttt{0\_b}\) and \(\texttt{1\_b}\). These deltas are trees on their own (trees with a single node), but they are not connected. If \(\mathcal D_3\) had been modified, for instance, between \(\texttt{3\_m}\) and \(\texttt{4\_m}\), then a new delta \(\delta_{\texttt{4\_m}}^{\mathcal D_3}\) would have been registered between those versions, and \(\delta_{\texttt{2\_m}}^{\mathcal D_3}\) and \(\delta_{\texttt{4\_m}}^{\mathcal D_3}\) would have formed a tree. Again that tree would be sparser, since it jumps over version \(\texttt{3\_m}\), meaning that the version of \(\mathcal D_3\) in \(\texttt{2\_m}\) and \(\texttt{3\_m}\) stays the same and it is modified only when the collection changes its state to \(\texttt{4\_m}\).

We know that all deltas that modify a document can be grouped together by the identifier of the document \(\mathcal D\) that they modify and denote this collection as \(\Delta^\mathcal D\) which is defined as follows:

\[\Delta^\mathcal D = \{ \delta_v^\mathcal D \mid \forall v \in \mathcal V \ \text{ s.t. } \mathcal D \in \mathscr C(v)\}\]

Since the deltas can be ordered then we can construct the document delta tree, or the per-document delta tree \(\widetilde{\Delta^\mathcal D}\) that groups the deltas in \(\Delta^\mathcal D\) into a tree. In general, the number of nodes of \(\widetilde{\Delta^\mathcal D}\) may be larger than the cardinality of \(\Delta^\mathcal D\) and we will explain later why this is the case.

We saw that deltas can be grouped by the document they modify, but there is another useful way of doing this, and that is grouping them by the version then were registered in. Let \(\Delta_v\) represent the set of deltas that need to be applied to the right documents to change the version of the collection to version \(v\):

\[\Delta_v = \{ \delta_v^\mathcal D \mid \forall \mathcal D \in \mathscr C \ \text{ s.t. } \mathcal D \in \mathscr C(v) \} \]

Note that :

\[\{\delta_v^\mathcal D\} = \Delta_v \cap \Delta^\mathcal D, \ \ \forall v \in \mathcal V, \forall \mathcal D \in \mathscr C\]

As we did with the log tree nodes, we can define the notion predecessor and successor set by overloading the \(prev\) and \(succ\) operators on deltas:

\[\begin{gather*} pred : \mathcal \Delta^\mathcal D \rightarrow \ \texttt{Set(} \mathcal \Delta^\mathcal D \texttt{)} \\ pred(\delta_v^\mathcal D) = \{ \delta_{v'}^\mathcal D \in \Delta^\mathcal D \ \mid \ \forall v' \in \mathcal V \text{ s.t. } v' \in pred(v) \} \\ \\ succ : \mathcal \Delta^\mathcal D \rightarrow \ \texttt{Set(} \mathcal \Delta^\mathcal D \texttt{)} \\ succ(\delta_v^\mathcal D) = \{ \delta_{v'}^\mathcal D \in \Delta^\mathcal D \mid \ \forall v' \in \mathcal V \text{ s.t. } v' \in succ(v) \} \end{gather*} \]

Note that in the above definitions we restrict the deltas added to the returned sets to only those deltas that are part of \(\Delta^\mathcal D\), since there may not exist a delta for all registered versions, for a document. We’ve seed this above, when we said that delta trees are sparser than the log trees.

Based on this operators, we define the notion of a delta tree rooted in version \(v\) as:

\[\Lambda_v^\mathcal D = \begin{cases} \{\delta_v^\mathcal D \} \cup \ succ(\delta_v^\mathcal D) & \ \text{ if } \ \exists \delta_v^\mathcal D \in \Delta^\mathcal D \text{ s.t. } \ pred(\delta_v^\mathcal D) = \varnothing\\ \texttt{undefined} & \text{ otherwise} \end{cases}\]

Even though \(\Lambda_v^\mathcal D\) is just a set of deltas, we refer to it as a tree, since the its elements can be structured as the nodes of a tree. Note that \(\Lambda_v^\mathcal D\) always exists, since if there exist at least one delta in \(\Delta^\mathcal D\), then we can build a delta tree with it, otherwise the document was never modified. For example, looking at Fig. 8, \(\Lambda_v^{\mathcal D_1}\) is the tree rooted in \(\delta_{\texttt{1\_m}}^{\mathcal D_3}\) that contains 5 nodes including the root, and \(\Lambda_v^{\mathcal D_2}\) is the tree rooted in \(\delta_{\texttt{1\_m}}^{\mathcal D_2}\) that has two children: \(\delta_{\texttt{2\_m}}^{\mathcal D_2}\) and \(\delta_{\texttt{1\_b}}^{\mathcal D_2}\).

Unconnected Document Delta Trees¶

In Fig. 8 we placed the delta nodes on top of the versions of the collection on which they were registered, but a better way of thinking about deltas is shown in Fig. 7, since deltas really belong in the the version tree. The reason why that way of visualising delta trees is problematic since we know that all deltas of a document form a tree, but the figure for \(\mathcal D_3\) does not look like a tree since the two deltas do not seem to be related, or in other words are unconnected. The greyed out nodes that represent versions are added just for context, but they give us a hint that version \(\texttt{1\_m}\) somehow relates the deltas \(\delta_{\texttt{2\_m}}^{\mathcal D_3}\) and \(\delta_{\texttt{1\_b}}^{\mathcal D_3}\).

We distinguish two situations that can happen when a delta is created for a document in the collection:

It’s the first time the document is ever modified. This means that it is freshly added to the collection, or it was part of the initial state of the collection and now it’s finally modified (in any way). Under this scenario, a delta is created the first time for that document, so it has no parent.

The document was modified before, therefore there exists a delta registered for it, and the newly added delta is linked to the previous delta.

The links between deltas can be seen in the documents of the deltas collection (fields next and prev). Based on this, we define a new operator \(prev\) as follows:

\[\begin{gather*} prev : \Delta^\mathcal D \rightarrow \ \Delta^\mathcal D \cup \{ \texttt{null} \} \\ \\ prev(\delta_v^\mathcal D) = \begin{cases} \delta_{v'}^\mathcal D & \text{ where } \ v' \in pred(v) \wedge \ \delta_{v'}^\mathcal D \in \Delta^\mathcal D \wedge \ dist(\delta_v^\mathcal D, \delta_{v'}^\mathcal D) = 1 \\ \texttt{null} & \text{ otherwise } \end{cases} \end{gather*}\]

, where \(dist\) is recursively defined as:

\[dist(\delta_v^\mathcal D, \delta_{v'}^\mathcal D) = \begin{cases} \infty & \text{if } \delta_v^\mathcal D \in \Lambda_v^\mathcal D, \ \delta_{v'}^\mathcal D \in \Lambda_{v'}^\mathcal D, \ \Lambda_v^\mathcal D \neq \Lambda_{v'}^\mathcal D \\ 0 & \text{if } v = v' \\ | pred(\delta_v^\mathcal D) \cap succ(\delta_{v'}^\mathcal D) | + 1 & \text{if } v' \in pred(v) \\ | pred(\delta_{v'}^\mathcal D) \cap succ(\delta_v^\mathcal D) | + 1 & \text{if } v \in pred(v') \\ dist(\delta_v^\mathcal D, \delta_{\tilde v}^\mathcal D) + \ dist(\delta_{v'}^\mathcal D, \delta_{\tilde v}^\mathcal D) & \text{otherwise, where } \tilde v = LCA(v,v') \end{cases}\]

The above definition for \(prev\) is quite intuitive and says that the previous delta of a given delta \(\delta_v^\mathcal D\) is \(\texttt{null}\), i.e., it does not exist, if the given delta is the root of a delta tree, or it is a delta in the set of document deltas, that is applied immediately before to transform the document such that \(\mathcal D \in \mathscr C(v)\).

We observe that the distance between two deltas is properly defined only for deltas that are part of the same delta tree rooted in a common version that precedes the two versions for which the argument deltas are registered.

If the two versions are the same, then the deltas are the same, so the distance between them is 0.

If the deltas are on the same tree branch (not versioning branch) then the distance between them is the number of deltas registered in between them (intuitive, right?).

Otherwise the two deltas are registered on two different branches, therefore the distance between them is the sum of distances between each delta and the lowest common ancestor of the versions of the two deltas.

Any deltas that have infinite distance between them are part of different unconnected document delta trees. This type of object arises, the unconnected delta tree, when the collection has branches and the same object is added independently to the collection on both branches, therefore two separate (root) deltas are registered for that document.

We have already seen this in Fig. 8, for document \(\mathcal D_3\). In this case we have two seemingly unrelated delta trees rooted in \(\texttt{2\_m}\), i.e., \(\Lambda_{\texttt{2\_m}}^{\mathcal D_3}\) and \(\texttt{1\_b}\), i.e., \(\Lambda_{\texttt{1\_b}}^{\mathcal D_3}\). If we modify \(\mathcal D_3\) again on either of the two existing branches, the new created deltas will be added as the children of the delta tree for that branch. This means that we can group the unconnected delta trees into the set of all unconnected per-document delta trees:

\[\Lambda^\mathcal D = \{ \Lambda_v^\mathcal D \mid \forall v \in \mathcal V \ \exists \delta_v^\mathcal D \in \Delta^\mathcal D \text{ s.t. } \ prev(\delta_v^\mathcal D) = \texttt{null} \}\]

The ‘complete’ per-document Delta Tree with unconnected components¶

When we talked about document delta trees, we said that for a document \(\mathcal D\) and from the set of deltas action on the document, which we called \(\Delta^\mathcal D\), we can create a document delta tree \(\widetilde {\Delta^\mathcal D}\) that can have more nodes than elements in \(\Delta^\mathcal D\). If \(| \Lambda^\mathcal D | = 1\), i.e., we have a single document delta tree, then \(\widetilde {\Delta^\mathcal D}\) can be built only from the elements of \(\Delta^\mathcal D\), so we call it complete. In fact the set \(\Delta^\mathcal D\) is complete in the sense that it contains all the necessary elements to build a delta tree for document \(\mathcal D\). If \(| \Lambda^\mathcal D | > 1\), then \(\mathcal D\) has at least two unconnected delta trees. Remember that we want to create a single delta tree for \(\mathcal D\), so we can somehow link the delta trees together. We can do that by creating a new tree, which has an identity delta as the root, and the roots of each tree in \(\Lambda^\mathcal D\) will be connected as the new tree (as the children of the identity delta). By doing that we can navigate between the leaves of one of the trees to the leaves of the other delta tree, by going up to the root of the first tree, stepping into the new root (the identity delta) and the going down to the desired root. We define the identity delta \(\tilde{\delta}_v^\mathcal D\) as the delta with the following property:

\[\delta_{v | d}^\mathcal D + \tilde{\delta}_{v' | d' }^\mathcal D = \ \tilde{\delta}_{v' | d'}^\mathcal D + \delta_{v | d}^\mathcal D = \ \delta_{v | d}^\mathcal D, \quad \ \forall v, v' \in \mathcal V, \ \forall \delta_{v | d}^\mathcal D \in \Delta_v^\mathcal D, \ \forall d \in \{1, -1\}\]

We now know what should the new root of the complete document delta tree looks like, and that is should be the identity delta. The next question is where should it placed into the version tree?

Delta objects are strictly linked to the versions of the version tree, since a delta is a transition applied to a document to change it so that it aligns with the state of the collection in that version (this is how we defined deltas previously). Looking back at the example in Fig. 8, we see that \(\mathcal D_3\) was modified in versions \(\texttt{2\_m}\) and \(\texttt{1\_b}\). Remember that we overloaded the \(pred\) and \(succ\) operators for deltas and the way we defined what deltas precede each delta is strictly related to the notion of predecessor and successor of versions in the log tree \(\mathcal L\), which is the intuitive notion of predecessor and successor sets on a tree. Since we don’t want to break those definitions and our way of ordering deltas based on the nodes of \(\mathcal L\), where we put the new root of the complete document delta tree matters. Therefore the new delta tree root should be linked to the version that is the lowest common ancestor of the versions for which the roots of the unconnected trees are registered.

The complete delta tree for D3. — Fig. 9 The delta tree for document \(\mathcal D_3\)¶

Using the example in Fig. 8, Fig. 9 shows the complete delta tree \(\widetilde {\Delta^{\mathcal D_3}}\), where \(\texttt{1\_m} = LCA(\texttt{2\_m}, \texttt{1\_b})\). For completeness, Algorithm 1 shows how to compute a complete document delta tree.

             \begin{algorithm}
 \caption{Build a complete delta tree}
 \begin{algorithmic}
 \PROCEDURE{build-complete-delta-tree}{$\Lambda^\mathcal D$}
 \IF{$| \Lambda^\mathcal D | = 1$}
     \Return $\Lambda^\mathcal D$
 \ENDIF
 \WHILE{$| \Lambda^\mathcal D | > 1$}
     \STATE overlaps = []
     \STATE pairs = []
     \FOR{$ \Lambda_v^\mathcal D \in \Lambda^\mathcal D$ }
         \FOR{$ \Lambda_{v'}^\mathcal D \in \Lambda^\mathcal D $}
             \STATE overlaps.append($ | (prev(v) \cap pred(v') |$)
             \STATE pairs.append($v, v'$)
         \ENDFOR
     \ENDFOR
     \STATE i = \textbf{argmax} overlaps
     \STATE $v, v'$ = pairs[i]
     \STATE $\tilde v = LCA(v, v')$
     \STATE pop $\Lambda_{v'}^\mathcal D$ from $\Lambda^\mathcal D$
     \STATE pop $\Lambda_{v}^\mathcal D$ from $\Lambda^\mathcal D$
     \STATE $\Lambda_{\tilde v}^\mathcal D = \Lambda_{v}^\mathcal D \cup \Lambda_{v'}^\mathcal D \cup \{ \tilde{\delta}_{\tilde v}^\mathcal D \} $
     \STATE add $\Lambda_{\tilde v}^\mathcal D$ to $\Lambda^\mathcal D$
 \ENDWHILE
 \State $\widetilde{\Delta^\mathcal D} = \Lambda^\mathcal D$
 \Return $\widetilde{\Delta^\mathcal D}$
 \ENDPROCEDURE
 \end{algorithmic}
 \end{algorithm}

Now we have all the components needed to properly understand the Version Tree. Fig. 8 shows the version tree for our toy example. Let \(\mathscr C\) pe the 'versioning' collection, then:

the nodes of tree are elements of \(\Delta^\mathcal D,\ \forall \mathcal D \in \mathscr C\), and

each edge of the version tree that connects two adjacent versions \(v\) and \(v'\), where \(v' < v\), is the set \(\Delta_v\).

We also understand that elements of \(\Delta^\mathcal D\) can be grouped into a set of unconnected delta trees \(\Lambda^\mathcal D\), which can be transformed into a complete document delta tree \(\widetilde{\Delta^\mathcal D}\).

This concludes the more ‘theoretical’ part of this document. There are several arguments why we included it in the tutorial in the first place:

It makes easier to explain how versioning works. We defined a set of notations that allow to quickly refer to elaborate concepts using just a few symbols.

It provides complete picture of the data structures used in versioning, so it is an alternative way of analysing the code

Detecting changes and registering new versions¶

To detect and compute the changes of made to the target collection, we listen and process the Change Streams to detect the modified documents. Those documents are tagged as modified, and tracker documents are added to the modified collection.

We have already seen this before in the section where we discussed the structure of the tracking collections. The missing piece of the puzzle is how to compute the changes, i.e, the deltas between the versions of the modified documents.

The Replica Collection¶

The Change Streams are a powerful concept, but they have certain limitations. They are efficient and fast, but they do not offer full support for a versioning application. A change stream event linked to a document can return either the whole updated/ modified document, or the changes made in the forward direction, i.e., we get information about something inserted or deleted into a document, but when a filed is updated we will only know the new value.

Note

The above is true for MongoDB versions < 6.0. MongoDB 6.0 adds a fullDocumentBeforeChange field, which can be used to simplify the versioning algorithm. Instead of storing a whole replica collection, we can only have a partial replica, that contains only the modified documents. The tradeoff between fast queries and fast versioning will still exist.

This versioned_collection library stores only the changes made to a document and not the whole document. For efficiency purposes, it needs a mechanism of knowing the state of the collection before changes were made to it. This is the purpose of the replica collection.

Registering a new version¶

At any given time, there are two versions of the target collection \(\mathscr C\) stored into the system: one as the working collection (the original collection) and one as the replica collection.

Version registration — Fig. 10 Registering a new version \(v'\) on the collection \(\mathscr C\) after modifying the collection.¶

Suppose we are at version \(v\) of collection \(\mathscr C\) and there are no changes. Then the replica and the working version of the collection \(\mathscr C\) i.e., \(\mathscr C(v)\) will be the same. Suppose we modify some documents and let’s take a look at Fig. 10. After the changes are made, the collection is in an intermediary state \(\widetilde{\mathscr C}(v)\), and now the replica collection is in state \(\mathscr C(v)\).

To register a version \(v'\), we want to do a couple of things:

add a new node for \(v'\) to \(\mathcal V\), which automatically adds a new sequence ending in \(v'\) to \(\mathcal L\), which basically means that we extend the log tree, so we log a new version into the logs collection;

store the the set of deltas \(\Delta_{v'}\) into the deltas collection;

update the branches collection in case the new version was registered on a new branch, or increase the version the branch tip is pointing to, if registered on the same branch;

update the metadata collection;

remove the tracker documents;

create a new replica.

After registering the new version \(v'\), we notice (second row of Fig. 10) that the replica collection becomes a copy of the working collection (both being in state \(\mathscr C(v')\)).

Navigating between versions¶

Navigating between versions, a.k.a checking out a version is another fundamental operations in versioned_collection. Let’s look once again at Fig. 8, and suppose the current version is \(v = \texttt{4\_m}\) and we want to checkout version \(v' = \texttt{1\_b}\). What we want to do is to traverse the version tree between the two version, and each time we encounter an edge that connects versions \(v_1\) and \(v_2\) we apply the corresponding deltas \(\Delta_{\tilde v}\) where \(\tilde v = v_1\) if \(v_1 \in pred(v_2)\), or \(\tilde v = v_2\) if \(v_2 \in pred(v_1)\). Note that in the first case the direction of the deltas that needs to be backward (or \(d=-1\)) and in the second case the direction should be forward (or \(d=1\)).

Such a traversal of the version tree can be broken down in three steps:

Get the path between versions \(v\) and \(v'\)

Build the document delta tree and compose (merge) the deltas collected in the order of the traversal, for each document modified between the two version

Apply the ‘merged’ deltas to the corresponding documents of \(\mathscr C(v)\) to get to \(\mathscr C(v')\)

Getting the path between two versions¶

Note that there are two possible cases that describe the relative position of two versions \(v\) and \(v'\):

\(v\) and \(v'\) are on the same branch in the version tree, i.e., \(\exists s \in \mathcal L \text{ s.t. } v \in s \ \text{ and } v' \in s.\)

\(v\) and \(v'\) are on different branches.

We can use the way the log tree \(\mathcal L\) is defined to get the sequences that contain the versions of interest to compute the path, but there is a simpler way, that’s also easily implementable, to achieve that, which uses the already defined notion of lowest common ancestor.

Let \(\Pi_{v, v'}\) be the (ordered) path between \(v\) and \(v'\) which is defined as follows:

\[\begin{align*} v^* &= LCA(v,v') \\ \Pi_{v, v'} &= \ [(v, d)] + \ path(v, v^*, d) + P\\ P &= \begin{cases} [(v', d)] & \text{if } C \\ [(v^*, -d)] + path(v^*, v', -d) + [(v', -d)] & \text{otherwise } \end{cases}\\ C &= v \equiv v^* \vee v' \equiv v^*\\ d &= 1 \text{ if } v \equiv v^* \text{ else } -1\\ path(v_1, v_2, d) &= \begin{cases} sorted(succ(v_1) \cap pred(v_2), <) & \text{if } d = 1\\ sorted(pred(v_1) \cap pred(v_2), >) & \text{if } d = -1 \end{cases}\\ \end{align*}\]

where \(sorted(S, op)\) returns a list where \(\forall i, j \leq |S|\) if \(op(i, j)\) then \(op(S[i], S[j])\).

This may seem a convoluted definition, but it has a simple implementation, if we store the level of a node in the log tree. The root node has level 0, it’s direct children have level 1, their children have level 2 and so on. This means associating each sequence in \(\mathcal L\) with its length.

Based on this notion, the path between two versions and their lowest common ancestor can be computed in the same time. Simply check the levels of the nodes corresponding to the two versions: if they are not equal, go up in the tree until the levels are equal, while keeping track of the traversed nodes. If the levels are equal, check if the versions are equal (which means we were in case 1, see above). If the version are not equal, the versions are on different branches, so continue to go up until the levels are equal and then stop. The version where we stopped is the LCA. Since we recorded the path along the way, we just need to put it in the right order, by checking the direction and the order of the nodes based on the position of the versions in the tree.

Recovering the deltas to apply¶

Using the path \(\Pi_{v, v'}\) between versions \(v\) and \(v'\) we can retrieve the set of deltas \(\Delta_{v^*}\) for each version \(v^*\) in \(\Pi_{v,v'}\). We can then use all \(\Delta_{v^*}\) to create the document delta tree from which we extract the deltas.

Observe that since we already have a path between versions, we do not need to build \(\Lambda^\mathcal D,\ \forall \mathcal D \in \mathscr C(v^*)\), \(\forall v^* \in \Pi_{v, v'}\), because we only care about the parts of the delta trees that overlap with our path. That being said, we can compute the partial unconnected document delta tree rooted in a version \(v^*\) that contains a version \(v\) as follows:

\[\Gamma_{v^*, \subset v}^\mathcal D = \left ( \Lambda_{v^*}^\mathcal D \ \setminus \{ \delta_{v^*}^\mathcal D \} \right ) \cap \ pred(\delta_v^\mathcal D) \cup \{\delta_v^\mathcal D\}\]

Having this object, to go from version \(v\) to \(v'\) on path \(\Pi_{v, v'}\), we need to compute the partial trees rooted in the lowest common ancestor of the two versions:

\[\widetilde{\Gamma_{\Pi_{v, v'}}^\mathcal D} = \ \Gamma_{LCA(v, v'), \subset v}^\mathcal D \bigcup \ \Gamma_{LCA(v, v'), \subset v'}^\mathcal D\]

and finally use Algorithm 1 to build the per document partial delta tree:

\[\widetilde{\Delta^\mathcal D} = \ \texttt{build-complete-delta-tree}(\widetilde{\Gamma_{\Pi_{v,v'}}^\mathcal D})\]

In the definition of \(\Gamma_{v^*, \subset v}^\mathcal D\) we excluded since we do not care about it, and it is replaced by the identity delta \(\tilde{\delta}_{v^*}^\mathcal D\) when the (partial) complete delta tree is built.

Algorithm 2 shows the step needed to transform a document between two versions \(v\) to \(v'\) after computing the path in the version tree between versions.

             \begin{algorithm}
 \caption{Compute the new version of a document during checkout}
 \begin{algorithmic}
 \PROCEDURE{transform-document}{$\mathcal D, \Pi_{v, v'}$}
     \STATE $v^* = LCA(v, v')$
     \STATE $\Gamma_{v^*, \subset v}^\mathcal D = \left ( \Lambda_{v^*}^\mathcal D \setminus \{ \delta_{v^*}^\mathcal D \} \right ) \cap pred(\delta_v^\mathcal D) \cup \{\delta_v^\mathcal D\}$
     \STATE $\Gamma_{v^*, \subset v'}^\mathcal D = \left ( \Lambda_{v^*}^\mathcal D \setminus \{ \delta_{v^*}^\mathcal D \} \right ) \cap pred(\delta_{v'}^\mathcal D) \cup \{\delta_{v'}^\mathcal D\}$
     \STATE $\widetilde{\Gamma_{\Pi_{v, v'}}^\mathcal D} = \Gamma_{v^*, \subset v}^\mathcal D \bigcup \Gamma_{v^*, \subset v'}^\mathcal D$
     \STATE $\widetilde{\Delta^\mathcal D} = \texttt{build-complete-delta-tree}(\widetilde{\Gamma_{\Pi_{v,v'}}^\mathcal D})$
     \STATE $\delta^* = \tilde{\delta}$
     \STATE $\tilde v = v$
     \WHILE{$\tilde v \neq v'$}
         \STATE $(\tilde v, d) = \Pi_{v, v'}[\tilde v]$
         \STATE $\delta^* = \delta^* + \delta_{\tilde v | d}^\mathcal D$
         \STATE $\tilde v = next(\widetilde{\Delta^\mathcal D}, \tilde v, d)$
     \ENDWHILE
     \Return $\mathcal D + \delta^*$
 \ENDPROCEDURE
 \end{algorithmic}
 \end{algorithm}

Algorithm 2 is computed for all necessary documents, and in the end, the collection will have the correct state.