Introduction

Networked VR applications require different types of logging, such as:

Debug Logs
Experiment Logs
Network Traces

Refers to logging expected and exceptional events that occur during a regular session. The purpose is post-hoc debugging of high-level application code.
Refers to logging application-specific data, such as measurements or questionnaire responses for an experiment.
Refers to captures of network traffic to investigate reproducible low-level netcode bugs.

(1) & (2) are handled by Ubiq's Event Logging system. (3) has distinct performance implications, so is handled seperately.

Use Case

The Event Logging System is for collecting low or medium frequency events from multiple peers. The Event Logging system can log both Ubiq and third-party events, which can then be extracted and analysed.

Events are discrete, but otherwise have very few restrictions. It is up to the user to ensure that event logging in their application doesn't negatively affect performance.

Overview

Events are generated by LogEmitter instances placed throughout the application. Events generated by these components are passed to a local LogCollector instance. The LogCollector then writes them to disk (or database, or other endpoint), or forwards them to another LogCollector that will.

Data flow diagram of the Event Logging System

LogEmitter

LogEmitter instances are lightweight objects that the application uses to log events. Calls to a LogEmitter are expected to be placed throughout the system persistently, rather than gated with pre-processor defines.

The most common types of event logger are the ContextEventLogger, which is designed to work with Components that have a NetworkContext, and the ExperimentEventLogger, designed for logging measurements in experiments.

public class VoipPeerConnectionManager : MonoBehaviour
{
    private ContextLogEmitter debug;

    private void Start()
    {
        context = NetworkScene.Register(this);
        debug = new ContextLogEmitter(context);
    }

    public void ProcessMessage(ReferenceCountedSceneGraphMessage message)
    {
        var msg = JsonUtility.FromJson<Message>(message.ToString());
        switch (msg.type)
        {
            case "RequestPeerConnection":
                debug.Log("CreatePeerConnectionForRequest", msg.objectid);
                break;
        }
    }
}

The snippet above demonstrates the creation and use of a ContextEventLogger. The VoipPeerConnectionManager declares the ContextLogEmitter debug and initialises it with a ContextEventLogger once a context has been created. The Log method can then be called to log the receipt of a specific message.

Events are recorded as Json objects, for example:

{
"ticks":637794978937526996,
"peer":"aa9a2c8c-0c6f11c7",
"type":"Ubiq.Voip.VoipPeerConnectionManager",
"objectid":"aa9a2c8c-0c6f11c7",
"componentid":50,
"event":"CreatePeerConnectionForRequest",
"arg1":"9020d814-45c41c19"
}

LogEmitter instances attach to a single LogCollector. The emitter constructors find the closest LogCollector automatically.

LogEmitter methods can be safely called from outside the Unity main thread. They should not be called from outside CLR threads however.

LogEmitter instances are designed to have zero overhead when logs are not actually written. The Log method has many overloads to avoid boxing, and serialisation only runs when logging is on. Logs are only written when there is a listening LogCollector in the scene. If the LogCollector is disabled, non-existent, or the Collector's EventFilter ignores the emitter's event type, the emitter will not do anything.

It is encouraged to make as many LogEmitter instances as needed. Individual emitters are simple, with few options. Use multiple LogEmitter instances within a class to get fine-grained control over logging, for example different log levels.

Events and Filters

A LogEmitter can tag events with a flag. A number of flags are pre-defined in Ubiq.Logging.EventType enum. The underlying type is an sbyte, so additional values can be used in an unchecked context.

[Flags]
enum EventType
{
    Application = 1,
    Experiment = 2,
    Debug = 4,
    Info = 8
}

A LogCollector can be set to ignore one or more of these types with its EventsFilter member. If a flag is set the event is logged. If it is unset the event is not logged. Filtering applies to local events. LogCollectors will forward and write all external events regardless of the filter. Filtered events are not cached, but lost permanently.

LogCollector writes events of different types to different streams/files. Prefixes are defined for the entries in Ubiq.Logging.EventType. Other tags will display as their numerical equivalent.

The LogCollector EventsFilter member and the Tag member of any LogEmitter can be changed at any time.

ComponentEventLogger and ContextEventLogger have their type set to Debug by default. Convenience classes are defined for ExperimentLogEmitter and InfoLogEmitter.

LogCollector

LogCollector instances receive events from local LogEmitter instances and remote LogCollector instances. They do one of three things with these events:

Cache them (Buffering Mode)
Write them to a local file (Writing Mode)
Forward them to another LogCollector (Forwarding Mode)

The behaviour depends on the value of the destination member, which determines where the events should go.

LogCollector instances can be organised this way into complex tree arrangements.

Possible Tree Diagram of LogCollector Hierarchy

However, the expected use case is that Peers in a Group forward events directly to one Collector, that also writes them to disk.

Expected LogCollector Hierarchy

LogCollector writes events to files in the Persistent Data folder of whatever platform it is running on. It does this through a set of Ubiq.Logging.LogCollector.IOutputStream instances (one for each event type, created on demand).

The filenames include the event type and a timestamp.

Buffering

When a LogCollector is not Forwarding or Writing, it will cache or Buffer all events received (whether local or remote). If a LogCollector was previously Writing or Forwarding and stops, it will resume Buffering. When a LogCollector begins Writing or Forwarding, it will start with all the events in its buffer.

Events only leave the buffer when being sent to a safe destination (another LogCollector, or a file). In this way logging is lossless, regardless of the initialisation order and even when changing the Active LogCollector.

However, events that leave the buffer are gone permanently. If a LogCollector writes events, then another LogCollector becomes active at a later time, the second will only output events received after it became active. The file from the first LogCollector must be retrieved out-of-band to have the complete record of events.

This can be avoided by ensuring only one LogCollector becomes active during the lifetime of the Peer Group.

Encodings

Events are written as Json (Utf8 strings).

LogCollector attempts to make the output files Json compliant, by placing events in a top-level Json Array, with comma seperated entries. If the file is being read live however, or the process terminates unexpectedly, the file will not have the closing bracket ( ] ).

In this case the application reading the files should either add the bracket to the end of the stream or perform its own tokenisation.

Changing the LogCollector

Only one LogCollector in a Peer Group may be in Writing Mode at any time.

The Active Collector can be changed by calling StartCollection on the new Collector. This can be done at any time.

When the Active Collector changes, the Peers must agree on the new Collector. This is a Distributed Agreement problem. LogCollector solves this using the Global Snapshot method [1].

The new Collector broadcasts a message with a logical clock value and atomically updates its cut state. When Peers receive the message, they compare the clock against their own. If it is greater, they update their cut state, otherwise the message is ignored. If a Collector is attempting to become Active, and receives a message with the same clock value as its own, a collision has occured. In this case, the Collector re-initialises its clock to a random value, and re-transmits its state. If a message is received with a greater clock value, it updates its own State as if it were any other Peer.

This algorithm assumes that Peers are fully connected, and message passing is reliable, which is true in Ubiq. The logical clock ensures that all Peers converge on the same State, regardless of the order the messages were recieved in. The Peers have converged when all messages have been delivered.

When the Active Collector leaves the Peer Group, the Clock is reset.

Limitations

As long as Peers remain connected, the system is lossless, even during convergence.

However, during convergence different events may flow to the old and new Active Collectors at the same time, resulting in events being spread between the sets of log files. This will occur until convergence. There is no upper bound on this time, and there is no mechanism to check if convergence has been achieved, so StartCollection should be called with care.

To surrender position as the Active Collector, a Collector may call StopCollection, however the time between this and the Peers converging to null is again indeterminate.

When the Active Collector changes, the delay between the previous Collector receiving the new state, and other Peers in the Group receiving the new state, may cause it to behave as a relay for a time.

Process Failure

If the Active Collector process fails, log events will be lost until the system converges on a new Collector or null.

If the Collector was part of a Room, the Rooms system can detect disconnection in some cases, and Collectors will update their cut state independently. The Collectors in this case will cache until a new Collector volunteers itself.

If a process that was not the Active Collector fails, then that process will simply not emit events. If that process was previously an Active Collector, it is still possible to lose events if the Collector was acting as a relay when the process failed.

For collecting data such as experimental logs, it is recommended to only ever have one collector on a process that can be monitored. For example, the logcollectorservice.

Verification

The LogCollector method GetBufferedEventCount returns the number of Events currently buffered. As LogCollector is multi-threaded, this count may change even while it is being returned. However, if it is known, for example, that an application will not generate any new Events of a particular type, it can be used to check whether all of those events have finished writing.

The LogCollector method Ping will ping the Active Collector.

All Log messages are delivered in order. Therefore these mechanisms can be used together to verify that all log messages for a particular application have been delivered successfully before it exits.

To do this, an application could:

Write the last Event, e.g. a Questionnaire result.
In the same thread, wait until the number of buffered events of that type is zero.
Ping the Active Collector

When a response is receieved, the Event, and all preceeding it, will have been successfully delivered and the application can safely exit.

This protocol is implemented in the WaitForTransmitComplete method.

This does not protect against process failure of an intermediary LogCollector however, a condition which is irrecoverable.

Reliability

In order for WaitForTransmitComplete to confirm that an event has been successfully delivered, the integrity of the LogCollector processes must be fully visible. One way to achieve this is to ensure only one LogCollector is ever active, and that that LogCollector never forwards. In this case, if the Ping is received, then the logs must also be successfully delivered as they cannot have taken any other path to the collector.

LogCollector will keep track of this by default, and warn any caller of WaitForTransmitComplete if this is the case.

Analysis

A LogCollector outputs a stream of structured logs in compliant Json. These logs can be fed to a stack like the ELK, processed with third-party tools like Matlab or Excel, or processed programmatically on platforms such as Python.

See the Analysis section for examples of how to process the logs.

[1] Kshemkalyani, A. D., & Singhal, M. (2008). Distributed Computing: Principles, Algorithms and Practice. Cambridge University Press.