Sesam Service Instance

Sesam is a general purpose data integration and processing platform. It is optimised for collecting or receiving data from source systems, transforming data, and providing data for target systems.

Sesam collects raw data from source systems and stores it in datasets. Transformation processes can join data across datasets to create new shapes of data. Data from these datasets is then exposed and delivered to other systems via push or pull. The entire system is driven by the state change of entities. The technology for each of these stages, Collect, Store, Transform and Deliver are both simple, consistent and powerful.

Sesam uses simple JSON based web protocols for moving data between systems, offers a powerful log based datahub for storage and data processing and provides simple extension points that allow developers to connect to systems that dont have out-of-the-box adaptors.

For complete insight into how things work check out the comprehensive API documentation.

Image showing how data flows into the hub and out to new systems

Data Sources

In Sesam, a Data Source is a single stream of data that flows from a source system into the datahub. e.g. a Table or entities from a REST endpoint.

Sesam provides a number of built in adaptors that can expose a stream of entities from the underlying system. These entities are pulled by the datahub. (It is possible to push data to the hub - but pull is more robust).

Data sources aways expose a complete set of the data. Optionally they can return an ordered stream and support the datahub asking for only the entities that have changed since a given point.

Example image of source code

Datasets

In the datahub data is stored in datasets. A dataset is a log of entities supported by primary and secondary indexes. A dataset sink can write entities to the dataset. The dataset appends the entity to the log if and only if it is new or if it is different from the most recent version of the same entity.

A dataset source exposes entities from a dataset so that they can be streamed through pipes. As the main data structure is a log the source can read from a specific location in the log.

Datasets have a primary key index and also dynamic secondary indexes. The secondary indexes are added based on the types of joins done in DTL in the system.

Dataset model showing what has been explained

Pipes

A pipe is composed of a source, a transformation chain, a sink, and a pump. It is an atomic unit that makes sure that data flows from the source to the sink at defined intervals. It is a simple way to talk about the flow of data from a source system to a target system. The pipe is also the only way to specify how entities flow from dataset to dataset.

A data source is a component hosted in Sesam that exposes a stream of entities. Typically, this stream of entities will be the rows of data in a SQL database table, the rows in a CSV file, or JSON data from an API.

A transformation chain takes a stream of entities, transforms them, and creates a new stream of entities. There are several different transform types supported; the primary one being the Data Transformation Language Transform, which uses DTL to join and transform data into new shapes.

A data sink is a component that can consume entities fed to it from a pipe. The sink has the responsibility to write these entities to the target system or dataset, handle transactional boundaries and batching of multiple entities, if supported by the target system.

Image of pipe showing what is being explained in the text

A scheduler handles the mechanics of 'pumping' data from a source to a sink. It runs periodically or at on a schedule and reads entities from a data source and writes them to a data sink. It's also capable of rescanning the data source from scratch at configurable points in time.

Data Transformation Language

The Data Transformation Language is used to construct new data from existing data. DTL transforms should only be applied to data in a dataset. DTL has a simple syntax and model where the user declares how to construct a new data entity. It has commands such as 'add', 'copy', and 'merge'. In general DTL is applied to the entities in a dataset and the resulting entities are pushed into a sink that writes to a new dataset. The new dataset is then used as a data source for sinks that write the data to external systems.

To see the full capabilities of DTL check out the DTL Documentation in the developer docs.

Image of pipe showing what is being explained in the text

Dependency tracking: One of the really smart things that Sesam can do is to understand complex dependencies in DTL. This is best described with an example. Imagine a dataset of customers and a dataset of addresses. Each address has a property 'customer_id' that is the primary key of the customer entity to which it belongs. A user creates a DTL transform that processes all customers and creates a new 'customer-with-address' structure that includes the address as a property.To do this they can use the 'hops' function to connect the customer and address. This DTL transform forms part of a pipe and as such when a customer entity is updated, added or deleted it will be at the head of the dataset log and get processed the next time the pump runs. But what if the address changes? The expected result is that the customer who contains that address should now be considered changed.

This is a tricky problem but one that Sesam takes care of automatically. The DTL languages allows us to introspect the transform to see which dependencies exist. Once we understand the dependencies we can create data structures and events that are able to understand that a change to an address should put a corresponding customer entity at the front of the dataset log. Once it is there it will be pulled the next time the pump is run and a new customer entity containing the updated address is exposed.