Not the OP, but as I understand this project, the idea is that you can represent a pipeline that looks like it's operating on an entire collection of data, but in reality can operate on a delta.
That means that if you have a million items in your input, and you add another million to it, your pipeline only needs to run on one million, even though the output will be two million. In this case, the input is a collection that may be modified at any point, not a linear sequence of "events".
I've designed a similar system to ingest data that comes in the form of snapshots. The pipeline has many discrete steps and starts by splitting up the snapshot (e.g. a big monolithic JSON file) into smaller parts, which gets parses, transformed, split, joined etc. When a modified file comes in, the pipeline only performs actions on modified items; as data trickles through the pipeline, subsequent steps only run if a step modifies an item (relative to current state). It significantly reduces processing.
Pachyderm works on a similar principle, but using files in a virtual file system.
That means that if you have a million items in your input, and you add another million to it, your pipeline only needs to run on one million, even though the output will be two million. In this case, the input is a collection that may be modified at any point, not a linear sequence of "events".
I've designed a similar system to ingest data that comes in the form of snapshots. The pipeline has many discrete steps and starts by splitting up the snapshot (e.g. a big monolithic JSON file) into smaller parts, which gets parses, transformed, split, joined etc. When a modified file comes in, the pipeline only performs actions on modified items; as data trickles through the pipeline, subsequent steps only run if a step modifies an item (relative to current state). It significantly reduces processing.
Pachyderm works on a similar principle, but using files in a virtual file system.