Operators
Operators
Operators transform one or more DataStreams into a new DataStream.
DataStream Transformations
- Map
DataStream → DataStream
Takes one element and produces one element. - FlatMap
DataStream → DataStream
Takes one element and produces zero, one, or more elements. - Filter
DataStream → DataStream
Evaluates a boolean function for each element and retains those for which the function returns true. - KeyBy
DataStream → KeyedStream
Logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition. Internally, keyBy() is implemented with hash partitioning. - Reduce
KeyedStream → DataStream
A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value. - Window
KeyedStream → WindowedStream
Windows can be defined on already partitioned KeyedStreams. Windows group the data in each key according to some characteristic (e.g., the data that arrived within the last 5 seconds). - WindowAll
DataStream → AllWindowedStream
Windows can be defined on regular DataStreams. Windows group all the stream events according to some characteristic (e.g., the data that arrived within the last 5 seconds). - Window Apply
WindowedStream → DataStream
AllWindowedStream → DataStream
Applies a general function to the window as a whole. Below is a function that manually sums the elements of a window. - WindowReduce
WindowedStream → DataStream
Applies a functional reduce function to the window and returns the reduced value. - Union
DataStream* → DataStream
Union of two or more data streams creating a new stream containing all the elements from all the streams. Note: If you union a data stream with itself you will get each element twice in the resulting stream. - Window Join
DataStream,DataStream → DataStream
Join two data streams on a given key and a common window. - Interval Join
KeyedStream,KeyedStream → DataStream
Join two elements e1 and e2 of two keyed streams with a common key over a given time interval. - Window CoGroup
DataStream,DataStream → DataStream
Cogroups two data streams on a given key and a common window. - Connect
DataStream,DataStream → ConnectedStream
“Connects” two data streams retaining their types. Connect allowing for shared state between the two streams. - CoMap, CoFlatMap
ConnectedStream → DataStream
Similar to map and flatMap on a connected data stream. - Cache
DataStream → CachedDataStream
Cache the intermediate result of the transformation. Currently, only jobs that run with batch execution mode are supported. The cache intermediate result is generated lazily at the first time the intermediate result is computed so that the result can be reused by later jobs. If the cache is lost, it will be recomputed using the original transformations. - Full Window Partition
DataStream → PartitionWindowedStream
Collects all records of each partition separately into a full window and processes them. The window emission will be triggered at the end of inputs. This approach is primarily applicable to batch processing scenarios. For non-keyed DataStream, a partition contains all records of a subtask. For KeyedStream, a partition contains all records of a key.
Physical Partitioning
Flink also gives low-level control (if desired) on the exact stream partitioning after a transformation, via the following functions.
- Custom Partitioning
DataStream → DataStream
Uses a user-defined Partitioner to select the target task for each element. - Random Partitioning
DataStream → DataStream
Partitions elements randomly according to a uniform distribution. - Rescaling
DataStream → DataStream
Partitions elements, round-robin, to a subset of downstream operations. - Broadcasting
DataStream → DataStream
Broadcasts elements to every partition.
Task Chaining and Resource Groups
Chaining two subsequent transformations means co-locating them within the same thread for better performance. Flink by default chains operators if this is possible (e.g., two subsequent map transformations).
- Start New Chain
Begin a new chain, starting with this operator. The two mappers will be chained, and filter will not be chained to the first mapper. - Disable Chaining
Do not chain the map operator. - Set Slot Sharing Group
Set the slot sharing group of an operation. Flink will put operations with the same slot sharing group into the same slot while keeping operations that don’t have the slot sharing group in other slots. This can be used to isolate slots. The slot sharing group is inherited from input operations if all input operations are in the same slot sharing group. The name of the default slot sharing group is “default”, operations can explicitly be put into this group by calling slotSharingGroup(“default”).
Name And Description
Operators and job vertices in flink have a name and a description. Both name and description are introduction about what an operator or a job vertex is doing, but they are used differently.
The name of operator and job vertex will be used in web ui, thread name, logging, metrics, etc. The name of a job vertex is constructed based on the name of operators in it. The name needs to be as concise as possible to avoid high pressure on external systems.
The description will be used in the execution plan and displayed as the details of a job vertex in web UI. The description of a job vertex is constructed based on the description of operators in it. The description can contain detail information about operators to facilitate debugging at runtime.
📄️ Windowing
Windows
📄️ Joining
Joining