Skip to main content

Operators

Operators
Operators transform one or more DataStreams into a new DataStream.

DataStream Transformations

  • Map
    DataStream → DataStream
    Takes one element and produces one element.
  • FlatMap
    DataStream → DataStream
    Takes one element and produces zero, one, or more elements.
  • Filter
    DataStream → DataStream
    Evaluates a boolean function for each element and retains those for which the function returns true.
  • KeyBy
    DataStream → KeyedStream
    Logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition. Internally, keyBy() is implemented with hash partitioning.
  • Reduce
    KeyedStream → DataStream
    A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.
  • Window
    KeyedStream → WindowedStream
    Windows can be defined on already partitioned KeyedStreams. Windows group the data in each key according to some characteristic (e.g., the data that arrived within the last 5 seconds).
  • WindowAll
    DataStream → AllWindowedStream
    Windows can be defined on regular DataStreams. Windows group all the stream events according to some characteristic (e.g., the data that arrived within the last 5 seconds).
  • Window Apply
    WindowedStream → DataStream
    AllWindowedStream → DataStream
    Applies a general function to the window as a whole. Below is a function that manually sums the elements of a window.
  • WindowReduce
    WindowedStream → DataStream
    Applies a functional reduce function to the window and returns the reduced value.
  • Union
    DataStream* → DataStream
    Union of two or more data streams creating a new stream containing all the elements from all the streams. Note: If you union a data stream with itself you will get each element twice in the resulting stream.
  • Window Join
    DataStream,DataStream → DataStream
    Join two data streams on a given key and a common window.
  • Interval Join
    KeyedStream,KeyedStream → DataStream
    Join two elements e1 and e2 of two keyed streams with a common key over a given time interval.
  • Window CoGroup
    DataStream,DataStream → DataStream
    Cogroups two data streams on a given key and a common window.
  • Connect
    DataStream,DataStream → ConnectedStream
    “Connects” two data streams retaining their types. Connect allowing for shared state between the two streams.
  • CoMap, CoFlatMap
    ConnectedStream → DataStream
    Similar to map and flatMap on a connected data stream.
  • Cache
    DataStream → CachedDataStream
    Cache the intermediate result of the transformation. Currently, only jobs that run with batch execution mode are supported. The cache intermediate result is generated lazily at the first time the intermediate result is computed so that the result can be reused by later jobs. If the cache is lost, it will be recomputed using the original transformations.
  • Full Window Partition
    DataStream → PartitionWindowedStream
    Collects all records of each partition separately into a full window and processes them. The window emission will be triggered at the end of inputs. This approach is primarily applicable to batch processing scenarios. For non-keyed DataStream, a partition contains all records of a subtask. For KeyedStream, a partition contains all records of a key.

Physical Partitioning

Flink also gives low-level control (if desired) on the exact stream partitioning after a transformation, via the following functions.

  • Custom Partitioning
    DataStream → DataStream
    Uses a user-defined Partitioner to select the target task for each element.
  • Random Partitioning
    DataStream → DataStream
    Partitions elements randomly according to a uniform distribution.
  • Rescaling
    DataStream → DataStream
    Partitions elements, round-robin, to a subset of downstream operations.
  • Broadcasting
    DataStream → DataStream
    Broadcasts elements to every partition.

Task Chaining and Resource Groups

Chaining two subsequent transformations means co-locating them within the same thread for better performance. Flink by default chains operators if this is possible (e.g., two subsequent map transformations).

  • Start New Chain
    Begin a new chain, starting with this operator. The two mappers will be chained, and filter will not be chained to the first mapper.
  • Disable Chaining
    Do not chain the map operator.
  • Set Slot Sharing Group
    Set the slot sharing group of an operation. Flink will put operations with the same slot sharing group into the same slot while keeping operations that don’t have the slot sharing group in other slots. This can be used to isolate slots. The slot sharing group is inherited from input operations if all input operations are in the same slot sharing group. The name of the default slot sharing group is “default”, operations can explicitly be put into this group by calling slotSharingGroup(“default”).

Name And Description

Operators and job vertices in flink have a name and a description. Both name and description are introduction about what an operator or a job vertex is doing, but they are used differently.

The name of operator and job vertex will be used in web ui, thread name, logging, metrics, etc. The name of a job vertex is constructed based on the name of operators in it. The name needs to be as concise as possible to avoid high pressure on external systems.

The description will be used in the execution plan and displayed as the details of a job vertex in web UI. The description of a job vertex is constructed based on the description of operators in it. The description can contain detail information about operators to facilitate debugging at runtime.