Dataflow Model概念

Post author:xfxia
Post published:2023年8月23日
Post category:其他

本文翻译自：https://cloud.google.com/dataflow/model/programming-model

概述

Dataflow model是为简化大规模数据处理而设计的。这个model可以让开发者专注于数据处理的逻辑部分，而不用关注并行计算的具体实现。

Dataflow model提供了一系列有用的抽象概念，将并行计算的具体细节封装起来。

Dataflow主要有四个概念：Pipeline，PCollections，Transforms，I/O sources and sink。

Pipelines

A pipeline包含了一系列操作：从外部接收输入数据，对数据进行变换，提供输出数据。

A pipeline encapsulats an entire series of computations that accepts some input data from external sources, transforms that data to provide some useful intelligence，and produces some output data.

PCollections

A PCollection就是pipeline里面的一组数据。

PCollections就是pipeline里面每一步的输入输出。

Transforms

A transforms就是pipeline里面的一个数据处理操作。

A transform is a data processing operation, or a step, in your pipeline.

A transform将一个或者多个PCollections作为输入，然后对PCollections的elements执行一个处理函数，最后输出一个PCollection。

I/O Sources and Sinks

Dataflow SDK提供data source和data sink APIs作为pipeline的输入输出。

开发者使用source APIs读取数据到pipeline，和sink APIs从pipeline输出数据。

you use the source APIs to read data into your pipeline，and the sink APIs to write output data from your pipeline.

原文链接：https://blog.csdn.net/ghalcyon/article/details/82109306

概述

Pipelines

PCollections

Transforms

I/O Sources and Sinks

你可能也喜欢