在AWS Glue ETL作业中,决定应该使用书签或覆盖的最佳实践取决于源数据的特性和作业的要求。
对于只能追加数据的源数据,例如日志文件或Kinesis数据流,最好使用书签。这可以确保每次作业运行时仅处理新的数据。以下是使用书签的代码示例:
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
# create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
# read in the data using the bookmark option
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mysourcetable", transformation_ctx = "datasource0", additional_options = {"bookmarkKeys": ["id", "timestamp"]})
# apply your transformations
# ...
# write the data using the bookmark option
glueContext.write_dynamic_frame.from_options(frame = transformed, connection_type = "s3", connection_options = {"path": "s3://mybucket/myoutputpath/",
"partitionKeys": ["year", "month", "day", "hour"]}, format = "parquet", transformation_ctx = "datasink")
对于可以覆盖现有数据的源数据,最好使用覆盖。这可以确保每次作业运行时处理整个数据集。以下是使用覆盖的代码示例:
# read in the data using the overwrite option
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mysourcetable", transformation_ctx = "datasource0", additional_options = {"option": "OVERWRITE_FILES"})
# apply your transformations
# ...
# write the data using the overwrite option
glueContext.write_dynamic_frame.from_options(frame = transformed, connection_type = "s3", connection_options = {"path": "s3://mybucket/myoutput
上一篇:AWSGlueETLJob-连接拒绝错误(以Catalog表作为输入)
下一篇:AWSGlueETLSparkjobfailsjava.lang.AssertionError:assertionfailed:Blockrdd_xx_xxisnotlockedforreading