AWSGlue任务（用于数据框）能否自动从S3CSV中检测模式？ _编程开发

AWSGlue任务（用于数据框）能否自动从S3CSV中检测模式？

创始人

2024-09-25 16:03:00

0次

是的，AWS Glue任务可以使用内置的表检测方式从S3 CSV文件自动检测模式。下面是示例代码：

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

input_bucket = "my-input-bucket"
input_prefix = "input-folder/"
output_bucket = "my-output-bucket"
output_prefix = "output-folder/"

datasource0 = glueContext.create_dynamic_frame_from_options(
    "s3",
    {"paths": ["s3://{}/{}".format(input_bucket, input_prefix)]},
    format="csv",
    format_options={"delimiter": ",", "header": True},
    transformation_ctx="datasource0"
)

dataframe = datasource0.toDF()
dataframe.printSchema()

## 输出结果示例
# root
#  |-- id: string (nullable = true)
#  |-- name: string (nullable = true)
#  |-- age: long (nullable = true)
#  |-- city: string (nullable = true)

datasink0 = glueContext.write_dynamic_frame.from_options(
    frame=DynamicFrame.fromDF(dataframe, glueContext, "dataframe"),
    connection_type="s3",
    connection_options={"path": "s3://{}/{}".format(output_bucket, output_prefix)},
    format="csv",
    transformation_ctx="datasink0"
)

上述代码会读取S3上指定路径下的CSV文件，并输出数据框模式。到这一步，您可以在输出的数据框模式中查看列的名称和数据类型。通过这种方式，您可以验证模式是否与您的预期相同。

请注意，如果CSV文件中没有包含文件头，则需要将上面示例代码中的"header": True更改为"header": False。此外，

上一篇：AWSGlue任务未遵守超时时间并无法停止。

下一篇：AWSGlue日志命名约定

AWSGlue任务（用于数据框）能否自动从S3CSV中检测模式？

相关内容

热门资讯