是的,AWS Glue任务可以使用内置的表检测方式从S3 CSV文件自动检测模式。下面是示例代码:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
input_bucket = "my-input-bucket"
input_prefix = "input-folder/"
output_bucket = "my-output-bucket"
output_prefix = "output-folder/"
datasource0 = glueContext.create_dynamic_frame_from_options(
"s3",
{"paths": ["s3://{}/{}".format(input_bucket, input_prefix)]},
format="csv",
format_options={"delimiter": ",", "header": True},
transformation_ctx="datasource0"
)
dataframe = datasource0.toDF()
dataframe.printSchema()
## 输出结果示例
# root
# |-- id: string (nullable = true)
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
# |-- city: string (nullable = true)
datasink0 = glueContext.write_dynamic_frame.from_options(
frame=DynamicFrame.fromDF(dataframe, glueContext, "dataframe"),
connection_type="s3",
connection_options={"path": "s3://{}/{}".format(output_bucket, output_prefix)},
format="csv",
transformation_ctx="datasink0"
)
上述代码会读取S3上指定路径下的CSV文件,并输出数据框模式。到这一步,您可以在输出的数据框模式中查看列的名称和数据类型。通过这种方式,您可以验证模式是否与您的预期相同。
请注意,如果CSV文件中没有包含文件头,则需要将上面示例代码中的"header": True
更改为"header": False
。此外,
下一篇:AWSGlue日志命名约定