AWS Glue Dynamic Frame可以对单个文件进行分区,具体方法如下:
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# 读取数据源文件
source_path = 's3://mybucket/myfile.csv'
source_dyf = glueContext.create_dynamic_frame_from_options('csv', {'paths': [source_path]}, format='csv')
# 分区定义列表
partition_keys = ['year', 'month', 'day']
# 进行分区
num_partitions = 10
partitioned_dyf = source_dyf.repartition(num_partitions, partition_keys)
现在,单个文件已经被分成了多个分区,可以使用partitionKeys属性查看数据被分成的多少份:
partition_keys = partitioned_dyf.partitionKeys
print(f'{len(partition_keys)} partitions created')
至此,AWS Glue Dynamic Frame已经成功将单个文件进行了分区。