可以通过添加自定义拆分逻辑来解决此问题。以下是一个使用Python编写的示例代码:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
inputData = glueContext.create_dynamic_frame.from_catalog(database = "my_database", table_name = "my_table")
# Split large records into smaller ones
splitRecords = SplitRows.apply(frame = inputData,
row_function = lambda row: [row[i:i+1000] for i in range(0, len(row), 1000)])
# Write the resulting split records to output
glueContext.write_dynamic_frame.from_options(frame = splitRecords,
connection_type = "s3",
connection_options = {"path": "s3://my-bucket/my-prefix/"},
format = "csv")
job.commit()
这段代码使用了SplitRows.apply()
来将大记录切换为小记录,并使用write_dynamic_frame.from_options()
将拆分记录写入S3。在这个示例中,我们将记录切分为每个1,000个字符的块,然后将拆分记录与CSV格式一起写入了S3。
请注意,这只是一种使用Python进行AWS GLUE自定义拆分的方法。实际解决方法可能因数据大小、数据结构和其他因素而异。