在AWS Glue中,当读取数据库中的数据时,如果默认设置为null,会导致一些问题。为了避免这种情况,可以在AWS Glue脚本中设置一个文件,其中包含了在默认值为null时的替代值。例如:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
## Create a DynamicFrame using the 'persons' table
persons_dyf = glueContext.create_dynamic_frame.from_catalog(database="mydb", table_name="persons", transformation_ctx="persons_dyf")
## Set default value to None
default_value = None
## Create a new DynamicFrame with replaces nulls with default values
persons_dyf_replaced = ApplyMapping.apply(frame = persons_dyf, mappings = [("id", "long", "id", "long"),
("name", "string", "name", "string"),
("age", "long", "age", "long"),
("dob", "string", "dob", "string")],
transformation_ctx = "persons_dyf_replaced")
## Select only those persons who have an age greater than 30
persons_dyf_filtered = Filter.apply(frame = persons_dyf_replaced, f = lambda x: x["age"] > 30)
在上述示例中,我们通过设置default_value = None,来替代在默认设置为null时的值。这样,在后续的操作中,我们将使用设置好的默认值,避免了出现null值的情况。