要比较两个PySpark数据框并修改其中一个,可以按照以下步骤进行操作:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [("Alice", 28), ("Bob", 35), ("Charlie", 42)]
data2 = [("Alice", 30), ("Bob", 35), ("Charlie", 40)]
df1 = spark.createDataFrame(data1, ["Name", "Age"])
df2 = spark.createDataFrame(data2, ["Name", "Age"])
df3 = df1.join(df2, ["Name"], "inner")
from pyspark.sql.functions import when
df3 = df3.withColumn("Age", when(df3.Age_x > df3.Age_y, df3.Age_x).otherwise(df3.Age_y))
在这个示例中,我们将df3的Age列值设置为df1.Age和df2.Age中的最大值。
df3.show()
完整代码示例:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when
spark = SparkSession.builder.getOrCreate()
data1 = [("Alice", 28), ("Bob", 35), ("Charlie", 42)]
data2 = [("Alice", 30), ("Bob", 35), ("Charlie", 40)]
df1 = spark.createDataFrame(data1, ["Name", "Age"])
df2 = spark.createDataFrame(data2, ["Name", "Age"])
df3 = df1.join(df2, ["Name"], "inner")
df3 = df3.withColumn("Age", when(df3.Age_x > df3.Age_y, df3.Age_x).otherwise(df3.Age_y))
df3.show()
这样,你就可以比较两个PySpark数据框并修改其中一个了。