要显示BigQuery表中列中的所有重复项,您可以使用以下SQL查询语句:
SELECT column_name, COUNT(column_name) as count
FROM dataset.table
GROUP BY column_name
HAVING count > 1
其中,dataset
是您的数据集名称,table
是您的表名称,column_name
是您要查找重复项的列名称。
这个查询将对指定的列进行分组,并计算每个值的出现次数。然后,使用HAVING
子句过滤出现次数大于1的值,这意味着它们是重复的。
以下是完整的示例代码:
from google.cloud import bigquery
def find_duplicates(project_id, dataset_id, table_id, column_name):
client = bigquery.Client(project=project_id)
query = f"""
SELECT {column_name}, COUNT({column_name}) as count
FROM `{project_id}.{dataset_id}.{table_id}`
GROUP BY {column_name}
HAVING count > 1
"""
query_job = client.query(query)
results = query_job.result()
for row in results:
print(f"{row[column_name]}: {row['count']} occurrences")
# Example usage
project_id = "your-project-id"
dataset_id = "your-dataset-id"
table_id = "your-table-id"
column_name = "your-column-name"
find_duplicates(project_id, dataset_id, table_id, column_name)
确保替换示例代码中的project_id
,dataset_id
,table_id
和column_name
变量为您自己的值。
这个示例使用了Google Cloud的bigquery
库来连接到BigQuery并执行查询。然后,它遍历查询结果,并打印每个重复项及其出现次数。您可以根据需要修改代码以满足您的具体需求。