是的,AWS Textract可以捕获PDF文本的特定部分。以下是使用AWS Textract和Python示例代码:
import boto3
# AWS Textract client
textract = boto3.client('textract')
# S3 bucket name and file name
s3_bucket_name = 'my-s3-bucket'
document_name = 'sample.pdf'
# Define the bounding box of the specific portion of the document you want to extract
# Coordinates should be in the format of {'X': int, 'Y': int}. All coordinates are in document pixels (72 DPI)
bounding_box = {
'Top': 400,
'Left': 100,
'Width': 300,
'Height': 100
}
# Set up the analysis features
feature_types = ['TABLES', 'FORMS']
# Call AWS Textract to analyze the document
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': s3_bucket_name,
'Name': document_name,
}
},
FeatureTypes=feature_types,
BlockType='LINE',
Geometry={
'BoundingBox': bounding_box
}
)
# Extract the text from the specific region
text = ''
for block in response['Blocks']:
if block['BlockType'] == 'LINE':
text += block['Text'] + '\n'
print(text)
在此示例中,我们使用了一个名为“sample.pdf”的文件并且按照我们所需的尺寸和位置定义了特定部分。运行脚本后,从PDF文档的指定位置提取文本并将其打印到控制台。