一、要使用 Python 提取 PDF 文件的标题、日期和内容并将其存储到 MySQL 数据库中,您可以按照以下步骤操作:
1.安装必要的库:pdfminer, PyPDF2, mysql-connector-python.
pip install pdfminer PyPDF2 mysql-connector-python
2.导入必要的库并连接到 MySQL 数据库。
import mysql.connector
from mysql.connector import Error
from mysql.connector import errorcode
import PyPDF2
from pdfminer.high_level import extract_text
try:connection = mysql.connector.connect(host='localhost',database='database_name',user='username',password='password')if connection.is_connected():cursor = connection.cursor()print("Connected to MySQL database")except Error as e:print("Error while connecting to MySQL", e)
3.打开 PDF 文件并提取其标题、日期和内容。
pdf_file = open('file.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
title = pdf_reader.documentInfo.title
date = pdf_reader.documentInfo['/CreationDate']
content = extract_text('file.pdf')
4.将提取的信息插入到 MySQL 数据库中。
try:cursor.execute("INSERT INTO table_name (title, date, content) VALUES (%s, %s, %s)", (title, date, content))connection.commit()print("Record inserted successfully into MySQL database")except mysql.connector.Error as error:print("Failed to insert record into MySQL database {}".format(error))finally:if connection.is_connected():cursor.close()connection.close()print("MySQL connection is closed")
请注意,您需要将database_name、username、password和替换table_name为您自己的数据库信息。此外,请确保 PDF 文件与 python 脚本位于同一目录中,或者指定文件的完整路径。
二、详例解析
1.假定文本内容
Title: Sample PDF Document
Date: 2022-03-20
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed at est at lectus viverra malesuada. Pellentesque fermentum dolor vel finibus consequat. Nulla facilisi.
2.创建数据表存储PDF数据
CREATE TABLE pdf_data (id INT AUTO_INCREMENT PRIMARY KEY,title VARCHAR(255),date DATE,content TEXT
);
3.编写Python代码将其解析存入数据库中
import PyPDF2
from datetime import datetime
import mysql.connector# Open the PDF file
pdf_file = open('sample.pdf', 'rb')# Read the PDF metadata
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
pdf_info = pdf_reader.getDocumentInfo()
title = pdf_info.title# Read the PDF content
content = ''
for page_num in range(pdf_reader.numPages):page = pdf_reader.getPage(page_num)content += page.extractText()# Format the date
date_str = pdf_info.get('CreationDate')[2:10]
date = datetime.strptime(date_str, '%Y%m%d').date()# Store the data in the MySQL database
cnx = mysql.connector.connect(user='username', password='password', host='localhost', database='pdf_db')
cursor = cnx.cursor()
add_pdf = ("INSERT INTO pdf_data (title, date, content) VALUES (%s, %s, %s)")
pdf_data = (title, date, content)
cursor.execute(add_pdf, pdf_data)
cnx.commit()# Close the database connection and PDF file
cursor.close()
cnx.close()
pdf_file.close()
4.插入成功后在数据库库中查询
SELECT * FROM pdf_data;
大致结果如下:
id | title | date | content |
---|---|---|---|
1 | Sample PDF Document | 2022-03-20 | Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed at est at lectus viverra malesuada. Pellentesque fermentum dolor vel finibus consequat. Nulla facilisi. |