第四章：使用 Python 进行数据解析

在本章中，我们将涵盖以下示例：

解析 HTML 表格
从 HTML 文档中提取数据
解析 XML 数据

介绍

由于我们已经在之前的示例中下载了网页，现在我们可以讨论如何处理这些文件并解析它们以获取所需的信息。

解析 HTML 表格

从服务器下载 HTML 页面后，我们必须从中提取所需的数据。Python 中有许多模块可以帮助我们做到这一点。在这里，我们可以使用 Python 包BeautifulSoup。

准备工作

和往常一样，确保你安装了所有必需的包。对于这个脚本，我们需要BeautifulSoup和pandas。你可以使用pip安装它们：

pip install bs4 
pip install pandas

pandas是 Python 中的一个开源数据分析库。

操作步骤...

我们可以从下载的页面中解析 HTML 表格，如下所示：

和往常一样，我们必须导入脚本所需的模块。在这里，我们导入BeautifulSoup来解析 HTML 和pandas来处理解析的数据。此外，我们还必须导入urllib模块以从服务器获取网页：

import urllib2 
import pandas as pd 
from bs4 import BeautifulSoup

现在我们可以从服务器获取 HTML 页面；为此，我们可以使用urllib模块：

url = "https://www.w3schools.com/html/html_tables.asp" 
try: 
    page = urllib2.urlopen(url) 
except Exception as e: 
    print e 
    pass

然后，我们可以使用BeautifulSoup来解析 HTML 并从中获取table：

soup = BeautifulSoup(page, "html.parser") 
table = soup.find_all('table')[0]

在这里，它将获取网页上的第一个表格。

现在我们可以使用pandas库为表格创建一个DataFrame：

new_table = pd.DataFrame(columns=['Company', 'Contact', 'Country'], index=range(0, 7))

这将创建一个具有三列和六行的DataFrame。列将显示公司名称、联系方式和国家。

现在我们必须解析数据并将其添加到DataFrame中：

row_number = 0 
for row in table.find_all('tr'): 
    column_number = 0 
    columns = row.find_all('td') 
    for column in columns: 
        new_table.iat[row_number, columns_number] = column.get_text() 
        columns_number += 1 
    row_number += 1  
print new_table

这将打印DataFrame。

DataFrame是一个二维的、带标签的数据结构，具有可能不同类型的列。它更像是dict的系列对象。

这个脚本可以在 Python 3 中运行，需要做一些更改，如下所示：

import urllib.request 
import pandas as pd 
from bs4 import BeautifulSoup  
url = "https://www.w3schools.com/html/html_tables.asp" 
try: 
    page = urllib.request.urlopen(url) 
except Exception as e: 
    print(e) 
    pass 
soup = BeautifulSoup(page, "html.parser")  
table = soup.find_all('table')[0]  
new_table = pd.DataFrame(columns=['Company', 'Contact', 'Country'], index=range(0, 7))  
row_number = 0 
for row in table.find_all('tr'): 
    column_number = 0 
    columns = row.find_all('td') 
    for column in columns: 
        new_table.iat[row_number, column_number] = column.get_text() 
        column_number += 1 
    row_number += 1  
print(new_table)

主要的更改是对urllib模块和print语句的修改。

你可以在pandas.pydata.org/pandas-docs/stable/了解更多关于pandas数据分析工具包的信息。

从 HTML 文档中提取数据

我们可以使用pandas库将解析的数据提取到.csv 或 Excel 格式。

准备工作

要使用pandas模块中导出解析数据到 Excel 的函数，我们需要另一个依赖模块openpyxl，所以请确保你使用pip安装了openpyxl：

pip install openpyxl

操作步骤...

我们可以将数据从 HTML 提取到.csv 或 Excel 文档中，如下所示：

要创建一个.csv 文件，我们可以使用pandas中的to_csv()方法。我们可以将上一个示例重写如下：

import urllib.request 
import pandas as pd 
from bs4 import BeautifulSoup  
url = "https://www.w3schools.com/html/html_tables.asp" 
try: 
    page = urllib.request.urlopen(url) 
except Exception as e: 
    print(e) 
    pass 
soup = BeautifulSoup(page, "html.parser")  
table = soup.find_all('table')[0]  
new_table = pd.DataFrame(columns=['Company', 'Contact', 'Country'], index=range(0, 7))  
row_number = 0 
for row in table.find_all('tr'): 
    column_number = 0 
    columns = row.find_all('td') 
    for column in columns: 
        new_table.iat[row_number, column_number] = column.get_text() 
        column_number += 1 
    row_number += 1  
new_table.to_csv('table.csv')

这将创建一个名为table.csv的.csv 文件。

同样地，我们可以使用to_excel()方法将数据导出到 Excel。

将上一个脚本的最后一行改为以下内容：

new_table.to_excel('table.xlsx')

解析 XML 数据

有时，我们会从服务器得到一个 XML 响应，我们需要解析 XML 以提取数据。我们可以使用xml.etree.ElementTree模块来解析 XML 文件。

准备工作

我们必须安装所需的模块，xml：

pip install xml

操作步骤...

以下是我们如何使用 XML 模块解析 XML 数据：

首先导入所需的模块。由于这个脚本是在 Python 3 中，确保你导入了正确的模块：

from urllib.request import urlopen 
from xml.etree.ElementTree import parse

现在使用urllib模块中的urlopen方法获取 XML 文件：

url = urlopen('http://feeds.feedburner.com/TechCrunch/Google')

现在使用xml.etree.ElementTree模块中的parse方法解析 XML 文件：

xmldoc = parse(url)

现在迭代并打印 XML 中的细节：

for item in xmldoc.iterfind('channel/item'): 
    title = item.findtext('title') 
    desc = item.findtext('description') 
    date = item.findtext('pubDate') 
    link = item.findtext('link')  
    print(title) 
    print(desc) 
    print(date) 
    print(link) 
    print('---------')

这个脚本可以重写为 Python 2 中运行，如下所示：

from urllib2 import urlopen 
from xml.etree.ElementTree import parse  
url = urlopen('http://feeds.feedburner.com/TechCrunch/Google') 
xmldoc = parse(url) 
xmldoc.write('output.xml') 
for item in xmldoc.iterfind('channel/item'): 
   title = item.findtext('title') 
   desc = item.findtext('description') 
   date = item.findtext('pubDate') 
   link = item.findtext('link')  
    print title 
    print desc 
    print date 
    print link 
    print '---------'

这也可以导出到 Excel 或.csv，就像我们在之前的示例中所做的那样。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

第四章：使用 Python 进行数据解析

介绍

解析 HTML 表格

准备工作

操作步骤...

从 HTML 文档中提取数据

准备工作

操作步骤...

解析 XML 数据

准备工作

操作步骤...

FilesExpand file tree

py-pentest-cb_04.md

Latest commit

History

py-pentest-cb_04.md

File metadata and controls

第四章：使用 Python 进行数据解析

介绍

解析 HTML 表格

准备工作

操作步骤...

从 HTML 文档中提取数据

准备工作

操作步骤...

解析 XML 数据

准备工作

操作步骤...