数据提取时用xpath还是正则表达式呢

魁首哥

作者

这篇文章给大家分享的是有关数据提取时用xpath还是正则表达式呢的内容。小编觉得挺实用的，因此分享给大家做个参考，一起跟随小编过来看看吧。

xpath和正则表达式是数据提取时最常用的两种方法，究竟用哪个好呢？

测试代码如下所示，实验目标是同一HTML文档，分别使用webscrpaing库中的xpath，lxml库中的xpath以及正则表达式提取100次，统计各方法的用时：

view plaincopy to clipboardprint?

#coding:utf-8
#xpath_speed_test.py
importre
importtime
fromlxmlimportetree
fromwebscrapingimportcommon,download,xpath
TEST_TIMES=100
deftest():
url='http://hotels.ctrip.com/international/washington26363'
html=download.Download().get(url)
html=common.to_unicode(html)
#测试webscraping库的xpath提取速度
start_time=time.time()
foriinrange(TEST_TIMES):
forhid,hpriceinzip(xpath.search(html,'//div[@class="hlist_item"]/@id'),xpath.search(html,'//div[@class="hlist_item_price"]/span')):
#printhid,hprice
pass
end_time=time.time()
webscraping_xpath_time_used=end_time-start_time
print'"webscraping.xpath"timeused:{}seconds.'.format(webscraping_xpath_time_used)
#测试lxml库xpath提取速度
start_time=time.time()
foriinrange(TEST_TIMES):
root=etree.HTML(html)
forhlist_divinroot.xpath('//div[@class="hlist_item"]'):
hid=hlist_div.get('id')
hprice=hlist_div.xpath('.//div[@class="hlist_item_price"]/span')[].text
#printhid,hprice
pass
end_time=time.time()
lxml_time_used=end_time-start_time
print'"lxml"timeused:{}seconds.'.format(lxml_time_used)
#测试正则表达式的速度
start_time=time.time()
foriinrange(TEST_TIMES):
forhid,hpriceinzip(re.compile(r'class="hlist_item"id="(\d+)"').findall(html),re.compile(r'¥([\d\.]+)').findall(html)):
#printhid,hprice
pass
end_time=time.time()
re_time_used=end_time-start_time
print'"re"timeused:{}seconds.'.format(re_time_used)
if__name__=='__main__':
test()