如何在python bs4中使用xpath获取字符串?

Adriano • 3 年前 • 1388 次点击

我需要使用python和bs4将字符串放入li标记中。我正在尝试以下代码:

from bs4 import BeautifulSoup
from lxml import etree

html_doc = """
<html>
<head>
</head>
<body>
   <div class="container">
      <section id="page">
         <div class="content">   
            <div class="box">  
               <ul>
                  <li>Name: Peter</li>
                  <li>Age: 21</li>
                  <li>Status: Active</li>
               </ul> 
            </div>
         </div>
      </section>
   </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')
dom = etree.HTML(str(soup))
print (dom.xpath('/html/body/div/section/div[1]/div[1]/ul/li[3]'))

那个人回来了: [<0x7fc640e896c0处的元素锂>]

但理想的结果是li标签文本,如下所示: 状态:活动

如何做? 谢谢

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/130053

1388 次点击

文章 [ 2 ] | 最新文章 3 年前

• 1 楼

balderman 3 年前

试试下面的( 不需要外部库 )

import xml.etree.ElementTree as ET

xml = """
<html>
<head>
</head>
<body>
   <div class="container">
      <section id="page">
         <div class="content">   
            <div class="box">  
               <ul>
                  <li>Name: Peter</li>
                  <li>Age: 21</li>
                  <li>Status: Active</li>
               </ul> 
            </div>
         </div>
      </section>
   </div>
</body>
</html>
"""
root = ET.fromstring(xml)
print(root.find('.//ul')[-1].text)

输出

Status: Active

• 2 楼

F.Hoque 3 年前

在xpath中,只需使用 text() 方法

from bs4 import BeautifulSoup
from lxml import etree

html_doc = """
<html>
<head>
</head>
<body>
   <div class="container">
      <section id="page">
         <div class="content">   
            <div class="box">  
               <ul>
                  <li>Name: Peter</li>
                  <li>Age: 21</li>
                  <li>Status: Active</li>
               </ul> 
            </div>
         </div>
      </section>
   </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')
dom = etree.HTML(str(soup))
print(dom.xpath('/html/body/div/section/div[1]/div[1]/ul/li[3]/text())

输出:

 ['Status: Active']

#或者

for li in dom.xpath('/html/body/div/section/div[1]/div[1]/ul/li[3]/text()'):
    txt=li.split()[1]
    print(txt)

输出:

Active

#或者

print(' '.join(dom.xpath('/html/body/div/section/div[1]/div[1]/ul/li[3]/text()')))

输出:

Status: Active

#或者

print(''.join(dom.xpath('//*[@class="box"]/ul/li[3]/text()')))

输出:

状态:活动

登录后回复