Ask: the use of Python, Beautifulsoup capture Webpage on specific words

Ask friends,

I want to grab the legal system on the web news headlines, and stored in the CSV file. Because just contact Python, knowledge is not enough. You want to consult you.

Because of too many problems, so separate accounts.

Difficulty: can not correctly grasp the text you want:

Webpage part of the source code for the:

</HR><A class="f14 blue001" href="content/2013-11/01/content_4983464.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>Yumen business revamp a number of unlicensed operation of households&nbsp;&nbsp;<SPAN class="f12 black">2013-11-01</SPAN></A> <BR><A class="f14 blue001" href="content/2013-11/01/content_4983441.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>Linxia hosted the involved law reform and training class&nbsp;&nbsp;<SPAN class="f12 black">2013-11-01</SPAN></A> <BR><A class="f14 blue001" href="content/2013-11/01/content_4983439.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>Suzhou District of Jiuquan City Ma Ying River Sluice road engineering smooth opening&nbsp;&nbsp;<SPAN class="f12 black">2013-11-01</SPAN></A> <BR><A class="f14 blue001" href="content/2013-11/01/content_4983401.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>Jiuquan pay close attention to the city of Victoria Day promoting law and petition work reform&nbsp;&nbsp;<SPAN class="f12 black">2013-11-01</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/30/content_4974324.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>Jiuquan Guazhou industrial and commercial bureau to carry out the mass line of educational practice&nbsp;&nbsp;<SPAN class="f12 black">2013-10-30</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/29/content_4971723.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>Guazhou County of Jiuquan City Industrial and commercial bureau to carry out the liquor market concentration and control&nbsp;&nbsp;<SPAN class="f12 black">2013-10-29</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/21/content_4948889.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>Letters and Complaints Bureau of Jiuquan city to open "moral lecture"&nbsp;&nbsp;<SPAN class="f12 black">2013-10-21</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/21/content_4948876.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>To build the new system construction promotes the economic development of Jiuquan&nbsp;&nbsp;<SPAN class="f12 black">2013-10-21</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/18/content_4944212.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>Jiuquan to strengthen the construction of administrative procedures and improve the level of administration according to law&nbsp;&nbsp;<SPAN class="f12 black">2013-10-18</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/16/content_4940043.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>Jiuquan Suzhou Xifeng Xiang to further implement the conflict investigation system&nbsp;&nbsp;<SPAN class="f12 black">2013-10-16</SPAN></A> <BR>


Objective is to grab headlines.

The code is as follows:

from bs4 import BeautifulSoup
import re
import urllib2

url = " ;
page = urllib2.urlopen(url)
soup = BeautifulSoup(

xinwen = soup.find_all('span')

for xw in xinwen:
print xw

But the operation effect, the title text is incomplete, and there are a lot of <span class=" F14 blue001" > Lu</span>Word.

How to ask the right out? Thank you.

Started by Derek at December 23, 2016 - 1:04 PM

  1. d = urllib2.urlopen('')
  2. soup = BeautifulSoup('utf-8'))
  3. f = open('c:/ldo.txt','wb')
  4. for a in soup.find_all('a', attrs={'class':'f14'}):
  5. for span in a.find_all('span'):
  6. span.extract()
  7. f.write(a.string.strip().encode('utf-8'))
  8. f.write('\n')
  9. f.close()

Posted by Malcolm at January 05, 2017 - 1:46 PM