Software » web2help » web2help guide » Extracting content

Extracting content from the web

Assume that you want to extract content from your web pages. There are two things that you need to extract:

the title, which must be a string;
the content, which, of course, can contain HTML markup.

web2help uses a Python library called BeautifulSoup to parse web pages. In order to extract content you need to write the extractor, a (very small) fragment of Python code that tells BeautifulSoup the portion of web page you are interested in.

The testbed

To help you with writing the extractor, web2help offers a testbed (Project -> Testbed).

Type the URL of a page you want to grab and press Load. The source code of your page will be formatted in a pretty way, so that you can analyze it.

Then edit the extractor in the lower pane and press Parse: if everything goes fine, in the upper pane you'll see the extracted piece of content. Otherwise web2help will tell you what went wrong.

Writing the extractor

The extractor is the body of a Python function. It receives a parameter named html, which consists of the parse tree of your web page. You write some code that manipulates html, and return a value that represents the extracted content. The extracted value should be:

a string, for the title extractor;
(an object representing) an HTML element, or a list of HTML elements, for the content extractor.

Let's have a look at some examples.

Extracting the title

If the page title is wrapped inside a <h1>...</h1> tag, the extractor suffices to be:

return html.h1.string

h1 looks for any <h1>...</h1> tags within html (i.e., the search in the html tree is recursive). .string is used to return the content of the element as a string. (It would be equivalent to write: return html.h1.contents[0], or return html.body.h1.string.)

Another example. Assume that the title is placed within a div with id="title". Then type:

return html.find('div', {'id': 'title'}).string

Extracting the content

Extracting the content can be done similarly, with the exception that you must return a subtree or a collection of subtrees, not a string. Again, let's see some examples.

If your content is placed inside a div with class="content", just type:

return html.find('div', {'class': 'content'})

(since we don't extract a .string, we are returning the whole subtree rooted at our div).

And now a more complicated example. Assume that both the title (inside h1) and the content are wrapped in the same div. Then, we need to return all the content that follows the title in the same div. In other words, we need to return all the siblings of the h1 node that follow it. We use this fragment of code:

h1Node = html.h1
return h1Node.fetchNextSiblings()

or, in a more compact form:

return html.h1.fetchNextSiblings()

Rather simple, isn't it? Please note that in this case the fetchNextSiblings() function returns not a single node, but a collection (a sequence) of nodes. That makes no difference for web2help.

BeautifulSoup is your friend