From 6667c9da6bd1f49f7a12931cc316317093188ccb Mon Sep 17 00:00:00 2001
From: Matthew Exon
However, setting up retriever can be quite tricky since it depends on -the internal design of the website. This was designed to make life +the internal design of the website. That was designed to make life easy for the website's developers, not for you. You'll need to have some familiarity with HTML, and be willing to adapt when the website suddenly changes everything without notice. @@ -43,7 +43,7 @@ A simple case is when the article is wrapped in a "div" element:
... - <div class="main-content"> + <div class="ArticleWrapper"> <h2>Man Bites Dog</h2> <img src="mbd.jpg"> <p> @@ -58,7 +58,7 @@ A simple case is when the article is wrapped in a "div" element:
You then specify the tag "div", attribute "class", and value -"main-content". Everything else in the page, such as navigation +"ArticleWrapper". Everything else in the page, such as navigation panels and menus and footers and so on, will be discarded. If there is more than one section of the page you want to include, specify each one on a separate row. If the matching section contains some sections @@ -76,7 +76,7 @@ articles should be available.
You can leave the attribute and value blank to include all the corresponding elements with the specified tag name. You can also use -a tag name of "*", which will match any element type with the +a tag name of just an asterisk ("*"), which will match any element type with the specified attribute regardless of the tag.
@@ -120,7 +120,7 @@ To change the URL used to retrieve the page, use the "URL Pattern" and "URL Replace" fields. The pattern is a regular expression matching part of the URL to replace. In this case, you might use a pattern of "/article" and a replace string of "/print/article". A common pattern -is simply "$", used to add the replace string to the end of the URL. +is simply a dollar sign ("$"), used to add the replace string to the end of the URL.