Just in case anyone is still interested, here are the excerpts from the article I mentioned earlier...
"Abstract
Jedi (Java based Extraction and Dissemination of Information)
is a lightweight tool for the creation of wrappers and
mediators to extract, combine, and reconcile information
from several independent information sources. For wrappers
it uses attributed grammars, which are evaluated with
a fault-tolerant parsing strategy to cope with ambiguous
grammars and irregular sources. For mediation it uses a
simple generic object-model that can be extended with
Java-libraries for specific models such as HTML, XML or
the relational model. This paper describes the architecture
of Jedi, and then focuses on Jedi’s wrapper generator."
"5. Example
In the following we illustrate Jedi’s parsing strategy
along a realistic example taken from a demo located at
“http://www.darmstadt.gmd.de/oasys/projects/jedi/index.html”.
The online demo shows Jedi’s facilities to extract, model
and integrate PSION palmtop computer related product data
from multiple Web sources, and to query and visualize the
extracted data.
Figure 4 presents a screenshot of one source4 which is
highly irregular, mixing images, natural language text,
forms and the relevant product data arbitrarily.
The code fragment depicted below the figure is the complete
Jedi specification needed to define a grammar to extract
article codes, article descriptions and their price from
this source. Other than in the online demo, the extracted
data is not mapped onto an object model, but directly rewritten
as tagged XML source.
The first rule ’Article’ specifies the source structure of
one article ’record’ and assigns appropriate data portions to
the variables ’code’, ’description’ and ’price’. These are reused
in the code block to write tagged XML code to stdout.
The first two assignment productions can safely contain
a trailing “.*” which is automatically left whenever the
more specific productions ’</B>’ or ’<B>’ match.
The third assignment production either requires a specific
pattern which identifies exactly the price or it requires a
more specific end tag to indicate where the price ends, e.g.:
’<B>’price = (’$’ .*) ’</B>’
The second rule ’ArticleList’ describes that the source
structure comprises a sequence of ’Article’ records as described
by the first rule.
As can be seen from the screenshot, this rule does not describe
exactly the source as it contains a lot of additional, irrelevant
information that must be filtered out.
4. located at
http://www.mplanet.com/cgi/Web_store/web_store.cgi?page=psion.html&cart_id=2726135.4533
Figure 4: Snapshot of Source
rule Article is
’<B>’ code = (’MP’ .*)
’</B>’ description = .*
’<B>’ price = (’$’[0-9.]+)
do
println(
“<Article>”,
“<Code>”, code, “</Code>”,
“<Price>”, price, “</Price>”,
“<Description>”, description,
“</Description></Article>”
);
end
end
rule ArticleList is
do
println(“<ArticleList>”);
end
(list += Article())+
do
println(“</ArticleList>”);
end
end
Strict parsing approaches will fail when given such a
grammar. Jedi however is able to proceed meaningfully. Its
fault tolerant interpretation of the ’Article’ rule allows to
skip irrelevant portions of the source by the fallback production
associated to the rule.
Finally, the embedded code blocks are evaluated according
to the grammar interpretation described by the most specific
solution path. Code execution will start with the first
’println’ statement of rule ’ArticleList’, it proceeds with the
assignments and code defined in rule ’Article’ as often as
this rule has matched and ends with executing the second
’println’ statement in rule ’ArticleList’. The portions accepted
by fallbacks do not have any side-effects and thus do
not cause any output to be written."
"The strategy has been implemented as part of the Jedi
tool. It offers the extraction language needed to specify context
free grammars for irregularly structured sources which
can be extended by semantic predicates to disambiguate
rules further. Grammar attribution is used to extract relevant
source portions. A fully fledged scripting language and
built-in data modeling means can be used to create flexible
wrappers which rewrite sources directly or instantiate rich
conceptual models for further querying and processing."
Like I said, I don't understand everything in this article by any means, but I think I get the gist of it. It appears the described technology can essentially interpret the displayed web page and pull the relevant bits of information out of it without directly reading the database values or relying on tags of any sort. The authors do indicate it isn't perfect but the indication is that it works in aggregate to a reasonably high level of confidence. Part of the last line in the paragraph above worries me though, "can be used to create flexible wrappers which rewrite sources directly". This sounds a bit like a bad thing in the wrong hands...but also like it could accomplish the automatic price dropping we've been discussing, I think...