Introduction

Automating a lot of XML and HTML processing is an important goal of Xill. You can crawl and scrape websites, get exactly the parts of content you need from pages, APIs or feeds, and let robots build new XML or clean/change HTML. This article aims to give a complete impression of the possibilities on this subject and to explain how to use all built-in HTML/XML functionality.

Packages

For processing XML, you will need the functions in the XML package. The same functions can be used on HTML, but only if it is also valid XML (XHTML). A page variable that has been acquired by Web.loadPage can not be used as parameter in, for instance, XML.xPath. There is also an xPath function in the Web package though. So usually you don't mix XML (with XML package) and HTML (with Web package), even though there is a lot of similarity.

Scraping webpages or XML documents

For data collection or website migration you might need to have robots extracting complete pages or specific parts from HTML- or XML documents.

Loading HTML or XML from the web

Open a test robot and put the following robot in it:

use Web;
use XML;
use System;
use File;

var html_page = Web.loadPage("http://www.google.com");
var page_string = Web.getText(html_page);
System.print(page_string);

var xml_page = Web.loadPage("http://www.omdbapi.com/?t=back+to+the+future&y=&plot=full&r=xml");
var xml_string = Web.getText(xml_page);
var xml = XML.fromString(xml_string); //loading the string into the XML package enables XML functionality like xPath()
System.print(xml_string);

If you run this, you should see the HTML source of the page and the XML about the movie logged in the console. There's a lot more to navigating the web with Xill, like the click() and input() functions. This is beyond the scope of this article, but you can read about it in the web navigation tutorial.

Local writing and loading

In this example, we'll step through ways to save html and XML locally and loading them back in new variables. Add the code below to your robot from the previous step and follow what it does using the debug functions.

//set paths
var xml_temp_path = "D:/temp/test.xml";
var html_temp_path = "D:/temp/test.html";

//save both examples
File.save(html_temp_path,page_string);
File.save(xml_temp_path,xml_string);

//load html from disk
var page_from_disk = Web.loadPage("file:/" :: html_temp_path);
var page_string_from_disk = Web.getText(page_from_disk);
System.print(page_string_from_disk);

//load xml from disk
var xml_from_disk = XML.fromFile(xml_temp_path);
var xml_string_from_disk = XML.toString(xml_from_disk);
System.print(xml_string_from_disk);

There are some other possibilities though. There is also a Web.download function to directly save things from the web, without loading the contents in a variable first. Try it now if you like.

Extracting information

To extract specific parts of the two markup document types, various functions are built-in. The preferred one is usually the Xpath function, because it's powerful, predictable and fast. If for some reason an Xpath is not enough, you can also convert to string and unleash all the String functions upon it, like regex(), replace() and allMatches().

In order to see what you're doing, we recommend that you use two monitors. That way you can see your code and the source document you're extracting from comfortably next to each other. For web page extraction, the built-in developer tools of any popular browser will usually suffice. Just right-click on a part of the page and select something like 'inspect element'. The preview in the debug panel can suffice for modestly sized documents though.

Xpath

Xpath is used widely by many applications. There are plenty of pages where you can learn about the syntax. To discover how to use a simple Xpath in Xill, add this code to the previous example and run it:

var xillio_page = Web.loadPage("http://www.xillio.com");
var first_h1 = Web.xPath(xillio_page,"//h1[1]/text()");
var actors = XML.xPath(xml,"//movie/@actors");
System.print("The first h1 from the Xillio page has the text: '" :: first_h1 :: "'");
System.print("The actors from the xml doc are: '" :: actors :: "'"); 

As you can see, Xpaths can be used on HTML as well as XML, as long as you take the function from the right package.

Namespaces

Namespaces are used in XML when node names can mean different things in different contexts within the same file, for instance when XML from more than one system resides in one XML structure. Look at the example called 'xmlns-sample.xml'. In that file, the most nodes have a namespace, which means Xill's xPath() function will not use without the namespaces parameter. Use this code, set the path of the sample XML and run it:

use XML;
use System;

// set your own path
var sample_files_path = "D:/projects/Xill Developers Platform/tutorial html+xml/";
var xml_doc = XML.fromFile(sample_files_path :: "xmlns-sample.xml");

var namespaces = {
    "hh" : "http://www.w3.org/TR/html4/",
    "ff" : "http://www.w3schools.com/furniture"
};
var fruits = XML.xPath(xml_doc,"//hh:tr/hh:td/text()",namespaces);
var furniture_width = XML.xPath(xml_doc,"//ff:width/text()",namespaces);

System.print(fruits);
System.print(furniture_width);

So, as you can see, you have to actively define all namespaces that you need items from, before you can do Xpaths. We gave the prefixes f and h the new names ff and hh in this example, to point out that they can be different if you like. That is not necessary, but even if you want to use the same prefixes that the document already uses, you have to define them in Xill.

With Xpaths, much more is possible than you might initially think. It is recommended to read up on the possibilities in the built-in documentation or external sources.

Extracting from string

If Xpaths are somehow not practical, you can convert a document to a string and use functions from the String package on it. This is less 'safe' because there is no guarantee that the structure remains intact/valid, and xml nodes are more predictable than text patterns in general. So usually you will want to extract a text node and only then use the String functions on the flat string. For more information on this, read the text editing tutorial.

Crawling

If you want to scrape all pages in a larger website (hundreds of pages or many many more), or for instance find out navigation paths or all dead links or something, you want to build a crawling robot. There is much to say about this subject, but actually all of Xill's functionality needed to make one is already explained here or somewhere else on this website. At the end of this article, you are given some pointers on how to make a simple crawler.

Building or changing XML/HTML

For web migrations, it's common use to clean up or enhance HTML, split or merge fields, and build XML for importing into the new cms. Let's try a little bit of that.

Editing functions

If for some reason you want to batch-edit HTML, you can either convert to XML and use the relevant functions, or convert to a string and use the text functions. The Web package does not have editing functions.

For web migrations, a relatively complicated case is when you are migrating complete HTML pages to a cms that builds up its pages using separate blocks for bodytext, images, links, video and other elements. Then you need to split up the pages in different parts (using String.split()) and make new valid HTML out of the bodytext parts. One way to re-add closing tags is by using html tidy. This will be included in a function package in a later release, but for the time being you can make Xill call the program as external command line program with System.exec()

There's some functionality available for directly editing XML too. It can only be used on an XML document, so to use it on HTML you need to make sure it is valid xml as well. If it is XHTML, you should be able to convert to the other package like this:

var html_string = Web.getText(html_doc); 
var xml_doc = XML.fromString(html_string);

The editing functions in the XML package are the following:

  • insertNode
  • moveNode
  • removeNode
  • replaceNode
  • setAttribute
  • removeAttribute

The names of course suggest what you can do with them. The big plus of using those in favor of text editing functionality is that you are always sure you will end up with valid XML. Again, the details of all functions can be read in the built-in documentation, under the XML header.

XML building options

If XML documents need to be generated from scratch, there are several ways. What the best one is, depends on the situation and personal preference. It's often a combination of the simplest way (string concatenation) for the simplest parts of XML and one of the higher level ways for the more complicated structures.

Using the XML functions

You could start with an empty xml variable and use the functions above to add all nodes and attributes you wish. For instance:

use XML;
use System;

var xml_doc = XML.fromString("<xml></xml>");
var comments = [
    "comment1",
    "comment2"
];
foreach(index,comment in comments) {
    var comment_node = XML.fromString("<comment>" :: comment :: "</comment>");
    XML.setAttribute(comment_node, "id", index); //also adding an id to the comment
    var parent = XML.xPath(xml_doc,"/xml");
    XML.moveNode(parent,comment_node);
}

System.print(XML.toString(xml_doc)); 

Concatenating strings

This might be the simplest way to build XML, but it can also clutter robot code to an unmaintainable mess pretty quickly as soon as the to-be-built XML gets a bit more complex. Example:

use XML;
use System;

var xml_start = "<xml>"; 
var xml_end = "</xml>"; 
var comments = [ 
    "comment1", 
    "comment2" 
]; 
var xml_string = xml_start; 

foreach(index, comment in comments) { 
    xml_string ::= "<comment id='" :: index :: "'>" :: comment :: "</comment>"; 
}

xml_string ::= xml_end; 
var xml = XML.fromString(xml_string); 
System.print(XML.toString(xml));

Using templates

Even though concatenating strings is not the most advanced option of building XML and there's no internal validating used to guarantee that the resulting XML will be valid, it can work well if external templates are used. In this simple case, you'll need only two of them. One for the outer tags, and one for the comments since there can be any number of them. Just save the XML parts to your hard drive and fill in an identifier in place of the content, like this:

  • <xml>COMMENTS-IDENTIFIER</xml> - save this as xml.xml
  • <comment>COMMENT-IDENTIFIER</comment> - save this as comment.xml

Then, you can load the first template, fill it with a number of comment templates and fill those with the comments. That way it is way more clear what the XML is going to look like than if you only use string concatenation like [open XML tag] :: [open comment tag] :: [comment] :: [close comment tag] (repeat last 3 steps) :: [close xml tag].

use XML;
use String;
use System;

var template_xml = "<xml>[COMMENTS]</xml>";
var template_comment = "<comment>[COMMENT]</comment>";

var comments = [ 
    "comment1", 
    "comment2" 
];

var xml_comments = "";
foreach(comment in comments) {
    xml_comments ::= String.replace(template_comment,"[COMMENT]",comment,false) :: "\r\n";
}

var xml_string = String.replace(template_xml,"[COMMENTS]",xml_comments,false);
System.print(xml_string);

This example has only one node (which you could call a 'field') for a comment, but if you would add one author per comment, you'd get two nodes for the comment template. If there would be a 1-to-many relation between comment and author, you'd need a separate template for the authors.

This way of building XML has the advantage that you know by looking at the templates how your XML is going to look. Many projects would require five or more templates, so it'll be a good idea to structure this code further by using a generic function xmlByTemplate([atomic] template_name,[object] content), that loads the template from disk, replaces the placeholders with the content, and returns the required chunk of XML.

We gave you all the built-in options and some tips. Try different combinations and cleverly design a set of routines; it's up to you how you make your robots maintainable.

Assignment

You can easily decide for yourself if you need some practice. If so, try this!

Build a crawler.
 That might be a relatively simple and effective way to test whether you can work with HTML in Xill. You can make it as complicated as you want. Hopefully you can also think of some useful application for yourself. Here are the basic steps:

  1. Choose a website that has more than 25 pages, but is not too complicated.
  2. Start the main part of the robot by loading the home page.
  3. Do an xpath for the urls in a-tags on the page ("//a/@href")
  4. Loop (foreach()) through the urls and pass them to a recursive routine. So load each of them, search for new links, give those to the same routine, etcetera. Don't load external links! We're not going to download the internet.
  5. After each pageload, you can do with it whatever you consider useful or fun. Keep it as simple as possible and only log the url. Or make it more interesting and extract all images, search for dead links, store structured information neatly in a database or whatever you can think of. If you weren't familiar with xpath yet, try to grab some specific fields out of the pages at least. For instance, you could log all sentences containing some chosen word within paragraph tags. If you want to scrape specific information, the easiest websites have a lot of semantic HTML, like well-chosen ids and class names (class='price' for a pricefield, for instance)