Introduction
![]() | ![]() |
Order of visiting folders in breadth-first search | Order of visiting folders in depth-first search |
Implementing a depth-first crawler in Xill
The heart of the depth-first crawler is the processFolders function, shown below in a stripped-down form. Given a root folder, processFolders() operates as follows:
- Is the given folder path readable?
- Are there any subfolders within this folder?
- - If so, for every subfolder, pass the path of the subfolder to another instance of processFolders(), returning to step 1 (this happens for points 1-3 in the figure on the right)
- - If not, the bottom of the folder hierarchy is reached (this happens for point 4 in the figure).
- -- Save information about the current folder
- -- Save information about all files in the current folder
- -- Processing this folder is finished, so go back up one level and resume execution of the rest of step 3 (for example, return to point 3 and start processing point 5)
function crawlFolders (folder) {
if (File.canRead(folder.path)) {
foreach (subfolder in File.iterateFolders(folder.path, false)) {
crawlFolders (subfolder);
}
processFolder (folder);
foreach (file in File.iterateFiles(folder.path, false)) {
processFile (file);
}
}
}
In this way, the crawler will go as deep as possible into all subfolders by storing properties of all files and folders it encounters, and calling itself to enter into even deeper-lying folders.
Implementing resume functionality
What if your robot crashes at some point, before completing its crawl? It would be a shame to let all that effort go to waste, right? The following two functions allow keeping track of which folders have already been crawled. A global variable completionList maintains key/value pairs with all folders that have been completely crawled, and their parents. Whenever a folder has been crawled, it is added to the list. Upon adding a folder to this list, all subfolders belonging to it (it's children) are removed from the list to keep it at a manageable size. Remember that in our depth-first search, we go as deep as possible into the tree structure, and then
By saving the completionList object as a text file, the crawler robot can survive a crash. The function markBranchCompletion should be called after returning from a crawlFolders() call. The function completedBefore() should be used to verify if a folder has been crawled before processing a folder; if it returns True, the tested folder URI has already been processed, as can be seen in the full robot code at the bottom of this page.
function markBranchCompletion (folder) { completionList[folder.path] = folder.parent; var iterable = Collection.duplicate(completionList); foreach (key, value in iterable) { if (value == folder.path) { Collection.remove(completionList, key); } } Stream.write(System.toJSON(completionList, true), File.openWrite("resume.json")); } function completedBefore (folderpath) { if (completionList == {}) { return (false); } foreach (key, value in completionList) { if (key == folderpath) { return (true); } } return (false); }
How to go from here?
At this point, the basic crawling functionality is in place. The requirements of your project will dictate what additional actions are performed upon encountering a file or a folder, and what (if any) further analysis is needed upon completing the crawl. Ideas for expansion include:
- Store extended file information, such as mimetype or MD5 hash of a binary
- Crawl multiple roots with one robot
- Run analysis scripts on results
Robot code:
use System, File, Collection, String, Stream; // SET THIS TO FALSE IF YOU WANT TO RESTART CRAWLING FROM THE BEGINNING var resume = true; // global vars var completionList = {}; if (File.exists("resume.json") && resume) { completionList = System.parseJSON(File.getText("resume.json")); } // fileshare root to crawl var root = "C:\\Program Files"; // enter main loop System.print("=== start crawling ===\nresume status: "::resume); processFolders({"path": root, "parent" : "_ROOT_RECORD_"}); System.print("=== finished crawling ==="); /////////////////// /// functions /// /////////////////// function processFolders (folder) { if (!completedBefore(folder.path)) { if (File.canRead(folder.path)) { foreach (subfolder in File.iterateFolders(folder.path, false)) { processFolders (subfolder); } map <processFiles> (File.iterateFiles(folder.path, false)); } // do something with the info of this folder object System.print("FOLDER OBJECT FOUND: "::folder.path); } // mark this folder as completed if (resume) { markBranchCompletion(folder); } } function processFiles (document) { // do something with the info of this document object System.print(" DOCUMENT OBJECT FOUND: "::document, "debug"); } function markBranchCompletion (folder) { completionList[folder.path] = folder.parent; var iterable = Collection.duplicate(completionList); foreach (key, value in iterable) { if (value == folder.path) { Collection.remove(completionList, key); } } Stream.write(System.toJSON(completionList, true), File.openWrite("resume.json")); } function completedBefore (folderpath) { if (completionList == {}) { return (false); } foreach (key, value in completionList) { if (key == folderpath) { return (true); } } return (false); }