Web Crawling Interface

int hcrawl (char * url, int depth, int (* action) (struct crawlnode * node))

hcrawl crawls through the URL space starting at url, repeatedly calling action with a crawlnode structure.

Parameters

  • url the URL to start traversing from.
  • depthcontrol the traversal of the URL space.
    • CRL_INTRASITE stay within the bounds of the "site" as defined by the starting URL. No exterior nodes will be traversed.
    • CRL_EXTRAFIRST traverse as with INTRASITE, but touch each exterior node. This is generally useful for creating site maps, etc.
    • CRL_LEAFVISIT normally, CRL_EXTRAFIRST doesn't not attempt to open off site URL's for efficiency sake. LEAFVISIT forces it to open those URL's which is useful for detecting all broken links within a site.
    • CRL_EXTRASITE loose the hounds.
  • action The routine to be called at each node. It is called with a crawlnode structure which describes the environment and other information about the URL, including the mime type and the HTTP status (providing the node was actually visited; see CRL_LEAFVISIT). Additionally, the action routine returns 1 if we should keep traversing, 0 if we should stop.
Note: hcrawl() is not especially bright about CGI scripts. It won't crawl any script which has parameters, but makes no other assumptions. Unfortunately, there doesn't seem to be a way to find out if the HTML is generated dynamically or not. Caveat: the URL parser doesn't have any clue about the translation of unqualified name of the form:
      http://www.mtcc.com/
      
to
      http://www.mtcc.com/index.html
      
It assumes that it should be index.html, but that is clearly not always the case, especially with lame M$ "operating systems" which can only deal with .htm. This can cause some nodes to be accidentally revisited, and other seemingly strange behavior.


© (copyright) 1997 MTCC
Last modified: Mon Apr 28 12:49:37 PDT 1997