- int hcrawl (char * url, int depth, int (* action) (struct crawlnode * node))
- hcrawl crawls through the URL space starting at url, repeatedly calling action with a crawlnode structure.
Parameters
- url the URL to start traversing from.
- depthcontrol the traversal of the URL space.
- CRL_INTRASITE stay within the bounds of the "site" as defined by the starting URL. No exterior nodes will be traversed.
- CRL_EXTRAFIRST traverse as with INTRASITE, but touch each exterior node. This is generally useful for creating site maps, etc.
- CRL_LEAFVISIT normally, CRL_EXTRAFIRST doesn't not attempt to open off site URL's for efficiency sake. LEAFVISIT forces it to open those URL's which is useful for detecting all broken links within a site.
- CRL_EXTRASITE loose the hounds.
- action The routine to be called at each node. It is called with a crawlnode structure which describes the environment and other information about the URL, including the mime type and the HTTP status (providing the node was actually visited; see CRL_LEAFVISIT). Additionally, the action routine returns 1 if we should keep traversing, 0 if we should stop.
Note: hcrawl() is not especially bright about CGI scripts. It won't crawl any script which has parameters, but makes no other assumptions. Unfortunately, there doesn't seem to be a way to find out if the HTML is generated dynamically or not. Caveat: the URL parser doesn't have any clue about the translation of unqualified name of the form:http://www.mtcc.com/tohttp://www.mtcc.com/index.htmlIt assumes that it should be index.html, but that is clearly not always the case, especially with lame M$ "operating systems" which can only deal with .htm. This can cause some nodes to be accidentally revisited, and other seemingly strange behavior.