biz.neustar.wpm.api
Interface WebBot


public interface WebBot

Download web pages using a simple emulated browser, without evaluating JavaScript. Images and content files on the page are downloaded, concurrently much like a real browser. Frames are also downloaded.

Caching can be turned on/off using enableCache(boolean).

The concurrent downloading behaviour can be adjusted by calling emulateBrowser(String)

Cookies are saved and recorded by HttpClient as for normal requests.

Headers can be added or overridden for every download.

Resource finding is done by parsing the HTML pages downloaded. Parsing can be slow and can miss things downloaded by javascript, or added to DOM by JavaScript. To enable custom parsing, or pre-baking graphs of items to download a JavaScript callback function can be called after every item is downloaded.

 var web = require('webbot');

 // Uncomment the following to emulate Webmetrics Fullpage Breakdown
 // web.emulateBrowser("wm");

 // Grab the HttpClient to make more advanced requests
 var c = test.openHttpClient();

 test.beginTransaction();
 test.beginStep("step1", 20000);
   var get = c.newGet("http://www.bbc.co.uk/news");
   var r = web.execute(get);
   r.searchString("news");
 test.endStep();
 test.endTransaction();
 


Nested Class Summary
static interface WebBot.Response
          The list of responses made during an execute() request.
 
Method Summary
 void addHeader(java.lang.String name, java.lang.String value)
          Add a header to each request made.
 void autoAddHeaders(boolean autoAdd)
          Enable/Disable auto adding of some common headers.
 void clearCache()
          Clear any currently cached data.
 void emulateBrowser(java.lang.String browser)
          Emulate the given browser's concurrent downloads, user agent and headers.
 void enableCache(boolean enableCache)
          Enable/Disable the use of the cache.
 WebBot.Response execute(HttpRequest request)
          Perform the given Http request.
 WebBot.Response execute(HttpRequest request, NativeFunction callback)
          Perform the given Http request, for every item that is downloaded call the given callback.
 void loadFrames(boolean loadSubFrames)
          Enable/Disable the loading of sub frames.
 void removeHeader(java.lang.String name)
          Remove a header.
 void resetHeaders()
          Remove all user added headers.
 void setMaxConnections(int maxConnections)
          Set a limit to the maximum amount of concurrent connections open at any one time.
 void setMaxConnectionsPerHost(int maxConnectionsPerHost)
          Set a limit to the maximum amount of concurrent connections allowed to be open towards a single host.
 

Method Detail

emulateBrowser

void emulateBrowser(java.lang.String browser)
Emulate the given browser's concurrent downloads, user agent and headers.

Parameters:
browser - the browser to emulate. Possible values are 'wm', 'chrome', 'firefox', 'ie8', 'ie9'.

setMaxConnections

void setMaxConnections(int maxConnections)
Set a limit to the maximum amount of concurrent connections open at any one time.

Default is 35.

Parameters:
maxConnections - the maximum number of concurrent connections

setMaxConnectionsPerHost

void setMaxConnectionsPerHost(int maxConnectionsPerHost)
Set a limit to the maximum amount of concurrent connections allowed to be open towards a single host.

The default is 6.

Parameters:
maxConnectionsPerHost - the maximum number of concurrent connections per host

loadFrames

void loadFrames(boolean loadSubFrames)
Enable/Disable the loading of sub frames.

Parameters:
loadSubFrames - if true frames are parsed for further downloads

enableCache

void enableCache(boolean enableCache)
Enable/Disable the use of the cache.

Parameters:
enableCache - if true items can be cached.

clearCache

void clearCache()
Clear any currently cached data.


addHeader

void addHeader(java.lang.String name,
               java.lang.String value)
Add a header to each request made.

Parameters:
name - the header name to add
value - the header value to add

removeHeader

void removeHeader(java.lang.String name)
Remove a header.

Parameters:
name - the header to remove

resetHeaders

void resetHeaders()
Remove all user added headers.


autoAddHeaders

void autoAddHeaders(boolean autoAdd)
Enable/Disable auto adding of some common headers. The following are added by default:
 Accept-Encoding: gzip,deflate
 Pragma: no-cache
 User-Agent: <useragent>
 Accept-Language: en-US,en
 Accept: *//*
 
Default is on.

Parameters:
autoAdd - if true the above headers are automatically added

execute

WebBot.Response execute(HttpRequest request)
Perform the given Http request.

The cache is checked, and if the item is not there it is downloaded. Any Html that is downloaded will be searched for images, css and script links and these will be downloaded. No JavaScript is executed however. Frames will also be downloaded and the same process to downloading their images, css and scripts applies.

Parameters:
request - This object can be obtained via call to HttpClient.newGet(String), HttpClient.newPost(String), etc...
Returns:
a response with all the items downloaded

execute

WebBot.Response execute(HttpRequest request,
                        NativeFunction callback)
Perform the given Http request, for every item that is downloaded call the given callback. This enables customs parsing of the Html responses and custom caching.

Items downloaded will not be added to the cache.

The callback takes the request response and returns an array of new response to made.

 var r = web.execute(get, function(response) {
   if (response.getUrl() === "http://somesite.biz") {
     return [ c.newGet("http://somesite.biz/images/logo.png"),
              c.newGet("http://somesite.biz/css/default.css"),
              c.newGet("http://somesite.biz/script/common.js") ];
   } else {
     return [];
   }
 });
 

Parameters:
request - This object can be obtained via call to HttpClient.newGet(String), HttpClient.newPost(String), etc...
callback - a javascript function that takes a response and generates new requests to make given the response.
Returns:
a response when all the items are downloaded

Copyright © 2020 Neustar, Inc. All Rights Reserved.