This document describes how to use P2P mechanism of Hyper Estraier. If you have never read the user's guide, please read it beforehand.
estseek.cgi is not efficient because it connects to the database per execution. And, it is impossible to perform search during database updating, because estcmd locks the database. To solve the problem, Hyper Estraier provides a server program of C/S (client/server) architecture. There are a resident process keeping connection to the database and it serves some operations via network. The C/S architecture has the following advantages.
Because the protocol between C/S is based on HTTP, some popular web browsers can be used as clients. Of course, clients can be implemented on your own way. It is also good idea to use such technologies around web browser as JavaScript, Flash, and so on.
Distributed processing based on P2P (Peer to Peer) architecture is supported. If you use 10 servers handling one million of documents, you can search 10 millions of documents. Because servers are equivalent, whole of the network service works successively even if a server crashes. Moreover, calculating reliability between servers is supported and it can improve search precision.
The node API is provided to hide the protocol between C/S. Using the node API, you can implement client applications without closeup know-hows about network. This document describes how to use the node API (C language). Interfaces of the node API for Java and Ruby are also provided.
This section describes the P2P architecture of Hyper Estraier.
If you uses many indexes, it is inefficient to run a server per index. So, a program called node master is provided. While it works as one process and uses one network port, it can handle several indexes. Because each index performs its own service, we can regard a "node master" as aggregation of several index servers. On the viewpoint, each virtual server handling an index is called "node server". Each node server has an own URL. A client application knows URL of a node server but does not know in which node master the node server works.
![[framework]](nodeframe.png)
The term "node" is used as with the term "peer" in the P2P architecture. As a client connects to a node master itself and manage some nodes, another client connects to a node server and search/register documents.
A node server can link to another node server one-sidedly. When a client send a query to a node server, the node server relay the query to linked nodes. Responses of linked nodes and the first node are merged and sent back to the client. That is, so-called meta search is supported by every nodes and it realizes distributed processing in P2P architecture.
The meta search is performed hierarchically. Because loop of routing is detected and restrained automatically, the behavior is as with search of tree structured network. Due to this mechanism, it is possible to increase nodes of the network up to infinity.
![[tree of meta search]](metatree.png)
Reliability called "credit" is set to each link between nodes. When a node merges responses collected by meta search, credit is used for weighting of scores. So, documents in response of nodes with high credit is apt to be shown in high ranks. As applications are responsible to set links and their credit, it is possible to improve search precision by increasing credit of frequently used nodes.
When a client connects to the node master or a node server, authentication with a user name and a password is performed. Users are classified into super users and normal users. The former has permission to manage nodes and users. The latter does not have those permissions. Moreover, permission is granted for each node servers. Each node has lists of administrator who can update the index and normal users who can perform search only. Besides, super users of the node master can connect as administrators to node servers in the same node master.
As the concept of P2P seems difficult, let's try to use some commands and learn it by degrees.
For preparation for the node master, create the server root directory which includes configuration files and indexes. Perform the following command and a directory "casket" will be created.
estmaster init casket
Next, start the node master. Perform the following command.
estmaster start casket
To stop the node master, input Ctrl-D on the terminal on which the node master is running or perform the following command on another terminal.
estmaster stop casket
While the node master is running, we can access "http://localhost:1978/masterui" with a web browser and use the administration interface. When access the URL, a dialog is shown and the user name and the password is required. Input "admin" and "admin". Then, the menu of administration commands is displayed.
If you step into "Manage Master" and select "SHUTDOWN", you can stop the node master. But, leave it for now.
Select "Manage Users". As you have logged in as a user whose name is "admin", create a new user account and switch to it. Input the user name, the password, the flags, the full name, and the the miscellaneous information into the forms at the bottom of the page. The name and the password can contain alphanumeric characters only. As for now, input "clint", "tnilc", "s", "Clint Eastwood", and "Dirty Harry". It is important to set "s" in the flags. It means that the user is a super user.
Now, the user "admin" is no longer in use. As it has some potential security problems, delete it. Select the "DELE" in the line of "admin" and push "delete" on the next confirmation step.
Step into "Manage Nodes". Because the user "admin" is deleted, you are asked the user name and the password again. Input the "clint" and "tnilc" so that you can continue. In turn, create new nodes. Input the node name and the label into the form at the bottom of the page. The name can contain alphanumeric characters only. As for now, create a node whose name is "test1" and whose label is "First Node". And create another node whose name is "test2" and whose label is "Second Node".
Back to the command line operations. As the terminal of the node master is busy to show log messages, open another terminal.
Let's register some documents into the indexes of nodes. It is needed to prepare document draft data of documents to register. Create the following file and save it as "data001.est".
@uri=data001 @title=Material Girl Living in a material world And I am a material girl You know that we are living in a material world And I am a material girl
To register it into the node "test1", perform the following command. Because the permission as administrators is needed to update the index, you should specify the user name and the password with the -auth option. Process of registration finishes in a flash. It is success if no error message is shown.
estcall put -auth clint tnilc http://localhost:1978/node/test1 data001.est
For explanation of meta search said later, register another document into the node "test2" also. Create the following file and save it as "data002.est".
@uri=data002 @title=Liberian Girl Liberian girl You came and you changed My world A love so brand new
Then, perform the following command.
estcall put -auth clint tnilc http://localhost:1978/node/test2 data002.est
It is useful to register documents from remote machines, isn't it? As with the above steps, register some other documents.
All right, let's search for some registered documents. Perform the following command, and information of the corresponding document is shown.
estcall search http://localhost:1978/node/test1 "material world"
By setting a link between the two nodes, you can do meta search. Now, set the link from "test1" to "test2". Specify the URL of the source node, the URL of the destination node, the label to display, and the credit.
estcall setlink -auth clint tnilc http://localhost:1978/node/test1 \ http://localhost:1978/node/test2 TEST02 8000
And, search again. This time, set depth of meta search with the option -dpt.
estcall search -dpt 1 http://localhost:1978/node/test1 "girl"
Though you access the node "test1", the result of "test2" is merged and shown. It is the feature of meta search in P2P architecture. If "test1" and "test2" are on separate machines, distributed computing is realized.
By increasing the credit, ranking of documents in the result of the destination node is to be higher. Perform the following command and search again. Then, you will see the ranking is changed.
estcall setlink -auth clint tnilc http://localhost:1978/node/test1 \ http://localhost:1978/node/test2 TEST02 12000
You can get result as XML data. Perform the following command. See estresult.dtd for detail of the XML format.
estcall search -dpt 1 -vx http://localhost:1978/node/test1 "girl"
Each node server embeds search interface for web browsers. Access "http://localhost:1978/node/test1/searchui" to use it.
`estmaster' is provided as a command to manage the node master. This section describes how to use estmaster.
estmaster is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument rootdir specifies the server root directory which contains configuration file and so on.
All sub commands return 0 if the operation is success, else return 1. A running node master finishes with closing the database when it catches the signal 2 (SIGINT), 3 (SIGQUIT), or 15 (SIGTERM). Moreover, when a running node master catches the signal 1 (SIGHUP), the process is re-start and re-read the configuration files.
A running node server should be finished by valid means by command line or via network. Otherwise, the index may be broken.
The server root directory contains the following files and directories.
The prime configuration file can be edit with a text editor. However, the user account file should not be edit during the node master is running.
If you have an index created by estcmd, move it into the node directory and reboot the server. So, the index will work as a node.
The prime configuration file is composed of lines and the name of an variable and the value separated by `:' are in each line. By default, the following configuration is there.
hostname: localhost
portnum: 1978
runmode: 1
authmode: 2
maxconn: 30
sessiontimeout: 600
searchtimeout: 8
searchdepth: 5
proxyhost:
proxyport:
loglevel: 2
docroot:
indexfile:
trustednode:
denyuntrusted: 0
cachesize: 64
specialcache:
snipwwidth: 480
sniphwidth: 96
snipawidth: 96
uilprefix: file:///home/mikio/public_html/
uigprefix: http://localhost/
uigsuffix:
uidirindex: index.html
uireplace: //localhost/{{!}}//127.0.0.1/
uireplace: //127.0.0.1:80/{{!}}//127.0.0.1/
uiextattr: @author|Author
uiextattr: @mdate|Modification Date
uismplphrase: 1
Meaning of each variable is the following.
The user account file is composed of lines and each includes the name, the encrypted password, the flags, the full name, and the miscellaneous information separated by tabs. The character encoding is UTF-8. By default, the following account is there.
admin 21232f297a57a5a743894a0e4a801fc3 s Carolus Magnus Administrator
The password is expressed as MD5 hash value. In the flags, "s" is for super users, and "b" is for banned users. Flags, full name, and miscellaneous information can be omitted.
By accessing the absolute URL "/masterui" of the node master with a web browser, you can use the administration interface. It requires authentication as a super user.
By accessing the URL which is the URL of a node server followed by "/searchui", you can use the search interface.
Communication between nodes and communication between clients and nodes are carried out by a protocol based on HTTP. This section describes the protocol.
The node master and node servers implement HTTP/1.0. As for now, such particular features of HTTP/1.1 as keep-alive connection, chunked encoding, and content negotiation are not supported.
While both of GET and POST are allowed for the request method of HTTP, GET is preferred if the command retrieves information, POST is preferred if the command update the node master or a node server. As the character encoding of parameters is UTF-8, meta characters and multi-byte characters should be escaped by URL encoding (application/x-www-form-urlencoded). The maximum length of data sent with the GET method is 8000. Authentication information is passed in the basic authentication mechanism of HTTP.
If an operation is done successfully, the status code 200 or 202 is returned. On error, one of the following status code is returned.
The result of operation of search or retrieve is sent as message body of response. As the format of the data is plain text whose encoding is UTF-8, it can be structured with tabs and line feeds.
To operate the node master, connect to the path "/master" of the server. For example, if the host name is "skyhigh.estraier.go.jp" and the port number is 8888, connect to "http://skyhigh.estraier.go.jp:8888/master". Only super users are granted to operate the node master. There are some sub commands for operations of the node master. The name of a sub command is specified by the parameter "action". Other parameters vary according to each sub command.
To operate a node server, connect to a path which begins "/node/" and is followed by the name of the node. For example, if the host name is "skyhigh.estraier.go.jp" and the port number is 8888 and the name of the node is "foo", connect to "http://skyhigh.estraier.go.jp:8888/node/foo". There are some sub commands for operations of node servers. The name of a sub command is specified after the node name. Parameters vary according to each sub command.
Note that while super users has permission to administrate all nodes, an administrator of a node may not be a super user. Moreover, setting of normal users of each node have meaning only when the authorization mode is 3 (all).
The format of the entity body of result of search command is alike to multipart of MIME. The following is an example.
--------[2387AD2E34554FFF]-------- VERSION 0.9 NODE http://localhost:1978/node/sample1 HIT 2 HINT#1 give 2 DOCNUM 2 WORDNUM 31 TIME 0.001 LINK#0 http://localhost:1978/node/sample1 Sample1 10000 2 31 2731304 2 LINK#1 http://localhost:1978/node/sample2 Sample2 4000 3 125 8524522 1 VIEW SNIPPET --------[2387AD2E34554FFF]-------- #nodelabel=Sample Node One #nodeurl=http://localhost:1978/node/sample1 @id=1 @uri=http://localhost/foo.html You may my glories and my state dispose, But not my griefs; still am I king of those. ( Give give it u p, Yo! Give give it up, Yo!) --------[2387AD2E34554FFF]-------- #nodelabel=Sample Node One #nodeurl=http://localhost:1978/node/sample1 @id=2 @uri=http://localhost/bar.html The faster I go, the behinder I get. ( Give give it up, Yo! Give give it up, Yo!) --------[2387AD2E34554FFF]--------:END
Each line feed is a single LF. The first line is definition of the border string. Each parts are delimited by the border string. The last border string is followed by ":END". The first part is the meta section. The other parts are document sections.
The format of the meta section is TSV. Meaning of each string is picked out by the first field. There are the following kinds.
Each document part expresses attributes and a snippet of a document. Top lines to the first empty line expresses attributes. Their format is as with the one of document draft. The format of the snippet is TSV. There are tab separated values. Each line is a string to be shown. Though most lines have only one field, some lines have two fields. If the second field exists, the first field is to be shown with highlighted, and the second field means its normalized form.
The following pseudo-attributes are added to each result documents of the search command or the get_doc command includes.
Because URL encoding is not efficient as for large data sent for the put_doc command, the raw mode is supported. If the value of "Content-Type" is "text/x-estraier-draft", the entity body is treated as a document draft itself. The following is an example.
POST /node/foo/put_doc HTTP/1.0 Content-Type: text/x-estraier-draft Content-Length: 138 @uri=http://gogo.estraier.go.jp/sample.html @title=Twinkle Twinkle Little Star Twinkle, twinkle, little star, How I wonder what you are.
As it is a bother to implement HTTP, the node API is useful. This section describes how to use the node API.
Using the node API, you can implement clients communicating node severs without considering such low level processing as TCP/IP and HTTP. Though the node API has overhead comparing to the core API, it is important to be able to execute at remote host and to perform parallel processing without discrimination of readers and writers.
In each source of applications of the node API, include `estraier.h', `estnode.h', `cabin.h', and `stdlib.h'.
#include <estraier.h> #include <estnode.h> #include <cabin.h> #include <stdlib.h>
To build an application, perform the following command. It is same as with the core API.
gcc `estconfig --cflags` -o foobar foobar.c `estconfig --ldflags` `estconfig --libs`
Because the node API uses features of the core API also, if you have never read the programming guide, please read it beforehand.
For preparation to use the node API, initialize the network environment at the beginning of a program. Moreover, the environment should be freed at the end of the program.
The function `est_init_net_env' is used in order to initialize the networking environment.
The function `est_free_net_env' is used in order to free the networking environment.
The type of the structure `ESTNODE' is for abstraction of connection to a node. A node has its own URL. No entity of `ESTNODE' is accessed directly, but it is accessed by the pointer. The term of "node connection object" means the pointer and its referent. A node connection object is created by the function `est_node_new' and destroyed by `est_node_delete'. Every created node connection object should be destroyed.
The following is a typical use case of node connection object.
ESTNODE *node;
/* create a node connection object */
node = est_node_new("http://estraier.gov:1978/node/foo");
/* set the proxy, the timeout, and the authentication */
est_node_set_proxy(node, "proxy.qdbm.go.jp", 8080);
est_node_set_timeout(node, 5);
est_node_set_auth(node, "mikio", "oikim");
/* register documents or search for documents here */
/* destroy the object */
est_node_delete(node);
The function `est_node_new' is used in order to create a node connection object.
The function `est_node_delete' is used in order to destroy a node connection object.
The function `est_node_set_proxy' is used in order to set the proxy information of a node connection object.
The function `est_node_set_timeout' is used in order to set timeout of a connection.
The function `est_node_set_auth' is used in order to set the authentication information of a node connection object.
The function `est_node_status' is used in order to get the status code of the last request of a node.
The function `est_node_put_doc' is used in order to add a document to a node.
The function `est_node_out_doc' is used in order to remove a document from a node.
The function `est_node_out_doc_by_uri' is used in order to remove a document specified by URI from a node.
The function `est_node_get_doc' is used in order to retrieve a document in a node.
The function `est_node_get_doc_by_uri' is used in order to retrieve a document specified by URI in a node.
The function `est_node_get_doc_attr' is used in order to retrieve the value of an attribute of a document in a node.
The function `est_node_get_doc_attr_by_uri' is used in order to retrieve the value of an attribute of a document specified by URI in a node.
The function `est_node_uri_to_id' is used in order to get the ID of a document specified by URI.
The function `est_node_name' is used in order to get the name of a node.
The function `est_node_label' is used in order to get the label of a node.
The function `est_node_doc_num' is used in order to get the number of documents in a node.
The function `est_node_word_num' is used in order to get the number of unique words in a node.
The function `est_node_size' is used in order to get the size of the database of a node.
The function `est_node_search' is used in order to search documents corresponding a condition for a node.
The function `est_node_set_user' is used in order to manage a user account of a node.
The function `est_node_set_link' is used in order to manage a link of a node.
The type of the structure `ESTNODERES' is for abstraction of search result from a node. A result is composed of a list of corresponding documents and information of hints. No entity of `ESTNODERES' is accessed directly, but it is accessed by the pointer. The term of "node result object" means the pointer and its referent. A node result object is created by the function `est_node_search' and destroyed by `est_noderes_delete'. Every created node connection object should be destroyed.
The type of the structure `ESTRESDOC' is for abstraction of a document in search result. A result document is composed of some attributes and a snippet. No entity of `ESTRESDOC' is accessed directly, but it is accessed by the pointer. The term of "result document object" means the pointer and its referent. A result document object is gotten by the function `est_noderes_get_doc' but it should not be destroyed because the entity is managed inside the node result object.
The following is a typical use case of node connection object and result document object.
ESTNODERES *nres;
CBMAP *hints;
ESTRESDOC *rdoc;
int i;
/* create a node result object */
nres = est_node_search(node, cond, 1);
/* get hints */
hints = est_noderes_hints(nres);
/* show the hints here */
/* scan documents in the result */
for(i = 0; i < est_noderes_doc_num(nres); i++){
/* get a result document object */
rdoc = est_noderes_get_doc(nres, i);
/* show the result document object here */
}
/* destroy the node result object */
est_noderes_delete(nres);
The function `est_noderes_delete' is used in order to delete a node result object.
The function `est_noderes_hints' is used in order to get a map object for hints of a node result object.
The function `est_noderes_doc_num' is used in order to get the number of documents in a node result object.
The function `est_noderes_get_doc' is used in order to refer a result document object in a node result object.
The function `est_resdoc_uri' is used in order to get the URI of a result document object.
The function `est_resdoc_attr_names' is used in order to get a list of attribute names of a result document object.
The function `est_resdoc_attr' is used in order to get the value of an attribute of a result document object.
The function `est_resdoc_snippet' is used in order to get the snippet of a result document object.
Each of node connection objects, node result objects, and result document objects can not be shared by threads. If you use multi threads, make each thread have its own objects. If the precondition is kept, functions of the node API can be treated as thread-safe functions.
The following is the simplest implementation of a gatherer.
#include <estraier.h>
#include <estnode.h>
#include <cabin.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv){
ESTNODE *node;
ESTDOC *doc;
/* initialize the network environment */
if(!est_init_net_env()){
fprintf(stderr, "error: network is unavailable\n");
return 1;
}
/* create and configure the node connection object */
node = est_node_new("http://localhost:1978/node/test1");
est_node_set_auth(node, "admin", "admin");
/* create a document object */
doc = est_doc_new();
/* add attributes to the document object */
est_doc_add_attr(doc, "@uri", "http://estraier.gov/example.txt");
est_doc_add_attr(doc, "@title", "Over the Rainbow");
/* add the body text to the document object */
est_doc_add_text(doc, "Somewhere over the rainbow. Way up high.");
est_doc_add_text(doc, "There's a land that I heard of once in a lullaby.");
/* register the document object to the node */
if(!est_node_put_doc(node, doc))
fprintf(stderr, "error: %d\n", est_node_status(node));
/* destroy the document object */
est_doc_delete(doc);
/* destroy the node object */
est_node_delete(node);
/* free the networking environment */
est_free_net_env();
return 0;
}
The following is the simplest implementation of a searcher.
#include <estraier.h>
#include <estnode.h>
#include <cabin.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv){
ESTNODE *node;
ESTCOND *cond;
ESTNODERES *nres;
ESTRESDOC *rdoc;
int i;
const char *value;
/* initialize the network environment */
if(!est_init_net_env()){
fprintf(stderr, "error: network is unavailable\n");
return 1;
}
/* create the node connection object */
node = est_node_new("http://localhost:1978/node/test1");
/* create a search condition object */
cond = est_cond_new();
/* set the search phrase to the search condition object */
est_cond_set_phrase(cond, "rainbow AND lullaby");
/* get the result of search */
nres = est_node_search(node, cond, 0);
if(nres){
/* for each document in the result */
for(i = 0; i < est_noderes_doc_num(nres); i++){
/* get a result document object */
rdoc = est_noderes_get_doc(nres, i);
/* display attributes */
if((value = est_resdoc_attr(rdoc, "@uri")) != NULL)
printf("URI: %s\n", value);
if((value = est_resdoc_attr(rdoc, "@title")) != NULL)
printf("Title: %s\n", value);
/* display the snippet text */
printf("%s", est_resdoc_snippet(rdoc));
}
/* delete the node result object */
est_noderes_delete(nres);
} else {
fprintf(stderr, "error: %d\n", est_node_status(node));
}
/* destroy the search condition object */
est_cond_delete(cond);
/* destroy the node object */
est_node_delete(node);
/* free the networking environment */
est_free_net_env();
return 0;
}
`estcall' is provided as a client command to manage the node server. This section describes how to use estcall.
estcall is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument nurl specifies the URL of a node. The option -proxy specifies the host name and the port number of a proxy server. The option -tout specifies timeout in seconds. The option -auth specifies the user name and the password of authentication information.
All sub commands return 0 if the operation is success, else return 1.
Operations for the node maser itself is not provided as APIs, use the raw sub command for that purpose. For example, the following command is used in order to shutdown the node master.
estcall raw -auth admin admin \ 'http://localhost:1978/master?action=shutdown'
In order to add a user, perform the following command.
estcall raw -auth admin admin \ 'http://localhost:1978/master?action=useradd&name=mikio&passwd=iloveyou'
In order to use POST method, perform the following command.
echo -n 'action=useradd&name=mikio&passwd=iloveyou' |
estcall raw -auth admin admin \
-eh 'Content-Type: application/x-www-form-urlencoded' \
'http://localhost:1978/master' -