<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Hello world</title>
</head>
<body>
<script>document.write('Hello,world!');</script>
</body>
</html>
I'm sorry that conventional web crawler can do nothing for this HTML page. They cannot execute javascript, so you'll get no results.
But we can get it by Seaflower.
What is Seaflower?
Seaflower is the world's first DOM crawler for vertical search, it's based on Firefox browser, runs on Linux sytems. It can crawl dynamic contents of web pages which generated by javascript, and can output DOM datas to xml which you can extract specific data by xpath.
Conventional web crawler has some disadvantages, such as:
1. Dynamic contents generated by javascript cannot be crawled
2. Data in misformed web page is diffcult to extract
3. Crawl web page by emulating browser instead of the true browser
4. Contents in web page is incomplete, which doesn't include contents in FRAME/IFRAME
Instead, Seaflower has these advantages as follows:
1. It's based on Firefox browser
2. Data using XML format
These XML data can be transformed to DOM (Document Object Model), you can use XPATH to extract contents. Get title of web page, use /html/head/title. Get all links, use //a/@href, and so on.
3. Web page data is complete and fresh
Web page data returned by Seaflower, contains dynamic datas generated by javascript, contains data in FRAME/IFRAME.
4. Multi-threaded, run on background
5. Simple crawl protocol
Http like crawl protocol. You can get XML results by a simple GET command
6. Turn page by javascript is enabled
Seaflower provides EXEC command to execute javascript on specific url. Combining GET/CONTINUE/NODATA command, you can get web page contents continually. Seaflower also provide getNodeByXPath method for javascript, emulate click first input button, just EXEC getNodeByXPath('/html/body/input[1]').onclick()
Note: Seaflower isn't a spider, it's a tool of crawl. Try seaspider - the cutting-edge spider system for vertical search.
NEW 20080721: Code based on Firefox 3.0.1, fix some bugs.
20080621: Code based on Firefox 3.0.
20080609: Code based on Firefox 3.0rc2, provide seaflowerctl command, fix some bugs.
20080524: Code based on Firefox 3.0rc1, use thread pool technology.
service seaflower start
service seaflower stop
service seaflower status
usage: seaflowertctl command where command is: 1) list list current settings 2) set [port|rcj|vxsmin|vxsmax|captureWaitTime] value set config 3) proxy [ on <IP> <PORT> | off ] set or clear proxy setting 4) help print this help info
Example 1: seaflowerctl list List current settings. Example 2: seaflowerctl set port 4444 Set Seaflower listen on 4444 port. Example 3: seaflowerctl set rcj 4444-5555-6666 Set register code to 4444-5555-6666. Example 4: seaflowerctl set vxsmin 5 Start 5 virtual browser when Seaflower starts. Example 5: seaflowerctl set captureWaitTime 2 Set wait time before crawl to 2s. Example 6: seaflowerctl proxy on 192.168.28.91 8080 Set proxy : 192.168.28.91, port 8080. Example 7: seaflowerctl proxy off Clear proxy. Note:vxsmax not used.
usage: crawl [-h host] [-p port] [-w timeToWait] url
Options: -h host which Seaflower listens on, default: localhost -p port which Seaflower listens, default: 4050 -w wait time before crawl (unit: second) url which you want to crawl.
Default listen on port: 4050
Request - client send snapshot request as follows (Each line ends with <LF>, <LF> stands for new line character, request ends with blank line)
GET <url> <LF> 或 EXEC <javascripts> <LF>
WAIT-TIME: <n> <LF>
CONTINUE <LF>
NODATA <LF>
<LF>
Note: <url> is url you'll crawl;<javascripts> is javascript program to run, must in one line; <n> is wait time before crawl (unit: second), default is 0; GET/EXEC are must option, WAIT-TIME and CONTINUE are optional. WAIT-TIME specify wait time before crawl. CONTINUE ensure Seaflower keep socket connection. NODATA specify no data returned.
Response message (OK)
SEAFLOWER/3.8 200 OK <LF>
Crawl-Pid: <pid> <LF>
Content-Length: <length> <LF>
<LF>
<contents>
Note: <pid> is pid of crawler process; <length> is xml data length; <contents> is xml data Seaflower crawled.
Response message (Failed)
SEAFLOWER/3.8 <code> <error> <LF>
<LF>
Note: <code> is error code, starts with 4; <error> is error message.
package com.syntimes.commons;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.Socket;
import java.net.UnknownHostException;
import java.util.StringTokenizer;
import org.apache.commons.lang.math.NumberUtils;
/**
* seaflower crawler
* @author zhsoft88
* @since 2008-4-13
*/
public class Seaflower {
public static final int PORT = 4050;
/**
* crawl result
* @author zhsoft88
* @since 2008-4-13
*/
public static class SeaflowerResult {
private int status;
private String title;
private String location;
private String contents;
private long time;
public SeaflowerResult(int status,String title,String location,String contents,long time) {
this.status = status;
this.title = title;
this.location = location;
this.contents = contents;
this.time = time;
}
public long getTime() {
return time;
}
public int getStatus() {
return status;
}
public String getContents() {
return contents;
}
public String getTitle() {
return title;
}
public String getLocation() {
return location;
}
@Override
public String toString() {
return "[status="+status+",time="+time+",title="+title+",location="+location+",contents="+contents+"]";
}
}
/**
* crawl configuration
* @author zhsoft88
* @since 2008-4-13
*/
public static class SeaflowerConf {
private String url;
private String exec;
private int waitTime;
private boolean cont;
private boolean nodata;
public SeaflowerConf() {
}
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public String getExec() {
return exec;
}
public void setExec(String exec) {
this.exec = exec;
}
public int getWaitTime() {
return waitTime;
}
public void setWaitTime(int waitTime) {
this.waitTime = waitTime;
}
public boolean isContinue() {
return cont;
}
public void setContinue(boolean cont) {
this.cont = cont;
}
public boolean isNodata() {
return nodata;
}
public void setNodata(boolean nodata) {
this.nodata = nodata;
}
}
private Socket socket;
public Seaflower() throws UnknownHostException, IOException {
this("localhost");
}
public Seaflower(String host) throws UnknownHostException, IOException {
this(host,PORT);
}
public Seaflower(String host,int port) throws UnknownHostException, IOException {
socket = new Socket(host,port);
}
/**
* crawl
* @param conf
* @return
* @throws Exception
*/
public SeaflowerResult crawl(SeaflowerConf conf) throws Exception {
if (socket==null) {
throw new Exception("socket closed");
}
long t1 = System.currentTimeMillis();
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(socket.getOutputStream()));
if (conf.getUrl()!=null) {
bw.write("get "+conf.getUrl()+"\r\n");
}
if (conf.getExec()!=null) {
bw.write("exec "+conf.getExec()+"\r\n");
}
if (conf.getWaitTime()!=-1) {
bw.write("wait-time: "+conf.getWaitTime()+"\r\n");
}
if (conf.isContinue()) {
bw.write("continue\r\n");
}
if (conf.isNodata()) {
bw.write("nodata\r\n");
}
bw.write("\r\n");
bw.flush();
BufferedReader br = new BufferedReader(new InputStreamReader(socket.getInputStream(),"utf-8"));
String line = br.readLine();
int status = -1;
StringTokenizer st = new StringTokenizer(line," ");
st.nextToken();
status = NumberUtils.toInt(st.nextToken());
String tagTitle = "Current-Title: ";
String tagLocation = "Current-Location: ";
String title = null;
String location = null;
while ((line=br.readLine())!=null) {
if (line.length()==0) break;
if (line.startsWith(tagTitle)) {
title = line.substring(tagTitle.length());
} else if (line.startsWith(tagLocation)) {
location = line.substring(tagLocation.length());
}
}
StringBuilder sb = new StringBuilder(100);
int c;
while ((c=br.read())!=-1) {
sb.append((char)c);
}
if (!conf.isContinue()) {
socket.close();
socket = null;
}
long t2 = System.currentTimeMillis();
return new SeaflowerResult(status,title,location,sb.toString(),t2-t1);
}
}
package com.syntimes.commons.tests;
import java.util.List;
import org.apache.commons.lang.StringUtils;
import org.dom4j.Document;
import org.dom4j.DocumentHelper;
import org.dom4j.Node;
import com.syntimes.commons.Seaflower;
import com.syntimes.commons.Seaflower.SeaflowerConf;
import com.syntimes.commons.Seaflower.SeaflowerResult;
/**
* Test of Seaflower: 抓取酷基金网站数据
* @author zhsoft88
* @since 2008-3-29
*/
public class TestSeaflower {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
Seaflower s = new Seaflower();
{
SeaflowerConf conf = new SeaflowerConf();
conf.setUrl("http://www.ourku.com/index.html");
SeaflowerResult result = s.crawl(conf);
// System.out.println(result);
Document doc = DocumentHelper.parseText(result.getContents());
// System.out.println("title="+doc.selectSingleNode("/html/head/title").getText());
List<Node> list = doc.selectNodes("//div[@id='maininfo_all']/table/tbody/tr");
if (list==null) return;
System.out.println("total size="+list.size());
for (Node no : list) {
//排名
Node order = no.selectSingleNode("td[1]");
if (StringUtils.isBlank(order.getStringValue())) continue;
//日期
Node date = no.selectSingleNode("td[2]");
if (date==null) continue;
//代码
Node code = no.selectSingleNode("td[3]");
//名称
Node name = no.selectSingleNode("td[4]");
//单位净值
Node netval = no.selectSingleNode("td[5]");
//累计净值
Node totalval = no.selectSingleNode("td[6]");
//增长值
Node growval = no.selectSingleNode("td[7]");
//增长率
Node growrate = no.selectSingleNode("td[8]");
System.out.println(order.getStringValue()+","+date.getStringValue()+","+code.getStringValue()+","+name.getStringValue()+","
+netval.getStringValue()+","+totalval.getStringValue()+","+growval.getStringValue()+","+growrate.getStringValue());
}
}
}
}