About Seaflower

Want to crawl body text in the following HTML page?
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Hello world</title>
</head>
<body>
<script>document.write('Hello,world!');</script>
</body>
</html>

I'm sorry that conventional web crawler can do nothing for this HTML page. They cannot execute javascript, so you'll get no results.

But we can get it by Seaflower.

What is Seaflower?

Seaflower is the world's first DOM crawler for vertical search, it's based on Firefox browser, runs on Linux sytems. It can crawl dynamic contents of web pages which generated by javascript, and can output DOM datas to xml which you can extract specific data by xpath.

Conventional web crawler has some disadvantages, such as:

1. Dynamic contents generated by javascript cannot be crawled

2. Data in misformed web page is diffcult to extract

3. Crawl web page by emulating browser instead of the true browser

4. Contents in web page is incomplete, which doesn't include contents in FRAME/IFRAME

Instead, Seaflower has these advantages as follows:

1. It's based on Firefox browser

2. Data using XML format

These XML data can be transformed to DOM (Document Object Model), you can use XPATH to extract contents. Get title of web page, use /html/head/title. Get all links, use //a/@href, and so on.

3. Web page data is complete and fresh

Web page data returned by Seaflower, contains dynamic datas generated by javascript, contains data in FRAME/IFRAME.

4. Multi-threaded, run on background

5. Simple crawl protocol

Http like crawl protocol. You can get XML results by a simple GET command

6. Turn page by javascript is enabled

Seaflower provides EXEC command to execute javascript on specific url. Combining GET/CONTINUE/NODATA command, you can get web page contents continually. Seaflower also provide getNodeByXPath method for javascript, emulate click first input button, just EXEC getNodeByXPath('/html/body/input[1]').onclick()

Note: Seaflower isn't a spider, it's a tool of crawl. Try seaspider - the cutting-edge spider system for vertical search.

Download

seaflower-4.1-1.en_US.fc9.i386.rpm (For Fedora Core 9 Linux) (52)
seaflower-4.1-1.en_US.fc8.i386.rpm (For Fedora Core 8 Linux) (191)
seaflower-4.1-1.en_US.el5.i386.rpm (For RedHat EL 5/CentOS Linux) (134)

NEW 20080721: Code based on Firefox 3.0.1, fix some bugs.
20080621: Code based on Firefox 3.0.
20080609: Code based on Firefox 3.0rc2, provide seaflowerctl command, fix some bugs.
20080524: Code based on Firefox 3.0rc1, use thread pool technology.

Total downloads: 1143

Install

rpm -ivh seaflower*.rpm

Seaflower server management

Tool of configuration management - seaflowerctl

usage: seaflowertctl command
where command is: 
1)  list
list current settings
2) set [port|rcj|vxsmin|vxsmax|captureWaitTime] value
set config
3) proxy [ on <IP> <PORT> | off ]
set or clear proxy setting
4) help
print this help info
Example 1: seaflowerctl list
List current settings.
Example 2: seaflowerctl set port 4444
Set Seaflower listen on 4444 port.
Example 3: seaflowerctl set rcj 4444-5555-6666
Set register code to 4444-5555-6666.
Example 4: seaflowerctl set vxsmin 5
Start 5 virtual browser when Seaflower starts.
Example 5: seaflowerctl set captureWaitTime 2
Set wait time before crawl to 2s.
Example 6: seaflowerctl proxy on 192.168.28.91 8080
Set proxy : 192.168.28.91, port 8080.
Example 7: seaflowerctl proxy off
Clear proxy.
Note:vxsmax not used.

Command line crawl tools - crawl

usage: crawl [-h host] [-p port] [-w  timeToWait] url
Options:
-h host which Seaflower listens on, default: localhost
-p port which Seaflower listens, default: 4050
-w wait time before crawl (unit: second)
url which you want to crawl.

Register

Seaflower is a shareware, free trial time is 30 days. For your proper use, please register it on time.
Contact with zhsoft88@gmail.com (Email/MSN). Price: RMB3000.00.

Seaflower Protocol

Example codes

Seaflower.java Download (153)

package com.syntimes.commons;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.Socket;
import java.net.UnknownHostException;
import java.util.StringTokenizer;

import org.apache.commons.lang.math.NumberUtils;

/**
 * seaflower crawler
 * @author zhsoft88
 * @since 2008-4-13
 */
public class Seaflower {

	public static final int PORT = 4050;
	
	/**
	 * crawl result
	 * @author zhsoft88
	 * @since 2008-4-13
	 */
	public static class SeaflowerResult {
		private int status;
		private String title;
		private String location;
		private String contents;
		private long time;
		
		public SeaflowerResult(int status,String title,String location,String contents,long time) {
			this.status = status;
			this.title = title;
			this.location = location;
			this.contents = contents;
			this.time = time;
		}
		public long getTime() {
			return time;
		}
		public int getStatus() {
			return status;
		}
		public String getContents() {
			return contents;
		}
		public String getTitle() {
			return title;
		}
		public String getLocation() {
			return location;
		}
		@Override
		public String toString() {
			return "[status="+status+",time="+time+",title="+title+",location="+location+",contents="+contents+"]";
		}
	}
	
	/**
	 * crawl configuration
	 * @author zhsoft88
	 * @since 2008-4-13
	 */
	public static class SeaflowerConf {
		private String url;
		private String exec;
		private int waitTime;
		private boolean cont;
		private boolean nodata;
		
		public SeaflowerConf() {
		}

		public String getUrl() {
			return url;
		}

		public void setUrl(String url) {
			this.url = url;
		}

		public String getExec() {
			return exec;
		}

		public void setExec(String exec) {
			this.exec = exec;
		}

		public int getWaitTime() {
			return waitTime;
		}

		public void setWaitTime(int waitTime) {
			this.waitTime = waitTime;
		}

		public boolean isContinue() {
			return cont;
		}

		public void setContinue(boolean cont) {
			this.cont = cont;
		}

		public boolean isNodata() {
			return nodata;
		}

		public void setNodata(boolean nodata) {
			this.nodata = nodata;
		}
		
	}

	private Socket socket;
	
	public Seaflower() throws UnknownHostException, IOException {
		this("localhost");
	}
	
	public Seaflower(String host) throws UnknownHostException, IOException {
		this(host,PORT);
	}
	
	public Seaflower(String host,int port) throws UnknownHostException, IOException {
		socket = new Socket(host,port);
	}
	
	/**
	 * crawl
	 * @param conf
	 * @return
	 * @throws Exception
	 */
	public SeaflowerResult crawl(SeaflowerConf conf) throws Exception {
		if (socket==null) {
			throw new Exception("socket closed");
		}
		long t1 = System.currentTimeMillis();
		BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(socket.getOutputStream()));
		if (conf.getUrl()!=null) {
			bw.write("get "+conf.getUrl()+"\r\n");
		}
		if (conf.getExec()!=null) {
			bw.write("exec "+conf.getExec()+"\r\n");
		}
		if (conf.getWaitTime()!=-1) {
			bw.write("wait-time: "+conf.getWaitTime()+"\r\n");
		}
		if (conf.isContinue()) {
			bw.write("continue\r\n");
		}
		if (conf.isNodata()) {
			bw.write("nodata\r\n");
		}
		bw.write("\r\n");
		bw.flush();
		BufferedReader br = new BufferedReader(new InputStreamReader(socket.getInputStream(),"utf-8"));
		String line = br.readLine();
		int status = -1;
		StringTokenizer st = new StringTokenizer(line," ");
		st.nextToken();
		status = NumberUtils.toInt(st.nextToken());
		String tagTitle = "Current-Title: ";
		String tagLocation = "Current-Location: ";
		String title = null;
		String location = null;
		while ((line=br.readLine())!=null) {
			if (line.length()==0) break;
			if (line.startsWith(tagTitle)) {
				title = line.substring(tagTitle.length());
			} else if (line.startsWith(tagLocation)) {
				location = line.substring(tagLocation.length());
			}
		}
		StringBuilder sb = new StringBuilder(100);
		int c;
		while ((c=br.read())!=-1) {
			sb.append((char)c);
		}
		if (!conf.isContinue()) {
			socket.close();
			socket = null;
		}
		long t2 = System.currentTimeMillis();
		return new SeaflowerResult(status,title,location,sb.toString(),t2-t1);
	}
	
}

TestSeaflower.java Download (137)

Crawl www.ourku.com, extract specific data
package com.syntimes.commons.tests;

import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.dom4j.Document;
import org.dom4j.DocumentHelper;
import org.dom4j.Node;

import com.syntimes.commons.Seaflower;
import com.syntimes.commons.Seaflower.SeaflowerConf;
import com.syntimes.commons.Seaflower.SeaflowerResult;

/**
 * Test of Seaflower: 抓取酷基金网站数据
 * @author zhsoft88
 * @since 2008-3-29
 */
public class TestSeaflower {

	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception {
		Seaflower s = new Seaflower();
		{
			SeaflowerConf conf = new SeaflowerConf();
			conf.setUrl("http://www.ourku.com/index.html");
			SeaflowerResult result = s.crawl(conf);
//			System.out.println(result);
			Document doc = DocumentHelper.parseText(result.getContents());
//			System.out.println("title="+doc.selectSingleNode("/html/head/title").getText());
			List<Node> list = doc.selectNodes("//div[@id='maininfo_all']/table/tbody/tr");
			if (list==null) return;
			System.out.println("total size="+list.size());
			for (Node no : list) {
				//排名
				Node  order = no.selectSingleNode("td[1]");
				if (StringUtils.isBlank(order.getStringValue())) continue;
				//日期
				Node date = no.selectSingleNode("td[2]");
				if (date==null) continue;
				//代码
				Node  code = no.selectSingleNode("td[3]");
				//名称
				Node  name = no.selectSingleNode("td[4]");
				//单位净值
				Node  netval = no.selectSingleNode("td[5]");
				//累计净值
				Node  totalval = no.selectSingleNode("td[6]");
				//增长值
				Node  growval = no.selectSingleNode("td[7]");
				//增长率
				Node  growrate = no.selectSingleNode("td[8]");
				System.out.println(order.getStringValue()+","+date.getStringValue()+","+code.getStringValue()+","+name.getStringValue()+","
					+netval.getStringValue()+","+totalval.getStringValue()+","+growval.getStringValue()+","+growrate.getStringValue());
			}
		}
		
	}

}

Crawl data in next web page

Products: Seacat Seaflower Seaspider Seasnipe Seastar NEW
(C) 2008 ZHUATANG.COM All rights reserved (网站备案中)

2008-07-21