<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
  <channel>
    <title>dcaoyuan</title>
    <description></description>
    <link>http://dcaoyuan.javaeye.com</link>
    <language>UTF-8</language>
    <copyright>Copyright 2003-2008, JavaEye.com</copyright>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <generator>JavaEye - 做最棒的软件开发交流社区</generator>
      <item>
        <title>新的Scala for NetBeans提供测试 </title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/184145" style="color:red;">http://dcaoyuan.javaeye.com/blog/184145</a>&nbsp;
          发表时间: 2008年04月18日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          重新写过的Scala for NetBeans现在可以在NetBeans 6.1RC或者最新的Nightly Build上测试，你可以从NetBeans Update Center获得，方法是：<br />"Tools"->"Plugins", 检查"Setting"看"Last Development Build"是否在Update Centers列表中, url是： http://deadlock.netbeans.org/hudson/job/javadoc-nbms/lastSuccessfulBuild/artifact/nbbuild/nbms/updates.xml.gz<br /><br />如果你用的是Beta/RC/Release NetBeans 6.1, 你需要手工添加上述"Last Development Build" Update Center。<br /><br />支持的功能有：<br /><ul><br />* Syntax highlighting<br />* Auto-indentation<br />* Brace completion<br />* Formatter<br />* Outline navigator<br />* Occurrences mark for local variables and function<br />* Instance rename for local variables and function<br />* Go-to-declaration for local variables and function<br />* Scala project<br />* Basic debugger<br /></ul><br /><br />已知的问题有：<br /><ul><br />* Auto-completion it not full supported yet and not smart<br />* There is no parsing errors recovering yet<br />* Semantic errors are not checked on editing, but will be noticed when you build project<br />* Due to the un-consistent of Scala's grammar reference document, there may be some syntax broken issues<br /></ul><br /><br />另外，Fortress的编辑插件也可以以同样的方法获得和安装，不过，这个插件还很弱。<br /><br />Erlang的插件现在也可以同时安装在同一个NetBeans 6.1RC和Nightly Build上，不需要另外下栽ErlyBird了，同时，Indexing的性能有了很大提高，在我的机器上大约5分钟就行了。<br /><br />Erlang插件将来也会重写。
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/184145#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Fri, 18 Apr 2008 16:17:01 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/184145</link>
        <guid>http://dcaoyuan.javaeye.com/blog/184145</guid>
      </item>
      <item>
        <title>ErlyBird 0.16.0 Released</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/168781" style="color:red;">http://dcaoyuan.javaeye.com/blog/168781</a>&nbsp;
          发表时间: 2008年03月06日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          ErlyBird 0.16.0 Released - An Erlang IDE based on NetBeans<br /><br />I'm pleased to announce ErlyBird 0.16.0, an Erlang IDE based on NetBeans. This is an important feature release in size of 25M. If you have latest NetBeans nightly build installed, you can also install ErlyBird modules via update center.<br /><br />CHANGELOG:<br /><br />    * Project metadata file is changed, please see Notes<br />    * Instant rename (put caret on variable or function name, press CTRL+R)<br />    * Go-To-Declaration to macros that are defined included header files<br />    * Fixed: Go-To-Declaration to -inlcudelib won't work again after this include header file was opened in editor once<br />    * Fixed: syntax broken for packaged import attribute<br />    * Fixed: syntax broken for wild attribute<br />    * Completion suggestion will not search other projects<br />    * Track GSF changes, reindex performance was improved a lot; Can live with other GSF based language support now (Ruby, Groovy etc)<br /><br />Java JRE 5.0+ is required.<br /><br />To download, please go to: http://sourceforge.net/project/showfiles.php?group_id=192439<br /><br />To install:<br /><br />   1. Unzip erlybird-bin-0.16.0-ide.zip to somewhere.<br />   2. Make sure 'erl.exe' or 'erl' is under your environment path<br />   3. For Windows user, execute 'bin/erlybird.exe'. For *nix user, 'bin/erlybird'.<br />   4. Check/set your OTP path. From [Tools]->[Options], click on 'Erlang', then 'Erlang Installation' tab, fill in the full path of your 'erl.exe' or 'erl' file. For instance: "C:/erl/bin/erl.exe"<br />   5. The default -Xmx option for jvm is set to 256M, ErlyBird now works good with less memory, such as -Xmx128M. If you want to increase/decrease it, please open the config file that is located at etc/erlybird.conf, set -J-Xmx of 'default_options'.<br /><br />When run ErlyBird first time, the OTP libs will be indexed. The indexing time varies from 10 to 30 minutes deponding on your computer.<br /><br />Notes:<br /><br />   1. Since project metadata format is changed, to open old ErlyBird created project, you should modify project.xml which is located at your project folder: nbproject/project.xml, change line:<br /><br />      &lt;type>org.netbeans.modules.languges.erlang.project&lt;/type><br /><br />      to:<br /><br />      &lt;type>org.netbeans.modules.erlang.project&lt;/type><br /><br />   2. If you have previous version ErlyBird installed, you should delete the old cache files which are located at:<br />          * *nix: "${HOME}/.erlybird/dev"<br />          * mac os x: "${HOME}/Library/Application Support/erlybird/dev"<br />          * windows: "C:\Documents and Settings\yourusername\.erlybird\dev" or some where<br /><br />The status of ErlyBird is still Alpha, feedbacks and bug reports are welcome.
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/168781#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Thu, 06 Mar 2008 15:24:59 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/168781</link>
        <guid>http://dcaoyuan.javaeye.com/blog/168781</guid>
      </item>
      <item>
        <title>Scala Editor for NetBeans</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/146082" style="color:red;">http://dcaoyuan.javaeye.com/blog/146082</a>&nbsp;
          发表时间: 2007年12月05日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          打算为Scala也写一个NetBeans编辑模块，还没有完成，目前已经支持大部分语法检查、代码折叠、高亮、大纲等功能，等稍加完善后正式发布。
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/146082#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Wed, 05 Dec 2007 02:34:44 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/146082</link>
        <guid>http://dcaoyuan.javaeye.com/blog/146082</guid>
      </item>
      <item>
        <title>Wide Finder - Erlang实现小结</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/139921" style="color:red;">http://dcaoyuan.javaeye.com/blog/139921</a>&nbsp;
          发表时间: 2007年11月12日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          Tim的WideFinder习题让多核和并行编程实践在一个简单的问题上有了多种语言作一次比较的机会，所以参与者甚多，我觉得也是很有意义的一件事。今天有点时间，作一个小总结。<br /><br />目前排行榜上列第一、第二、第三的分别是OCaml+JoCaml，Erlang和Python。C/C++的版本理论上应该可以有很好的结果，但现在还没出来，这反倒说明用C/C++来完成这么一个简单的并行任务并不是很顺畅。<br /><br />就Erlang的实现而言，基本上是一些象我这样的初学者慢慢摸索（Anders和Steve也都是初学者），加上专家们在一些关键地方的适时指点的过程。<br /><br />现在看来，几个重要的转折在：<br />1、min_heap_size的设置对Binary性能的影响非常重要，这涉及到Heap的初始分配、GC的性能。好在探询合适的+h参数比较方便，在命令行加+h Size就可以了；<br />2、遍历Binary时尽量只移动指针，如无必要，就不要分解出子binary。这对Binary的操作性能也是至关重要，因为分解出的binary要在globle heap中分配并被GC回收，这样是要耗时间的。<br /><br />解决上述问题后，Erlanger们（包括Anders, Pichi和我）立刻不需要再担心binary的性能，而且已经可以轻易击败ruby了。但要和最快的python实现（wf-6.py）相比，还要解决一个问题：搜索算法。python对定长字符串的搜索是2006年时由Fredrik Lundh写了一个Boyer-Moore的算法并加入到Python 2.5中的，因此Fredrik Lundh在实现他的WideFinder时很自然就用了它。Steve注意到了这一点，用Erlang实现了一个类似的，这样Erlang和Python的的比较才算公平。<br /><br />于是Anders的wfinder8能在T5120上跑到4.42秒，与Fredrik Lundh的wf-6的4.38秒期鼓相当。<br /><br />但Fredrik Lundh的wf-6.py和Fernandez的JoCaml版本都是利用操作系统的process来跑并行的。Erlang则是用操作系统的thread来调度自己的轻量process，因此Anders就再写了一个wfinder7_2，干脆跑n个node，每个node是一个操作系统的process，利用Node之间的通讯来合并结果，这个实现能跑到3.54秒，超过了Python，只比JoCaml的1.76秒慢。<br /><br />还应该提到的是，JoCaml和Python都使用到了Memory Mapping，相当于把整个log文件读到和映射到内存中，因为Tim的那台T5120有8个G的内存，所以这个处理不会带来频繁的swap，不过我本人不喜欢这种方式。<br /><br />我写程序比较喜欢简洁和直接了当，所以我的实现注重的是在简洁、可读和性能之间的平衡，最后的实现是<a href="http://blogtrader.net/page/dcaoyuan/entry/tim_bray_s_erlang_exercise2" target="_blank">tbray9a.erl</a>，大约115行，最快是5.26秒。<br /><br />与Python，JoCaml相比，我喜欢Erlang的地方在于，Erlang对并行和并发的支持在语法和机制上是一致的，而且性能也毫不逊色。Python在进程间通讯时使用了管道、JoCaml也基本上是这样，这也无所谓，但遇到并发情景时，Python和JoCaml就难受了，是用操作系统的Process呢还是用Thread？都不是好的方案。而对Erlang而言，并发和并行的处理机制是一致的，不管在是在自己的轻量processes层次，还是在Node层次，甚至，还有一个层次，就是，它同时还是分布的。
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/139921#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Mon, 12 Nov 2007 13:50:55 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/139921</link>
        <guid>http://dcaoyuan.javaeye.com/blog/139921</guid>
      </item>
      <item>
        <title>Prediction 4 Months Ago and Actual Trends Today, by Neural Network</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/133014" style="color:red;">http://dcaoyuan.javaeye.com/blog/133014</a>&nbsp;
          发表时间: 2007年10月17日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>Today, Shanghai Security Index (<a href="http://finance.yahoo.com/q?s=000001.ss">000001.SS</a>) touched 2100, and, from my previous neural network <a href="http://blogtrader.net/page/dcaoyuan?entry=is_neural_network_useful_to">research on 0000001.SS</a>, about 4 months passed. In that <a href="http://blogtrader.net/page/dcaoyuan?entry=is_neural_network_useful_to">blog</a>, I placed a prediction picture, and now, here it's a verification picture of the actual price trends comparing to the prediction:
</p><p>
The <b>Blue</b> line is the prediction calculated by Neural Network.<br />
The <b>Red/Green</b> line is the actual price trends.
</p><p> 
Click on the picture to enlarge it
</p><p>
<a href="/resources/dcaoyuan/nn_000001_SS_060717_real.png"><img src="/resources/dcaoyuan/nn_000001_SS_060717_real.png" height="450" alt="nn" width="600" /></a>
</p>
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/133014#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Wed, 17 Oct 2007 22:01:48 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/133014</link>
        <guid>http://dcaoyuan.javaeye.com/blog/133014</guid>
      </item>
      <item>
        <title>Learning Coding Parallelization (Was Tim's Erlang Exercise - Round V)</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/133015" style="color:red;">http://dcaoyuan.javaeye.com/blog/133015</a>&nbsp;
          发表时间: 2007年10月17日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>
<b>Updated Oct 16:</b> After testing my code on different machines, I found that disk/io performed varyingly, for some very large files, reading file in parallel may cause longer elapsed time (typically on non-server machine, which is not equipped for fast disk/io). So, I added another version <b>tbray4b.erl</b>, in this version, only reading file is not parallalized, all other code is the same. If you'd like to have a test on your machine, please try both.
</p><p>
Well, I think I've learned a lot from doing Tim's exercise, not only the List vs Binary in Erlang, but also computing in parallel.
Coding <b>Concurrency</b> is farely easy in Erlang, but coding <b>Parallelization</b> is not only about the Languages, it's also a real question.</p><p>
I wrote tbray3.erl in <a href="http://blogtrader.net/page/dcaoyuan?entry=the_erlang_way_was_tim">The Erlang Way (Was Tim Bray's Erlang Exercise - Round IV)</a> and got a fairly good result by far on my 2-core MacBook. But things always are a bit complex. As <a href="http://steve.vinoski.net/blog/2007/10/14/one-more-erlang-wide-finder/">Steve</a> pointed in the comment, when he tried tbray3.erl on his 8-core linux box:
</p><blockquote>"I ran it in a loop 10 times, and the best time I saw was 13.872 sec, and user/CPU time was only 16.150 sec, so it’s apparently not using the multiple cores very well."
</blockquote>
<p>
I also encoutered this issue on my 4-CPU Intel Xeon CPU 2.80GHz debian box, it runs even worse (8.420s) than my 2-core MacBook (4.483s).
</p><p>
I thought about my code a while, and found that my code seems spawning too many processes for scan_chunk, as the scan_chunk's performance has been improved a lot, each process will finish its task very quickly, too quick to the file reading, the inceasing CPUs have no much chance to play the game, the cycled 'reading'-'spawning scan process' is actually almost sequential now, there has been very few simultaneously alive scanning processes. I think I finally meet the file reading bound.
</p><p>But wait, as I claimed before, that reading file to memory is very fast in Erlang, for a 200M log file, it takes less than 800ms. The time elapsed for tbray3.erl is about 4900ms,  far away from 800ms, why I say the file reading is the bound now?
</p><p>The problem here is: since I suspect the performance of traversing binary byte by byte, I choose to convert binary to list to scan the world. Per my testing results, list is better than binary when is not too longer, in many cases, not longer than several KBytes. And, to make the code clear and readable, I also choose splitting big binary when read file in the meanwhile, so, I have to read file in pieces of no longer than n KBytes. For a very big file, the reading procedure is broken to several ten-thousands steps, which finally cause the whole file reading time elapsed is bit long. That's bad.
</p><p>
So, I decide to write another version, which will read file in parallel (<a href="http://blogtrader.net/page/dcaoyuan?entry=reading_file_in_parallel_in">Round III</a>), and split each chunk on lastest new-line (<a href="http://blogtrader.net/page/dcaoyuan">Round II</a>), scan the words using pattern match (<a href="http://blogtrader.net/page/dcaoyuan?entry=the_erlang_way_was_tim">Round IV</a>), and yes, I'll use binary instead of list this time, try to solve the worse performance of binary-traverse by parallel, on multiple cores.
</p><p>
The result is interesting, it's the first time I achieved around 10 sec in my 2-core MacBook when use binary match only, and it's also the first time, on my dummy 4-CPU Intel Xeon CPU 2.80GHz debian box, I got better result than my MacBook.
</p><p>
(<b>Updated Oct 15</b>: Steve run the code on his 8-core 2.33 GHz Intel Xeon Linux box, with the best time was 4.920 sec, which was exactly 100% speedup to my 4-core box (Although, they are two different machines, we can not compare the results linearly) :
</p><blockquote>
"the best time I saw for your newest version was 4.920 sec on my 8-core Linux box. Fast! However, user time was only 14.751 sec, so I’m not sure it’s using all the cores that well. Perhaps you’re getting down to where I/O is becoming a more significant factor."
</blockquote>
<p>Please see Steve's <a href="http://steve.vinoski.net/blog/2007/10/14/one-more-erlang-wide-finder/">One More Erlang Wide Finder</a> and his widefinder attempts.)
</p><p>
Result on 2.0GHz 2-core MacBook:
</p><p>
</p><pre>
$ time erl -smp -noshell -run tbray4_bin start o1000k.ap 4 -s erlang halt
8900    : 2006/09/29/Dynamic-IDE
2000    : 2006/07/28/Open-Data
1300    : 2003/07/25/NotGaming
800     : 2003/10/16/Debbie
800     : 2003/09/18/NXML
800     : 2006/01/31/Data-Protection
700     : 2003/06/23/SamsPie
600     : 2006/09/11/Making-Markup
600     : 2003/02/04/Construction
600     : 2005/11/03/Cars-and-Office-Suites
Time:   10375.53 ms

real    0m10.788s
user    0m11.216s
sys     0m3.851s
</pre>
<p>
Result on 4-CPU Intel Xeon CPU 2.80GHz debian box,:
</p><p>
</p><pre>
# When process number is set to 20:
$ time erl -smp -noshell -run tbray4_bin start o1000k.ap 20 -s erlang halt

real    0m9.894s
user    0m20.521s
sys     0m1.668s

# When process number is set to 1:
$ time erl -smp -noshell -run tbray4_bin start o1000k.ap 1 -s erlang halt

real    0m28.193s
user    0m27.218s
sys     0m0.984s

# On a 940M 5 million lines log file:
$ time erl -smp -noshell -run tbray4_bin start o5000k.ap 400 -s erlang halt
44500   : 2006/09/29/Dynamic-IDE
10000   : 2006/07/28/Open-Data
6500    : 2003/07/25/NotGaming
4000    : 2003/10/16/Debbie
4000    : 2003/09/18/NXML
4000    : 2006/01/31/Data-Protection
3500    : 2003/06/23/SamsPie
3000    : 2006/09/11/Making-Markup
3000    : 2003/02/04/Construction
3000    : 2005/11/03/Cars-and-Office-Suites
Time:   66456.95 ms

real    1m6.767s
user    2m7.512s
sys     0m8.489s
</pre>
<p>
</p><p>On the 4-CPU linux box, comparing the elapsed time between ProcNum = 20 and ProcNum = 1, the elapsed time of parallelized one was only 35% of un-parallelized one, speedup about 185%. The ratio was almost the same as my pread_file.erl testing on the same machine.
</p><p>
It's actually a combination of code in my four previous blogs. Although the performance is not so good as tbray3.erl on my MacBook, but I'm happy that this version is a fully parallelized one, from reading file, scanning words etc. it should scale better than all my previous versions.
</p><p>
The code: tbray4.erl
</p><p>
</p><pre>
<span class="function-name">-module</span>(tbray4).

<span class="function-name">-compile</span>([native]).

<span class="function-name">-export</span>([start/1,
         start/2]).

<span class="function-name">-include_lib</span>(<span class="string">"kernel/include/file.hrl"</span>).

<span class="function-name">start</span>([<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>]) <span class="keyword">when</span> <span class="constant">is_list</span>(<span class="variable-name">ProcNum</span>) -&gt;<span class="function-name"> </span>
    start(<span class="variable-name">FileName</span>, <span class="keyword">list_to_integer</span>(<span class="variable-name">ProcNum</span>)).
<span class="function-name">start</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>) -&gt;
    <span class="variable-name">Start</span> = now(),

    <span class="variable-name">Main</span> = <span class="keyword">self</span>(),
    <span class="variable-name">Counter</span> = <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span>count_loop(<span class="variable-name">Main</span>) <span class="keyword">end</span>),
    <span class="variable-name">Collector</span> = <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span>collect_loop(<span class="variable-name">Counter</span>) <span class="keyword">end</span>),

    pread_file(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>),

    <span class="comment-delimiter">%% </span><span class="comment">don't terminate, wait here, until all tasks done.
</span>    <span class="keyword">receive</span>
        stop -&gt;<span class="function-name"> </span>io:format(<span class="string">"Time: ~10.2f ms~n"</span>, [timer:now_diff(now(), <span class="variable-name">Start</span>) / 1000])       
    <span class="keyword">end</span>.

<span class="function-name">pread_file</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>) -&gt;
    <span class="variable-name">ChunkSize</span> = get_chunk_size(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>),
    pread_file_1(<span class="variable-name">FileName</span>, <span class="variable-name">ChunkSize</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>).       
<span class="function-name">pread_file_1</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ChunkSize</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>) -&gt;
    [<span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;
                   <span class="variable-name">Length</span> = <span class="keyword">if</span>  <span class="variable-name">I</span> == <span class="variable-name">ProcNum</span> - 1 -&gt;<span class="function-name"> </span><span class="variable-name">ChunkSize</span> * 2; <span class="comment-delimiter">%% </span><span class="comment">lastest chuck
</span>                                true -&gt;<span class="function-name"> </span><span class="variable-name">ChunkSize</span> <span class="keyword">end</span>,
                   {ok, <span class="variable-name">File</span>} = file:open(<span class="variable-name">FileName</span>, [read, binary]),
                   {ok, <span class="variable-name">Bin</span>} = file:pread(<span class="variable-name">File</span>, <span class="variable-name">ChunkSize</span> * <span class="variable-name">I</span>, <span class="variable-name">Length</span>),
                   {<span class="variable-name">Data</span>, <span class="variable-name">Tail</span>} = split_on_last_newline(<span class="variable-name">Bin</span>),
                   <span class="variable-name">Collector</span> ! {seq, <span class="variable-name">I</span>, <span class="variable-name">Data</span>, <span class="variable-name">Tail</span>},
                   file:close(<span class="variable-name">File</span>)
           <span class="keyword">end</span>) || <span class="variable-name">I</span> &lt;- lists:seq(0, <span class="variable-name">ProcNum</span> - 1)],
    <span class="variable-name">Collector</span> ! {chunk_num, <span class="variable-name">ProcNum</span>}.

<span class="function-name">collect_loop</span>(<span class="variable-name">Counter</span>) -&gt;<span class="function-name"> </span>collect_loop_1([], &lt;&lt;&gt;&gt;, -1, <span class="variable-name">Counter</span>).
<span class="function-name">collect_loop_1</span>(<span class="variable-name">Chunks</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">Counter</span>) -&gt;
    <span class="keyword">receive</span>
        {chunk_num, <span class="variable-name">ChunkNum</span>} -&gt;
            <span class="variable-name">Counter</span> ! {chunk_num, <span class="variable-name">ChunkNum</span>},
            collect_loop_1(<span class="variable-name">Chunks</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">Counter</span>);
        {seq, <span class="variable-name">I</span>, <span class="variable-name">Data</span>, <span class="variable-name">Tail</span>} -&gt;
            <span class="variable-name">SortedChunks</span> = lists:keysort(1, [{<span class="variable-name">I</span>, <span class="variable-name">Data</span>, <span class="variable-name">Tail</span>} | <span class="variable-name">Chunks</span>]),
            {<span class="variable-name">Chunks1</span>, <span class="variable-name">PrevTail1</span>, <span class="variable-name">LastSeq1</span>} = 
                process_chunks(<span class="variable-name">SortedChunks</span>, [], <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">Counter</span>),
            collect_loop_1(<span class="variable-name">Chunks1</span>, <span class="variable-name">PrevTail1</span>, <span class="variable-name">LastSeq1</span>, <span class="variable-name">Counter</span>)
    <span class="keyword">end</span>.
    
<span class="function-name">count_loop</span>(<span class="variable-name">Main</span>) -&gt;<span class="function-name"> </span>count_loop_1(<span class="variable-name">Main</span>, dict:new(), undefined, 0).
<span class="function-name">count_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ChunkNum</span>) -&gt;
    print_result(<span class="variable-name">Dict</span>),
    <span class="variable-name">Main</span> ! stop;
<span class="function-name">count_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span>) -&gt;
    <span class="keyword">receive</span>
        {chunk_num, <span class="variable-name">ChunkNumX</span>} -&gt;<span class="function-name"> </span>
            count_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNumX</span>, <span class="variable-name">ProcessedNum</span>);
        {dict, <span class="variable-name">DictX</span>} -&gt;
            <span class="variable-name">Dict1</span> = dict:merge(<span class="keyword">fun</span> (<span class="variable-name">_</span>, <span class="variable-name">V1</span>, <span class="variable-name">V2</span>) -&gt;<span class="function-name"> </span><span class="variable-name">V1</span> + <span class="variable-name">V2</span> <span class="keyword">end</span>, <span class="variable-name">Dict</span>, <span class="variable-name">DictX</span>),
            count_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">Dict1</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span> + 1)
    <span class="keyword">end</span>.

<span class="function-name">process_chunks</span>([], <span class="variable-name">ChunkBuf</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">_</span>) -&gt;<span class="function-name"> </span>{<span class="variable-name">ChunkBuf</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>};
<span class="function-name">process_chunks</span>([{<span class="variable-name">I</span>, <span class="variable-name">Data</span>, <span class="variable-name">Tail</span>}=<span class="variable-name">Chunk</span>|<span class="variable-name">T</span>], <span class="variable-name">ChunkBuf</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">Counter</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">LastSeq</span> + 1 <span class="keyword">of</span>
        <span class="variable-name">I</span> -&gt;
            <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span><span class="variable-name">Counter</span> ! {dict, scan_chunk(&lt;&lt;<span class="variable-name">PrevTail</span>/binary, <span class="variable-name">Data</span>/binary&gt;&gt;)} <span class="keyword">end</span>),
            process_chunks(<span class="variable-name">T</span>, <span class="variable-name">ChunkBuf</span>, <span class="variable-name">Tail</span>, <span class="variable-name">I</span>, <span class="variable-name">Counter</span>);
        <span class="variable-name">_</span> -&gt;
            process_chunks(<span class="variable-name">T</span>, [<span class="variable-name">Chunk</span> | <span class="variable-name">ChunkBuf</span>], <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">Counter</span>)
    <span class="keyword">end</span>.

<span class="function-name">print_result</span>(<span class="variable-name">Dict</span>) -&gt;
    <span class="variable-name">SortedList</span> = lists:reverse(lists:keysort(2, dict:to_list(<span class="variable-name">Dict</span>))),
    [io:format(<span class="string">"~b\t: ~s~n"</span>, [<span class="variable-name">V</span>, <span class="variable-name">K</span>]) || {<span class="variable-name">K</span>, <span class="variable-name">V</span>} &lt;- lists:sublist(<span class="variable-name">SortedList</span>, 10)].

<span class="function-name">get_chunk_size</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>) -&gt;
    {ok, #<span class="type">file_info</span>{size=<span class="variable-name">Size</span>}} = file:read_file_info(<span class="variable-name">FileName</span>),
    <span class="variable-name">Size</span> div <span class="variable-name">ProcNum</span>.

<span class="function-name">split_on_last_newline</span>(<span class="variable-name">Bin</span>) -&gt;<span class="function-name"> </span>split_on_last_newline_1(<span class="variable-name">Bin</span>, <span class="keyword">size</span>(<span class="variable-name">Bin</span>)).   
<span class="function-name">split_on_last_newline_1</span>(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span>) <span class="keyword">when</span> <span class="variable-name">Offset</span> &gt; 0 -&gt;
    <span class="keyword">case</span> <span class="variable-name">Bin</span> <span class="keyword">of</span>
        &lt;&lt;<span class="variable-name">Data</span>:<span class="variable-name">Offset</span>/binary,<span class="string">$\n</span>,<span class="variable-name">Tail</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Data</span>, <span class="variable-name">Tail</span>};
        <span class="variable-name">_</span> -&gt;<span class="function-name"> </span>
            split_on_last_newline_1(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span> - 1)
    <span class="keyword">end</span>;
<span class="function-name">split_on_last_newline_1</span>(<span class="variable-name">Bin</span>, <span class="variable-name">_</span>) -&gt;<span class="function-name"> </span>{<span class="variable-name">Bin</span>, &lt;&lt;&gt;&gt;}.
    
<span class="function-name">scan_chunk</span>(<span class="variable-name">Bin</span>) -&gt;<span class="function-name"> </span>scan_chunk_1(<span class="variable-name">Bin</span>, 0, dict:new()).    
<span class="function-name">scan_chunk_1</span>(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span>, <span class="variable-name">Dict</span>) <span class="keyword">when</span> <span class="variable-name">Offset</span> =&lt; <span class="keyword">size</span>(<span class="variable-name">Bin</span>) - 34 -&gt;
    <span class="keyword">case</span> <span class="variable-name">Bin</span> <span class="keyword">of</span>
        &lt;&lt;<span class="variable-name">_</span>:<span class="variable-name">Offset</span>/binary,<span class="string">"GET /ongoing/When/"</span>,<span class="variable-name">_</span>,<span class="variable-name">_</span>,<span class="variable-name">_</span>,<span class="string">$x</span>,<span class="string">$/</span>,<span class="variable-name">Y1</span>,<span class="variable-name">Y2</span>,<span class="variable-name">Y3</span>,<span class="variable-name">Y4</span>,<span class="string">$/</span>,<span class="variable-name">M1</span>,<span class="variable-name">M2</span>,<span class="string">$/</span>,<span class="variable-name">D1</span>,<span class="variable-name">D2</span>,<span class="string">$/</span>,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;<span class="function-name"> </span>           
            <span class="keyword">case</span> match_until_space_newline(<span class="variable-name">Rest</span>, 0) <span class="keyword">of</span>
                {<span class="variable-name">Rest1</span>, &lt;&lt;&gt;&gt;} -&gt;<span class="function-name"> </span>
                    scan_chunk_1(<span class="variable-name">Rest1</span>, 0, <span class="variable-name">Dict</span>);
                {<span class="variable-name">Rest1</span>, <span class="variable-name">Word</span>} -&gt;<span class="function-name"> </span>
                    <span class="variable-name">Key</span> = &lt;&lt;<span class="variable-name">Y1</span>,<span class="variable-name">Y2</span>,<span class="variable-name">Y3</span>,<span class="variable-name">Y4</span>,<span class="string">$/</span>,<span class="variable-name">M1</span>,<span class="variable-name">M2</span>,<span class="string">$/</span>,<span class="variable-name">D1</span>,<span class="variable-name">D2</span>,<span class="string">$/</span>, <span class="variable-name">Word</span>/binary&gt;&gt;,
                    scan_chunk_1(<span class="variable-name">Rest1</span>, 0, dict:update_counter(<span class="variable-name">Key</span>, 1, <span class="variable-name">Dict</span>))
            <span class="keyword">end</span>;
        <span class="variable-name">_</span> -&gt;<span class="function-name"> </span>scan_chunk_1(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span> + 1, <span class="variable-name">Dict</span>)
    <span class="keyword">end</span>;
<span class="function-name">scan_chunk_1</span>(<span class="variable-name">_</span>, <span class="variable-name">_</span>, <span class="variable-name">Dict</span>) -&gt;<span class="function-name"> </span><span class="variable-name">Dict</span>.

<span class="function-name">match_until_space_newline</span>(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span>) <span class="keyword">when</span> <span class="variable-name">Offset</span> &lt; <span class="keyword">size</span>(<span class="variable-name">Bin</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">Bin</span> <span class="keyword">of</span>
        &lt;&lt;<span class="variable-name">Word</span>:<span class="variable-name">Offset</span>/binary,<span class="string">$ </span>,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Rest</span>, <span class="variable-name">Word</span>};
        &lt;&lt;<span class="variable-name">_</span>:<span class="variable-name">Offset</span>/binary,<span class="string">$.</span>,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Rest</span>, &lt;&lt;&gt;&gt;};
        &lt;&lt;<span class="variable-name">_</span>:<span class="variable-name">Offset</span>/binary,10,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Rest</span>, &lt;&lt;&gt;&gt;};
        <span class="variable-name">_</span> -&gt;
            match_until_space_newline(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span> + 1)
    <span class="keyword">end</span>;
<span class="function-name">match_until_space_newline</span>(<span class="variable-name">_</span>, <span class="variable-name">_</span>) -&gt;<span class="function-name"> </span>{&lt;&lt;&gt;&gt;, &lt;&lt;&gt;&gt;}.

</pre>
<p>
=====&gt;
<b>Updated Oct 16:</b>After testing my code on different machines, I found that disk/io performed varyingly, for some very large files, reading file in parallel may cause longer elapsed time (typically on non-server machine, which is not equipped for fast disk/io). So, I wrote another version: tbray4b.erl, in this version, only reading file is not parallalized, all other code is the same. Here's a result for this version on a 940M file with 5 million lines, with ProcNum set to 200 and 400)
</p><p>
</p><pre>
# On 2-core MacBook:
$ time erl -smp -noshell -run tbray4b start o5000k.ap 200 -s erlang halt

real    0m50.498s
user    0m49.746s
sys     0m11.979s

# On 4-cpu linux box:
$ time erl -smp -noshell -run tbray4b start o5000k.ap 400 -s erlang halt

real    1m2.136s
user    1m59.907s
sys     0m7.960s
</pre>
<p>
The code: tbray4b.erl
</p><pre>
<span class="function-name">-module</span>(tbray4b).

<span class="function-name">-compile</span>([native]).

<span class="function-name">-export</span>([start/1,
         start/2]).

<span class="function-name">-include_lib</span>(<span class="string">"kernel/include/file.hrl"</span>).

<span class="function-name">start</span>([<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>]) <span class="keyword">when</span> <span class="constant">is_list</span>(<span class="variable-name">ProcNum</span>) -&gt;<span class="function-name"> </span>
    start(<span class="variable-name">FileName</span>, <span class="keyword">list_to_integer</span>(<span class="variable-name">ProcNum</span>)).
<span class="function-name">start</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>) -&gt;
    <span class="variable-name">Start</span> = now(),

    <span class="variable-name">Main</span> = <span class="keyword">self</span>(),
    <span class="variable-name">Counter</span> = <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span>count_loop(<span class="variable-name">Main</span>) <span class="keyword">end</span>),
    <span class="variable-name">Collector</span> = <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span>collect_loop(<span class="variable-name">Counter</span>) <span class="keyword">end</span>),

    read_file(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>),

    <span class="comment-delimiter">%% </span><span class="comment">don't terminate, wait here, until all tasks done.
</span>    <span class="keyword">receive</span>
        stop -&gt;<span class="function-name"> </span>io:format(<span class="string">"Time: ~10.2f ms~n"</span>, [timer:now_diff(now(), <span class="variable-name">Start</span>) / 1000])       
    <span class="keyword">end</span>.

<span class="function-name">read_file</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>) -&gt;
    <span class="variable-name">ChunkSize</span> = get_chunk_size(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>),
    {ok, <span class="variable-name">File</span>} = file:open(<span class="variable-name">FileName</span>, [raw, binary]),
    read_file_1(<span class="variable-name">File</span>, <span class="variable-name">ChunkSize</span>, 0, <span class="variable-name">Collector</span>).    
<span class="function-name">read_file_1</span>(<span class="variable-name">File</span>, <span class="variable-name">ChunkSize</span>, <span class="variable-name">I</span>, <span class="variable-name">Collector</span>) -&gt;
    <span class="keyword">case</span> file:read(<span class="variable-name">File</span>, <span class="variable-name">ChunkSize</span>) <span class="keyword">of</span>
        eof -&gt;
            file:close(<span class="variable-name">File</span>),
            <span class="variable-name">Collector</span> ! {chunk_num, <span class="variable-name">I</span>};
        {ok, <span class="variable-name">Bin</span>} -&gt;<span class="function-name"> </span>
            <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;
                          {<span class="variable-name">Data</span>, <span class="variable-name">Tail</span>} = split_on_last_newline(<span class="variable-name">Bin</span>),
                          <span class="variable-name">Collector</span> ! {seq, <span class="variable-name">I</span>, <span class="variable-name">Data</span>, <span class="variable-name">Tail</span>}
                  <span class="keyword">end</span>),
            read_file_1(<span class="variable-name">File</span>, <span class="variable-name">ChunkSize</span>, <span class="variable-name">I</span> + 1, <span class="variable-name">Collector</span>)
    <span class="keyword">end</span>.

<span class="function-name">collect_loop</span>(<span class="variable-name">Counter</span>) -&gt;<span class="function-name"> </span>collect_loop_1([], &lt;&lt;&gt;&gt;, -1, <span class="variable-name">Counter</span>).
<span class="function-name">collect_loop_1</span>(<span class="variable-name">Chunks</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">Counter</span>) -&gt;
    <span class="keyword">receive</span>
        {chunk_num, <span class="variable-name">ChunkNum</span>} -&gt;
            <span class="variable-name">Counter</span> ! {chunk_num, <span class="variable-name">ChunkNum</span>},
            collect_loop_1(<span class="variable-name">Chunks</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">Counter</span>);
        {seq, <span class="variable-name">I</span>, <span class="variable-name">Data</span>, <span class="variable-name">Tail</span>} -&gt;
            <span class="variable-name">SortedChunks</span> = lists:keysort(1, [{<span class="variable-name">I</span>, <span class="variable-name">Data</span>, <span class="variable-name">Tail</span>} | <span class="variable-name">Chunks</span>]),
            {<span class="variable-name">Chunks1</span>, <span class="variable-name">PrevTail1</span>, <span class="variable-name">LastSeq1</span>} = 
                process_chunks(<span class="variable-name">SortedChunks</span>, [], <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">Counter</span>),
            collect_loop_1(<span class="variable-name">Chunks1</span>, <span class="variable-name">PrevTail1</span>, <span class="variable-name">LastSeq1</span>, <span class="variable-name">Counter</span>)
    <span class="keyword">end</span>.
    
<span class="function-name">count_loop</span>(<span class="variable-name">Main</span>) -&gt;<span class="function-name"> </span>count_loop_1(<span class="variable-name">Main</span>, dict:new(), undefined, 0).
<span class="function-name">count_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ChunkNum</span>) -&gt;
    print_result(<span class="variable-name">Dict</span>),
    <span class="variable-name">Main</span> ! stop;
<span class="function-name">count_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span>) -&gt;
    <span class="keyword">receive</span>
        {chunk_num, <span class="variable-name">ChunkNumX</span>} -&gt;<span class="function-name"> </span>
            count_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNumX</span>, <span class="variable-name">ProcessedNum</span>);
        {dict, <span class="variable-name">DictX</span>} -&gt;
            <span class="variable-name">Dict1</span> = dict:merge(<span class="keyword">fun</span> (<span class="variable-name">_</span>, <span class="variable-name">V1</span>, <span class="variable-name">V2</span>) -&gt;<span class="function-name"> </span><span class="variable-name">V1</span> + <span class="variable-name">V2</span> <span class="keyword">end</span>, <span class="variable-name">Dict</span>, <span class="variable-name">DictX</span>),
            count_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">Dict1</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span> + 1)
    <span class="keyword">end</span>.

<span class="function-name">process_chunks</span>([], <span class="variable-name">ChunkBuf</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">_</span>) -&gt;<span class="function-name"> </span>{<span class="variable-name">ChunkBuf</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>};
<span class="function-name">process_chunks</span>([{<span class="variable-name">I</span>, <span class="variable-name">Data</span>, <span class="variable-name">Tail</span>}=<span class="variable-name">Chunk</span>|<span class="variable-name">T</span>], <span class="variable-name">ChunkBuf</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">Counter</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">LastSeq</span> + 1 <span class="keyword">of</span>
        <span class="variable-name">I</span> -&gt;
            <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span><span class="variable-name">Counter</span> ! {dict, scan_chunk(&lt;&lt;<span class="variable-name">PrevTail</span>/binary, <span class="variable-name">Data</span>/binary&gt;&gt;)} <span class="keyword">end</span>),
            process_chunks(<span class="variable-name">T</span>, <span class="variable-name">ChunkBuf</span>, <span class="variable-name">Tail</span>, <span class="variable-name">I</span>, <span class="variable-name">Counter</span>);
        <span class="variable-name">_</span> -&gt;
            process_chunks(<span class="variable-name">T</span>, [<span class="variable-name">Chunk</span> | <span class="variable-name">ChunkBuf</span>], <span class="variable-name">PrevTail</span>, <span class="variable-name">LastSeq</span>, <span class="variable-name">Counter</span>)
    <span class="keyword">end</span>.

<span class="function-name">print_result</span>(<span class="variable-name">Dict</span>) -&gt;
    <span class="variable-name">SortedList</span> = lists:reverse(lists:keysort(2, dict:to_list(<span class="variable-name">Dict</span>))),
    [io:format(<span class="string">"~b\t: ~s~n"</span>, [<span class="variable-name">V</span>, <span class="variable-name">K</span>]) || {<span class="variable-name">K</span>, <span class="variable-name">V</span>} &lt;- lists:sublist(<span class="variable-name">SortedList</span>, 10)].

<span class="function-name">get_chunk_size</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>) -&gt;
    {ok, #<span class="type">file_info</span>{size=<span class="variable-name">Size</span>}} = file:read_file_info(<span class="variable-name">FileName</span>),
    <span class="variable-name">Size</span> div <span class="variable-name">ProcNum</span>.

<span class="function-name">split_on_last_newline</span>(<span class="variable-name">Bin</span>) -&gt;<span class="function-name"> </span>split_on_last_newline_1(<span class="variable-name">Bin</span>, <span class="keyword">size</span>(<span class="variable-name">Bin</span>)).   
<span class="function-name">split_on_last_newline_1</span>(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span>) <span class="keyword">when</span> <span class="variable-name">Offset</span> &gt; 0 -&gt;
    <span class="keyword">case</span> <span class="variable-name">Bin</span> <span class="keyword">of</span>
        &lt;&lt;<span class="variable-name">Data</span>:<span class="variable-name">Offset</span>/binary,<span class="string">$\n</span>,<span class="variable-name">Tail</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Data</span>, <span class="variable-name">Tail</span>};
        <span class="variable-name">_</span> -&gt;<span class="function-name"> </span>
            split_on_last_newline_1(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span> - 1)
    <span class="keyword">end</span>;
<span class="function-name">split_on_last_newline_1</span>(<span class="variable-name">Bin</span>, <span class="variable-name">_</span>) -&gt;<span class="function-name"> </span>{<span class="variable-name">Bin</span>, &lt;&lt;&gt;&gt;}.
    
<span class="function-name">scan_chunk</span>(<span class="variable-name">Bin</span>) -&gt;<span class="function-name"> </span>scan_chunk_1(<span class="variable-name">Bin</span>, 0, dict:new()).    
<span class="function-name">scan_chunk_1</span>(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span>, <span class="variable-name">Dict</span>) <span class="keyword">when</span> <span class="variable-name">Offset</span> =&lt; <span class="keyword">size</span>(<span class="variable-name">Bin</span>) - 34 -&gt;
    <span class="keyword">case</span> <span class="variable-name">Bin</span> <span class="keyword">of</span>
        &lt;&lt;<span class="variable-name">_</span>:<span class="variable-name">Offset</span>/binary,<span class="string">"GET /ongoing/When/"</span>,<span class="variable-name">_</span>,<span class="variable-name">_</span>,<span class="variable-name">_</span>,<span class="string">$x</span>,<span class="string">$/</span>,<span class="variable-name">Y1</span>,<span class="variable-name">Y2</span>,<span class="variable-name">Y3</span>,<span class="variable-name">Y4</span>,<span class="string">$/</span>,<span class="variable-name">M1</span>,<span class="variable-name">M2</span>,<span class="string">$/</span>,<span class="variable-name">D1</span>,<span class="variable-name">D2</span>,<span class="string">$/</span>,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;<span class="function-name"> </span>           
            <span class="keyword">case</span> match_until_space_newline(<span class="variable-name">Rest</span>, 0) <span class="keyword">of</span>
                {<span class="variable-name">Rest1</span>, &lt;&lt;&gt;&gt;} -&gt;<span class="function-name"> </span>
                    scan_chunk_1(<span class="variable-name">Rest1</span>, 0, <span class="variable-name">Dict</span>);
                {<span class="variable-name">Rest1</span>, <span class="variable-name">Word</span>} -&gt;<span class="function-name"> </span>
                    <span class="variable-name">Key</span> = &lt;&lt;<span class="variable-name">Y1</span>,<span class="variable-name">Y2</span>,<span class="variable-name">Y3</span>,<span class="variable-name">Y4</span>,<span class="string">$/</span>,<span class="variable-name">M1</span>,<span class="variable-name">M2</span>,<span class="string">$/</span>,<span class="variable-name">D1</span>,<span class="variable-name">D2</span>,<span class="string">$/</span>, <span class="variable-name">Word</span>/binary&gt;&gt;,
                    scan_chunk_1(<span class="variable-name">Rest1</span>, 0, dict:update_counter(<span class="variable-name">Key</span>, 1, <span class="variable-name">Dict</span>))
            <span class="keyword">end</span>;
        <span class="variable-name">_</span> -&gt;<span class="function-name"> </span>scan_chunk_1(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span> + 1, <span class="variable-name">Dict</span>)
    <span class="keyword">end</span>;
<span class="function-name">scan_chunk_1</span>(<span class="variable-name">_</span>, <span class="variable-name">_</span>, <span class="variable-name">Dict</span>) -&gt;<span class="function-name"> </span><span class="variable-name">Dict</span>.

<span class="function-name">match_until_space_newline</span>(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span>) <span class="keyword">when</span> <span class="variable-name">Offset</span> &lt; <span class="keyword">size</span>(<span class="variable-name">Bin</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">Bin</span> <span class="keyword">of</span>
        &lt;&lt;<span class="variable-name">Word</span>:<span class="variable-name">Offset</span>/binary,<span class="string">$ </span>,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Rest</span>, <span class="variable-name">Word</span>};
        &lt;&lt;<span class="variable-name">_</span>:<span class="variable-name">Offset</span>/binary,<span class="string">$.</span>,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Rest</span>, &lt;&lt;&gt;&gt;};
        &lt;&lt;<span class="variable-name">_</span>:<span class="variable-name">Offset</span>/binary,10,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Rest</span>, &lt;&lt;&gt;&gt;};
        <span class="variable-name">_</span> -&gt;
            match_until_space_newline(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span> + 1)
    <span class="keyword">end</span>;
<span class="function-name">match_until_space_newline</span>(<span class="variable-name">_</span>, <span class="variable-name">_</span>) -&gt;<span class="function-name"> </span>{&lt;&lt;&gt;&gt;, &lt;&lt;&gt;&gt;}.

</pre>
<p>
=======
</p><p></p>
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/133015#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Wed, 17 Oct 2007 21:58:50 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/133015</link>
        <guid>http://dcaoyuan.javaeye.com/blog/133015</guid>
      </item>
      <item>
        <title>The Erlang Way (Was Tim Bray's Erlang Exercise - Round IV)</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/133016" style="color:red;">http://dcaoyuan.javaeye.com/blog/133016</a>&nbsp;
          发表时间: 2007年10月17日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>Playing with Tim's Erlang Exercise is so much fun. 
</p><p>I've been coding in Erlang about 6 months as a newbie, in most cases, I do parsing on string (or list what ever) with no need of regular expressions, since Erlang's pattern match can usaully solve most problems straightforward.</p><p>
</p><p>Tim's log file is also a good example for applying pattern match in Erlang way. It's a continuous stream of dataset, after splitting it to line-bounded chunks for parallellization purpose, we can truely match whole {GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) } directly on chunk with no need to split to lines any more.
</p><p>
This come out my third solution, which matchs whole 
</p><pre>
{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) } 
</pre>
<p>likeness using the pattern:
</p><pre>
"GET /ongoing/When/"++[_,_,_,$x,$/,Y1,Y2,Y3,Y4,$/,M1,M2,$/,D1,D2,$/|Rest]
</pre>
<p>and then fetchs 
</p><pre>[Y1,Y2,Y3,Y4,$/,M1,M2,$/,D1,D2,$/] ++ match_until_space_newline(Rest, [])</pre>
<p>as the matched key, with no need to split the chunk to lines.</p><p>
But yes, we still need to split each chunk on the lastest newline to get parallelized result exactly accurate.
</p><p>
On my 2-core 2 GHz MacBook, the best time I’ve got is <b>4.483 sec</b>
</p><pre>
# smp enabled:
$ erlc -smp tbray3.erl
$ time erl -smp +P 60000 -noshell -run tbray3 start o1000k.ap -s erlang halt
8900    : &lt;&lt;"2006/09/29/Dynamic-IDE"&gt;&gt;
2000    : &lt;&lt;"2006/07/28/Open-Data"&gt;&gt;
1300    : &lt;&lt;"2003/07/25/NotGaming"&gt;&gt;
800     : &lt;&lt;"2003/10/16/Debbie"&gt;&gt;
800     : &lt;&lt;"2003/09/18/NXML"&gt;&gt;
800     : &lt;&lt;"2006/01/31/Data-Protection"&gt;&gt;
700     : &lt;&lt;"2003/06/23/SamsPie"&gt;&gt;
600     : &lt;&lt;"2006/09/11/Making-Markup"&gt;&gt;
600     : &lt;&lt;"2003/02/04/Construction"&gt;&gt;
600     : &lt;&lt;"2005/11/03/Cars-and-Office-Suites"&gt;&gt;
Time:    4142.83 ms

real    0m4.483s
user    0m5.804s
sys     0m0.615s

# no-smp:
$ erlc tbray3.erl
$ time erl -noshell -run tbray_list_no_line start o1000k.ap -s erlang halt

real    0m7.050s
user    0m6.183s
sys     0m0.644s

</pre>
<p>The smp enable result speedup about 57%
</p><p>
On the 2.80GHz 4-cpu xeon debian box that I mentioned before in previous blog, the best result is:
</p><pre>
real    0m8.420s
user    0m11.637s
sys     0m0.452s
</pre>
<p>And I've noticed, adjusting the BUFFER_SIZE can balance the time consumered by parallelized parts and un-parallelized parts. That is, if the number of core is increased, we can also increase the BUFFER_SIZE a bit, so the number of chunks decreased (less un-parallelized <b>split_on_last_new_line/1</b> and <b>file:pread/3</b>) but with more heavy work for parallelized <b>binary_to_list/1</b> and <b>scan_chunk/1</b> on <b>longer</b> list. 
</p><p>The best BUFFER_SIZE on my computer is 4096 * 5 bytes,  which causes un-parallized <b>split_on_last_newline/1</b> took about only 0.226s in the case.
</p><p>
The code:</p><p>
</p><pre>
<span class="function-name">-module</span>(tbray3).

<span class="function-name">-compile</span>([native]).

<span class="function-name">-export</span>([start/1]).
  
<span class="comment-delimiter">%% </span><span class="comment">The best Bin Buffer Size is 4096 * 1 - 4096 * 5
</span><span class="function-name">-define</span>(<span class="constant">BUFFER_SIZE</span>, (4096 * 5)). 

<span class="function-name">start</span>(<span class="variable-name">FileName</span>) -&gt;
    <span class="variable-name">Start</span> = now(),

    <span class="variable-name">Main</span> = <span class="keyword">self</span>(),
    <span class="variable-name">Collector</span> = <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span>collect_loop(<span class="variable-name">Main</span>) <span class="keyword">end</span>),
 
    {ok, <span class="variable-name">File</span>} = file:open(<span class="variable-name">FileName</span>, [raw, binary]),
    read_file(<span class="variable-name">File</span>, <span class="variable-name">Collector</span>),
    
    <span class="comment-delimiter">%% </span><span class="comment">don't terminate, wait here, until all tasks done.
</span>    <span class="keyword">receive</span>
        stop -&gt;<span class="function-name"> </span>io:format(<span class="string">"Time: ~10.2f ms~n"</span>, [timer:now_diff(now(), <span class="variable-name">Start</span>) / 1000])
    <span class="keyword">end</span>.

<span class="function-name">read_file</span>(<span class="variable-name">File</span>, <span class="variable-name">Collector</span>) -&gt;<span class="function-name"> </span>read_file_1(<span class="variable-name">File</span>, [], 0, <span class="variable-name">Collector</span>).
<span class="function-name">read_file_1</span>(<span class="variable-name">File</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">I</span>, <span class="variable-name">Collector</span>) -&gt;
    <span class="keyword">case</span> file:read(<span class="variable-name">File</span>, ?<span class="constant">BUFFER_SIZE</span>) <span class="keyword">of</span>
        eof -&gt;
            <span class="variable-name">Collector</span> ! {chunk_num, <span class="variable-name">I</span>},
            file:close(<span class="variable-name">File</span>);
        {ok, <span class="variable-name">Bin</span>} -&gt;<span class="function-name"> </span>
            {<span class="variable-name">Chunk</span>, <span class="variable-name">NextTail</span>} = split_on_last_newline(<span class="variable-name">PrevTail</span> ++ <span class="keyword">binary_to_list</span>(<span class="variable-name">Bin</span>)),
            <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span><span class="variable-name">Collector</span> ! {dict, scan_chunk(<span class="variable-name">Chunk</span>)} <span class="keyword">end</span>),
            read_file_1(<span class="variable-name">File</span>, <span class="variable-name">NextTail</span>, <span class="variable-name">I</span> + 1, <span class="variable-name">Collector</span>)
    <span class="keyword">end</span>.

<span class="function-name">split_on_last_newline</span>(<span class="variable-name">List</span>) -&gt;<span class="function-name"> </span>split_on_last_newline_1(lists:reverse(<span class="variable-name">List</span>), []).
<span class="function-name">split_on_last_newline_1</span>(<span class="variable-name">List</span>, <span class="variable-name">Tail</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">List</span> <span class="keyword">of</span>
        []         -&gt;<span class="function-name"> </span>{lists:reverse(<span class="variable-name">List</span>), []};
        [<span class="string">$\n</span>|<span class="variable-name">Rest</span>] -&gt;<span class="function-name"> </span>{lists:reverse(<span class="variable-name">Rest</span>), <span class="variable-name">Tail</span>};
        [<span class="variable-name">C</span>|<span class="variable-name">Rest</span>]   -&gt;<span class="function-name"> </span>split_on_last_newline_1(<span class="variable-name">Rest</span>, [<span class="variable-name">C</span> | <span class="variable-name">Tail</span>])
    <span class="keyword">end</span>.

<span class="function-name">collect_loop</span>(<span class="variable-name">Main</span>) -&gt;<span class="function-name"> </span>collect_loop_1(<span class="variable-name">Main</span>, <span class="bold">dict:new</span>(), undefined, 0).
<span class="function-name">collect_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ChunkNum</span>) -&gt;
    print_result(<span class="variable-name">Dict</span>),
    <span class="variable-name">Main</span> ! stop;
<span class="function-name">collect_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span>) -&gt;
    <span class="keyword">receive</span>
        {chunk_num, <span class="variable-name">ChunkNumX</span>} -&gt;<span class="function-name"> </span>
            collect_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNumX</span>, <span class="variable-name">ProcessedNum</span>);
        {dict, <span class="variable-name">DictX</span>} -&gt;<span class="function-name"> </span>
            <span class="variable-name">Dict1</span> = <span class="bold">dict:merge</span>(<span class="keyword">fun</span> (<span class="variable-name">_</span>, <span class="variable-name">V1</span>, <span class="variable-name">V2</span>) -&gt;<span class="function-name"> </span><span class="variable-name">V1</span> + <span class="variable-name">V2</span> <span class="keyword">end</span>, <span class="variable-name">Dict</span>, <span class="variable-name">DictX</span>),
            collect_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">Dict1</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span> + 1)
    <span class="keyword">end</span>.
    
<span class="function-name">print_result</span>(<span class="variable-name">Dict</span>) -&gt;
    <span class="variable-name">SortedList</span> = lists:reverse(lists:keysort(2, <span class="bold">dict:to_list</span>(<span class="variable-name">Dict</span>))),
    [io:format(<span class="string">"~b\t: ~p~n"</span>, [<span class="variable-name">V</span>, <span class="variable-name">K</span>]) || {<span class="variable-name">K</span>, <span class="variable-name">V</span>} &lt;- lists:sublist(<span class="variable-name">SortedList</span>, 10)].

<span class="function-name">scan_chunk</span>(<span class="variable-name">List</span>) -&gt;<span class="function-name"> </span>scan_chunk_1(<span class="variable-name">List</span>, <span class="bold">dict:new</span>()).
<span class="function-name">scan_chunk_1</span>(<span class="variable-name">List</span>, <span class="variable-name">Dict</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">List</span> <span class="keyword">of</span>
        [] -&gt;<span class="function-name"> </span><span class="variable-name">Dict</span>;
        <span class="string">"GET /ongoing/When/"</span>++[<span class="variable-name">_</span>,<span class="variable-name">_</span>,<span class="variable-name">_</span>,<span class="string">$x</span>,<span class="string">$/</span>,<span class="variable-name">Y1</span>,<span class="variable-name">Y2</span>,<span class="variable-name">Y3</span>,<span class="variable-name">Y4</span>,<span class="string">$/</span>,<span class="variable-name">M1</span>,<span class="variable-name">M2</span>,<span class="string">$/</span>,<span class="variable-name">D1</span>,<span class="variable-name">D2</span>,<span class="string">$/</span>|<span class="variable-name">Rest</span>] -&gt;
            <span class="keyword">case</span> match_until_space_newline(<span class="variable-name">Rest</span>, []) <span class="keyword">of</span>
                {<span class="variable-name">Rest1</span>, []} -&gt;<span class="function-name"> </span>
                    scan_chunk_1(<span class="variable-name">Rest1</span>, <span class="variable-name">Dict</span>);
                {<span class="variable-name">Rest1</span>, <span class="variable-name">Word</span>} -&gt;<span class="function-name"> </span>
                    <span class="variable-name">Key</span> = <span class="keyword">list_to_binary</span>([<span class="variable-name">Y1</span>,<span class="variable-name">Y2</span>,<span class="variable-name">Y3</span>,<span class="variable-name">Y4</span>,<span class="string">$/</span>,<span class="variable-name">M1</span>,<span class="variable-name">M2</span>,<span class="string">$/</span>,<span class="variable-name">D1</span>,<span class="variable-name">D2</span>,<span class="string">$/</span>, <span class="variable-name">Word</span>]),
                    scan_chunk_1(<span class="variable-name">Rest1</span>, <span class="bold">dict:update_counter</span>(<span class="variable-name">Key</span>, 1, <span class="variable-name">Dict</span>))
            <span class="keyword">end</span>;
        [<span class="variable-name">_</span>|<span class="variable-name">Rest</span>] -&gt;<span class="function-name"> </span>scan_chunk_1(<span class="variable-name">Rest</span>, <span class="variable-name">Dict</span>)
    <span class="keyword">end</span>.
    
<span class="function-name">match_until_space_newline</span>(<span class="variable-name">List</span>, <span class="variable-name">Word</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">List</span> <span class="keyword">of</span>
        []     -&gt;<span class="function-name"> </span>{[],   []};
        [10|<span class="variable-name">_</span>] -&gt;<span class="function-name"> </span>{<span class="variable-name">List</span>, []};
        [<span class="string">$.</span>|<span class="variable-name">_</span>] -&gt;<span class="function-name"> </span>{<span class="variable-name">List</span>, []};
        [<span class="string">$ </span>|<span class="variable-name">_</span>] -&gt;<span class="function-name"> </span>{<span class="variable-name">List</span>, lists:reverse(<span class="variable-name">Word</span>)};
        [<span class="variable-name">C</span>|<span class="variable-name">Rest</span>] -&gt;<span class="function-name"> </span>match_until_space_newline(<span class="variable-name">Rest</span>, [<span class="variable-name">C</span> | <span class="variable-name">Word</span>])
    <span class="keyword">end</span>.
</pre>
<p>
</p><p>
<b>I also wrote another corresponding binary version, which is 2-3 times slower than above list version on my machine, but the result may vary depending on your compiled Erlang/OTP on various operation system. I will test it again when Erlang/OTP R12B is released, which is claimed to have been optimized for binary match performance.</b></p><p>
The code:
</p><p>
</p><pre>
<span class="function-name">-module</span>(tbray3_bin).

<span class="function-name">-compile</span>([native]).

<span class="function-name">-export</span>([start/1]).

<span class="function-name">-define</span>(<span class="constant">BUFFER_SIZE</span>, (4096 * 10000)).

<span class="function-name">start</span>(<span class="variable-name">FileName</span>) -&gt;
    <span class="variable-name">Start</span> = now(),

    <span class="variable-name">Main</span> = <span class="keyword">self</span>(),
    <span class="variable-name">Collector</span> = <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span>collect_loop(<span class="variable-name">Main</span>) <span class="keyword">end</span>),

    {ok, <span class="variable-name">File</span>} = file:open(<span class="variable-name">FileName</span>, [raw, binary]),    
    read_file(<span class="variable-name">File</span>, <span class="variable-name">Collector</span>),

    <span class="comment-delimiter">%% </span><span class="comment">don't terminate, wait here, until all tasks done.
</span>    <span class="keyword">receive</span>
        stop -&gt;<span class="function-name"> </span>io:format(<span class="string">"Time: ~p ms~n"</span>, [timer:now_diff(now(), <span class="variable-name">Start</span>) / 1000])       
    <span class="keyword">end</span>.
    
<span class="function-name">collect_loop</span>(<span class="variable-name">Main</span>) -&gt;<span class="function-name"> </span>collect_loop_1(<span class="variable-name">Main</span>, dict:new(), undefined, 0).
<span class="function-name">collect_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ChunkNum</span>) -&gt;
    print_result(<span class="variable-name">Dict</span>),
    <span class="variable-name">Main</span> ! stop;
<span class="function-name">collect_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span>) -&gt;
    <span class="keyword">receive</span>
        {chunk_num, <span class="variable-name">ChunkNumX</span>} -&gt;<span class="function-name"> </span>
            collect_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNumX</span>, <span class="variable-name">ProcessedNum</span>);
        {dict, <span class="variable-name">DictX</span>} -&gt;
            <span class="variable-name">Dict1</span> = dict:merge(<span class="keyword">fun</span> (<span class="variable-name">_</span>, <span class="variable-name">V1</span>, <span class="variable-name">V2</span>) -&gt;<span class="function-name"> </span><span class="variable-name">V1</span> + <span class="variable-name">V2</span> <span class="keyword">end</span>, <span class="variable-name">Dict</span>, <span class="variable-name">DictX</span>),
            collect_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">Dict1</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span> + 1)
    <span class="keyword">end</span>.

<span class="function-name">print_result</span>(<span class="variable-name">Dict</span>) -&gt;
    <span class="variable-name">SortedList</span> = lists:reverse(lists:keysort(2, dict:to_list(<span class="variable-name">Dict</span>))),
    [io:format(<span class="string">"~b\t: ~s~n"</span>, [<span class="variable-name">V</span>, <span class="variable-name">K</span>]) || {<span class="variable-name">K</span>, <span class="variable-name">V</span>} &lt;- lists:sublist(<span class="variable-name">SortedList</span>, 10)].
          
<span class="function-name">read_file</span>(<span class="variable-name">File</span>, <span class="variable-name">Collector</span>) -&gt;<span class="function-name"> </span>read_file(<span class="variable-name">File</span>, &lt;&lt;&gt;&gt;, 0, <span class="variable-name">Collector</span>).            
<span class="function-name">read_file</span>(<span class="variable-name">File</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">I</span>, <span class="variable-name">Collector</span>) -&gt;
    <span class="keyword">case</span> file:read(<span class="variable-name">File</span>, ?<span class="constant">BUFFER_SIZE</span>) <span class="keyword">of</span>
        eof -&gt;<span class="function-name"> </span>
            file:close(<span class="variable-name">File</span>),
            <span class="variable-name">Collector</span> ! {chunk_num, <span class="variable-name">I</span>};
        {ok, <span class="variable-name">Bin</span>} -&gt;<span class="function-name"> </span>
            {<span class="variable-name">Data</span>, <span class="variable-name">NextTail</span>} = split_on_last_newline(<span class="variable-name">Bin</span>),
            <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span><span class="variable-name">Collector</span> ! {dict, scan_chunk(&lt;&lt;<span class="variable-name">PrevTail</span>/binary, <span class="variable-name">Data</span>/binary&gt;&gt;)} <span class="keyword">end</span>),
            read_file(<span class="variable-name">File</span>, <span class="variable-name">NextTail</span>, <span class="variable-name">I</span> + 1, <span class="variable-name">Collector</span>)
    <span class="keyword">end</span>.

<span class="function-name">split_on_last_newline</span>(<span class="variable-name">Bin</span>) -&gt;<span class="function-name"> </span>split_on_last_newline(<span class="variable-name">Bin</span>, <span class="keyword">size</span>(<span class="variable-name">Bin</span>)).   
<span class="function-name">split_on_last_newline</span>(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">Bin</span> <span class="keyword">of</span>
        &lt;&lt;<span class="variable-name">Data</span>:<span class="variable-name">Offset</span>/binary,<span class="string">$\n</span>,<span class="variable-name">Tail</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Data</span>, <span class="variable-name">Tail</span>};
        <span class="variable-name">_</span> <span class="keyword">when</span> <span class="variable-name">Offset</span> =&lt; 0 -&gt;<span class="function-name"> </span>
            {<span class="variable-name">Bin</span>, &lt;&lt;&gt;&gt;};
        <span class="variable-name">_</span> -&gt;<span class="function-name"> </span>
            split_on_last_newline(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span> - 1)
    <span class="keyword">end</span>.
    
<span class="function-name">scan_chunk</span>(<span class="variable-name">Bin</span>) -&gt;<span class="function-name"> </span>scan_chunk_1(<span class="variable-name">Bin</span>, 0, dict:new()).    
<span class="function-name">scan_chunk_1</span>(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span>, <span class="variable-name">Dict</span>) <span class="keyword">when</span> <span class="variable-name">Offset</span> &lt; <span class="keyword">size</span>(<span class="variable-name">Bin</span>) - 34 -&gt;
    <span class="keyword">case</span> <span class="variable-name">Bin</span> <span class="keyword">of</span>
        &lt;&lt;<span class="variable-name">_</span>:<span class="variable-name">Offset</span>/binary,<span class="string">"GET /ongoing/When/"</span>,<span class="variable-name">_</span>,<span class="variable-name">_</span>,<span class="variable-name">_</span>,<span class="string">$x</span>,<span class="string">$/</span>,<span class="variable-name">Y1</span>,<span class="variable-name">Y2</span>,<span class="variable-name">Y3</span>,<span class="variable-name">Y4</span>,<span class="string">$/</span>,<span class="variable-name">M1</span>,<span class="variable-name">M2</span>,<span class="string">$/</span>,<span class="variable-name">D1</span>,<span class="variable-name">D2</span>,<span class="string">$/</span>,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;<span class="function-name"> </span>           
            <span class="keyword">case</span> match_until_space_newline(<span class="variable-name">Rest</span>, 0) <span class="keyword">of</span>
                {<span class="variable-name">Rest1</span>, &lt;&lt;&gt;&gt;} -&gt;<span class="function-name"> </span>
                    scan_chunk_1(<span class="variable-name">Rest1</span>, 0, <span class="variable-name">Dict</span>);
                {<span class="variable-name">Rest1</span>, <span class="variable-name">Word</span>} -&gt;<span class="function-name"> </span>
                    <span class="variable-name">Key</span> = &lt;&lt;<span class="variable-name">Y1</span>,<span class="variable-name">Y2</span>,<span class="variable-name">Y3</span>,<span class="variable-name">Y4</span>,<span class="string">$/</span>,<span class="variable-name">M1</span>,<span class="variable-name">M2</span>,<span class="string">$/</span>,<span class="variable-name">D1</span>,<span class="variable-name">D2</span>,<span class="string">$/</span>, <span class="variable-name">Word</span>/binary&gt;&gt;,
                    scan_chunk_1(<span class="variable-name">Rest1</span>, 0, dict:update_counter(<span class="variable-name">Key</span>, 1, <span class="variable-name">Dict</span>))
            <span class="keyword">end</span>;
        <span class="variable-name">_</span> -&gt;<span class="function-name"> </span>scan_chunk_1(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span> + 1, <span class="variable-name">Dict</span>)
    <span class="keyword">end</span>;
<span class="function-name">scan_chunk_1</span>(<span class="variable-name">_</span>, <span class="variable-name">_</span>, <span class="variable-name">Dict</span>) -&gt;<span class="function-name"> </span><span class="variable-name">Dict</span>.

<span class="function-name">match_until_space_newline</span>(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">Bin</span> <span class="keyword">of</span>
        &lt;&lt;<span class="variable-name">Word</span>:<span class="variable-name">Offset</span>/binary,<span class="string">$ </span>,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Rest</span>, <span class="variable-name">Word</span>};
        &lt;&lt;<span class="variable-name">_</span>:<span class="variable-name">Offset</span>/binary,<span class="string">$.</span>,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Rest</span>, &lt;&lt;&gt;&gt;};
        &lt;&lt;<span class="variable-name">_</span>:<span class="variable-name">Offset</span>/binary,10,<span class="variable-name">Rest</span>/binary&gt;&gt; -&gt;
            {<span class="variable-name">Rest</span>, &lt;&lt;&gt;&gt;};
        &lt;&lt;<span class="variable-name">_</span>:<span class="variable-name">Offset</span>/binary,<span class="variable-name">_</span>,<span class="variable-name">_</span>/binary&gt;&gt; -&gt;
            match_until_space_newline(<span class="variable-name">Bin</span>, <span class="variable-name">Offset</span> + 1);
        <span class="variable-name">_</span> -&gt;<span class="function-name"> </span>
            {&lt;&lt;&gt;&gt;, &lt;&lt;&gt;&gt;}
    <span class="keyword">end</span>. 
</pre>
<p></p>
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/133016#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Wed, 17 Oct 2007 07:45:57 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/133016</link>
        <guid>http://dcaoyuan.javaeye.com/blog/133016</guid>
      </item>
      <item>
        <title>Reading File in Parallel in Erlang (Was Tim Bray's Erlang Exercise - Round III)</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/133017" style="color:red;">http://dcaoyuan.javaeye.com/blog/133017</a>&nbsp;
          发表时间: 2007年10月15日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>
My <a href="http://blogtrader.net/page/dcaoyuan?entry=tim_bray_s_erlang_exercise">first solution</a> for Tim's exercise tried to read file in parallel, but I just realized by reading file module's source code, that file:open(FileName, Options) will return a process instead of IO device. Well, this means a lot:
</p><ul>
<li>It's a process, so, when you request more data on it, you actually send message to it. Since you only send 2 integer: the offset and length, <b>sending</b> message should be very fast. But then, this process (File) will wait for <b>receiving</b> data from disk/io.  For one process, the receiving is sequential rather than parallelized.</li>
<li>If we look the processes in Erlang as ActiveObjects, which send/receive messages/data in async, since the receiving is sequential in one process, requesting/waiting around one process(or, object) is almost safe for parallelized programming, you usaully do not need to worry about lock/unlock etc. (except the outside world).</li>
<li>We can open a lot of File processes to read data in parallel, the bound is the disk/IO and the os' resources limit.</li>

</ul>
<p>
I wrote some code to test file reading in parallel, discardng the disk cache, on my 2-core MacBook, reading file with two processes can speedup near 200% to one process.
</p><p>
The code:
</p><pre>

<span class="function-name">-module</span>(file_pread).

<span class="function-name">-compile</span>([native]).

<span class="function-name">-export</span>([start/2]).

<span class="function-name">-include_lib</span>(<span class="string">"kernel/include/file.hrl"</span>).

<span class="function-name">start</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>) -&gt;
    [start(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Fun</span>) || <span class="variable-name">Fun</span> &lt;- [<span class="keyword">fun</span> read_file/3, <span class="keyword">fun</span> pread_file/3]].


<span class="function-name">start</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Fun</span>) -&gt;
    <span class="variable-name">Start</span> = now(),  

    <span class="variable-name">Main</span> = <span class="keyword">self</span>(),
    <span class="variable-name">Collector</span> = <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span>collect_loop(<span class="variable-name">Main</span>) <span class="keyword">end</span>),

    <span class="variable-name">Fun</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>),
    
    <span class="comment-delimiter">%% </span><span class="comment">don't terminate, wait here, until all tasks done.
</span>    <span class="keyword">receive</span>
        stop -&gt;<span class="function-name"> </span>io:format(<span class="string">"time: ~10.2f ms~n"</span>, [timer:now_diff(now(), <span class="variable-name">Start</span>) / 1000]) 
    <span class="keyword">end</span>.

<span class="function-name">collect_loop</span>(<span class="variable-name">Main</span>) -&gt;<span class="function-name"> </span>collect_loop_1(<span class="variable-name">Main</span>, undefined, 0).
<span class="function-name">collect_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ChunkNum</span>) -&gt;<span class="function-name"> </span>
    <span class="variable-name">Main</span> ! stop;
<span class="function-name">collect_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span>) -&gt;
    <span class="keyword">receive</span>
        {chunk_num, <span class="variable-name">ChunkNumX</span>} -&gt;
            collect_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">ChunkNumX</span>, <span class="variable-name">ProcessedNum</span>);
        {seq, <span class="variable-name">_Seq</span>} -&gt;
            collect_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span> + 1)
    <span class="keyword">end</span>.

<span class="function-name">get_chunk_size</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>) -&gt;
    {ok, #<span class="type">file_info</span>{size=<span class="variable-name">Size</span>}} = file:read_file_info(<span class="variable-name">FileName</span>),
    <span class="variable-name">Size</span> div <span class="variable-name">ProcNum</span>.

<span class="function-name">read_file</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>) -&gt;
    <span class="variable-name">ChunkSize</span> = get_chunk_size(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>),
    {ok, <span class="variable-name">File</span>} = file:open(<span class="variable-name">FileName</span>, [raw, binary]),
    read_file_1(<span class="variable-name">File</span>, <span class="variable-name">ChunkSize</span>, 0, <span class="variable-name">Collector</span>).
    
<span class="function-name">read_file_1</span>(<span class="variable-name">File</span>, <span class="variable-name">ChunkSize</span>, <span class="variable-name">I</span>, <span class="variable-name">Collector</span>) -&gt;
    <span class="keyword">case</span> file:read(<span class="variable-name">File</span>, <span class="variable-name">ChunkSize</span>) <span class="keyword">of</span>
        eof -&gt;
            file:close(<span class="variable-name">File</span>),
            <span class="variable-name">Collector</span> ! {chunk_num, <span class="variable-name">I</span>};
        {ok, <span class="variable-name">_Bin</span>} -&gt;<span class="function-name"> </span>
            <span class="variable-name">Collector</span> ! {seq, <span class="variable-name">I</span>},
            read_file_1(<span class="variable-name">File</span>, <span class="variable-name">ChunkSize</span>, <span class="variable-name">I</span> + 1, <span class="variable-name">Collector</span>)
    <span class="keyword">end</span>.


<span class="function-name">pread_file</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>) -&gt;
    <span class="variable-name">ChunkSize</span> = get_chunk_size(<span class="variable-name">FileName</span>, <span class="variable-name">ProcNum</span>),
    pread_file_1(<span class="variable-name">FileName</span>, <span class="variable-name">ChunkSize</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>).
       
<span class="function-name">pread_file_1</span>(<span class="variable-name">FileName</span>, <span class="variable-name">ChunkSize</span>, <span class="variable-name">ProcNum</span>, <span class="variable-name">Collector</span>) -&gt;
    [<span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;
                   <span class="comment-delimiter">%% </span><span class="comment">if it's the lastest chuck, read all bytes left, 
</span>                   <span class="comment-delimiter">%% </span><span class="comment">which will not exceed ChunkSize * 2
</span>                   <span class="variable-name">Length</span> = <span class="keyword">if</span>  <span class="variable-name">I</span> == <span class="variable-name">ProcNum</span> - 1 -&gt;<span class="function-name"> </span><span class="variable-name">ChunkSize</span> * 2;
                                true -&gt;<span class="function-name"> </span><span class="variable-name">ChunkSize</span> <span class="keyword">end</span>,
                   {ok, <span class="variable-name">File</span>} = file:open(<span class="variable-name">FileName</span>, [read, binary]),
                   {ok, <span class="variable-name">_Bin</span>} = file:pread(<span class="variable-name">File</span>, <span class="variable-name">ChunkSize</span> * <span class="variable-name">I</span>, <span class="variable-name">Length</span>),
                   <span class="variable-name">Collector</span> ! {seq, <span class="variable-name">I</span>},
                   file:close(<span class="variable-name">File</span>)
           <span class="keyword">end</span>) || <span class="variable-name">I</span> &lt;- lists:seq(0, <span class="variable-name">ProcNum</span> - 1)],
    <span class="variable-name">Collector</span> ! {chunk_num, <span class="variable-name">ProcNum</span>}.

</pre>
<p>The pread_file/3 is parallelized, it always opens new File process for each reading process instead of sharing one opened File process during all reading processes. The read_file/3 is non-parallelized.
</p><p>
To evaulate: (run at least two-time for each test to average disk/IO caches.)
</p><pre class="code">
$ erlc -smp file_pread.erl
$ erl -smp

1&gt; file_pread:start("o100k.ap", 2).
time:     691.72 ms
time:      44.37 ms
[ok,ok]
2&gt; file_pread:start("o100k.ap", 2).
time:      74.50 ms
time:      43.59 ms
[ok,ok]
3&gt; file_pread:start("o1000k.ap", 2).
time:    1717.68 ms
time:     408.48 ms
[ok,ok]
4&gt; file_pread:start("o1000k.ap", 2).
time:     766.00 ms
time:     393.71 ms
[ok,ok]
5&gt; 
</pre>
<p>Let's compare the results for each file (we pick the second testing result of each), the speedup:
</p><ul>
<li>o100k.ap, 20M, 74.50 / 43.59 - 1= 70%</li>
<li>o1000k.ap, 200M, 766.00 / 393.71 - 1 = 95%</li>
 </ul>
<p>
On another 4-CPU debian machine, with 4 processes, the best result I got:
</p><pre class="code">
4&gt; file_pread:start("o1000k.ap", 4).
time:     768.59 ms
time:     258.57 ms
[ok, ok]
5&gt;
</pre>
<p>
The parallelized reading speedup 768.59 / 258.57 -1 = 197%
</p><p>
I've updated my first solution according to this testing, opening new File process for each reading process instead of sharing the same File process. Of cource, there are still issues that I pointed in <a href="http://blogtrader.net/page/dcaoyuan?entry=tim_bray_s_erlang_exercise1">Tim Bray's Erlang Exercise on Large Dataset Processing - Round II</a>
</p><p>
Although the above result can also be achieved in other Languages, but I find that coding parallelization in Erlang is a pleasure.
</p><p></p>
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/133017#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Mon, 15 Oct 2007 19:56:03 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/133017</link>
        <guid>http://dcaoyuan.javaeye.com/blog/133017</guid>
      </item>
      <item>
        <title>Tim Bray's Erlang Exercise on Large Dataset Processing - Round II</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/133018" style="color:red;">http://dcaoyuan.javaeye.com/blog/133018</a>&nbsp;
          发表时间: 2007年10月15日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>
<b>Updated Oct 09: </b>Added more benchmark results under linux on other machines.<br />
<b>Updated Oct 07: </b>More concise code.<br />
<b>Updated Oct 06: </b>Fixed bugs: 1. Match "GET /ongoing/When/" instead of "/ongoing/When/"; 2. split_on_last_newline should not reverse Tail. 
</p><p>Backed from a short vacation, and sit down in front of my computer, I'm thinking about Tim Bray's exercise again. 
</p><p>
As I realized, the most expensive procedure is splitting dataset to lines. To get the multiple-core benefit, we should parallelize this procedure instead of reading file to binary or macthing process only.
</p><p>
In my <a href="http://blogtrader.net/page/dcaoyuan?entry=tim_bray_s_erlang_exercise">previous solution</a>, there are at least two issues:
</p><ul>
<li>Since the file reading is fast in Erlang, then, parallelizing the file reading is not much helpful.</li>
<li>The buffered_read actually can be merged with the buffered file reading.</li>
</ul>
<p>
And, <a href="http://www.erlang.org/pipermail/erlang-questions/2007-September/029549.html">Per's solution</a> parallelizes process_match procedure only, based on a really fast divide_to_lines, but with hacked binary matching syntax. 
</p><p>After a couple of hours working, I finially get the second version of tbray.erl (with some code from Per's solution). 
</p><ul>
<li>Read file to small pieces of binary (about 4096 bytes each chunk), then convert to list.</li>
<li>Merge the previous tail for each chunk, search this chunk from tail, find the last new line mark, split this chunk to line-bounded data part, and tail part for next chunk.</li>
<li>The above steps are difficult to parallelize. If we try, there will be about 30 more LOC, and not so readable.</li>
<li>Spawn a new process at once to split line-bounded chunk to lines, process match and update dict. </li>
<li>Thus we can go on reading file with non-stop.</li>
<li>A collect_loop will receive dicts from each process, and merge them.</li>
</ul>
<p>
What I like of this version is, it scales on mutiple-core <b>almost linearly!</b> On my 2.0G 2-core MacBook, it took about <b>13.522</b> seconds with non-smp, <b>7.624</b> seconds with smp enabled (for a 200M data file, with about 50,000 processes spawned). The 2-core smp result achieves about <b>77%</b> faster than non-smp result. I'm not sure how will it achieve on an 8-core computer, but we'll finally reach the limit due to the un-parallelized procedures.
</p><p>
</p><p> The Erlang time results:
</p><pre class="code">
$ erlc tbray.erl
$ time erl -noshell -run tbray start o1000k.ap -s erlang halt &gt; /dev/null

real    0m13.522s
user    0m12.265s
sys     0m1.199s

$ erlc -smp tbray.erl
$ time erl -smp +P 60000 -noshell -run tbray start o1000k.ap -s erlang halt &gt; /dev/null

real    0m7.624s
user    0m13.302s
sys     0m1.602s

# For 5 million lines, 958.4M size:
$ time erl -smp +P 300000 -noshell -run tbray start o5000k.ap -s erlang halt &gt; /dev/null

real    0m37.085s
user    1m5.605s
sys     0m7.554s
</pre>
<p>
And the original Tim's Ruby version:
</p><pre class="code">
$ time ruby tbray.rb o1000k.ap &gt; /dev/null

real    0m2.447s
user    0m2.123s
sys     0m0.306s

# For 5 million lines, 958.4M size:
$ time ruby tbray.rb o5000k.ap &gt; /dev/null

real    0m12.115s
user    0m10.494s
sys     0m1.473s
</pre>
<p>
Erlang time result on 2-core 1.86GHz CPU RedHat linux box, with kernel:<br />
Linux version 2.6.18-1.2798.fc6 (brewbuilder@hs20-bc2-4.build.redhat.com) (gcc v 
ersion 4.1.1 20061011 (Red Hat 4.1.1-30)) #1 SMP Mon Oct 16 14:37:32 EDT 2006<br /> 
is 7.7 seconds.
</p><p>
Erlang time result on 2.80GHz 4-cpu xeon debian box, with kernel:<br />
Linux version 2.6.15.4-big-smp-tidy (root@test) (gcc version 4.0.3 20060128 (prerelease) (Debian 4.0 
.2-8)) #1 SMP Sat Feb 25 21:24:23 CST 2006 
</p><p>
The smp result on this 4-cpu computer is questionable. It speededup only 50% than non-smp, even worse than my 2.0GHz 2-core MacBook. I also tested the <a href="http://www.franklinmint.fm/blog/archives/000792.html">Big Bang</a> on this machine, it speedup less than 50% too.
</p><pre class="code">
$ erlc tbray.erl 
$ time erl -noshell -run tbray start o1000k.ap -s erlang halt &gt; /dev/null 

real 0m22.279s 
user 0m21.597s 
sys  0m0.676s 

$ erlc -smp tbray.erl 
$ time erl -smp +S 4 +P 60000 -noshell -run tbray start o1000k.ap -s erlang halt &gt; /dev/null 

real 0m14.765s 
user 0m28.722s 
sys  0m0.840s 
</pre>
<p>
<b>Notice: </b>
</p><ul>
<li>All tests run several times to have the better result expressed, so, the status of disk/io cache should be near.</li>
<li>You may need to compile tbray.erl to two different BEAMs, one for smp version, and one for no-smp version.</li>
<li>If you'd like to process bigger file, you can use +P processNum to get more simultaneously alive Erlang processes. For BUFFER_SIZE=4096, you can set +P arg as FileSize / 4096, or above. From Erlang's <a href="http://www.erlang.org/doc/efficiency_guide/advanced.html#7.2">Efficiency Guide</a>: <br />
<b>Processes</b><br />
The maximum number of simultaneously alive Erlang processes is by default 32768. This limit can be raised up to at most 268435456 processes at startup (see documentation of the system flag +P in the erl(1) documentation). The maximum limit of 268435456 processes will at least on a 32-bit architecture be impossible to reach due to memory </li>
</ul>
<p>
To evaluate with smp enable: (Erlang/OTP R11B-5 for Windows may not support smp yet)
</p><pre class="code">
erl -smp +P 60000
&gt; tbray:start("o1000k.ap").
</pre>
<p> 
The code: (pretty formatted by <a href="http://blogtrader.net/page/dcaoyuan?entry=erlybird_0_15_1_released">ErlyBird 0.15.1</a>)
</p><pre>
<span class="function-name">-module</span>(tbray_blog).

<span class="function-name">-compile</span>([native]).

<span class="function-name">-export</span>([start/1]).

<span class="comment-delimiter">%% </span><span class="comment">The best Bin Buffer Size is 4096
</span><span class="function-name">-define</span>(<span class="constant">BUFFER_SIZE</span>, 4096). 

<span class="function-name">start</span>(<span class="variable-name">FileName</span>) -&gt;
    <span class="variable-name">Start</span> = now(),

    <span class="variable-name">Main</span> = <span class="keyword">self</span>(),
    <span class="variable-name">Collector</span> = <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span>collect_loop(<span class="variable-name">Main</span>) <span class="keyword">end</span>),

    {ok, <span class="variable-name">File</span>} = file:open(<span class="variable-name">FileName</span>, [raw, binary]),
    read_file(<span class="variable-name">File</span>, <span class="variable-name">Collector</span>),
    
    <span class="comment-delimiter">%% </span><span class="comment">don't terminate, wait here, until all tasks done.
</span>    <span class="keyword">receive</span>
        stop -&gt;<span class="function-name"> </span>io:format(<span class="string">"Time: ~10.2f ms~n"</span>, [timer:now_diff(now(), <span class="variable-name">Start</span>) / 1000])
    <span class="keyword">end</span>.

<span class="function-name">read_file</span>(<span class="variable-name">File</span>, <span class="variable-name">Collector</span>) -&gt;<span class="function-name"> </span>read_file_1(<span class="variable-name">File</span>, [], 0, <span class="variable-name">Collector</span>).
<span class="function-name">read_file_1</span>(<span class="variable-name">File</span>, <span class="variable-name">PrevTail</span>, <span class="variable-name">I</span>, <span class="variable-name">Collector</span>) -&gt;
    <span class="keyword">case</span> file:read(<span class="variable-name">File</span>, ?<span class="constant">BUFFER_SIZE</span>) <span class="keyword">of</span>
        eof -&gt;
            <span class="variable-name">Collector</span> ! {chunk_num, <span class="variable-name">I</span>},
            file:close(<span class="variable-name">File</span>);
        {ok, <span class="variable-name">Bin</span>} -&gt;<span class="function-name"> </span>
            {<span class="variable-name">Data</span>, <span class="variable-name">NextTail</span>} = split_on_last_newline(<span class="variable-name">PrevTail</span> ++ <span class="keyword">binary_to_list</span>(<span class="variable-name">Bin</span>)),
            <span class="keyword">spawn</span>(<span class="keyword">fun</span> () -&gt;<span class="function-name"> </span><span class="variable-name">Collector</span> ! {dict, scan_lines(<span class="variable-name">Data</span>)} <span class="keyword">end</span>),
            read_file_1(<span class="variable-name">File</span>, <span class="variable-name">NextTail</span>, <span class="variable-name">I</span> + 1, <span class="variable-name">Collector</span>)
    <span class="keyword">end</span>.

<span class="function-name">split_on_last_newline</span>(<span class="variable-name">List</span>) -&gt;<span class="function-name"> </span>split_on_last_newline_1(lists:reverse(<span class="variable-name">List</span>), []).
<span class="function-name">split_on_last_newline_1</span>(<span class="variable-name">List</span>, <span class="variable-name">Tail</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">List</span> <span class="keyword">of</span>
        []         -&gt;<span class="function-name"> </span>{lists:reverse(<span class="variable-name">List</span>), []};
        [<span class="string">$\n</span>|<span class="variable-name">Rest</span>] -&gt;<span class="function-name"> </span>{lists:reverse(<span class="variable-name">Rest</span>), <span class="variable-name">Tail</span>};
        [<span class="variable-name">C</span>|<span class="variable-name">Rest</span>]   -&gt;<span class="function-name"> </span>split_on_last_newline_1(<span class="variable-name">Rest</span>, [<span class="variable-name">C</span> | <span class="variable-name">Tail</span>])
    <span class="keyword">end</span>.

<span class="function-name">collect_loop</span>(<span class="variable-name">Main</span>) -&gt;<span class="function-name"> </span>collect_loop_1(<span class="variable-name">Main</span>, dict:new(), undefined, 0).
<span class="function-name">collect_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ChunkNum</span>) -&gt;
    print_result(<span class="variable-name">Dict</span>),
    <span class="variable-name">Main</span> ! stop;
<span class="function-name">collect_loop_1</span>(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span>) -&gt;
    <span class="keyword">receive</span>
        {chunk_num, <span class="variable-name">ChunkNumX</span>} -&gt;<span class="function-name"> </span>
            collect_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">Dict</span>, <span class="variable-name">ChunkNumX</span>, <span class="variable-name">ProcessedNum</span>);
        {dict, <span class="variable-name">DictX</span>} -&gt;<span class="function-name"> </span>
            <span class="variable-name">Dict1</span> = dict:merge(<span class="keyword">fun</span> (<span class="variable-name">_</span>, <span class="variable-name">V1</span>, <span class="variable-name">V2</span>) -&gt;<span class="function-name"> </span><span class="variable-name">V1</span> + <span class="variable-name">V2</span> <span class="keyword">end</span>, <span class="variable-name">Dict</span>, <span class="variable-name">DictX</span>),
            collect_loop_1(<span class="variable-name">Main</span>, <span class="variable-name">Dict1</span>, <span class="variable-name">ChunkNum</span>, <span class="variable-name">ProcessedNum</span> + 1)
    <span class="keyword">end</span>.
    
<span class="function-name">print_result</span>(<span class="variable-name">Dict</span>) -&gt;
    <span class="variable-name">SortedList</span> = lists:reverse(lists:keysort(2, dict:to_list(<span class="variable-name">Dict</span>))),
    [io:format(<span class="string">"~p\t: ~s~n"</span>, [<span class="variable-name">V</span>, <span class="variable-name">K</span>]) || {<span class="variable-name">K</span>, <span class="variable-name">V</span>} &lt;- lists:sublist(<span class="variable-name">SortedList</span>, 10)].

<span class="function-name">scan_lines</span>(<span class="variable-name">List</span>) -&gt;<span class="function-name"> </span>scan_lines_1(<span class="variable-name">List</span>, [], dict:new()).
<span class="function-name">scan_lines_1</span>(<span class="variable-name">List</span>, <span class="variable-name">Line</span>, <span class="variable-name">Dict</span>) -&gt;<span class="function-name"> </span>
    <span class="keyword">case</span> <span class="variable-name">List</span> <span class="keyword">of</span>
        [] -&gt;<span class="function-name"> </span>match_and_update_dict(lists:reverse(<span class="variable-name">Line</span>), <span class="variable-name">Dict</span>);
        [<span class="string">$\n</span>|<span class="variable-name">Rest</span>] -&gt;
            scan_lines_1(<span class="variable-name">Rest</span>, [], match_and_update_dict(lists:reverse(<span class="variable-name">Line</span>), <span class="variable-name">Dict</span>));
        [<span class="variable-name">C</span>|<span class="variable-name">Rest</span>] -&gt;
            scan_lines_1(<span class="variable-name">Rest</span>, [<span class="variable-name">C</span> | <span class="variable-name">Line</span>], <span class="variable-name">Dict</span>)
    <span class="keyword">end</span>.

<span class="function-name">match_and_update_dict</span>(<span class="variable-name">Line</span>, <span class="variable-name">Dict</span>) -&gt;
    <span class="keyword">case</span> process_match(<span class="variable-name">Line</span>) <span class="keyword">of</span>
        false -&gt;<span class="function-name"> </span><span class="variable-name">Dict</span>;
        {true, <span class="variable-name">Word</span>} -&gt;<span class="function-name"> </span>
            dict:update_counter(<span class="variable-name">Word</span>, 1, <span class="variable-name">Dict</span>)
    <span class="keyword">end</span>.
    
<span class="function-name">process_match</span>(<span class="variable-name">Line</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">Line</span> <span class="keyword">of</span>
        [] -&gt;<span class="function-name"> </span>false;
        <span class="string">"GET /ongoing/When/"</span>++[<span class="variable-name">_</span>,<span class="variable-name">_</span>,<span class="variable-name">_</span>,<span class="string">$x</span>,<span class="string">$/</span>,<span class="variable-name">Y1</span>,<span class="variable-name">Y2</span>,<span class="variable-name">Y3</span>,<span class="variable-name">Y4</span>,<span class="string">$/</span>,<span class="variable-name">M1</span>,<span class="variable-name">M2</span>,<span class="string">$/</span>,<span class="variable-name">D1</span>,<span class="variable-name">D2</span>,<span class="string">$/</span>|<span class="variable-name">Rest</span>] -&gt;<span class="function-name"> </span>
            <span class="keyword">case</span> match_until_space(<span class="variable-name">Rest</span>, []) <span class="keyword">of</span>
                [] -&gt;<span class="function-name"> </span>false;
                <span class="variable-name">Word</span> -&gt;<span class="function-name"> </span>{true, [<span class="variable-name">Y1</span>,<span class="variable-name">Y2</span>,<span class="variable-name">Y3</span>,<span class="variable-name">Y4</span>,<span class="string">$/</span>,<span class="variable-name">M1</span>,<span class="variable-name">M2</span>,<span class="string">$/</span>,<span class="variable-name">D1</span>,<span class="variable-name">D2</span>,<span class="string">$/</span>] ++ <span class="variable-name">Word</span>}
            <span class="keyword">end</span>;
        [<span class="variable-name">_</span>|<span class="variable-name">Rest</span>] -&gt;<span class="function-name"> </span>
            process_match(<span class="variable-name">Rest</span>)
    <span class="keyword">end</span>.
    
<span class="function-name">match_until_space</span>(<span class="variable-name">List</span>, <span class="variable-name">Word</span>) -&gt;
    <span class="keyword">case</span> <span class="variable-name">List</span> <span class="keyword">of</span>
        [] -&gt;<span class="function-name"> </span>[];
        [<span class="string">$.</span>|<span class="variable-name">_</span>] -&gt;<span class="function-name"> </span>[];
        [<span class="string">$ </span>|<span class="variable-name">_</span>] -&gt;<span class="function-name"> </span>lists:reverse(<span class="variable-name">Word</span>);
        [<span class="variable-name">C</span>|<span class="variable-name">Rest</span>] -&gt;<span class="function-name"> </span>match_until_space(<span class="variable-name">Rest</span>, [<span class="variable-name">C</span> | <span class="variable-name">Word</span>])
    <span class="keyword">end</span>.
</pre><p>
Lessons learnt:
</p><ul>
<li>Split large binary to proper size chunks, then convert to list for further processing</li>
<li>Parallelize the most expensive part (of course)</li>
<li>We need a new or more complete Efficent Erlang</li>
</ul>
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/133018#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Mon, 15 Oct 2007 15:37:06 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/133018</link>
        <guid>http://dcaoyuan.javaeye.com/blog/133018</guid>
      </item>
      <item>
        <title>It Will Be My First Attendance at NetBeans Day, Seattle</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/133019" style="color:red;">http://dcaoyuan.javaeye.com/blog/133019</a>&nbsp;
          发表时间: 2007年10月13日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>I will be there, <a href="http://developers.sun.com/events/techdays/2006/US_SEA.jsp">Sun Tech Days, Seattle</a>, Sep 6, 2006. As I'm now in Vancouver, it's about 2 or 3 hours trip to Seattle. 
</p><p>I'm glad to have a chance to meet those great guys who develop NetBeans IDE and Platform. As you know, the AIOTrade (formerly Humai Trader) is built on <a href="http://platform.netbeans.org/">NetBeans Platform</a> using <a href="http://www.netbeans.org">NetBeans IDE</a>.
</p><p>
And, I've committed the re-packed source code to SVN repository on sourceforge.net, and am doing cleanup on the neural network code, hope to commit the code in one week.
</p><p>
For the neural network module, there should be a lot of UI works still needed to be done, I've been beginning to hack the <a href="http://graph.netbeans.org/">Visual Library API of NetBeans</a>, and hope to apply these great works on visual neural network definition. 
</p><p></p>
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/133019#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Sat, 13 Oct 2007 14:46:55 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/133019</link>
        <guid>http://dcaoyuan.javaeye.com/blog/133019</guid>
      </item>
      <item>
        <title>Ruby IDE for NetBeans Almost Useful</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/133020" style="color:red;">http://dcaoyuan.javaeye.com/blog/133020</a>&nbsp;
          发表时间: 2007年10月13日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>
As NetBeans IDE 6.0M7 released, I tried the Ruby module for it, and it's almost useful now.
</p><p>
To get and install,<br />
<br />
1. Downloand NetBeans IDE 6.0M7 from:<br />
<a href="http://www.netbeans.info/downloads/dev.php" target="_blank">http://www.netbeans.info/downloads/dev.php</a><br />
Select 'Q-Build' and download the newest M7<br />
<br />
2. Update Ruby modules:<br />
1) [Tools] -&gt; [Update Center]<br />
2) Select Ruby folder as you wanted (9 files will be selected)<br />
3) Following the instructions. <br />
<br />
3. Set your Ruby environment:<br />
As the default installation will use JRuby, if you want to use c-ruby, go to<br />
1) [Tools]-&gt;[Options]-&gt;Miscellaneous-&gt;Ruby Installation<br />
2) Change all ruby tools to yours<br />
<br />
4. Now setup your first Ruby on Rails Application: <br />
1) [File]-&gt;[New Project]-&gt;Ruby-&gt;Ruby on Rails Application<br />
2) If you have an existed project, copy and override to the new created project tree.<br />
<p>
Want to take a look at the snapshot? here it is:<br />
<a href="http://blogs.sun.com/tor/entry/netbeans_and_ruby_is_true">NetBeans + Ruby = True</a>
</p><p>
That's all. Have fun with great NetBeans.</p><p> 

<b>Notice: </b> If you are using c-ruby, don't try to run project via NetBeans' "run main project" button, which may change your environment temporarily.
</p><p></p></p>
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/133020#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Sat, 13 Oct 2007 14:46:06 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/133020</link>
        <guid>http://dcaoyuan.javaeye.com/blog/133020</guid>
      </item>
      <item>
        <title>Erlang Editor for NetBeans - ErlyBird 0.10.1 released</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/133021" style="color:red;">http://dcaoyuan.javaeye.com/blog/133021</a>&nbsp;
          发表时间: 2007年10月13日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>
<b>Update - Mar 29,2007:</b> If you got exception: java.lang.reflect.InvocationTargetException  when try completion, please check the version number of your "Generic Languages Framework" module (Tools -&gt; Module Manager -&gt; Language Support), if the version number is less than 1.70, you can go to <a href="http://sourceforge.net/projects/erlybird/">http://sourceforge.net/projects/erlybird</a>
to download and update to the newly built <b>org-netbeans-modules-languages.nbm</b> 
</p><p>
</p><p>
I'm pleased to announce ErlyBird 0.10.1, an Erlang Editor Module for NetBeans has been released. 
</p><p>
Current features: <br />
</p><ul>
<li>Syntax checking; </li>
<li>Syntax highlighting; </li>
<li>Functions navigator; </li>
<li>Code-folding; </li>
<li>Indentation; </li>
<li>Built-in function completion.</li> 
</ul>
<p>
You can download ErlyBird from <a href="http://sourceforge.net/projects/erlybird/">http://sourceforge.net/projects/erlybird</a>
</p><p>
ErlyBird needs NetBeans IDE 6.0 M7+, which can be downloaded via:<br /> 
<a href="http://www.netbeans.info/downloads/dev.php">http://www.netbeans.info/downloads/dev.php page</a><br />
select Q-Build in 'Build Type'. 
</p><p>
 
After NetBeans IDE installed, go to Tools-&gt;Update Center, fetch the "Generic Language Framework" module from Category "Languages Support" 
</p><p> 
To install ErlyBird module, unzip the binary package first, then:<br />
</p><ol> 
<li>From menu: Tools -&gt; Update Center </li>
<li>In the "Select Location of Modules" pane, click "Install Manually Downloaded Modules(.nbm Files)", then "Next" </li>
<li>Click [Add...] button, go to the path to select the unzip .nbm file. </li>
<li>Following the instructions to install updated modules. </li>
<li>Restart NetBeans. </li>
</ol>
<br />
It may not be stable yet, feedback and bug reports are welcome.
<p></p>
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/133021#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Sat, 13 Oct 2007 14:45:39 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/133021</link>
        <guid>http://dcaoyuan.javaeye.com/blog/133021</guid>
      </item>
      <item>
        <title>Go to Declaration of Function call and Var in Erlang Editor for Netbeans</title>
        <author>dcaoyuan</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://dcaoyuan.javaeye.com">dcaoyuan</a>&nbsp;
          链接：<a href="http://dcaoyuan.javaeye.com/blog/133022" style="color:red;">http://dcaoyuan.javaeye.com/blog/133022</a>&nbsp;
          发表时间: 2007年10月13日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>I've got "Go to declaration of function call and var" if the declarations are in the same module file, and "Highlighting for function call/function arguments" working.</p><p>
To go to the declaration of function call or var, just press down "Ctrl", and put cursor on the function call or var name's position, then click on it. The editor will jump to the source position of declaration.</p><p>
But to get cross-module "Go to declaration of function call" working, I may need much more works to do.
</p><p>BTW, the Erlang project management is also under developing. Before this feature is released, the only method to create a managed project in NetBeans is create a Java project tree and use it.</p><p>
</p><p> 
Click on the picture to enlarge it
</p><p>
<a href="/resources/dcaoyuan/erlang_editor_070403.png"><img src="/resources/dcaoyuan/erlang_editor_070403.png" height="450" alt="nn" width="600" /></a>
</p>
          <br/>
          <span style="color:red;">
            <a href="http://dcaoyuan.javaeye.com/blog/133022#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Sat, 13 Oct 2007 14:45:15 +0800</pubDate>
        <link>http://dcaoyuan.javaeye.com/blog/133022</link>
        <guid>http://dcaoyuan.javaeye.com/blog/133022</guid>
      </item>
      <item>
        <title>Some Tips for Upgrading to Rails 1.2.x</title>
        <author>dcaoyuan</author>
        <description>
          