blog.xam.de

Max Völkel on Personal Knowledge Management, Wikis, Semantic Web, and personal ideas.

6.28.2010

Java copies files 80 times fast than commercial backup software

I cannot believe what I saw today. I bought a brand new hard drive, a 500 GB 2.5" USB hard disk. It's  a Western Digital My Passport Essential. It was about 15 Euros more expensive than a another 500 GB 2.5" USB drive from Western Digital - so it must be the software. I thought: Okay, let's try this backup software, I really need one.
So I installed everything and ran my first backup. After almost half an hour I took a look at the result. I was shocked. At this speed, the full backup would take days! I startet a self-written Java program called syncr with the sole purpose to copy files fast. Syncr was over 80 times faster. Of course, the comparison is not entirely fair. The software from Western Digital, called WD SmartWare 1.2.0.8, Copyright 2009 by Western Digital, can do much more than my hand-written syncr. The table below lists all differences. Nevertheless, 80 times is really a lot.

Syncr uses very few ticks to achieve this speed:
  • It does not copy redundant information - However, the speedup was measured by those files that were copied
    • It skips redundant folders with these names Temp", "RECYCLER", "System Volume Information", "Google Desktop"
    • It always skips files named "hiberfil.sys", "pagefile.sys", "Thumbs.db", "autorun.inf", "UsrClass.dat", "UsrClass.dat.LOG", "ntuser.dat.LOG","ntuser.dat", "parent.lock"
    • It does not copy files that are already present at the target drive with the same name and change date
  • It writes some kind of directories directly into a zip file in the target drive - this is much faster than copying first and zipping later and even faster than not zipping. Many small files cost a lot of file system operations. By default, only Eclipse projects and Eclipse workspaces are zipped.Overall, not many files are zipped.
  • It uses Java NIO, which is faster than the old Java.io.
All in all I cannot explain the slow speed of WD SmartWare nor the fast speed of Syncr. As a side remark: Syncr is also faster than the Windows Explorer (which is why Syncr has been written initially).
    Quick facts about Syncr:
    Because of all this, I released Syncr into the wild today. Enjoy.

    Feature WD SmartWare Syncr
    Business model Commercial Open source (BSD)
    Front-end graphical Java source code, run from Eclipse
    Installer Windows, Mac None
    Languages 28 English only
    Usability very easy easy for Java developers of any level, impossible for non-developers
    Design Aesthetic visualisation of current hard drive content Accurate log messages
    Backup storage format Special folder system that contains the original files plus additional .dcm files Original folder structure with original files. Some folders are automatically stores as .zip archives.
    Product URL http://www.wdc.com/en/products/wdsmartware/ http://code.google.com/p/syncr/
    Versioning (can get back an older version of a file) Yes No
    Incremental backup Yes Yes
    Constant incremental background backup Yes No
    Experimental resultsExperimental resultsExperimental results
    CPU usage 20-40% 20-40%
    Files created to backup 100 files ca. 200 100 or just one zip file
    Memory usage 164 MB 10-400 MB
    Files copied after first 25 minutes 4200 files 5152 files
    (1 resulting zip file counting as 1 file)
    Data copied after first 25 minutes 136 MB 11 GB
    Estimated time for a full backup (of my 88 GB) 11 days, 5.6 hours 3.4 hours

    6.17.2010

    Unifying Slf4j, gwt-log, GWT, AppEngine, Part II

    After some attempts to GWTify the existing slf4j-api it runs out, the approach of using linking API and implementation will not work in GWT, because it relies on reflection.

    Hence, we use a primitive solution without any magic. The API provides the class LoggerFactory and a log implementation is required to register itself via a static method.

    Finally I managed to write a first working code base for this. A description about how to use it can be found here.

    Unifying Slf4j, gwt-log, GWT, AppEngine, Part I

    I like the API of SLF4J very much and used in all my code so far.

    Can I re-use the same API for code that runs via GWT in the browser?


    Experiment:
    • Check out source code from http://github.com/ceki/slf4j from trunk on 2010-06-17
    • Add GWT dependencies in Maven, add get-maven plugin in POM
    • Create a dummy entry point and a get.xml file
    • Run mvn gwt:compile
    Result:

    [INFO] Compiling module org.slf4j.GWTTest
    [INFO]    Validating newly compiled units
    [INFO]       [ERROR] Errors in 'file:/O:/2010/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java'
    [INFO]          [ERROR] Line 28: The import java.net cannot be resolved
    [INFO]          [ERROR] Line 207: The method getClassLoader() is undefined for the type Class
    [INFO]          [ERROR] Line 217: URL cannot be resolved to a type
    [INFO]          [ERROR] Line 217: URL cannot be resolved to a type
    [INFO]       [ERROR] Errors in 'file:/O:/2010/slf4j-api/src/main/java/org/slf4j/helpers/MessageFormatter.java'
    [INFO]          [ERROR] Line 27: The import java.text cannot be resolved
    [INFO]       [ERROR] Errors in 'file:/O:/2010/slf4j-api/src/main/java/org/slf4j/MDC.java'
    [INFO]          [ERROR] Line 74: No source code is available for type java.lang.NoClassDefFoundError; did you forget to inherit a required module?
    [INFO]       [ERROR] Errors in 'file:/O:/2010/slf4j-api/src/main/java/org/slf4j/MarkerFactory.java'
    [INFO]          [ERROR] Line 53: No source code is available for type java.lang.NoClassDefFoundError; did you forget to inherit a required module?
    [INFO]       [ERROR] Errors in 'file:/O:/2010/slf4j-api/src/main/java/org/slf4j/helpers/BasicMDCAdapter.java'
    [INFO]          [ERROR] Line 47: No source code is available for type java.lang.InheritableThreadLocal; did you forget to inherit a required module?
    [INFO]       [ERROR] Errors in 'file:/O:/2010/slf4j-api/src/main/java/org/slf4j/helpers/NamedLoggerBase.java'
    [INFO]          [ERROR] Line 40: No source code is available for type java.io.ObjectStreamException; did you forget to inherit a required module?

    These errors show that SLF4j is not GWT-ready by itself. Now there are two options:
    • Create a drop-in replacement that implements some parts differently
    • Create a new SLF4j-like API which IS GWT-ready
    The second option is probably less work, but the first option would be easier to use.I'll keep you updated.

    5.17.2010

    2nd workshop on Personal Knowledge Management

    Call for papers: 2nd workshop on Personal Knowledge Management at Mensch und Computer 2010, September 12, Duisburg, Germany.
    • June 13, 2010 (Sun.) Deadline for paper submission
    • July 4, 2010 (Sun.)    Notification to authors
    • July 14, 2010 (Wed.)  Camera-ready copy due
    • September 12, 2010  Workshop

    5.13.2010

    Runtime optimisation on AppEngine for Java (GAE/J)

    1. Install AppStats from http://code.google.com/appengine/docs/java/tools/appstats.html
    Its basically a servlet provided by Google that intercepts GAE inner workings and logs it to your applications data store. It comes with a servlet that is able to display the data in nice ways. Leave all settings as described.

    2. Log into the appstats servlet and keep the browser open

    3. Open the Firebug network tab

    4. Make a request to your appengine application

    5.In the response headers: You see messages about the cost and CPU usage. The cost is for 1000 requests, the usage numbers are for this request. Sample:



    Google explains:
    If you access your site while signed in using an administrator account, App Engine includes per-request statistics in the response headers. The header X-AppEngine-Estimated-CPM-US-Dollars represents an estimate of what 1,000 requests similar to this request would cost in US dollars.

    6. Go into the Appstats and click to see your last request. You get a timeline diagram and each point you called the data store. Click on these points to see the stack trace that called them.

    7. Use Firebug to determine the page loading time from a users point of view.

    Our appengine application remains for ca. 20 seconds active before the next request causes another cold start (clearly distinguishable by page loading times; time measures with stop watch by hand). A cold start every 20 seconds is in the worst case 4320 cold starts per day. Given a cold start cost of $ 0,00377529 for 12 seconds startup time, we end up with ca. $ 49 per month. All other request hit certainly a warm JVM and are hence much cheaper.

    There is also a little Firefox plugin to display the costs in the Firefox error console.

    3.21.2010

    Open Letter to Alfresco [update 2]

    Original Post (see updates at the end):

    Dear Alfresco,

      I am trying to buy something from you. I downloaded the source code from your site, an I am pleased with what I see. In order to embed JLAN in my product, I need to know the price  of your product. I want to embed your JLAN prodcut [1]. On the product site you write:

    Alfresco offers either an end user and developer license that allows companies to use Alfresco JLAN for internal deployments, or an OEM license that allows you to add Alfresco JLAN features and functionality to your products.  For more information please contact jlan@alfresco.com.

    Ok, so I wrote an email to that address, asking for a quote. No response.
    I also asked the same question at the German sales email contact as listed at [2], twice. Additionally, I called the German hotline number, twice, but only got an answering machine at normal business hours. Also no reaction from the US sales email, twice.

    After these seven (!) attempts to just get a price quote from you, I feel lost.
    But customer service can get even worse!
    Now I get emails like these:
    • Besuchen Sie Alfresco auf der CeBIT
    • Doing More with your Alfresco Trial. 30-Day Full Enterprise Download Trial.
    • Webinar: Alfresco-SAP-Anbindung in der Praxis, 18 März 2010 um 16:00
    • Upcoming Webinars: Alfresco Share Customization and SAP Integration
    Apparently you have neither an idea what language I speak, nor a clue what I want from you. I do not care about SAP integration.

    Why do you spam your potential customers instead of selling your software?

    [1] http://www.alfresco.com/products/aifs/
    [2] http://www.alfresco.com/about/contact/

    Regards,
    Max Völkel

    Update 1:
    On 22.03.2010 at 10 am, I got an Email from Alfresco with various contact details. Whatever caused their hickup, the Web 2.0 seems to have worked better than their internal structure, this time. Maybe more and more future communication will turn in Web 2.0 social network communication?

    Unfortunately, I cannot predict my future sales numbers and had to learn that OEM business works not as I expected it. I expected a simple price table for 1-10 licenses, 11-100, 101-1000 and another per-license-price for more than 1000. This is not the case. Alfresco wants to calculate my business case, determine how much of the value is contributed by their software and then come up with a quote. This makes sense for them to obtain a fair price for their software. On the other hand, for smaller players like me, this means I have no way to start using Alfresco software.

    As all the code is open source anyway, all they would need to do is set up a contract valid for, let's say, up to 10.000 licenses and let me pay by credit card for the number of licenses used at the end of each quarter.

    Right now I could secretly take the source code, use it, and get sued (rightly so!). In the future, I could pay for a number of licenses, and if I would not have paid enough, get sued in the same way. So there is almost no overhead for them to earn more money. They can just set up self-service web-shop for licenses and continue to sue people without proper licenses.

    Does anybody knows of another CIFS/Samba server written in Java that can be embedded for a reasonable license fee even for a small or unknown number of licenses?

    At the end of the week, Alfresco will tell me more details about licensing options with them, I'll keep you updated.

    Update 2:
    Today, 26.03.2010, we received our final verdict from Alfresco: We are too small to talk too. Of course, they did not use these words. But they are definitely not interested in negotiating a deal with unknown results, most likely because their negotiation process is too costly. Their product is really nice, they have a SMB/CIFS+WebDav-server in Java called JLAN and fits our current APIs really well. But since its either GPL (killing all our potential business models) or a minimum deal of ca. 10.000 Euros, we'll now need to go ahead an continue to implement our own WebDAV server. So SemFS will not have CIFS support in the near future, but we will open-source (BSD license) our WebDAV server. We looked at Slide (project is dead), JackRabbits WebDAV (really complicated OO-architecture with way too many interfaces), and other.

    3.03.2010

    Logging in the Google AppEngine for Java (GAE/J) with slf4j, log4j and jul - and using Maven

    Logging is an invaluable tool for debugging web applications, especially if you can't run them in debug mode. Developping an application on Google AppEngine allows you to do these steps prior to uploading it to the real server
    On the real appengine, you cannot run JUnit tests (at least I don't know how. Maybe you can somehow.) and you cannot run a remote debug session. Therefore you need logging. It's your only way to debug things that are different on the real appengine compared to the local test stub environment. And there are differences.

    AppEngine uses java.utils.logging (also often called j.u.l. or just jul). As the blog from Fred Sauer (the man behin gwt-log) shows, the AppEngine web console nicely displays the differen log levels:
    • DEBUG
    • INFO
    • WARN
    • ERROR
    • CRITICAL
    AppEngine uses jul for logging, although ironically jul defines different log-levels, namelly:
    • FINEST (lowest value)
    • FINER
    • FINE
    • CONFIG
    • INFO
    • WARNING
    • SEVERE (highest value) 
    The mapping to the GAE levels is not completely clear to me yet.
    To add more confusion to the mix, most code projects that I start today use slf4j, the Simple Logging Facade for Java. Slf4j allows to delegate to another logging framework and involves almost zero runtime overhead. Underneath I usually use log4j, an old, mature logging framework.

    Clickable hyperlinks in log output
    Log4j has nice things such as a PatternFormatter. If you use "(%F:%L)" in your pattern, you get "(Customer.java:33)", which your Eclipse console parses and turns it into a clickable link to the java file "Customer.java" in line 33. That is a very quick way from an unexpected log message to you fixing it: One click!
    Be aware that line numbers in source code are rather expensive to generate, you should therefore not use this pattern in a production environment. However, you can simply use another confguration file, how nice.
    There are other reasons for using framework such as log4j instead of just jul.

    Situation
    In my project I have this (not uncommon) situation:
    • GAE uses jul
    • Jersey uses jul
    • Other bundled libraries and my code use slf4j
    Goals
    • Clickable hyperlinks for local testing
    • Nice mapping from log levels to GAE icons
    • Same dependencies for local testing and production deployment
    Failed Attempts
    • Use slf4j-jdk14 to route all log output to jul. This works and nicely uses the five different categories on the AppEngine, but at the cost of getting a default ugly pattern. Jul has no PatternFormatter.
    • Write a custom formatter for jul.  This does not work, because a custom formatter can only be set to a known handler, such as the ConsoleHandler form jul. Unfortunately, the ConsoleHandler does not exist on AppEngine. Attemtps to get the existing handlers by creating a log and listing its handlers or parents failed: Its all emtpy on appengine.
    Chosen Solution
    • Use log4j with PatternFormatter
    • Log4j put log messages to Sysatem.out which in turn are rendered by AppEngine as INFO level.
    • Jersey and AppEngine still use all five levels and an ugly pattern.
    Configuration
    This is the main piece of this long post, because it didn't get it right first.

    In  \src\main\webapp\WEB-INF\appengine-web.xml   you need to have

        <!-- Configure java.util.logging -->

        <system-properties>
            <property name="java.util.logging.config.file" value="WEB-INF/java-util-logging.properties"/>
        </system-properties>

    to tell GAE where jul is configured.
    In  \src\main\webapp\WEB-INF\java-util-logging.properties  you need

    .level = ALL


    or another of the jul level names, as you like (i.e. 'TRACE' does not work).

    In \src\main\resources\log4j.properties  (if you put the file as usual in \src\test\resources\log4j.properties, it won't be deployed to the production server and hence you will have no logging there). In this file you will have

    log4j.rootLogger=ALL, console


    or a lower log level such as DEBUG, INFO or WARN.
    And you can customize your output like:

    log4j.appender.console=org.apache.log4j.ConsoleAppender

    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%-5p: %m  at %C.(%F:%L) on %d{ISO8601}%n

    You may not use FileAppenders, as GAE does not allow to write files.
    That's it. here is an overview how log messages will flow:
    • GAE
      • generates its own log messages and sends them to jul
      • GAE checks if this level should be displayed (java-util-logging.properties), if yes:
      • message appears at correct level with ugly pattern in GAE log
    • Jersey
      • generates log messages and sends them to jul
      • GAE checks if this level should be displayed (java-util-logging.properties), if yes:
      • message appears at correct level with ugly pattern in GAE log
    • Other libraries and your code
      • sends log messages to slf4j
      • slf4j sends them to log4j
      • log4j checks if this package & level should log (log4.properties), if yes:
      • log message is formatted nicely
      • and sends to System.out which is interpreted as a jul-messages at INFO level in GAE
      • GAE checks if INFO level should be displayed (java-util-logging.properties), if yes:
      • messages shown with nice pattern at INFO level in GAE log, real level represented as text. E.g. a complete log message then looks like this:
        03-02 02:05PM 23.948
        [myappname/1.340252283834355681].<stdout>: DEBUG: this is debug at org.example.app.Logtest.(Logtest.java:13) on 2010-03-02 22:05:23,947
    So in the end you have 3 barriers for a log message until you see it:
    • Log4j
    • GAE's jul
    • The drop-down box in AppEngine manager app (Show only requests will show everyhing including requests with no errors).
    I hope this helps other people struggling with logging an maybe somebody has an even nicer solution.