Friday, August 31, 2007

Drools vs JRules Performance and Future R&D

We have had fantastic feedback on Drools 4.0.1, showing that we have real enterprise class performance.

One user just reported on the mailing list that their application, that uses large number of field constraints (more than 4) in a pattern, is over 200x faster in Drools 4.0.1 than in 4.0.0 :)

Some of you may remember my previous blog titled "JBoss Drools vs ILog JRules - an anecdotal story"; where the user had 1300 rules executing in an high volume transaction environment. I've since had further feedback that even after guidance on optimising their JRules implementation, trying both "optimised" and"sequential" modes, that Drools out of the box using the same rules and data is 4 times faster than JRules' best efforts. They have sent me the following results:

Summary

Sequential Rete
Jrules Drools Jrules Drools
Average 16.45 3.53 14.53 3.71





Difference
Jrules/Drools
5 4

Timings were based on making 1000 rule execution calls.
  • There are 5 data set, randomly selected for each rule execution call.
  • Each data set contains about 100 elements.
  • For each execution call, up to 100 rules would fire.
  • Same rule set containing 1219 rules (simple rules) was used for all execution calls.
  • For JRules, the same instance of IlrContext was used and reset after each call. For Drools, a stateless session was created from the same RuleBase and was discarded after each call.
  • I am using JRules 6.6.1, with JIT enabled and calling IlrRuleSet().optimize with hasherGeneration set and wmModifiedByCode unset.
Prior to "optimised" mode our standard Rete was 27 times faster than their standard Rete. JRules "optimised" mode optimises the network, where possible, to execute sequential and with a static agenda - it does not allow dynamic runtime rule additions to the rulebase.

Sequential (seconds)
Drools
Jrules Drools
Jrules Drools
16.04 3.44
95.85 3.5





16.035/3.440 = 5
95.852/3.500 = 27

I must stress that this is just one users story and the information is provided anonymously, thus making it anecdotal, and other users mileage may vary :) But I tell you the story as is, engineer to engineer, with no slight of hand and we have done nothing to help them "tweak" Drools further than any other user gets out of the box. The academic, open and transparent nature to Drools ensures an openness in these sorts of discussions. To be balanced the author said Drools consumed a lot more memory during rule serialisation of rules than JRules. This is due to our reliance on Java serialisation which is not very efficient with graphs and storing byte[]s for class generation in memory, we will address this in the future. We did however consume less memory during runtime execution. Further to that Drools' management system needs to mature more, before a completion transition:
"The top reasons for us to select Drools over JRules is not performance: the #1 is the open API's that allows us to do what is required, #2 is the active community and the support. Exceptional performance is #3, but may gain more weight when time to convert existing projects. I believe both will exist in our company for the near future until Drools' management system matures."

To wet everyone's appetite I have detailed our future performance R&D to show that we are not done yet.

Lazy latch nodes
Parent nodes do not propagate unless there is something to join against in the child node. For small systems this has little impact, but for large, wide, systems it can save memory and provide an increase in performance.

Network disconnection for RuleFlow Groups and Agenda Groups
When data is asserted into the network all patterns for all rules are evaluated, regardless of whether that rule is able to fire. If a rule is not in a ruleflow-group or agenda-group that is currently in focus and active it should be disconnected from the network at the attachment points where there is no node sharing. For truly large systems it should also be possibly to page those detached rules to disk, until their ruleflow-group and/or agenda-group has the focus and is active.

Partial Network Paging to Disk
For really large networks it may be desirable not to have the entire network in memory. Instead the network can be paged to disk, and will only page into memory when data attempts to propagate to that part of the network. This can be applied to the "Network Disconnection" optimisation to page out entire ruleflow-groups and agenda-groups when they are not in-focus and not active. Java serialisation is not very efficient, especially for graphs, so this would need to be combined with a custom wire protocol for marshalling.

Lazy Latch Querries
For queries with no parameters we can add the DroolsQuery pattern to the end of the network for that Rule, instead of the beginning. We then find either the last unshared node or the root beta node and put in a "block" semaphore, that means no data will propagate that is specific to the query rule. When the DroolsQuery object is propagated to the JoinNode it signals the "block" semaphore to propagate all its data. This allows queries to share with existing rule data.

No Memory storage for Querries
As querries are effectively stateless we do not need to store any left input memory for the join nodes. The root part of the query should mark the Tuple as not requiring left input memory storage, so that left memory is not used during propagation. This increases query performance and uses less memory. We may even be able to modify sequential mode so that they both use the same technique to avoid left memory storage.

Isolation and Sharing of the Right Input for a BetaNode
Currently we either share, or or do not share, a whole beta node dependant on whether the entire parent chain is identical. If we can isolate the right input a little more we can, in some situations, allow sharing of that right input. Those situations would be where there is either no indexing, or the indexing is the same, for the beta node. Further to this, for situations with no indexing, we could probably look to just use the parent node as also the right input node. This reduces memory usage and slightly less needed network propagations.

Composite index Alpha Node Hashing
We already have composite index hashing for beta nodes, to a maximum depth of three columns. currently alpha node indexing only works on the a single literal value, composite would allow us to index multiple literal constraints combinations, again a maximum depth would need to be provided to avoid this becomes too expensive. Further to this there may be times when single indexing is preferably, as this is based on literals whether to composite or single should be deterministic at compile time, based on some level of cost analysis.

Node Collapsing
Currently each alpha node constraint is in it's own node, this is needed for dynamic rulebases where rules can be attached at a later date. If we know the rulebase is not going to change, like with stateless sessions, we can collapse similar shared node groups into single execution units; most likely with optional bytecode JIT. For standard Rete execution we will need a property, or method call, for the user to state the rulebase will not change so we can apply this optimisation.

Alpha Node Reordering
Alpha nodes can be re-ordered to maximise node sharing, this will make "Node Collapsing" optimisations even more efficient. Again we would need some indicator that the RuleBase building is complete and that the optimisation can now be applied.

Closed World Assumptions
Allow some data to be marked as "closed world". This means the data will only ever be created inside of the working memory. This way we can analyse the rules and handle the life cycle of the data and allow us to auto-unpropagate it back to the object type node when we know it will have no further impact on resulting conflict sets. This reduces the size of the working memory network and the number of wasteful cross product attempts.

Network Re-Writing via Analysis
I haven't put too much thought into this one, but it must be possible to recognise some patterns where the network can be re-organised to provide better execution, by failing the propagation earlier on. This is similar to network eager-filtering, where we can stop certain facts propagating at an earlier stage in the network, than the join later in the network dictates.

Sub-System threading
While this won't address parallel execution and evaluation of a rule engine (which I consider a very large R&D effort and out of scope for this blog, which focuses on low hanging fruit), it should be possible to identify some sub systems that would benefit from execution in their own thread, without too much complexity - the Agenda sub system is an idea candidate for this.

Whole Bucket Joining
I'm not entirely sure how or if this would work, but the basic idea is instead of joining each tuple with each fact on the right input to form a new tuple, the tuple instead just references the bucket for the matching index for the constraints. The actual tuple joining is delayed until we reach the terminal node, where those buckets are evaluated. We may only be able to apply this to some limited situations, but it will avoid a lot of wasted cross product join attempts if we can make it work.

Indexing for Number Ranges
Indexing number ranges is really only useful for environments where the data doesn't change that often, otherwise the re-balancing of the data structure can outweigh the benefits of indexing. For this reason it will probably need to be combined with some ability to annotate field constraints with the desire to apply the indexing for number ranges.

Reduce the Number of Generated Classes
For the Java dialect we currently generate a single Class for all the expressions and blocks in that rule. However, due to the need to use interfaces in the engine, we generate an Invoker Class for each expression and block, to allow the engine to call that expression or block via the interface. We can instead provide an integer index value per expression and block and allow that integer to be passed via the invoker interface. This means we can generate a single class that implements all expression and block interfaces using the integer value to determine which expression or block we actually call. A single switch statement can be used to determine this, which should have no noticeable impact on execution. This should also mean faster building times as less classes have to be resolved.

7 comments:

  1. From my own experiments with alpha node re-ordering the gains for 10K, and 20K rules with 5, 10, 15, 20, 25 alpha nodes per rule was less than 5%.

    the way I tested it was to generate 3 different sets of rules. the first had the conditions in most inefficient order. The second in optimal order. The third in random order.

    Collapsing the alpha nodes has been done before in previous research. For most cases it probably won't help. AFAIK, the only cases where it would help is with 50K+ rules that have no joins, dozens of conditions (aka alpha nodes) and large dataset over 100K rows.

    In my experiments, I tried 100K, 200K, 300K, 500K and 1million rows of data. In the case of 1million rows, the memory delta was still less than 5%.

    disconnecting parts of the network should provide a significant boost. The trick will be the efficiency of disconnecting and reconnecting the module. You're probably better off organizing the rules into multiple rulesets, then load each ruleset into a different engine. It's manual partitioning of the rules, but it has proven to be very effective.

    RETE analysis is probably the most bang for the buck in terms of time invested versus benefit. With good analysis, the tool can suggest ways of partitioning a large ruleset into smaller ones. With rule flows, that has a great potential. The best example I can think of is when the process splits/forks.

    the easiest way to do a lazy latch, is to just convert the join to a query. there's some literature on this, but I can't remember the paper at the moment. I've used the technique and so have the old timers from Inference and ART.

    ReplyDelete
  2. I'm curious to see the rules for JRules. If your anonymous poster would like some free JRules performance analysis they can contact me on dselman-at-ilog-dot-fr.

    Daniel Selman
    http://blogs.ilog.com/brms/author/dselman/

    ReplyDelete
  3. Greetings:

    It's been a while since I read this back in my Fair Isaac days. However, I do feel that reporting fantastic results for any kind of benchmark does require that the benchmark be repeatable - meaning that both data and rules must be published so that this can be checked by other, more independent verification organizations. Like, well, you know, KBSC or someone like that. :-)

    Seriously, Haley used to publish incredibly wild benchmark results for Miss Manners but they would NOT allow anyone to see the code nor to use their tools and/or code to double check the results.

    I do believe that the client reported exactly what Mark says. However, before Mark publishes something as wild as this he should allow someone to double check the client's work. Almost any set of rules will run one way with one rulebase and another with another rulebase and then completely change 180 degrees when the rules are restructured to favor the other vendor.

    IMHO

    SDG

    ReplyDelete
  4. Here is the rule set Haley used:

    (deftemplate guest
    (field name)
    (field sex)
    (field hobby)
    )
    (deftemplate seating
    (field number-assigned)
    (multi-field assignments)
    )
    (defrule AssignFirstTwoSeats
    (declare (salience -3))
    (guest (name ?guest1) (sex ?sex) (hobby ?hobby))
    (guest (name ?guest2) (sex ~?sex) (hobby ?hobby))
    =>
    (assert (seating (number-assigned 2) (assignments ?guest2 ?guest1)))
    )
    (defrule AssignNextSeat
    (declare (salience 0))
    (guest (name ?guest1) (sex ?sex) (hobby ?hobby))
    (guest (name ?guest2) (sex ~?sex) (hobby ?hobby))
    (seating (number-assigned ?n) (assignments $?others ?guest1&~:(member ?guest2 ?others)))
    =>
    (assert (seating (number-assigned =(+ ?n 1)) (assignments $?others ?guest1 ?guest2)))
    )
    (defrule AbandonAssignment
    (declare (salience -2))
    ?f <- (seating)
    =>
    (retract ?f)
    )
    (defrule FoundSolution
    (declare (salience 1))
    ?f <- (seating (numer-assigned 16) (assignments $?assignments))
    =>
    (printout t "Seatassignments: " $?assignments t)
    (halt)
    )

    As you can see, it bears no resemblance at all to Miranker's benchmark. As we all know, Daniel Miranker went out of his way to write the most inefficient rule set he could think up! He wanted to create a small, easily implementable rule set that would generate huge amounts of work. By contrast, Haley's rule set solves the same problem (seating arrangements) using a well optimised, efficient rule set. Hence, the comparison they do (with JRules, showing approx 15 seconds against Haley's .44 second for 128 guests) is totally meaningless. I would hope that this represents a genuine mistake on the part of Haley, and not a deliberate attempt to mislead.

    ReplyDelete
  5. Mark we are using Drools 4.0.3 for our sales application which is an Eclipse RCP based application. We are having around 10000-15000 rules to be loaded as part of this application. Currently we have tested the application with around 2000 rules. When we create a package and save it in a file, the file size goes nearly in the range of 25-26 MB. Can you suggest any way to optimize the size of the package since we want to scale upto 10000-15000 rules?
    Since this is an RCP application which is going to run in an offline mode (by sales guys) it is very important for us to reduce the loading time of these rules. Appreciate if you provide us some suggestion.

    ReplyDelete
  6. We don't tend to do support in the forums. Please use the user mailing list, or you can get direct priority support from sales@jboss.com

    http://www.jboss.org/drools/lists.html

    ReplyDelete
  7. Performance also needs to be designed into your BRMS integration solutions. i.e. If your data fetch routines are taking two seconds, the 30 millis it takes to execute a rule won't matter in the grander scheme of things. Likewise - if you write a rule that needs to evaluate too many possibilities it may never finish executing.

    Nigel DeFreitas

    ReplyDelete