Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-2156

Benchmark and blog about a fast method of loading data into Infinispan

    XMLWordPrintable

Details

    • Task
    • Resolution: Obsolete
    • Major
    • None
    • None
    • Core, Documentation

    Description

      To summarise:
      When using distributed caches, when we need to batch-load a set of data into the cluster inserting batches of keys that map to the same node should significantly increase the performance.
      Why?
      during the prepare phase each node receives the
      complete list of modifications in that transaction and not only the
      modification pertaining to it.
      E.g. say we have the following key->node mapping:

      k1 -> A
      k2 -> B
      k3 -> C
      

      Where k1, k2 and k3 are keys; A, B and C are nodes.
      If Tx1 writes (k1,k2,k3) then during the prepare A,B and C will receive
      the the same package containing all the modification - namely (k1,
      k2,k3). There are several reasons for doing this (apparently)
      unoptimized approach: serialize the prepare only once, better handling
      of recovery information.

      Now if you group transactions/batches base on key distribution the amount of redundant traffic is significantly reduced - and that translates in better performance especially when the datasets
      you're inserting is quite high.

      This JIRA is basically about benchmarking and blogging about this approach.
      A entry in the FAQ would be helpful as well.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mircea.markus Mircea Markus (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: