hashing function

  • collision resistance
    • attach SHA1/MD5 less than 2^(N/2), N bits of hash output.
  • uniform distribution

collision resolution

  • chaining with list/RBT/…
    • linkedlist vs RBT
  • open address (hash(key)+ di) mod m, ith times to collision
    • how to avoid cluster?

hash open address application in java api

  • ThreadLocalMap ```java

    private Entry getEntry(ThreadLocal<?> key) {
        int i = key.threadLocalHashCode & (table.length - 1);
        Entry e = table[i];
        if (e != null && e.get() == key)
            return e;
        else
            return getEntryAfterMiss(key, i, e); //
    }
    
  
# hash chaining application in java api 
    * LinkedHashMap
      - double-linked list
      - access order by move access node to last
      - default by insert order
    * Hashtable
      - chaining with single linked list
      - indexing
          - hashtable: int index = (hash & 0x7FFFFFFF) % tab.length;
          - hashmap:   int index = (hash ^ (hash>>>16)) % (tab.length-1);
    * HashMap
      - if binCount <6, untrieefy to list 
      - if binCount >8, (if table.length >64, trieefy to RBT; else resize())
      - if map.size > (threshold=capacity*loadfactor), resize()

```java

     /**
       * The maximum capacity, used if a higher value is implicitly specified
       * by either of the constructors with arguments.
       * MUST be a power of two <= 1<<30.
       */
      static final int MAXIMUM_CAPACITY = 1 << 30;

      /**
       * The load factor used when none specified in constructor.
       */
      static final float DEFAULT_LOAD_FACTOR = 0.75f;

      /**
       * The bin count threshold for using a tree rather than list for a
       * bin.  Bins are converted to trees when adding an element to a
       * bin with at least this many nodes. The value must be greater
       * than 2 and should be at least 8 to mesh with assumptions in
       * tree removal about conversion back to plain bins upon
       * shrinkage.
       */
      static final int TREEIFY_THRESHOLD = 8;

      /**
       * The bin count threshold for untreeifying a (split) bin during a
       * resize operation. Should be less than TREEIFY_THRESHOLD, and at
       * most 6 to mesh with shrinkage detection under removal.
       */
      static final int UNTREEIFY_THRESHOLD = 6;

      /**
       * The smallest table capacity for which bins may be treeified.
       * (Otherwise the table is resized if too many nodes in a bin.)
       * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
       * between resizing and treeification thresholds.
       */
      static final int MIN_TREEIFY_CAPACITY = 64;
      /**
       * The table, initialized on first use, and resized as
       * necessary. When allocated, length is always a power of two.
       * (We also tolerate length zero in some operations to allow
       * bootstrapping mechanics that are currently not needed.)
       * RBT Or LinkedList
       */
      transient Node<K,V>[] table;

      /**
       * The number of key-value mappings contained in this map.
       */
      transient int size;

      /**
       * The number of times this HashMap has been structurally modified
       * Structural modifications are those that change the number of mappings in
       * the HashMap or otherwise modify its internal structure (e.g.,
       * rehash).  This field is used to make iterators on Collection-views of
       * the HashMap fail-fast.  (See ConcurrentModificationException).
       */
      transient int modCount;

      /**
       * The next size value at which to resize (capacity * load factor).
       *
       * @serial
       */
      int threshold;

      /**
       * The load factor for the hash table.
       *
       * @serial
       */
      final float loadFactor;

Java author Joshua J. Bloch

  • the Java Collections Framework
  • http://www.javaworld.com/article/2073877/core-java/joshua-bloch–a-conversation-about-design.html
  • http://www.oracle.com/technetwork/articles/javase/bloch-effective-08-qa-140880.html

Distributed HashTable

  • DHT need to handle:
    • autonomy and decentralization
    • fault tolerance
    • scalability
    • load balance
    • data integrity
    • performance (routing/storage/retrieval)
  • Keyspace partitioning Task: removal or addition of one node changes only the set of keys owned by the nodes with adjacent IDs, and leaves all other nodes unaffected.
    • Consistent hashing hashing ring + tags
    • Rendezvous hashing client side: highest random weight hashing, selected = max((h(s,key) for s in serverlist)
    • Locality-preserving hashing efficient range query by ensure that similar keys are assigned to similar objects.

Bloomfilter

  • used to test whether an element is a member of a set.
  • False positive matches are possible, but false negatives are not
  • – in other words, a query returns either “possibly in set” or “definitely not in set”.
  • Elements can be added to the set, but not removed (though this can be addressed with a “counting” removed_filter, but add removed items back is not supported.);
  • the more elements that are added to the set, the larger the probability of false positives.
    public boolean add(T object) {
        byte[] state = encode(object);

        int hashIterations = this.hashIterations;
        long size = this.size;
        long[] indexes = hash(state, hashIterations, size);
        ...set bits...
    }

    private long[] hash(byte[] state, int iterations, long size) {
        long hash1 = LongHashFunction.xx_r39().hashBytes(state);
        long hash2 = LongHashFunction.farmUo().hashBytes(state);

        long[] indexes = new long[iterations];
        long hash = hash1;
        for (int i = 0; i < iterations; i++) {
            indexes[i] = (hash & Long.MAX_VALUE) % size;
            if (i % 2 == 0) {
                hash += hash2;
            } else {
                hash += hash1;
            }
        }
        return indexes;
    }

Reference

  • https://en.wikipedia.org/wiki/Poisson_distribution