Skip to content
Snippets Groups Projects
  • Arseny Kapoulkine's avatar
    c55ea3bc
    XPath: Make remove_duplicates generate stable order · c55ea3bc
    Arseny Kapoulkine authored
    Given an unsorted sequence, remove_duplicates would sort it using the
    pointer value of attributes/nodes and then remove consecutive
    duplicates.
    
    This was problematic because it meant that the result of XPath queries
    was dependent on the memory allocation pattern. While it's technically
    incorrect to rely on the order, this results in easy to miss bugs.
    
    This is particularly common when XPath queries use union operators -
    although we also will call remove_duplicates in other cases.
    
    This change reworks the code to use a hash set instead, using the same
    hash function we use for compact storage. To make sure it performs well,
    we allocate enough buckets for count * 1.5 (assuming all elements are
    unique); since each bucket is a single pointer unlike xpath_node which
    is two pointers, we need somewhere between size * 0.75 and size * 1.5
    temporary storage.
    
    The resulting filtering is stable - we remove elements that we have seen
    before but we don't change the order - and is actually significantly
    faster than sorting was.
    
    With a large union operation, before this change it took ~56 ms per 100
    query invocations to remove duplicates, and after this change it takes
    ~20ms.
    
    Fixes #254.
    c55ea3bc
    History
    XPath: Make remove_duplicates generate stable order
    Arseny Kapoulkine authored
    Given an unsorted sequence, remove_duplicates would sort it using the
    pointer value of attributes/nodes and then remove consecutive
    duplicates.
    
    This was problematic because it meant that the result of XPath queries
    was dependent on the memory allocation pattern. While it's technically
    incorrect to rely on the order, this results in easy to miss bugs.
    
    This is particularly common when XPath queries use union operators -
    although we also will call remove_duplicates in other cases.
    
    This change reworks the code to use a hash set instead, using the same
    hash function we use for compact storage. To make sure it performs well,
    we allocate enough buckets for count * 1.5 (assuming all elements are
    unique); since each bucket is a single pointer unlike xpath_node which
    is two pointers, we need somewhere between size * 0.75 and size * 1.5
    temporary storage.
    
    The resulting filtering is stable - we remove elements that we have seen
    before but we don't change the order - and is actually significantly
    faster than sorting was.
    
    With a large union operation, before this change it took ~56 ms per 100
    query invocations to remove duplicates, and after this change it takes
    ~20ms.
    
    Fixes #254.
Code owners
Assign users and groups as approvers for specific file changes. Learn more.