Skip to content
Snippets Groups Projects
  • Arseny Kapoulkine's avatar
    5f996eba
    Do not emit surrounding whitespace for text nodes · 5f996eba
    Arseny Kapoulkine authored
    Previously we omitted extra whitespace for single PCDATA/CDATA children, but in
    mixed content there was extra indentation before/after text nodes.
    
    One of the problems with that is that the text that you saved is not exactly
    the same as the parsing result using default flags (parse_trim_pcdata helps).
    
    Another problem is that parse-format cycles do not have a fixed point for mixed
    content - the result expands indefinitely. Some XML libraries, like Python
    minidom, have the same issue, but this is definitely a problem.
    
    Pretty-printing mixed content is hard. It seems that the only other sensible
    choice is to switch mixed content nodes to raw formatting. In a way the code in
    this change is a weaker version of that - it removes indentation around text
    nodes but still keeps it around element siblings/children.
    
    Thus we can switch to mixed-raw formatting at some point later, which will be
    a superset of the current behavior.
    
    To do this we have to either switch at the first text node (.NET XmlDocument
    does that), or scan the children of each element for a possible text node and
    switch before we output the first child.
    
    The former behavior seems non-intuitive (and a bit broken); unfortunately, the
    latter behavior can cost up to 20% of the output time for trees *without* mixed
    content.
    
    Fixes #13.
    5f996eba
    History
    Do not emit surrounding whitespace for text nodes
    Arseny Kapoulkine authored
    Previously we omitted extra whitespace for single PCDATA/CDATA children, but in
    mixed content there was extra indentation before/after text nodes.
    
    One of the problems with that is that the text that you saved is not exactly
    the same as the parsing result using default flags (parse_trim_pcdata helps).
    
    Another problem is that parse-format cycles do not have a fixed point for mixed
    content - the result expands indefinitely. Some XML libraries, like Python
    minidom, have the same issue, but this is definitely a problem.
    
    Pretty-printing mixed content is hard. It seems that the only other sensible
    choice is to switch mixed content nodes to raw formatting. In a way the code in
    this change is a weaker version of that - it removes indentation around text
    nodes but still keeps it around element siblings/children.
    
    Thus we can switch to mixed-raw formatting at some point later, which will be
    a superset of the current behavior.
    
    To do this we have to either switch at the first text node (.NET XmlDocument
    does that), or scan the children of each element for a possible text node and
    switch before we output the first child.
    
    The former behavior seems non-intuitive (and a bit broken); unfortunately, the
    latter behavior can cost up to 20% of the output time for trees *without* mixed
    content.
    
    Fixes #13.
Code owners
Assign users and groups as approvers for specific file changes. Learn more.