Quantcast
Channel: Using Python Iterparse For Large XML Files - Stack Overflow
Viewing all articles
Browse latest Browse all 7

Answer by Stefan for Using Python Iterparse For Large XML Files

$
0
0

In my experience, iterparse with or without element.clear (see F. Lundh and L. Daly) cannot always cope with very large XML files: It goes well for some time, suddenly the memory consumption goes through the roof and a memory error occurs or the system crashes. If you encounter the same problem, maybe you can use the same solution: the expat parser. See also F. Lundh or the following example using OP’s XML snippet (plus two umlaute for checking that there are no encoding issues):

import xml.parsers.expatfrom collections import dequedef iter_xml(inpath: str, outpath: str) -> None:    def handle_cdata_end():        nonlocal in_cdata        in_cdata = False    def handle_cdata_start():        nonlocal in_cdata        in_cdata = True    def handle_data(data: str):        nonlocal in_cdata        if not in_cdata and open_tags and open_tags[-1] == 'desc':            data = data.replace('\\', '\\\\').replace('\n', '\\n')            outfile.write(data +'\n')    def handle_endtag(tag: str):        while open_tags:            open_tag = open_tags.pop()            if open_tag == tag:                break    def handle_starttag(tag: str, attrs: 'Dict[str, str]'):        open_tags.append(tag)    open_tags = deque()    in_cdata = False    parser = xml.parsers.expat.ParserCreate()    parser.CharacterDataHandler = handle_data    parser.EndCdataSectionHandler = handle_cdata_end    parser.EndElementHandler = handle_endtag    parser.StartCdataSectionHandler = handle_cdata_start    parser.StartElementHandler = handle_starttag    with open(inpath, 'rb') as infile:        with open(outpath, 'w', encoding = 'utf-8') as outfile:            parser.ParseFile(infile)iter_xml('input.xml', 'output.txt')

input.xml:

<root><item><title>Item 1</title><desc>Description 1ä</desc></item><item><title>Item 2</title><desc>Description 2ü</desc></item></root>

output.txt:

Description 1äDescription 2ü

Viewing all articles
Browse latest Browse all 7

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>