Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- {
- "cells": [
- {
- "cell_type": "markdown",
- "id": "ec5091c9",
- "metadata": {},
- "source": [
- "# VASP XML 文件解析案例"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "61270d81",
- "metadata": {},
- "source": [
- "*注:本实验中所用到的数据文本行数为 23597933*"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "df7e3535",
- "metadata": {},
- "source": [
- "## 场景一:提取 band set 节点部分 r 值并对应相加"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "383ca44e",
- "metadata": {},
- "source": [
- "### 方法 1.1:Perl\n",
- "\n",
- "代码:`calc_add_2_4.pl`\n",
- "\n",
- "---\n",
- "```Perl\n",
- "if (/<set comment=\"band /) {\n",
- " while (<>) {\n",
- " push @r, [split(\" \", $1)] if @r<=3 && m{<r>\\s*(.*?)\\s*</r>};\n",
- " if (m{</set>}) {\n",
- " print join \" \", map { sprintf \"%.4f\", $r[1][$_] + $r[3][$_] } 0..$#{$r[0]};\n",
- " undef @r;\n",
- " last;\n",
- " };\n",
- " }\n",
- "}\n",
- "```\n",
- "---\n",
- "\n",
- "执行:\n",
- "\n",
- "---\n",
- "```Bash\n",
- "perl -lnf calc_add_2_4.pl vasprun_large.xml > out_pl_calc_add_2_4.txt\n",
- "```\n",
- "---\n",
- "\n",
- "耗时:10秒"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "2cccabcf",
- "metadata": {},
- "source": [
- "## 场景二:band set 节点下所有 r 值对应相加"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0205dcf4",
- "metadata": {},
- "source": [
- "### 方法 2.1:Perl + List::Util\n",
- "\n",
- "代码:`calc_sum.pl`\n",
- "\n",
- "---\n",
- "```Perl\n",
- "if (/<set comment=\"band /) {\n",
- " while (<>) {\n",
- " push @r, [split(\" \", $1)] if m{<r>\\s*(.*?)\\s*</r>};\n",
- " if (m{</set>}) {\n",
- " print join \" \", map {\n",
- " $j = $_;\n",
- " sprintf \"%.4f\", sum( map { $r[$_][$j] } 0..$#r )\n",
- " } 0..$#{$r[0]};\n",
- " undef @r;\n",
- " last;\n",
- " }\n",
- " }\n",
- "}\n",
- "```\n",
- "---\n",
- "\n",
- "执行:\n",
- "\n",
- "---\n",
- "```Bash\n",
- "perl -MList::Util=sum -lnf calc_sum.pl vasprun_large.xml > out_pl_calc_sum.txt\n",
- "```\n",
- "---\n",
- "\n",
- "耗时:3分10秒"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f9c57121",
- "metadata": {},
- "source": [
- "### 方法 2.2:Perl + 逐步循环\n",
- "\n",
- "代码:`calc_add.pl`\n",
- "\n",
- "---\n",
- "```Perl\n",
- "if (/<set comment=\"band /) {\n",
- " @r = ();\n",
- " while (<>) {\n",
- " if (m{<r>\\s*(.*?)\\s*</r>}) {\n",
- " @d = split(\" \", $1);\n",
- " if (@r) {\n",
- " $r[$_] += $d[$_] for 0 .. $#r;\n",
- " } else {\n",
- " @r = @d;\n",
- " }\n",
- " } else {\n",
- " print join \" \", map { sprintf \"%.4f\", $_ } @r;\n",
- " last;\n",
- " }\n",
- " }\n",
- "}\n",
- "```\n",
- "---\n",
- "\n",
- "执行:\n",
- "\n",
- "---\n",
- "```Bash\n",
- "perl -lnf calc_add.pl vasprun_large.xml > out_pl_calc_add.txt\n",
- "```\n",
- "---\n",
- "\n",
- "耗时:2分44秒"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "86efffbc",
- "metadata": {},
- "source": [
- "### 方法 2.3:XQuery(basex 实现)\n",
- "\n",
- "代码:`xq_sum.xqy`\n",
- "\n",
- "---\n",
- "```XQuery\n",
- "for $band-set in //set[starts-with(@comment, \"band \")]\n",
- "return\n",
- " for $raw-record in $band-set\n",
- " let $rcount := count($raw-record/r),\n",
- " $data-line := for-each($raw-record/r, function($r) {\n",
- " for-each(tokenize(normalize-space($r), \"\\s+\"), xs:decimal(?))\n",
- " }),\n",
- " $vcount := count($data-line) idiv $rcount\n",
- " return string-join(for-each(1 to $vcount, function($iv) {\n",
- " format-number(sum(for-each(1 to $rcount, function($ir) {\n",
- " $data-line[$vcount * ($ir - 1) + $iv]\n",
- " })), \"0.0000\")\n",
- " }), \" \"\n",
- " )\n",
- "```\n",
- "---\n",
- "\n",
- "执行:\n",
- "\n",
- "---\n",
- "```Bash\n",
- "basex -i vasprun_large.xml xq_sum.xqy > out_basex_xq_sum.txt\n",
- "```\n",
- "---\n",
- "\n",
- "耗时:2分39秒\n",
- "\n",
- "**注:** 上述写法生成的文件尾(EOF)将不包含回车符 `\\n`,利用 `wc -l` 统计行数将比上述方法少 1 行。"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "4b203335",
- "metadata": {},
- "source": [
- "### 方法 2.4:Python + lxml\n",
- "\n",
- "代码:`lxml_query.py`\n",
- "\n",
- "---\n",
- "```Python\n",
- "import sys\n",
- "from lxml import etree\n",
- "\n",
- "tree = etree.parse(sys.argv[1])\n",
- "for spin in tree.xpath('//set[starts-with(@comment, \"band \")]]'):\n",
- " for kpoint in spin.findall(\"set\"):\n",
- " for band in kpoint.xpath('.//set[starts-with(@comment, \"band \")]'):\n",
- " print(\" \".join(map(lambda xs: format(sum(xs), \".4f\"),\n",
- " zip(*map(lambda line: map(float,\n",
- " line.strip().split()),\n",
- " band.xpath(\".//r/text()\"))))))\n",
- "```\n",
- "---\n",
- "\n",
- "执行:\n",
- "\n",
- "---\n",
- "```Bash\n",
- "python lxml_query.py vasprun_large.xml > out_python_lxml.txt\n",
- "```\n",
- "---\n",
- "\n",
- "结果:超出内存最大限制"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.8.12"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
- }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement