Glom | Python 数据格式化清理库

This post introduces glom, Python's missing operator for nested objects and data.

If you're an easy sell, full API docs and tutorial are already available at glom.readthedocs.io.
Harder sells, this 5-minute post is for you.
Really hard sells, meet me at PyCon.

The Spectre of Structure

In the Python world, there's a saying: "Flat is better than nested."

Maybe times have changed or maybe that adage just applies more to code than data. In spite of the warning, nested data continues to grow, from document stores to RPC systems to structured logs to plain ol' JSON web services.

After all, if "flat" was the be-all-end-all, why would namespaces be one honking great idea? Nobody likes artificial flatness, nobody wants to call a function with 40 arguments.

Nested data is tricky though. Reaching into deeply structured data can get you some ugly errors. Consider this simple line:

value = target.a['b']['c']

That single line can result in at least four different exceptions, each less helpful than the last:

AttributeError: 'TargetType' object has no attribute 'a'
KeyError: 'b'
TypeError: 'NoneType' object has no attribute '__getitem__'
TypeError: list indices must be integers, not str

Clearly, we need our tools to catch up to our nested data.

Enter glom.

Restructuring Data

glom is a new approach to working with data in Python, featuring:

Path-based access for nested structures
Declarative data transformation using lightweight, Pythonic specifications
Readable, meaningful error messages
Built-in data exploration and debugging features

A tool as simple and powerful as glom attracts many comparisons.

While similarities exist, and are often intentional, glom differs from other offerings in a few ways:

Going Beyond Access

Many nested data tools simply perform deep gets and searches, stopping short after solving the problem posed above. Realizing that access almost always precedes assignment, glom takes the paradigm further, enabling total declarative transformation of the data.

By way of introduction, let's start off with space-age access, the classic "deep-get":

from glom import glom

target = {'galaxy': {'system': {'planet': 'jupiter'}}}
spec = 'galaxy.system.planet'

output = glom(target, spec)
# output = 'jupiter'

Some quick terminology:

target is our data, be it dict, list, or any other object
spec is what we want output to be

With output = glom(target, spec) committed to memory, we're ready for some new requirements.

Our astronomers want to focus in on the Solar system, and represent planets as a list. Let's restructure the data to make a list of names:

target = {'system': {'planets': [{'name': 'earth'}, {'name': 'jupiter'}]}}

glom(target, ('system.planets', ['name']))
# ['earth', 'jupiter']

And let's say we want to capture a parallel list of moon counts with the names as well:

target = {'system': {'planets': [{'name': 'earth', 'moons': 1},
                                 {'name': 'jupiter', 'moons': 69}]}}

spec = {'names': ('system.planets', ['name']),
        'moons': ('system.planets', ['moons'])}

glom(target, spec)
# {'names': ['earth', 'jupiter'], 'moons': [1, 69]}

We can react to changing data requirements as fast as the data itself can change, naturally restructuring our results, despite the input's nested nature. Like a list comprehension, but for nested data, our code mirrors our output.

And we're just getting started.

True Python-Native

Most other implementations are limited to a particular data format or pure model, be it jmespath or XPath/XSLT. glom makes no such sacrifices of practicality, harnessing the full power of Python itself.

Going back to our example, let's say we wanted to get an aggregate moon count:

target = {'system': {'planets': [{'name': 'earth', 'moons': 1},
                                 {'name': 'jupiter', 'moons': 69}]}}


glom(target, {'moon_count': ('system.planets', ['moons'], sum)})
# {'moon_count': 70}

With glom, you have full access to Python at any given moment. Pass values to functions, whether built-in, imported, or defined inline with lambda. But glom doesn't stop there.

Now we get to one of my favorite features by far. Leaning into Python's power, we unlock the following syntax:

from glom import T

spec = T['system']['planets'][-1].values()

glom(target, spec)
# ['jupiter', 69]

What just happened?

T stands for target, and it acts as your data's stunt double. T records every key you get, every attribute you access, every index you index, and every method you call. And out comes a spec that's usable like any other.

No more worrying if an attribute is None or a key isn't set. Take that leap with T. T never raises an exception, so worst case you get a meaningful error message when you run glom() on it.

And if you're ok with the data not being there, just set a default:

glom(target, T['system']['comets'][-1], default=None)
# None

Finally, null-coalescing operators for Python!

But so much more. This kind of dynamism is what made me fall in love with Python. No other language could do it quite like this.

That's why glom will always be a Python library first and a CLI second. Oh, didn't I mention there was a CLI?

Library first, then CLI

Tools like jq provide a lot of value on the console, but leave a dubious path forward for further integration. glom's full-featured command-line interface is only a stepping stone to using it more extensively inside application logic.

$ pip install glom
$ curl -s https://api.github.com/repos/mahmoud/glom/events \
  | glom '[{"type": "type", "date": "created_at", "user": "actor.login"}]'

Which gets us:

[
  {
    "date": "2018-05-09T03:39:44Z",
    "type": "WatchEvent",
    "user": "asapzacy"
  },
  {
    "date": "2018-05-08T22:51:46Z",
    "type": "WatchEvent",
    "user": "CameronCairns"
  },
  {
    "date": "2018-05-08T03:27:27Z",
    "type": "PushEvent",
    "user": "mahmoud"
  },
  {
    "date": "2018-05-08T03:27:27Z",
    "type": "PullRequestEvent",
    "user": "mahmoud"
  }
...
]

Piping hot JSON into glom with a cool Python literal spec, with pretty-printed JSON out. A great way to process and filter API calls, and explore some data. Something genuinely enjoyable, because you know you won't be stuck in this pipe dream.

Everything on the command line ports directly into production-grade Python, complete with better error handling and limitless integration possibilities.

Next steps

Never before glom have I put a piece of code into production so quickly.

Within two weeks of the first commit, glom has paid its weight in gold, with glom specs replacing Django Rest Framework code 2x to 5x their size, making the codebase faster and more readable. Meanwhile, glom's core is so tight that we're on pace to have more docs and tests than code very soon.

The glom() function is stable, along with the rest of the API, unless otherwise specified.

A lot of other features are baking or in the works. For now, we'll be focusing on the following growth areas:

Validation functionality, in the vein of schema and cerberus
CLI robustness, better error messages, etc.
Extension API, clean up some internal code, open up extensions
Automatic default registration of default behaviors for co-installed packages (e.g., Django)

We'll be talking about all of this and more at PyCon, so swing by if you can. In either case, I hope you'll try glom out and let us know how it goes!

今天看啥 - 高品质阅读平台
本文地址：http://www.jintiankansha.me/t/ULI1rdhAcS