Organize Your Data with Python Data Descriptors

Organizational data often presents itself as being neatly organized, well-behaved, and logical on the outside, much like a library bookshelf. But, let’s face it—databases can get messy.

As organizations add new data from separate sources into their data lakes, it can become challenging to maintain a consistent representation of that data. The developers responsible for integrating one source may be separated by space and time from those responsible for integrating another.

Further, representations of data in working memory and in a database might be quite different. When interacting with a common database, developers must manage the transformations of data formats to and from working memory as new applications emerge and old applications evolve.

While good development practices and documentation help keep representations aligned and systems functioning, some good software library design can alleviate much of the pain. While data sources might disagree on formatting of phone numbers or addresses and developers might not consistently apply the proper formatting logic, we can automate data representation to achieve effortless consistency. 

What makes this possible? A powerful but little-known feature of Python: the data descriptor. In this post, you’ll learn more about data descriptors and how you can use them to make your databases more consistent and easier to navigate. 

Modeling Data with Python 

Suppose you have a data system that pulls content from one source and stores information about people and their phone numbers and addresses. Your object model for a person might just store content directly, like:

class Person: 
   def __init__(self, name, phone_number, address): 
      self.name = str(name) 
      self.phone_number = str(phone_number) 
      self.address = str(address) 

While this code is correct and concise, it is not very robust or extensible. Of course, data is never quite in the right format, and we’ll need to convert phone numbers and addresses into a consistent format, and handle errors appropriately. If you’re creating a library of objects to be reused and subclassed, you may easily find yourself dealing with phone numbers in different objects (offices, for example) that require duplicate phone number processing code.  

Code specific to phone numbers and addresses might be written within the Person object, mixing responsibilities in a cumbersome and illogical fashion. Developers may end up with data in conflicting formats, resulting in bugs and technical debt down the road. Enter data descriptors, which will enable you to write more readable, reusable code. 

What are Data Descriptors? 

Data descriptors mediate object assignment and retrieval. They manage variables by mediating their value assignments, allowing convenient under-the-hood formatting and validation. More directly, data descriptors allow you to override object access and assignment. 

Why would you want to do that? For one, it allows you to encapsulate data processing and formatting, which enables some powerful and useful programming patterns. Let’s start our exploration with a basic Python object representing a person in a database. 

class Person: 
   def __init__(self, name, phone_number, address): 
      self.name = str(name) 
      self.phone_number = str(phone_number) 
      self.address = str(address) 

This example just stores data directly. Of course, this does nothing to help manage formatting differences between databases. Using data descriptors for Name, Phone, and Address, you can rewrite the Person object as: 

class Person: 
   name = Name() 
   phone_number = Phone() 
   address = Address() 

   def __init__(self, name, phone_number, address): 
      self.name = name 
      self.phone_number = phone_number 
      self.address = address 

This is still very concise as we only add three lines. Up front in the class definition, we have declared that “name” is managed by the Name descriptor, phone_number is managed by the Phone descriptor, and address is managed by the Address descriptor. 

The Name, Phone, and Address objects contain the magic necessary to consistently process and format the input, managing every variable assignment and access for the class attributes managed by the descriptors. For example, the Phone descriptor manages self.phone_number, so its “set” and “get” routines are overridden. 

Now, Python does have getters and setters with the @propertydecorator pattern. These are implemented in Python as a special type of data descriptor! However, a deeper understanding of descriptors can allow you to take advantage of more powerful design patterns. 

In the rest of this article, we’ll examine data descriptors in more detail, demonstrating how they are defined and motivate their use with some examples. 

Managed Attribute Formatting 

In the Person class, we want to associate a name, phone number, and address with an individual. This data might be supplied from several different databases, so a name might be passed in different formats, like: 

  • “Hájek, Petr” 
  • “Petr Hájek” 
  • (“HÁJEK”, ”PETR”) 

Or, the field might be None or something unexpected like 5730925. We could process the name field as follows: 

def __init__(self, name, phone_number, address): 
   # check if name is None or empty 
   if not name: 
      raise ValueError("name field is invalid") 

   # special handling if name is a string 
   if isinstance(name, str): 
      # split on first comma 
      names = [x.strip() for x in name.split(',', 1)] 
      if len(names) > 1: 
         self.name = " ".join(names[::-1]) 
      else: 
         self.name = names[0] 
   else: 
      # assume iterable collection of strings 
      names = [n.strip() for n in name] 
      self.name = " ".join(names[1:], names[0]) 

   self.name = self.name.title()  # enforce title-case naming

That’s an ugly mess for a piece of code that semantically just wants to pass an input into a “name” variable. You could refactor this by extracting the formatting code into a separate formatting function:

def format_name(name): 
   # check if name is None or empty 
   if not name: 
      raise ValueError("name field is invalid") 

   # special handling if name is a string 
   if isinstance(name, str): 
      # split on first comma 
      names = [x.strip() for x in name.split(',', 1)] 
      if len(names) > 1: 
         name = " ".join(names[::-1]) 
      else: 
         name = names[0] 
   else: 
      # assume iterable collection of strings 
      names = [n.strip() for n in name] 
      name = " ".join(names[1:], names[0]) 

   return name.title() 

... 

# in Person class:  

def __init__(self, name, phone_number, address): 
   self.name = format_name(name) 

That looks much better, from the perspectives of both the Person class and the data formatting logic. This solution separates the responsibilities, allowing developers to understand, develop, and debug the whole system more effectively. 

The format_name() function can also be reused, allowing other parts of the codebase to use the same formatting code. But it is still a brittle solution. For example:

p = Person("Jon Snow", "5556666846", "The Wall, Apt 2, The North") 
p.name = 42

When someone uses the Person object later, they can assign values to the attribute at any time and bypass the formatter function. Whoops. Using @property, an “off-the-shelf” descriptor, can help protect class attributes from being improperly used or assigned. 

class Person: 
   def __init__(self, name, phone_number, address): 
      self._name = None 
      self._phone_number = None 
      self._address = None 

      self.name = name 
      self.phone_number = phone_number 
      self.address = address 

   @property 
   def name(self): 
      return self._name.title() 

   @name.setter 
   def name(self, val): 
      self._name = format_name(val) 

The getter-setter pattern allows you to define how attributes are accessed and assigned anywhere, which helps with the previous Jon Snow example. It relies on creating private attributes such as self._name to store the data with the object. 

The repetition of having name and _name defined in __init__ seems unwieldy. Python best practices recommend defining all attributes in the __init__() method. However, that isn’t strictly necessary in this case. So it would technically be correct to remove lines like self._name = None. This elides a subtle issue, however. There are now two ways to get and set the same variable within the Person object—through the name setter and through the attribute self._name

A developer down the road might interact directly with the private attributes since they are defined and available. Subclasses might be implemented with direct interactions with the private attributes. This can make for trouble. Further, if you are going to use the same access pattern within multiple objects, you will end up repeating a lot of getter and setter boilerplate. 

You can make your own data descriptor instead! This pattern isn’t described much in tutorials on the web, but Python’s docs have plenty of detail. 

A Simple Descriptor

Here’s what the Name descriptor could look like.

class Name: 
   def __set_name__(self, owner, name): 
      self.name = str(name) 

   def __get__(self, instance, owner_type=None): 
      val = instance.__dict__.get(self.name) or "" 
      return str(val).title() 

   @staticmethod 
   def format_name(val): 
      # check if name is None or empty 
      if not val: 
         raise ValueError("name is invalid") 

      # special handling if name is a string 
      if isinstance(val, str): 
         # split on first comma 
         names = [x.strip() for x in val.split(',', 1)] 
         if len(names) > 1: 
            name = " ".join(names[::-1]) 
         else: 
            name = names[0] 
      else: 
         # assume iterable collection of strings 
         names = [n.strip() for n in val] 
         name = " ".join(names[1:], names[0]) 

      return name 

   def __set__(self, instance, val): 
      instance.__dict__[self.name] = self.format_name(val) 

Let’s break it down. 

The first thing to keep in mind is that the Name object manages an attribute for another object. __set_name__ is a special magic method introduced for convenience in Python 3.6. When a Name descriptor is created, it is associated with an attribute name in another class. 

Let’s take a look at __set__ next, which overrides the attribute assignment operation. Here, obj is an instance of an object. The attribute being managed is accessed with instance.__dict__[self.name]. The attribute itself is fetched through the managed object’s __dict__ so as to interact with the value by name. 

Then, __get__ is called when someone retrieves the value of the managed attribute. Again, the managed attribute is referenced by name with instance.__dict__. Note that calling get on a dictionary will return None by default if the key doesn’t exist. If there is some error, you will end up with an empty string. 

Let’s look back at how the descriptors were created: 

class Person: 
   name = Name() 
   phone_number = Phone() 
   address = Address() 

   def __init__(self, name, phone_number, address): 
      ... 

Note that the attributes are defined and associated with the descriptors outside of the __init__ method. This is a key point: one instance of a descriptor is created for a class; that instance of the descriptor is shared by all the instances of the class. That’s why you get the  instance and owner arguments in the special methods.That opens up some interesting possibilities! 

Going Further: Maintaining Statistics

Descriptors open the door to many other possibilities. In the previous example, we didn’t really have to define a custom __get__ method. __set__ would be enough for that use case. But what if you’re interested in a value within the context of its community, not just a value in isolation? Let’s consider this basic example: 

class Height: 
      """Descriptor maintaining statistics about height of individuals in a population 
      """ 
      def __init__(self): 
         self.total = 0 
         self.n = 0 

      def __set_name__(self, owner, name): 
         self.name = str(name) 

      def __get__(self, instance, owner_type=None): 

         val = instance.__dict__.get(self.name) 

         if val is None: 
            return None 

         avg = self.total / self.n 

         return { 
            'value': val, 
            'average': avg, 
            'delta': val - avg 
         } 

      def __set__(self, instance, val): 
         curr = instance.__dict__.get(self.name) 
         if curr: 
            self.total -= curr 
            self.n -= 1 

         instance.__dict__[self.name] = val 
         self.total += val 
         self.n += 1 
 

class Person: 
   """Simple class to track a "height" attribute managed by the Height descriptor 
   """ 
   height = Height() 

   def __init__(self, height=0): 
      self.height = int(height) 

if __name__ == "__main__": 
   """ 
   Demonstration: show height and deviation from average height as values are created. 
   Note the difference in when looking at the values of the first Person when its added 
   and the same Person after all the values are added. 

   """ 
   people = [] 
   for i in range(20): 
      x = Person(height=i) 
      print(x.height['value'], x.height['delta']) 
      people.append(x) 

   print(people[0].height['delta']) 
   people[0].height = 42 
   print(people[0].height['delta']) 

The following is the output of this example:

0 0.0 
1 0.5 
2 1.0 
3 1.5 
4 2.0 
5 2.5 
6 3.0 
7 3.5 
8 4.0 
9 4.5 
10 5.0 
11 5.5 
12 6.0 
13 6.5 
14 7.0 
15 7.5 
16 8.0 
17 8.5 
18 9.0 
19 9.5 
-9.5 
30.952380952380953 

As elements are added in height order, note how the “delta”—or difference from average—appears. On the second-to-last line, we have requested the updated delta for the first entry, which is now -9.5 instead of 0.0. Changing that value will update the average of the entire population. This is a contrived example, but it can be useful to consider the relationships between values beyond just the values on their own. 

The situation becomes even more interesting when considering the descriptor within the context of asyncio. Suppose we have a webscraper collecting data from several independent web sources. To maximize efficiency of such a Python scraper, you can yield control to the main loop when requesting a networked resource and trigger an interrupt to process the data when it’s available. This allows other code to function in the meantime.  

With asyncio, this programming model is quite efficient. Allowing running computation of statistics at the descriptor level can greatly reduce the effort needed to collate the data from different scrapers, essentially offering the combined result “for free.” 

What’s Next?

This article merely scratches the surface of the possibilities afforded by data descriptors. They can give your organization better control over your data while teams develop code in parallel, and they can enable new computations and interesting code patterns. So, how will you use them? 

From entity resolution to machine learning, our Data Team offers up thought leadership on an array of data-related topics. Register for our newsletter to get these insights and more delivered straight to your inbox.

Get alerted to new job postings, events, and insights by registering for our monthly newsletter.